Identifying the UFO of Service Modeling and Virtualization

One of the nice things about UFOs is that until they land and present themselves, you can say what you like about them.  Abstraction in networking and the cloud is similar in many ways; because an abstraction is a representation, there’s no limit to what one can represent.  No limit, perhaps, but a potentially huge difference in value.  We can’t afford to leave our “abstraction UFO” in an unidentified state, so we’re going to explore the issues here.

A “virtual network” is an abstraction of a network, a “virtual host” an abstraction of a host.  Virtual private networks abstract private networks, so they’re a subset of virtual networks.  Software-defined networks are also abstractions of private networks, and so of virtual networks.  The point here is that virtualization is all about abstraction—defining something that appears real but isn’t—and then mapping it to resources so that it becomes real.  That two-step process makes virtualization very powerful and also very complicated.

What you abstract/virtualize has major implications because the agility of implementation and deployment/topology that virtualization provides tends to focus at and below the point you’re abstracting.  An easy example is in “hosting”.  You can abstract hosting by presenting hosting/orchestration tools with a single abstraction that represents a huge virtual host, and then map the specifics of deployment and management below that abstraction.  The details of the servers, the networking, even the orchestration are invisible inside that abstract virtual host.  On the other hand, if you abstract the network by creating a virtual network, you can define connectivity as you like but you have to explicitly decide how and where to host something, and then connect it.

We have examples of this today in the container space.  Docker and Linux container technology virtualizes the hosting points.  Kubernetes extends this to virtualizing hosting points to realize them from a cluster rather than a host.  Mesos virtualizes the entire resource pool to make it appear as a single host, and we’re clearly heading that way in the mainstream of Kubernetes.  In networking we have SDN and SD-WAN products that create virtual networks that can connect cloud elements, as well as service meshes that abstract added elements like load-balancing and scaling.

One obvious problem this elasticity of meaning in abstraction and virtualization creates is the fact that the term “virtualization” tends to be used indiscriminately, disconnected from the specific level of abstraction that’s being proposed or described.  More abstraction is different from less abstraction in some important ways, and we need to understand what they are to understand what imprecise terminology might be costing us.

If we looked at modern cloud infrastructure, including carrier cloud, we’d see a complex hierarchy of software, servers, network devices, and fiber/copper/wireless connections.  There are many different ways to build the structure, many different vendors and equipment models.  Deploying applications or services on such a structure creates a serious risk of “brittle” behavior.  That means that the process of deployment, if aware of the details of how the infrastructure hierarchy is built, will break if a part of it (even sometimes a small part) is changed.  This brittleness issue deters effective service and application lifecycle automation because it’s hard to define processes that don’t break when you change the technologies they operate on.

Abstraction came along as a solution to the problem.  To understand how, we need to start at the bottom, with an example of a network device.  Suppose we have five router vendors and several different models for two of the vendors.  That’s a total of seven different “routers”, all of which have different properties.  Instead of requiring our management software layer to contend with knowing which router is where, why not have an abstraction called “Router” that’s a software plugin.  On the north side, the abstraction presents a standard set of router properties, and the plugin then harmonizes the seven different router management information bases and control language inputs to that abstraction.  Now one management toolkit can manage all these routers, and any others that are harmonized via a plugin.

We can take this to the next level now.  Suppose we have, instead of a physical router model, a hosted instance of a router.  We can now frame that router instance to work with our plugin, and it now is managed like any real router.  However, we have to deploy it, so what we really need is a management tool that recognizes all of the phases of deployment and management needed for both routers and router instances.  Such a tool, for example, might have a “Deploy” command to host a router in a given location, and that would be ignored for a physical device already there.

In these examples, we have two sets of “management behaviors”, one that’s above the abstraction and always operates on the virtual image that abstraction presents, and another below it that represents the real-world stuff.  In our router instance example, our below-the-line behaviors not only have to map real-to-abstraction management properties, they have to handle things that happen in the hosted-instance world that don’t happen in the real-router world, like deployment and redeployment and perhaps scaling.  Hosting of functions and application components is already based on abstraction, so we have in virtual routers an example of abstractions that contain lower-level abstractions.

We’re not done.  A collection of routers can be abstracted too, just like a real set.  Suppose we collect our router abstractions from a network of routers in all our metro areas, as “NYC-Router”, “LA-Router” and so forth.  Since our routers are already abstracted, we have another example of an abstraction of abstractions, a hierarchy.

What’s the value of this kind of hierarchy?  Think of a service order process.  Somebody wants to buy an IP VPN that has terminations in a couple dozen cities.  There might be literally hundreds of cities in which the service could be provided, so do we build service descriptions for each possible combination of cities an order might include?  It would make more sense to say that an IP VPN was a “transit VPN” connected by metro router abstractions to each served city.  Now, an order is just a list of those metro areas, and when we select a metro area for service the abstraction for that area will “decompose” to deploy the service based on whatever the routers, router instances, or even SD-WAN instances might be needed there.

Speaking of deployments and operations tasks, one other thing this makes clear is that the nature of the hierarchy you build tends to separate high- and low-level operations tasks.  What happens at the service level has to be reflected in what’s visible at the service level, so if you let people order SD-WAN or MPLS VPN services explicitly, you probably need to have your metro and transit VPN elements divided by implementation.  If you don’t, it still could be possible to separate the options by pricing or other parametric data that you pass to each of the service-level objects, but having orderable items directly visible to service orchestration means that if they’re not uniformly available you can tell that at order time.

Each abstraction in a model is responsible for presenting a uniform view of what’s above and harmonizing everything that is or could be below.  If we assume a hierarchy like this, and if we assume intent-model principles apply to the definition of all the abstractions, then we can assume that the management of an abstraction could be based on general, model-centric, principles as long as we’re dealing with an abstraction of abstractions.  When an abstraction decomposes into something other than a model set based on the same principles, we have to assume that management at that point has to be defined by the deployment process.  In other words, all common-structured abstraction models can be managed the same way, but when what’s inside a model uses different principles (or none at all, because it’s a direct interface to another management system) then different operations practices will prevail there, and deeper.

The property of having abstractions isolate management tools and practices is a blessing and a curse.  Where there are established processes for lifecycle management, as with cloud components, you can use those within the model elements that represent cloud-hosted components and use something else more general elsewhere.  The problem is that this can lead to anarchy in the use of management tools.  It would be better to use common tools where possible, and adopt specialized and local solutions only when common tools won’t serve.

But how far do you virtualize?  The answer, in the case of a hierarchy, is that it may not matter much.  Any set of objects at a given level represent a virtual infrastructure at that level.  Abstract networks and you have network operations, abstract devices and you have device operations or element management.  That’s what makes abstractions very UFO-like but also very useful, and it’s also where they can have unexpected impacts.

I’ve been playing with the abstraction issue in networking since 2008, and what I’ve found is that it’s possible to support abstractions at almost any level, as long as you pick a place where you have a significant operations investment you’re trying to protect.  That’s probably why the NFV ISG decided to do what’s effectively device-level abstraction and present virtual devices to legacy EMS/NMS and OSS/BSS systems.  But what makes this only “possible” as a strategy is the failure to adopt a modeling hierarchy that makes it inherently flexible, and then use that modeling (as the TMF proposed long ago) to steer events to operations processes.

The best approach, I think, is to understand that a continuum of hierarchical models that each represent abstractions of the structure below can be adapted to operations tools at any level, what I’ve called “derived operations” because you derive an abstract picture of infrastructure by selecting a model segment where the tools match what the abstraction represents.  This is why modeling is so important to effective service operations automation, no matter what specific set of tasks you’re trying to automate.

Google and Apple May Need the Same Thing for Growth

One of the most important concepts in business is TAM, which most know stands for “total addressable market”.  TAM is the problem that Google (Alphabet) faced in its revenue loss early this week, and also the problem Apple confronted the next day.  Even though these two tech giants are in different markets, market principles still apply to them both, and both have the same choice to make now.

Google’s growth in ad revenue disappointed the Street, even though the company turned in revenue growth (19% in constant currency) that would have been the envy of most tech companies.  Here, though, the Street may have unrealistic expectations because ad revenue isn’t a pot of gold at the end of the rainbow.

Global ad revenues have, over time, tended to expand roughly at the same pace as global GDP, because companies were willing to spend about the same percentage of sales on advertising year over year.  The total ad spend reached about $740 billion at its peak, but it’s declined as advertisers have turned to digital-based ads, including Google, to target better for less.  Today, it stands at about $612 billion according to my model, and it’s growing again at about 4% per year.

If you want to grow revenues by 20% in a market that’s growing at 4%, you need to gain market share, which means that all the Googles and Facebooks of the world are fighting for the ad pie along with all the TV networks, radio stations, billboards, newspapers, and others who rely on advertising.  One sign of the fight is that we see more TV commercials per hour than ever before, simply because networks have to accommodate the competitive pressure by selling more time to get roughly the same money.

Competition between Google and Facebook for ad dollars isn’t surprising and in itself isn’t likely very destructive, but competition with the networks could be a problem.  Video is the largest source of traffic online, and a major source of ad revenue in itself.  Video content production by the networks is almost totally ad-funded, and if the networks are under pressure for ad revenues, then at the least programming quantity and quality is likely to suffer, and that suffering can then translate into lost viewership, more lost ad revenues, and eventually content starvation.  How many years of “I Love Lucy” reruns could we stomach?

Google is the market leader in ad revenue, which makes it particularly difficult to steal market share from others.  That’s why Street analysts are already saying Google has to look beyond advertising, and talking about why Google’s cloud initiatives have to succeed.  But there may be other, easier, choices, as we’ll see.

For Apple, the problem is obvious—smartphone market saturation.  There is a total addressable smartphone market, and there’s a subset of that market that are willing to pay a premium price for being cool.  We see some of these people any time we drive past an Apple store.  What Apple is now facing is that the cool-people subset of the smartphone market isn’t growing any faster than the market overall, and the problem with smartphones is that you can’t sell them easily to those who already have them.  Sure, you can try to get people to toss their phones every year for the latest and greatest, but the problem with that is that this year’s “greatest” is successively more difficult to differentiate from last year’s, feature-wise.

To make matters worse, there’s a host of Apple competitors who offer quite serviceable phones at a third the price of an iPhone.  There are high-end competitors that many would rate as good or better than iPhones and that still cost perhaps 60% to 70% of an iPhone’s cost.  Obviously, that means having iPhone sales off by 17% shouldn’t be a big surprise.  Particularly when Apple says it’s going to be getting into things like subscription TV and wearables to raise revenues from another source.

The big challenge Apple faces in its quest for additional revenue is the closed-ecosystem nature of Apple’s business.  There’s no vendor out there as restrictive in terms of competing hardware, software and apps conforming to their policies, etc.  Those who (like me) remember the old days of the Apple computers versus the IBM PC know what happens when you try to keep everything to yourself.  IBM had an open architecture that spawned a whole PC industry, and it won decisively in market share.  Can Apple continue to look inward, to focus on its own “cool” base of customers, and ignore the broader market?

If you’re a manufacturer of best-in-the-world left-handed golf clubs, you have a TAM challenge, just like Google and Apple do.  In a way, though, you may be better off because it’s probably obvious to you that you can’t base your revenue planning on genetic modification of future babies to enhance the number who are left-handed.  It’s not as clear-cut for Google and Apple, and if we’ve learned anything about public-company behavior in the last couple decades, it’s that they’ll do anything to avoid compromising the near-term for the long-term.  They might well admit (privately) that they can’t stick their heads in the sand for years, but a couple of quarters?  Hey, why not?  Quarters, of course, eventually add up to years, so both companies have to look to the long term even if they keep their visions close to the vest to keep the Street happy.

For Google, as for all OTT players who have ad revenue dependence, the solution is paid services.  As an example, public cloud services to support consumer applications and business productivity represent an incremental one trillion-dollar market, which is almost double the whole global ad spend.  What’s needed to support a win in that space is nothing more than what Google has excelled in doing for a decade—creating a cloud-native infrastructure model.

Google’s failing has been in the articulation of the vision.  Google’s Andromeda project work is the best and largest example of virtual-network and cloud-native deployment on the planet, but few know anything about it.  I asked a dozen Wall Street analysts whether they believed Andromeda could be the basis for a Google push toward new revenue, and all of them admitted they had no idea what I was talking about.  Google (like so many tech companies) is great at geek-speak and bad at Street-speak.  They need to productize their technology vision to open new revenue doors.

Apple obviously doesn’t have a geek-speak fixation.  They do have a long-standing “cool-speak” fixation, though, I was doing some consulting for a big NYC bank when the commercials for Apple’s Lisa computer came out.  One showed a young guy going to work with his Irish Setter, sitting down at his Lisa, and being productive.  A couple banking executives were talking with me about it when an executive VP joined us and asked what we were discussing.  When he was told, his response was “What kind of company would let someone bring a dog to work?”  Moral: If you want to sell to banks, you have to say things bankers identify with, and geek-speak and cool-speak are both going to miss the mark.

The good-even-great news for Apple is that the Street and even their customers are prepared to accept that somehow they’ll pull a rabbit out of the hat on new revenue sources even when logic seems to speak against the success of their initiatives.  They have the runway to launch something revolutionary, but to do that the need something to launch.

I’ve been critical of Apple’s cloud vision for years now, and I still believe that Apple needs a cloud strategy.  I’m not arguing that they need to offer public cloud computing services, but that they need to build out cloud infrastructure to support a more service-centric future.  Google has the needed technology but doesn’t have the technology salesmanship; Apple is the opposite.  Apple, to do TV, can’t just resell somebody else’s TV; the margins won’t satisfy investors.  But to do more than that, Apple would need to build out something like Google’s architecture, and they apparently don’t want to do that.

That makes wearables the near-term hope of Apple, but only that.  They’ll get a grace period because of their image, and they might get a bump from 5G starting in late 2020, but by 2022 they’ll need to have something really smart in place, and they don’t seem to have anything in motion that would produce it.  They can dabble in self-driving cars and subscription TV, but wearables would have to be their next big thing, and even that has its limits.  Watches are easy, and they’re having some success.  Augmented reality might offer something, but the investment needed in AR is enormous and the risks are even bigger.

In the long term, both Google and Apple need paid services, and both need both a platform and positioning to make them successful.  Google likely has the former, but Apple remains behind in its conceptualizing of the cloud.  The early focus on wearables could accentuate their device-myopic view of the world.  That could make Apple’s positioning acumen less valuable; knowing grass doesn’t make you a forester.

Is a New Model for Service Automation Emerging?

Is a new lifecycle automation concept championed by AT&T about to displace AT&T’s earlier ONAP?  According to recent stories HERE, HERE, and HERE, Airship might be emerging as the new darling of automation.  If that’s the case, what could it mean for networking and transformation?

We’ve been struggling for six years to frame a rational transformation architecture for network operators.  From very early on (the fall of 2013 to be exact) the operators themselves have admitted that operations automation would have to be a critical piece of that.  Obviously, automating the steps in service management would reduce opex even for legacy network architectures, and for new architectures based on hosted virtual functions (NFV) it would be essential.  The problem is that NFV declared operations to be out of scope, and later initiatives like ETSI’s Zero-Touch Automation and AT&T/Linux Foundation ONAP have failed to accept the basic requirement of an automated operations system—which is that it’s necessarily event-driven.

Airship, which is reported to be a big piece of AT&T’s 5G plans, is an assembly of open-source components designed to “declaratively automate cloud provisioning.”  It includes familiar tools like Kubernetes and Helm (including OpenStack-Helm), and it straddles the container-VM-bare-metal deployment options.  The term “declarative” means that Airship is a form of “DevOps” that declares an end-state and then automatically works to achieve and sustain it, rather than requiring users specify deployment steps in a script.  The declarations are in YAML form (Helm charts, for example) and they describe both deployment and updates to components.

AT&T doesn’t seem to see Airship as a replacement for ONAP, based on the architecture paper they’ve offered (see the diagram on page 4).  They see Airship as an “undercloud” or platform tool that operates below both NFV and ONAP, which retain their own cloud-level open-source tools (like Nova and Neutron for NFV and Keystone and Swift for ONAP).  However, it seems to me that Airship is nibbling at the edges of both ONAP and NFV even today, and that further development of Airship and of other Kubernetes-ecosystem tools could bring the concepts even closer.  In fact, Airship might eat both NFV and ONAP from below.

Helm charts are the central element of this structure, in my view.  A “release” is a collection of container resources that deploy as a unit, and so they’re a reflection of what the NFV people would call a VNF deployment.  The Helm charts make up a “portable deployment”, in short, and it’s easy to see how people looking at this piece of the structure would conclude that Airship has everything you need to do NFV or ONAP, but that’s not yet true.

The critical missing pieces here are network-related.  Airship has a virtual-network framework (Calico) for the control-plane activity of Airship and the cloud elements, but the tenant network is outside the scope of Airship today.  Similarly, Airship doesn’t see “real” devices or network services.  Given that Airship is a cloud deployment/redeployment platform, those omissions aren’t surprising, but in network operator applications where hosted and physical device elements would be expected to intermix, it creates a void.

The structure in the figure I referenced in AT&T’s architecture paper reflects a broad issue in the orchestration of network services.  Cloud orchestration traditionally deals with hosted components.  Network orchestration traditionally deals with devices.  What deals with both?  NFV planted an early flag by presuming that network devices would transmute into virtual functions, managed largely as the traditional devices were.  That would render “service management” unchanged.  In this model, which AT&T’s application of Airship perpetuates, the cloud orchestration tools like Airship (and like NFV and ONAP above them) are responsible for realizing virtual devices on resources and managing them inside the virtual box.

This bicameral management model has collided with “service lifecycle automation” goals and even software architectures.  If services are created by coercing network functions (virtual or physical) then that coercion has to happen above the level of both software orchestration and device management, which is what puts ONAP up where it is in functional diagram terms.  There are then two orchestrations, and while that obvious redundancy seems wasteful, it’s not the big issue.  That issue is the natural interdependence of the virtual and physical structures.

If a higher-level orchestration element is to deploy service on a mixed real/virtual infrastructure by invoking the appropriate management and orchestration (respectively) tools, then it has to know whether a given element is real or virtual.  If the management and orchestration tools are expected to “redeploy” or manage problems, then we have to presume that the new configuration might involve a different mixture of the real and virtual elements.  Thus, low-level remediation, including what’s built in to tools like Airship, is hamstrung in actually remediating because it can’t fix a problem that impacts the higher-level real/virtual model.

Let’s illustrate this issue quickly.  Suppose the original deployment (“A”) includes Pittsburgh, PA, where there’s a good data center and thus a virtual router is deployed there.  Suppose that virtual router fails because its server fails, and there is no available capacity to rehost it in Pittsburgh.  The best topological match to replace Pittsburgh is Steubenville, OH, and there is router capacity available there but not hosting capacity.  We cannot use Airship to rehost if we use our “best” city, and if we do use Airship and demand a rehosting of the virtual router, we might end up with traffic through Philadelphia, which isn’t the optimum location at all.

That’s the issue that AT&T, and Airship proponents in general, have to be looking at.  I can build resilient container-based services and use cloud tools to maintain them.  I can’t do that if I mix in physical network devices, because the tools don’t recognize network device management.  My declarative Airship/Helm models can’t represent the service overall, just the individual pieces of hosted functionality that might be integrated into it.

That’s not the end of the issues either.  Recall that AT&T makes a point of saying that they’re using Calico for control-plane virtual networking, but not tenant networking.  But think about it; in the cloud we do virtual networking for the service transactions, the equivalent of tenant networking.  If we presumed that cloud-native implementation of virtual functions was the rule, why would we not want to use a service mesh (like Istio) for the service data plane, the “tenant network?”  If we did that, then we’d be integrating Istio into Airship, in which case the scaling and locating of new or replacement instances would be done within Istio.  How, then, do we address mixed physical/virtual network infrastructure, given that Istio doesn’t do network devices?

The point is that the “network” is the “cloud network” in the future, and the current separation of real and virtual devices isn’t going to cut it as we move into that future.  It’s important that networking at the cloud level be purely virtual, which means that either there’s a total separation between the cloud network and the “real” network, or that the two are integrated rather than overlaid.  The former strategy is fine for network users, but it doesn’t address operator interests in virtualization of functions/devices rather than services.  The operators, then, have two choices to bring their own stars into alignment.

The first of these choices is to forget about “function orchestration” and build network models that define services in both real and virtual element terms.  That’s the approach that I think NFV should have taken, that I think the TMF has taken (and just not articulated well), and that operators would probably benefit most from promoting.  Use TOSCA to model services, use hierarchies and intent modeling to build services from functions that can then map (at the “resource layer”) to either real or virtual devices.  This is better, I think, because you could define one model-driven, event-based, orchestration and management platform for everything.

The second of the choices is to try to keep service orchestration and function orchestration separate, which means two layers of orchestration and all of the issues I’ve already raised of coordinating problem responses between the layers.  However, this approach is going to mean defining a more event-driven approach to that service layer process set, and that’s going to obsolete things like ONAP and ETSI ZTA, neither of which are truly event-driven.

But before we feel too sorry for the operators in having to make a tough decision, let’s return to the theme of this post.  Airship could eat all the current strategies from below, because cloud-native initiatives either have to ignore the dichotomy of real/virtual resources or they have to absorb it.  We already see, even in Airship, a recognition that you can host containers on Linux and bare metal or in VMs.  Is it really so far from that to the realization that we can either harness virtual routers or real ones?  I don’t think so, and I think we’ll see Airship, Kubernetes, Istio, and other tools expanding into the real/virtual space.  When they do, whatever they provide is what lifecycle automation will be based on, because the breadth of both support and adoption for it will swamp the smaller efforts of the operators.

Look to Helm for a sign of progress here.  Helm is “declarative” only in a minimal sense.  In its current form it’s not suitable for the kind of intent-driven hierarchical modeling that a service needs.  It wouldn’t take much to fix that, and there are good reasons to make the necessary changes even within the original cloud-centric mission of Helm.  AT&T might even be a catalyst for the changes, since it’s seeing both the cloud and service dimension of Helm already.

AT&T may have launched ONAP, and also what will eventually supplant it, and in my view that’s a good thing, because Airship is more cloud-native, and cloud-native is where we need to be heading.