AT&T’s Interesting Cloud Strategy: A Sign of the Future?

What kind of carrier cloud we finally get, and when we get it, will depend on what drives its deployment.  There are a lot of drivers of carrier cloud, and the network operators probably all emphasize them in different ways.  It also turns out that there are options for how carrier cloud services are hosted, but here we see two clear choices.  Network operators, it seems, see these two choices differently too, and we are starting to see some signs of how they assign drivers to hosting choices.  AT&T is presenting a particularly interesting (if somewhat confusing) example.

Multiplicity is one reason for confusion.  AT&T has announced public cloud deals with both IBM and Microsoft, and at the same time is committed to its own “Network Cloud”.  The IBM deal focuses on AT&T Business and services to enterprise customers, and the Microsoft deal makes Microsoft the preferred cloud provider for AT&T’s “non-network” applications outside AT&T Business.  The whole picture is confusing enough that AT&T did a blog to explain it, which is the first of my three links in this paragraph.

Most early deals between operators and cloud providers have been focused on providing cloud services to customers, meaning that the operators believe they have to take a market position in public cloud services before they have enough customer demand to justify building out to host those services themselves.  AT&T is taking things a bit further by moving some of its own stuff to a public cloud partner, but even after its blog to clarify its position, there’s still a question of just what a “non-network” application or workload is.

I think it’s pretty clear that data-plane connection services would be network workloads.  It’s not clear whether control-plane features associated with data-plane connection services are also network workloads, but that seems more likely than not.  It’s clear that AT&T’s operations support services (OSS/BSS) activity, is not considered a network workload, but network operations and service lifecycle management…who knows?  This is more than a semantic game, because if AT&T’s Network Cloud is to fully realize “carrier cloud” potential, it has to address all the demand drivers of carrier cloud, and we have to be able to decide if each driver is driving a “network” or “non-network” workload.

The demand drivers for carrier cloud are NFV/vCPE, personalization of advertising and video, 5G and mobile, network operator cloud services, contextual services, and IoT.  You can argue that the first one is at least substantially a data-plane or connection service mission, which makes it a “network workload”.  5G (if you believe in 5G Core) and mobile IMS/EPC would include some data-plane functions, but everything else is really a higher-level (OTT) service opportunity.  In the way AT&T sees things, that could mean that some or all of the hosting generated by these non-data-plane drivers could end up outsourced to the public cloud.

I want to emphasize here that I don’t believe most operators are determined to sustain a public cloud platform commitment for any of their “carrier cloud” services; even public cloud services might come in-house if they had the hosting resources in place.  The issue I think they all face is that of “first cost”.  To make carrier cloud useful, you have to deploy it where the drivers are operating, which could be anywhere.  That could mean a massive early build-out with a high cost, and much of it would sit idle while demand built.  Given operators’ sad record in marking new things, that demand-building period could last a long time, perhaps long enough to render early infrastructure obsolete.  Better to rent than buy, at first, but why fatten the coffers of public cloud providers if you have your own infrastructure?  Even given their announcements, AT&T may decide, eventually, to shift some of its public cloud work back to the Network Cloud if they build enough economy of scale.

It’s my view, and I think most operators would agree, that all of the carrier cloud drivers except public cloud services demand the operator implement or acquire cloud-native functionality to support the opportunity.  That adds to the problem of first cost I cited earlier; not only do you have to build out data centers, you have to recognize and buy, or build, cloud-native applications/features to support the opportunities, which is another cost and takes a non-zero amount of time.  Operators are all over the place relative to when they might have “cloud-native” capability; it might be a month or it might be several years, depending on who you talk to and how much cloud-native capability qualifies to be counted.

A proving ground would be nice, but nobody wants to run a trial on something important and expensive.  However, as the AT&T story proves, operators are generating requirements that fall outside my six drivers, in areas like network and service management.  These probably would not be large enough to consider a true driver of a carrier cloud (and in my modeling they didn’t qualify, which is why they’re not included), but they could qualify for cloud hosting.  If that were done on a public cloud, not only would it bypass the first cost of infrastructure deployment, it might offer operators a way of learning cloud techniques before they dive into cloud-native.  Remember, we don’t have at this point a single architecture or toolkit to define cloud-native development or behavior.

I think that the notion of public cloud hosting for non-network services, which in my view could also include control-plane cloud-native elements associated with services, is a potential game-changer.  Public cloud providers have good economies of scale and (in combination, in particular) enormous geographical scope.  Using them could give an operator like AT&T a position in every market they were interested in serving, without building out a data center there.  However, there are some risks that will need to be addressed.

The primary risk is a loss of uniformity in feature/application architecture principles.  Every public cloud provider offers low-level hosting (IaaS and in at least some cases, container hosting) based on fairly standard models, but nearly all of them also offer higher-level “web services” that are not at all uniform.  An operator planning to host in the public cloud would have to accommodate the differences in the implementations of these higher-level services, or not use them and instead build or select tools to replace them.  Every cloud, including something like AT&T’s Network Cloud, is likely to need some web or platform services above simple IaaS or container hosting.

AT&T has expressed a commitment (in Airship, ONAP, and the recent relationship with Dell) to promoting, supporting, and in some cases initiating open-source projects that combine to build a framework for cloud-hosting of their network and network-related functionality.  According to AT&T “Dell Technologies is working closely with AT&T to combine our joint telco industry best practices with decades of data center transformation experience to help service providers quickly roll out new breeds of experiential Edge and 5G services.”  This seems to suggest that Dell would be involved in the higher-layer carrier cloud demand drivers, and since Dell isn’t a public cloud provider that suggests these drivers would be addressed in AT&T’s Network Cloud.

Or maybe not.  Remember, the edge and hosted-element vision of the cloud is really platform-independent, and it will almost surely use open-source tools that could be ported to the public cloud.  Airship frames a platform model, currently below the level of full container-and-cloud-native development but likely inclusive of it at some point.  In my view, that would mean that AT&T could then apply their “experiential and 5G services” framework, when complete, to an arbitrary public cloud or to their own data centers.  For operators who don’t have any commitment to select/build a toolkit to frame their deployment in the cloud, the risk of incompatibility among cloud-provider tools could well be a deal-breaker…if they recognize it in time.

An associated risk is what I’ll call the “ONAP risk”.  The operator universe is inhabited primarily by box-thinkers, as the fact that operator-centric standards activities keep falling into the trap of virtualizing boxes not services proves.  If you are trying to do the wrong thing, there is no right set of tools to do it with.  Almost every operator I interact with in any way says they do not have adequate technology resources for cloud-native projects, period.  As long as that’s true, it will be difficult for them to even select packaged solutions offered by others, because they’d have no strong framework within which to assess them and inadequate staff to conduct the review.

A third risk, specifically associated with public cloud services is the portability problem.  As I’ve noted above, web services differ among operators.  In public cloud applications as they’re currently developing, web services are almost surely a part of the cloud service that an enterprise purchases.  Thus, enterprise applications can quickly become non-portable, and that means an operator who serves enterprise cloud via a public cloud provider deal may end up locked in.  In addition, it may be difficult to port these users to an operator’s own public cloud infrastructure, unless they duplicate these web services—which they likely can’t do.

While these risks have to be managed, they can be, and I think that the public-cloud partnership concept has very significant benefits.  Without it, I’d be concerned that operators might never have the kind of safe on-ramp to their own cloud infrastructure that they need to have to address future service opportunities.  The key to addressing the risks is to establish a carrier-cloud platform model, and that should be a goal of operators and of the open-source and standards activities they support.

AT&T is trying hard to establish a leadership position in what they call the “cloud-first” approach, which includes public cloud in a broader mission than just offloading commercial cloud compute hosting to a third party.  They seem to be succeeding overall, and so they should be watched carefully as a barometer for where “transformation” at the network operator might take operators overall, and thus the industry.

The Real Process of Making the Cloud Business Case

Just what benefits drive or justify cloud adoption?  This is a question you’d think had been asked a million times since the dawn of the cloud era, but in surveys and discussions with CIOs, I’m finding that’s not been the case.  Companies had largely accepted the widely publicized view of cloud benefits until about 2016, and since then they’ve been demanding a stricter business case be made on cloud projects.  In the process, they’ve generated some data on the cloud and its role in future IT.  Some of it is very interesting.

I want to start this blog with a note, or warning if you like.  The factors that determine what value the cloud will bring to a given application or even a given company are highly variable.  The work I’ve done on the question of the value of the cloud reflects broad statistical trends, and those are useful to planners looking to make high-level decisions.  While I’m still getting a small number of new data points, they aren’t shifting the results enough to justify holding back on this discussion.  But remember, there’s nothing that my work can do to say whether a single application, or even a single company, can benefit from the cloud at all, much less how substantial the benefit can be.  Look at this material as general guidance, an almanac and not a cookbook for cloud adoption.

Policies evolve, so we have to open with some cloud history.  Companies tell me that when they first looked at the cloud, they were of the view that public cloud services would be “cheaper” than data center services (every story on the cloud said that, after all, and most still do).  The cost reduction was first seen as a server-versus-IaaS cost difference attributable to “economy of scale”, but then broadened to reflect total cost of ownership (TCO).  Many companies admit that this broadening was a response to the fact that early cloud projects based on narrow server-capital-cost-versus-IaaS-hosting quickly plucked the low apples and didn’t result in much cloud adoption at all.

The early cloud projects that did succeed were migration projects, largely related to the then-popular topic of “server consolidation”.  Many companies had been buying relatively inexpensive servers for specific line department missions, many of which could have been done in the data center, simply because the purchase authority of the line managers allowed the buy.  Those managers said it was easier (and sometimes cheaper, given how IT costs were allocated in their company) to buy and deploy a server and applications than to have IT handle them.  The result of this was a lot of under-utilized servers, and moving these to the cloud offered a cost savings.

That raises the first point about cloud benefits.  Everybody buys commercial off-the-shelf servers, and while big cloud providers may get a slightly better price than a big enterprise, the cost difference is almost never enough to cover expected cloud provider profit margins.  Thus, a server in the cloud is almost never going to be cheap enough relative to one on premises to justify a shift, unless the premises server is poorly utilized.  This truth came late to the CIOs I heard from; most said that they didn’t really understand the early cloud opportunity until 2014, and nearly a quarter said they hadn’t gotten their benefit facts straight even in 2016.

In the next, “TCO phase”, of cloud justification, companies argued that even though their own pure hardware and environmental costs were at least competitive with those of the cloud, the cost of operations management for application hosting achieved an “operations economy of scale” in the cloud that was very real.  And yes, this was true in a number of cases, but most companies quickly learned that IaaS public cloud services didn’t really impact the majority of their operations costs, which were associated with the software platform (OS and middleware) and application lifecycle management.  At the end of the day, based on my current data, the combination of capex and opex savings for application migration to the cloud is unlikely to make a business case for more than about ten percent of applications run traditionally in the data center.  Again, this seemed largely consistent with company experience as related in 2019, but it wasn’t clear at the time.

By about 2016, a slight majority of companies had realized that the problem of cloud justification based on cost was that application models that were likely highly beneficial to cloud hosting could never have been justified on the premises and thus weren’t there to be migrated.  Could it be, they thought, that the real benefits of the cloud aren’t focused on what we run today, but on what we don’t?  It turns out that was true, but who wants to redo every application to optimize its cloud compatibility?  For third-party software, you’d have to wait for the provider to do that in any case.  And about a third of companies told me they couldn’t realistically redo even their in-house applications because of the major cost, resource needs, etc.  Could they ease into the cloud, somehow?  Yes, and that ushered in the next period of cloud benefit analysis and justification.

Most companies were already looking at improving their applications (application modernization, or “appmod”), with “improving” meaning creating a better user quality of experience for their workers, customers, and even business partners, or what we’d call “UX” today.  In both 2016 and 2017, businesses reported this to be their highest development priority.  In most cases, this UX improvement stuff focused on the graphical user interface or mobile app interface to applications rather than on changes to the core (usually transaction-processing) piece of the applications.  That meant that if the cloud could be used to “front-end” traditional applications, the cloud-specific changes would be confined to areas where application modernization was already being funded.  This front-end-cloud mission is the current dominant model for enterprise cloud adoption.

Using the cloud as a front-end to traditional IT can generate at least a minimal cost savings for about 30% of current applications according to my model, and with the addition of the scalability and resilience benefits, most of these can then justify cloud adoption to front-end existing data center IT.  This doesn’t eliminate the data center applications, only the incremental development for the front-end modernization, but it’s still a market larger than the “migration” market for cloud services.  This is the next low-apple cloud opportunity, the one we’re working against today in most cases.  It’s also preparing us for the next big step.

As a front-end technology, the cloud’s ability to support scaling (and “un-scaling”) as workloads change, to replace failed components, and to facilitate rapid development practices (continuous integration/continuous development or CI/CD) to optimize the response of applications to business change creates a powerful business case.  The front-end mission also had the benefit of focusing the cloud and open-source community on the development practices that promote all these benefits.  This started the current “container wave” of application development and deployment.

Containers, as I’ve noted before, are more like portable units of work than like an alternative to VMs.  Container hosting (which is an alternative to VM hosting) is more efficient, and this means that you can get more application bang for your hosting buck with them.  It doesn’t alter the cloud business case in a pure hosting-cost comparison because you can host containers in the data center too, but container operations and orchestration overall are a good match for small application components, leading even to what are called “microservices”.

What’s happening now is that we’re building on the container concept to create the software foundation for the next step in cloud justification evolution, which is arguably the big step—the concept we call “cloud-native”.  But cloud-native is a lockstep progression, a kind of technology three-legged race.  On one hand, we have the opportunities that line organizations have to incorporate a new model of application, one that was never constrained by either the limitations of the data center or the limitations of application architects trained in the data center.  On the other we have the toolkit necessary to build applications with.  As long as our tools build monolithic applications, they’re not suitable for cloud-native, and their limitations will hamstring the education of developers and CIOs in cloud-native principles.  About one business in five today thinks they’ve mastered cloud-native.

We already know that containers are the foundation of the cloud-native future.  Kubernetes has become the irreplaceable tool in container orchestration, to the point where no business that has organized IT should be considering anything else.  We’re starting to see a collection of tools coalesce around Kubernetes to create an ecosystem, the middleware kit that will be the platform on/for which cloud-native applications will develop.

Cloud-native will extend the targets for cloudification in two ways.  First, it will allow enterprises to modify the “front-of-the-back-end” part of their current IT applications to fit cloud principles.  That allows cloud benefits in elasticity and resiliency to be extended to more of the current application base.  Second, it will allow developers to frame new applications that are more contextual, meaning they tune themselves to the environment their users are inhabiting, making workers more productive and consumers more eager to spend.  The first piece of that shift generates a business case for about 35% of current applications (actually components of applications) to be migrated to the cloud.  The second eventually generates about a trillion dollars per year in additional IT spending by unlocking more productivity and revenue.  This is the long-term future of the cloud, making it coequal to traditional computing and data centers, then actually surpassing them…eventually.

But we’ve got to get back to that one-in-five who get it.  Cloud-native is, by a slight margin, an accepted goal for enterprises I’ve heard from, but interestingly it’s those who depend on third-party software that think it’s going to happen within the next two years.  Those who have to self-develop push the timeline out another year or two.  But third-party software vendors themselves seem less optimistic.

I think the combination of tool development and concept acceptance is critical to realizing this vision, and also in establishing the timeline.  It is in truth easier to push the third-party software market to cloud-native because the cost of change can be amortized.  Third-party software dominates the market these days, and that means that the developers of that software have a critical impact on the pace of cloud-native evolution.  However, software providers are going to drag their feet until there’s an established “cloud-native-platform-as-a-service”, or middleware and API set, that they can use for their own development and expect to have support for in their target buyers’ environments.  Until that’s in place, we can’t expect them to commit to a transformation to cloud-native, and we don’t have it yet.

This is a highly simplified vision of a very complicated process.  The key question at each phase of cloud evolution, including the current one, is how we balance the cost of adoption in that phase with the benefits cloud adoption would bring.  That’s what makes this so application-and-company-specific.  I’ve worked with and for companies who ran applications that hadn’t been changed in 20 years, and in fact couldn’t be changed now because they’d lost the source code or skills.  Other companies who have been working hard on things like CI/CD have probably componentized their software already, and could adopt cloud-native practices with much less effort, providing there was an established platform or framework they could work toward (back to our three-legged race).

How things would net out to a company under favorable conditions can be statistically assessed.  Ten percent of applications could be expected to be migrated to the cloud under server consolidation mandate.  Another 30% of application code, created by projects aimed at improving UX, could be justified for cloud hosting if the UX improvements were made in a cloud-centric (or cloud-native) way.  Another 35% of application code could be shifted to the cloud by applying cloud-centric/native principles to the portion of data center apps where the cloud front-end touched.  That adds up to 75% of application code eventually being made cloud-ready (some of it will still run in the data center for governance or overall cost reasons).  In addition, every company could justify increasing IT spending by about 1.8 times if they built contextual applications to improve productivity and sales, using cloud-native principles.  That summarizes the statistical picture of cloud opportunity as my model presents it today.

Over how long?  I can’t say.  I pooh-poohed the whole idea of “migrating to the cloud” from the first, but in the heady days of cloud hype nobody believed that it wouldn’t sweep the data centers aside, so nobody thought about what the future would really need.  In fact, we didn’t see any rational tools to support a cloud-centric future develop until about 2016, about a decade after public cloud services came along.  That’s hardly a breathtaking drive along the cloud path so far, and if ignorance and hype stalled things early on, there’s little sign those two factors won’t continue to hurt us.

My guess, based on highly imperfect modeling?  We can expect to see the UX-improvement opportunity realization peaking in about 2022.  At about that time, we can expect to see real growth in the cloud-native contextualization opportunity, but that won’t hit its stride for another five years.  By the end of the next decade (by 2030), we can expect to have a true cloud-native world.

Optimizing the Virtual Implementation of 5G User Plane Function

One area of 5G and cloud intersection is the 5G User Plane, a component of the 5G Core architecture.  My readers know of my skepticism about 5G Core deployment overall, but I’m going to set that aside for a bit to deal with how the 5G User Plane could be considered a poster child for cloud-native thinking.  It’s also, in my own view, a poster child for why “network virtualization” is different from “virtual devices”.  In fact, the 5G UP may contain an example of how we got stuck in box networks in past virtual-network specification attempts.

In 4G, the Evolved Packet Core handles information exchange, the user plane function, and mobility management, the control plane function.  EPC with CUPS (Control-User-Plane Separation) provides a semi-independent planar structure, but there are a lot of interfaces between the control and user planes, to the point where most agree that CUPS is “S-in-name-only”.  5G’s Non-Stand-Alone (NSA) model glues 5G’s New Radio onto the EPC model, making many of the 5G benefits (increased user speed and cell capacity) available quickly.  This is the model of mobile 5G now deploying in most areas.

5G Core proposes a new model, one that simplifies control/user-plane separation and also simplifies the user plane overall.  In theory, the 5G Core model would offer an easier way to introduce cloud-native behavior, because the control plane that really is a natural cloud-native application is less fettered by continual interfaces with the data plane.  The data plane, or “user plane” of 5G core, is also potentially easier to assess for cloud support because it’s more purely “data”.

There are two points of interface between the 5G CP and UP functions.  The first is the Access and Mobility Management Function (AMF) and the second is the Session Management Function (SMF).  These interface with the User Plane through three defined interfaces.  The User Plane in 5G is really a single dual-function element, the User Plane Function (UPF), which connects between the RAN and the data network.  This singular UPF has, in the 3GPP diagrams, a “side” facing the RAN and another facing the public data network (PDN)

Referencing the 3GPP diagram of the UPF, we see the Access Network (usually RAN) connecting to the data network (like the Internet) through a UPF that’s decomposed as an Access network element and a PDN session anchor element.  Obviously, these elements are likely distributed to the point where the terminations of their respective networks occur.  In an appliance/device view, they can be mapped to individual VMs or containers, as they are in the model shown in Intel’s material, which would imply a “UPF network” made up of hosted UPF instances, meaning virtual devices.  This same “virtual UPF instance” approach is widely referenced to illustrate selective traffic routing to different data networks or to different on-ramps of a data network.

The advantage of this literal approach to UPF is that if we presume that everyone implements a UPF as either a device or a virtual device, then the implementations could be interchangeable.  If we were to assume that the UPF was a network-level abstraction with multiple inputs from the (R)AN and multiple PDN outputs, then the implementation of that abstraction would be a potentially proprietary black box, and at the least would probably not be open to internal substitution of parts or vendors.

The question is whether, to avoid catching cold, we expose ourselves to a more serious disease.  If we want element-per-element substitution within our virtual UPF, then we must define an internal structure of elements, which means we now create that box network and turn a functional specification or picture into an implementation architecture.

It’s hard, given that we have few if any full-scale 5G UPF deployments, to say how serious our box disease would really be, but we can refer to an example I’ve used in the past to get a hint as to what we might be losing with it.  Google’s core network is SDN with a layer of BGP around it.  An SDN core with a layer of UPF interfaces around it could serve as a virtual UPF network, without dictating the structure of the interior.  This would open the UPF implementation not only to SDN, but also to any form of cloud-native hosting that could deliver the edge functionality.  It doesn’t say you can’t have virtual-device UPFs, even.

A virtual UPF network could attach to all ANs and all PDNs, do all the selective routing internally, and present a very simple vision—essentially one huge PDN—to the control plane.  We would be free to adopt any set of internal behaviors that would meet those external interfaces, which means we could do more to optimize cloud-native elements.  This, I think, is how cloud-centric players would do a 5G UPF.

If we were to pick something like SDN as the implementation, we could reference a standard SDN implementation, such as one based on OpenFlow with a central controller, to avoid the lock-in problem.  Further, a central controller could provide the linkage between the virtual UPF network and the control-plane elements (AMF and SMF).  This, I believe, would simplify the creation of a virtual control plane and the use of cloud-native elements.

One reason to consider this is that with all the usual nomenclature changes we’ve come to expect from new standards, 5G UPF retains a lot of the concepts of EPC, including the tunneling.  SDN could manage session connections to the right cells without explicit tunnels, particularly if the mobility management control plane piece talked to a central SDN controller.  I argued even before 5G that we really needed to think about how to use SDN to implement a virtualization of the whole of EPC rather than using VNFs to replace appliances in the various EPC element roles.  I believe that’s even more true of 5G.  Why is it that we can’t seem to shake box-think, anyway?

I’m not necessarily against the notion of having an abstract black-box UPF that is implemented in the way that Intel and others (who do UPF-instance hosting) have described.  What I’m against is making the decomposed contents of the black-box UPF visible when the box isn’t being recognized.  Virtualization should always be abstraction and decomposition, and in the 5G Core UPF model, we’re seeing only the decomposition.  The NFV ISG went wrong by taking a functional architecture literally, because it foreclosed a true cloud solution.  The same could happen to 5G if we’re not careful.

Why the AT&T Airship Project is Important to Networks and the Cloud

There are some interesting things going on with Airship, an AT&T-promoted open-source project to deploy a virtualized hosting layer on top of bare metal.  For one thing, it’s getting support from a major server vendor, Dell.  For another, it’s closing a loophole in the overall vision of cloud hosting, and it’s addressing some issues that are particularly critical for carrier cloud.  Finally, it just might be showing that AT&T is getting serious about a cloud-centric vision of its own infrastructure evolution.

Virtualization in hosting, in simple terms, is the creation of an abstraction between real data centers and applications or hosted features, so that deployment and operations for software can be standardized across a mix of hardware options and locations.  This is how “the cloud” is built, but most of the focus of the cloud has been on how to build the applications and features.  The abstraction piece has been taken a bit for granted, other than some catchily named initiatives like “infrastructure as code”.  Airship was launched to fix that.

Airship is an open-source toolkit assembled from fairly popular pieces, including YAML, OpenStack Helm, Redfish, and Kubernetes.  With Airship, it’s possible to declare a kind of portable virtualization layer, and map it to infrastructure when data center resources are augmented.  For classic cloud data centers, where the hosting resources tend to be homogeneous and augmented incrementally by adding server racks, this isn’t a life-changing capability.  Where it makes a lot of sense is in its application to edge computing or an expanding carrier cloud.

If my models are right and carrier cloud represents a potential for more than a hundred thousand incremental data centers by 2030, it would be madness to expect the operators to pony up all the money for a single massive deployment.  What would happen is that data centers would be incrementally commissioned in areas where opportunities converged, at a pace that made business sense.  However, all this discrete commissioning has to add up to collectivized resources, which is where Airship can help.  Arbitrary edge data center plus Airship equals an added piece to carrier cloud.

There are two critical elements to the current Airship story.  One is Redfish, a universal RESTful API and data model set that’s designed to deliver open management and abstraction to all the hardware elements of a server farm.  Redfish lets a single set of management tools manage a variety of specialized hardware elements based on “class-based” management of things like disk drives or even specialty chips.  The other is the notion of Helm charts as descriptors for software tool deployment and integration.

An Airship deployment links all the current cloud-native and container work, proceeding at what I’ve called a “dazzling pace”, to the necessarily earthbound data centers.  Integration of the application-and-feature abstractions of things like cloud-native components or even NFV’s virtual network functions (VNFs) is facilitated if there’s a uniform set of properties defined for the abstract resources on which things run.  Functionality is delivered from the top, but to realize it the resources at the bottom have to be organized into a common framework.  NFV failed to do that in its specifications, and the result has been major integration (“on-boarding”) problems.

It’s the Helm charts stuff that sets Airship apart from what might be called “pod” or “abstract pod” strategies that create a pure hardware-layer abstraction.  One thing the cloud work of the open-source community has shown us, particularly in the container, Kubernetes, and cloud-native space, is that the whole cloud-native software ecosystem is evolving almost faster than it can be documented.  The cloud-native applications and VNFs don’t deploy onto hardware, but onto virtual hardware and through a set of changing middleware tools.  Getting those tools deployed on the hardware is as necessary as getting the hardware abstracted.  Helm is used not only to deploy services, applications, or features, it also deploys Airship and any necessary middleware.  With Helm charts, Airship can deploy a complete platform for applications and features.

I think the most useful thing about Airship’s evolution, from a carrier cloud perspective, is its solidification of the concept of a platform on which services/features are run.  In the age of the cloud, you don’t talk about “hosting” to mean “running on servers”, because there are a lot of pieces that have to be in place above the server, in order for things to run.  The thing that carrier cloud needs is not the standardization of servers, or hardware, or even just “operations tools” but that whole platform.  Otherwise, deployment of services will involve special integration, and deployment of new data centers will too.

The most problematic thing here is coming up with the makeup of that platform.  Remember that Airship permits unified platform deployment, but it’s not a platform in itself.  There are still a lot of moving parts that will have to be addressed, and as the concepts of cloud-native mature the number of these parts is going to explode.  Operators may want to “collectivize” their vision of platform contents, but with early service priorities highly variable among operators, that will be difficult to do.

A close second, problem-wise, is describing the software environment that the platform creates, as a set of APIs that applications, services, and features can then rely on and be written for.  Remember some of my early blogs, when I said that NFV really needed a form of platform-as-a-service API set?  That’s true of anything that’s expected to run on a portable platform environment.  You need to present a single, standard, model for application or virtual function integration, and until that happens you don’t resolve any of the on-boarding problems.

The abstraction centerpiece of virtualization has to adapt in two directions.  It first has to adapt the hosting environment to present itself as a single, common, set of capabilities, lifecycle-managed in a uniform way.  Then it has to present that common set of capabilities through a set of well-known and stable APIs so virtual functions and components can use them.  Only then can those functions and components be altered to conform, and only then does a standard hosting platform map to standardized feature and service deployment.

AT&T could do some things to drive this process along.  Obviously, they could work to define that standard service-facing API set.  They could also work to make ONAP conform to the Airship platform framework, but in my view, ONAP has scalability issues that should be addressed first.  In any event, ONAP is already looking at Airship.  AT&T could even promote a rational view in the Common NFVI Telco Taskforce (CNTT), where they’re a member.  So far, I’m not hearing much about a CNTT/Airship connection, except perhaps via ONAP.

Another interesting question is whether the Airship concept could spread.  All cloud computing, but in particular hybrid cloud computing, has to deal with the issue of deploying a consistent platform on various data center resources, and it would be helpful if the service-facing aspect of that platform could deploy on the public cloud to create better application portability.  This could be a rightful mission for Airship given that deploying public cloud services on carrier cloud infrastructure is a mission most operators are at least considering.

All this interest comes with some concerns, obviously.  One is whether AT&T, having been the driver behind the ONAP design that lacks fundamental cloud-readiness, can take a positive cloud step.  The difference here is that Airship is assembled from cloud-community technology, where ONAP was an AT&T development project from the first.  I think that difference is more than enough to ensure Airship isn’t falling into the ONAP monolithic trap.

The second question is whether those necessary steps I outlined above will actually be taken.  Having a way to do something doesn’t guarantee it gets done.  There’s a lot of careful cloud-centric planning associated with the kind of future Airship enables.  AT&T and other operators obviously have a mixture of cloud-thinking and cloud-opaque people, and we don’t know which group will contribute the most to the actual deployment of carrier cloud infrastructure.  Remember that Airship is a loosely coupled toolkit, and “loosely coupled” means “human-integrated”.  If the cloud-opaque thinkers get the nod, those humans may find they’re not able to play their role.

Could Open Source Break the Negative IT Cycle?

Are we in for tech trouble?  Cisco’s quarter ended last week, and while their numbers were decent, their guidance disappointed.  Enterprise orders fell, and the service provider space continues to be weak.  Generally, Cisco has been a bright spot in the network and tech space, so we have to ask whether those spaces are at risk overall.  Enterprises and operators I’ve chatted with over the last month have contributed some views, and I’ll share and analyze them here.

The first point both groups made with me is that global economic concerns are rising, which tends to put a damper on capital programs even before there’s a convincing level of risk present.  The general view of both enterprises and operators is that trade tension and geopolitical political issues, including Brexit, are going to dampen growth.  The question is whether there will be any compensating positives.  In the US and EU, it seems likely that some form of economic stimulus will be applied, but that may not be enough.

Buyers overall believe that trade problems, in which category many include Brexit because of its impact on EU/UK trade, will always hurt tech sales, and that economic stimulus can’t make up for it completely because it can’t address the uncertainties that underlie the trade issues.  We have not had a global trade war, or currency war, in decades, and we thus have no relevant experience with how either/both would impact the tech industry or the economy overall.  Because buyers believe this, it becomes true to an extent by default.

The second point is also made by both groups, but in slightly different form.  Both enterprise technology and service provider infrastructure are having increased problems meeting corporate ROI expectations.  The providers express this as the “profit per bit” problem, and it’s now almost a decade old.  Enterprises are mixed in how they see this issue, and even the mixture is an interesting data point.

The majority of enterprises say that new IT investment to lower overall cost, meaning things like virtualization, are running out of gas.  Virtually all buyers focus early attention on the projects with the best business cases, the “low apples”.  As those are picked, the ROI on the remaining projects averages lower and lower, making it more difficult to get them approved.  This is true of both networking and IT, and it’s probably responsible for the decline in enterprise orders that Cisco reported.

A smaller percentage of enterprises say that the specific problem is with “the cloud”.  The simplistic model of cloud migration to reduce costs has not worked for the average project or business.  They now recognize that they have to think about applications, and application design and development, very differently to optimize cloud value, but they don’t know how to do that yet.

This second concern is the number one technical issue service providers cite, though frankly I wonder whether they’re just latching on to something to explain their problems.  According to the providers, they believed in the NFV Call for Action model, meaning the substitution of hosted features and off-the-shelf servers for proprietary devices.  They recognize, overall, that the initiatives aimed at doing this haven’t succeeded, and they accept (overall) that a successful transformation of this sort will require “cloud-native”.  Overall, they don’t know what that means, and so they’re stuck with no real projects to execute on.

The smallest group of enterprises say the problem isn’t cost as much as benefit, and that’s also the second-most-held position among operators.  ROI can be maximized by lowering the “I” or raising the “R”, which for enterprises would mean more projects to enhance productivity, and for providers, projects that would raise service revenues.  Both groups express the same doubt that they have the skills needed to identify things that would boost returns, or to execute projects to deploy them if they could be recognized.

“I know we really need to move up the value chain,” one operator market planner told me, “but that’s just not how we think.”  An enterprise architect said “We think of our business processes in terms of our IT capabilities, instead of planning new IT capabilities to shape how we do business.”  And these problems are becoming entrenched.  Service providers last considered above-the-connection services seriously in the early 1990s according to my surveys, and enterprises ran out of productivity insights just a decade later.  Neither group has recovered the initiative.

The third point that both enterprises and operators raise is the consumerization of tech.  Thirty years ago, business service revenues and consumer revenues were virtually equal parts of operator revenues.  Today, consumer revenues dominate, and consumer services set the infrastructure priorities overall.  The big successes in tech sales to non-operators are vendors who sell to consumers—think Apple and Microsoft and how their revenues divide.

You can see the impact of this in the cloud.  Insiders tell me that the majority of cloud spending today is actually directed at consumer empowerment not worker empowerment.  Social media and content startups have long dominated the cloud space, and only recently has enterprise use of the cloud started to show signs of life.  It’s not that the mass market is bad, but that the mass market doesn’t typically value “good” technology as much as good marketing.

Lurking within this point, of course, is the ad sponsorship trap.  As long as consumers resist paying directly for services, operators have little chance breaking into that heady up-the-value-chain space.  They’re far behind in ad sponsorship, too far to catch up.  But ad sponsorship, even for the OTT successes, has its limits.  Total global ad spend on all channels and media is only about seven hundred billion dollars per year.  My models say that the revenue potential for cloud-based pay-for services is about triple that, but how do you promote something people have to buy when they want it all for nothing?

Consumers value price leadership highest, then “social impact”, and finally features or technology.  You can see a bit of this in the fact that Walmart was the big retail success of the quarter, and it’s clearly a price/value play.  But as understandable as the consumer priorities might be, and as easily proved as they are, they clearly impact technology products.  Further, the enterprise or service provider buyer is a consumer, like all the rest of us, and influenced by consumer tech attitudes.

All these are valid points, but the one that both groups missed and is at least as valid is the short-term thinking of vendors combined with user reliance on vendor initiatives.  If you look realistically at these points, what you see is a signal that we’ve had a revolutionary shift in how the tech market works, and we need a revolutionary response.  But revolutions are scary to buyers, and vendors don’t want to overhang their current profits to attempt to promote one.  At the same time, both enterprises and service providers admit that they do technology planning based primarily on vendor stimulation.  They rationalize it by saying that it makes no sense to build technology plans around a hypothetical product set; real products are needed, and only vendors can offer them.

Cisco commented on this particular point, in an indirect way, when it cited the fact that 5G wasn’t generating new vendor opportunity in the near term.  This reflects the fact that for vendors looking for service provider opportunity, all traditional methods of cost and revenue enhancement have failed to change the ROI equation, so the vendors are relying on standards-driven transformations.  We can’t make 5G look good, so we want to make it look mandatory, and that’s not working.

On the plus side, Cisco also cited “communications and collaboration”, and in particular “cognitive collaboration” as part of their desire to “accelerate the future of work”.  Might that be an example of a vendor trying to address the future and the evolution to reach it?  Sure, but it could just as easily be an example of Cisco’s well-known marketing hype.

Buyers need to take more control, or at least look like they’re about to do that.  Vendors could, and many do, run futuristic projects behind the curtains to avoid contaminating current sales.  However, they could tie the two in to create differentiation, providing that the process of educating the buyer to understand the future and how to evolve to it wasn’t prohibitively difficult.

Open-source projects could be one solution to this.  Think of open-source as “tech nerds do innovation”, which both enterprises and service providers alike find a bit uncomfortable.  However, there are a lot of open-source projects, and despite their nerdy roots, they’ve recently accomplished more than proprietary planning and development has been able to accomplish.  The question is whether they can make a more direct and solid business connection, deliver high-level benefits and make CxOs understand how they’ll do that.  So far, open-source has been great for the architecture-level stuff but struggling with the true application layer.  If buyers can figure out how to get that benefit connection to work, then open-source might break the benefit stalemate…finally.

Cloud ROUTING versus Hosted Router Instances

I mentioned data plane feature hosting in my last blog, noting that we needed to spend some time looking at the connection-service elements and how we’d propose to make them candidates for hosted and cloud-native implementation.  I propose to start the ball rolling with this blog, and to do that we have to look at least a bit at the way we do connection services today.

Connection services are currently implemented using devices, and because most connection services are IP-based, those devices are usually routers.  A router network is a community that cooperates to learn its own structure and connectivity and deliver traffic among the endpoints of every device in the community.  The traffic is the data plane in our discussion, and the cooperation that takes place within the community is mediated by the control plane.

The union of these planes, the center of routing, is the routing or forwarding table.  This is a table in each router, containing addresses and masks that identify a set of packets and associate that set with a destination port.  A route from source to destination is created by the sum of the forwarding tables of the devices.  Router A sends packets of a given type out on port 3, which gets them to Router D, whose table routes them out on port 11 to Router M, and so forth.  Forwarding tables are created using routing protocols, part of our control plane, which advertise reachability in a sort-of-Name-That-Tune way: “I can reach Charlie in THREE hops!”, in the simple case, or reach it based on more complex metrics for modern protocols like OSPF. Each device will keep its best routes to destinations in its forwarding table.

A data plane, then, is the collection of routes created by the sum of the forwarding tables.  In traditional router networks, the forwarding tables are created by the adaptive control-plane routing-protocol processes.  It’s been proposed that in OpenFlow SDN, the tables would be provided or updated by a central control agency, and adaptive routing-protocol exchanges would not happen.

There are about four dozen control-plane packet types (look up the ICMP or Internet Control Message Protocol for the list), and they divide into what could be called “status” packets and “route” packets.  A hosted functional instance of “routing” would have to pass the data plane based on forwarding table entries and do something with the ICMP packets, the control plane.  Let’s go back to OpenFlow for a moment and see how that works by examining the extreme case.  We’ll focus on the “configured forwarding” model of OpenFlow rather than on the “adaptive” model, but it would work either way.

In an OpenFlow device, the central control element will have loaded a forwarding table based on network topology and traffic policies, and then kept it current with conditions.  There would be no need to have control packet exchanges within an OpenFlow network, but you’d probably need them at the edge, where conventional IP connectivity and devices would be or could be connected.  Thus, we could envision a hosted OpenFlow device instance to consist of the forwarding processes of the data plane, the management exchanges with the central controller, and a set of control plane proxies that would generate or respond to ICMP/routing protocols as needed, based presumably on central controller information.  This approach is consistent with how Google, for example, uses SDN within its network core, where it emulates BGP at the core’s edge.

We can see from all of this that our “software-hosted router network” could be visualized in two ways.  First, it could be visualized as a “network of software-hosted routers”, each of which looked exactly how a real router would look, from the outside in.  This would be an abstract router model.  It could also be visualized as a “software-implemented router network” which, from the outside, looked not like a router but like a router network.  This is a distinction of critical importance when you look at how software-hosted, and especially cloud-native, technology would apply in the data plane connection services.

If we go back to the initial NFV concept (the “Call for Action” white paper in the fall of 2012), we see that the goal was to reduce capex by substituting software instances hosted on commercial services for proprietary devices.  Within a year, the operators who authored that paper had largely agreed that this would not make enough difference in capex to be worth the effort.  Thus, I contend that the abstract router model of data plane connection service implementation is not going to offer enough benefits to justify the effort.  We have to look at the abstract router network model instead.

But what the heck is that?  As it happens, we have a general model of this approach that goes back decades, in the IETF RFC 2332 or Next Hop Resolution Protocol (NHRP).  This RFC posits a “Non-Broadcast Multi-Access” or NBMA network, surrounded by a ring of devices that adapt that network to “look” like a router network.  You could implement that NBMA network using a connection-oriented protocol (frame relay, ATM, or even ISDN data calls), or you could implement it using any arbitrary protocol of the type we’d today call “SDN”, including OpenFlow.

The data-plane procedure here is simple.  A packet arrives at our boundary ring, addressed to a destination somewhere on another ring device.  The arrival ring device looks up the destination and does the stuff necessary to get the packet to the ring device on which the destination is connected.  That device forwards the packet onward in an IP-centric way.  The ring devices are thus proxies for the control plane protocols and adapters for the data-plane connectivity.  This, I submit, is the right model for implementing hosted routing rather than hosted routers.

What Google did, and what NHRP proposed, was just that, a model for a software-hosted router network, not for software-hosted routers.  With it, we’ve broken the device-specific model of IP networking into two pieces—the NBMA core where any good forwarding technology is fine, and an adapting ring of functionality that spoofs IP networks in the outward direction to match connection requirements, but matches the traffic to the NBMA requirements.

One obvious feature of this approach is that ring on/off-ramp technology is where we have to look to combine control- and data-plane behavior.  Inside the NBMA, we can probably deal only with data-plane handling.  Another feature is that since the NBMA’s technology is arbitrary, and any place we can put an on/off-ramp is a good place, we could assume that we’d give every user such a ramp point.  That would mean their control-plane relationships could be supported by a dedicated software element.  It could theoretically be scalable too, but it may not be necessary to scale the control-plane element if it’s offered on a per-user basis.

We could imagine our NBMA as being a geographic hierarchy, a network that picks up users near the access edge and delivers traffic to other access edge points.  It might have an edge/metro/core hierarchy, but as the traffic between edge points or metro points increases, we’d reach a point where it made sense to create a direct (in packet-optical terms) path between those higher-traffic points.  I’m envisioning this level of connectivity as being created by agile optics and SDN-like technology.  The hosted elements, in my view, would be between metro and edge, where traffic didn’t justify dedicated facilities and instead required aggregation for efficient use of transport.

This model creates a kind of symbiosis between SDN to define semi-permanent node points and routes, and hosted ring-element functionality to serve as the on/off-ramp technology.  Since the latter would handle either a single user (business) or small access community (residential) the demands for data-plane bandwidth wouldn’t be excessive, and in any case some of the data-plane work could be done using an SDN element in each edge office (where access terminations are found).

We end up with what’s probably a metro-and-core network made up of fairly fixed optical/SDN an edge network made up of hosted ring-element instances, and an edge-to-metro network that might include a mixture of the two technologies, even a mixture of fixed and instance-based elements.  This is what I think a connection service of the future might look like.  It wouldn’t change optical deployment, except perhaps to encourage an increase in capacity and agility.  It would promote electrical-layer grooming via SDN, likely OpenFlow.  It would permit an overlay network, in Ethernet, IP, or SD-WAN/SDN form.

To me, the decisive point here is that the deeper you go into a network, the more likely it is that the elements of the network don’t require repositioning, only resiliency.  If you want to create a more reliable metro/core you do that with multiple paths, meaning more nodes, not by creating scalable nodes.  That’s because the optical paths between major locations are not “agile” or subject to software virtualization.  You need facilities where trunks terminate. Agility, and cloud-native, belong with the control plane and closer to the edge, where variability in activity and traffic might indeed justify virtualization of the node points.

Just because this is what I believe doesn’t mean it’s always true, of course.  Metaswitch, a company whose work in the IMS space (Project Clearwater, in the old days) I know and respect, has a different view of cloud-native VNFs and hosted 5G core.  They published an excellent white paper, available HERE, that talks about cloud-native development in NFV.  I agree with the principles they articulate, and I’ll do a later blog on their paper and approach, but I believe that they’re talking about a software-based router, not a software-abstracted router network, which is where I think we have to be.

The biggest challenge in virtualization lies in what you elect to virtualize.  Virtualizing network elements means building a conventional network of virtual elements.  Virtualizing a network means building a new model of networking.  Only the latter of these approaches qualifies as a revolution, and only a revolution will really optimize what the cloud can bring to networking.

Where is Cloud-Native NOT a Good Idea?

Cloud-native technology is important to everyone, and critical to many, but there’s already a trend toward seeing cloud-native as sweeping everything else from the tech world.  I think it’s certain that every enterprise will end up adopting cloud-native applications or application components, and that at least three quarters of all applications will have cloud-native elements, but there is no universal constant (that number which, when multiplied by your answer, yields the correct answer) and there’s no universal development paradigm either.

Are there things for which cloud-native is absolutely not the right answer?  Yes.  Are there things that most believe to be universal cloud-native benefits, and aren’t?  Yes.  Are there myths about cloud-native that, even where it’s a good fit, could still lead companies astray?  There sure are, and we’ll look at all these points in this blog.

To briefly reprise a point for those who haven’t followed this topic, in my blog or elsewhere, “cloud-native” is a model of application development where logic is divided into small “microservices” that are individually resilient and scalable.  This model, combined with cloud hosting, lets an application heal itself and adapt to changing workloads or even feature requirements.  Because the microservices are loosely coupled and written based on abstraction-centric “intent-modeled” design, they can be changed easily to accommodate new requirements and the changes can be introduced quickly and without major production impacts.

This definition is important because it opens a discussion on the issues that my earlier questions are intended to raise.  Thus, let’s look at the definition and apply it to the real world of application development.

You can envision a cloud-native application as a swirling set of logic elements that appear, grow, shrink, and are replaced as needed.  Work, from the user perspective, is accomplished by enlisting a bunch of these logic elements to complete a task.  There’s almost always an “orchestration” or “step function” process involved that sequences the cloud-native microservices, stringing them out in what we’d call a “workflow”.  From this vision alone, you can infer one of the truths of cloud-native, which is that cloud-native is a framework for interactive applications or application elements, elements that are essentially handling units of work that are often called “events”.  Cloud-native is not batch processing.

An interactive application is one that has a close relationship with the behavior of humans or external systems.  Somebody hits a key, and (eventually) expects to get a response.  Somebody presses a button and (eventually) expects a gate to open.  The parenthetical here is offered to illustrate that the reaction is expected but not expected instantaneously.  Human think time is such that there is a tolerance for delay, and there’s a natural pace to the external system (how fast will a human move to the next task, or how quickly will the next truck present itself at the gate?) that means that delays in the processing, up to a point, won’t impact the application overall.

Contrast this with a batch application.  Here the data is already there, already stored, and the application gets a data element (a “record”) when it’s ready, processes it, perhaps outputs it, and then goes back to get another.  The difference is that with this batch model, any delay that’s introduced in the input-process-output sequence will accumulate, since it will lengthen the time between successive readings of the inputs, and that will lengthen the runtime of the application.

If I break a batch application into multiple, network-connected, microservices, I’ll add a transit delay for each of the connections, and those delays will accumulate.  Add ten milliseconds in network delay, and in a million records I accumulate ten thousand seconds of delay, or 167 minutes.  Yet, ten milliseconds of network delay would almost surely be unnoticeable in an interactive application.

Right here, then, we have an example of things we shouldn’t be looking at as “cloud-native” candidates.  Anything that works on stored data is almost surely not a cloud-native application candidate.  That doesn’t mean it couldn’t be broken up into components, only that the componentization should be limited to reduce the workflow-connection latency introduced by the process.  That would mean bigger microservices, which is at least a bit oxymoronic.  All componentized applications are not microservice applications, and not cloud-native candidates.

A microservice is a small, scalable, unit of processing.  That means it’s probably made up of simple logic, not highly complex iterative calculations or something.  Given that, cloud-native applications have simple logic components that can be scaled, not so much to support one user’s needs but to support the collective needs of vast numbers of users.  It’s volume, not complexity of logic, that characterizes them.  That’s why interactive/event-based applications are good candidates.

Even interactive or event-driven applications aren’t always ideal candidates for a pure cloud-native implementation.  One key question is just how work is steered through microservices.  An ideal cloud-native application is one where the application can be represented as a system with finite states (the so-called “finite state machine”), and where each possible event, in each of the states, identifies a specific process for handling.  In this model, the system’s data model and state/event table orchestrate the event-to-process relationships.

Where the path of work depends on the work previously done, meaning the results of earlier microservices in the workflow, it may be necessary to visualize the application as a “main portion” that’s at least somewhat conventional or monolithic in structure, and which then invokes the necessary microservices as they’re identified.  This is a model that could well come about if you translated traditional transaction processing into a microservice-and-cloud-native form.

If we apply this to the service provider world, what you end up with is the realization that there are three different kinds of “processing” activity that could be expected.  We have data-plane connection-service activity, we have control-plane activity related to things like mobility management, and we have management activity that’s really related to the software that’s automating the service processes overall.

Data-plane connectivity is probably not something easily translated to a cloud-native form.  Distributing instances of a data-plane process dynamically to handle load requires dealing with the issue of packet sequencing and distributed state control to ensure that you don’t end up mixing threads of different conversations (sometimes called “tail-ending”).  Some of the distributed state problem relates to the handling of control packets, which while suitable for cloud-native microservice handling, may present complications if the control packets have to impact data flows.

Control packet handling that relates to things like mobility management are ideal cloud-native applications.  An individual user generates a very limited volume of these packets, and the processing needed per user/packet is limited.  Load is created not by process complexity but by sheer volume, and so we’re dealing with simple, highly scalable processes.  A perfect cloud-native model.

Management processes are, in my view, the big opportunity for cloud-native handling.  Service lifecycle automation is (or should be) an inherent state/event process.  Services have a finite number of states through which they cycle during operation, and a finite number of events that represent conditions to be handled.  Since lifecycle automation is logically the first thing that happens with a service, before we have any data plane or control plane interactions, it would make sense to start cloud-native with lifecycle automation.  Which, obviously, we’ve not done.

This little sequence of data-plane to management-plane also illustrates a point I made in a prior blog, which is that we need to spend a bit of time looking at just what (if anything) could-native could do for the data-plane features of network services.  Remember, everything doesn’t have to be cloud-native, but everything should be considered a candidate.  We don’t need or want to force-feed cloud-native into the data plane, but we don’t want to miss an opportunity to rethink how connection services are handled via hosted functions either.

VMware’s Plan to Own the Telco Cloud: Workable?

Does VMware have a plan to transform the carrier cloud?  It’s far from an idle question, especially now that IBM has acquired Red Hat.  Cloud computing is transforming under the stimulus of “cloud-native” thinking and planning, and there’s no vertical market where that’s needed more than the carrier cloud space.  Network operators have been locked in the dark ages of cloud thinking with NFV, and they’ve failed to really exploit any of the other half-dozen drivers of carrier cloud.  Could VMware have a plan?  Light Reading did a nice piece on their recent M&A, and it’s worth taking a deeper look at the topic overall.

The thing we call “carrier cloud” is really a combination of a single truth, six demand drivers, and a rather fuzzy set of technical visions.  The single truth is that connectivity services cannot possibly sustain operator profit goals over time, which means that operators need to climb the value chain by providing more than just bit-pushing.  The six demand drivers are virtual CPE, video/advertising features, 5G transformation, operator cloud computing services, contextual services, and IoT.  The fuzzy technical visions are…well…fuzzy.

Most operators realize that they need to exploit hosting features and functions better.  One mission, defined by NFV, is the replacement of legacy appliances/devices by hosted function instances—virtual network functions or VNFs.  The other missions, defined by the other drivers, are related to providing non-connection services, what we customarily call “higher-level” services.  Most operators had hoped that NFV would rush onto their scene, reducing costs by replacing expensive devices with commodity servers, and build out a pool of hosting resources that could then be exploited (via NFV technology) by the other higher-level demand drivers.  That was the plan.

The problem is that NFV hasn’t been particularly impactful as a driver of carrier cloud.  In fact, some operators (see HERE) doubt that NFV’s approach is the right one, and as that same article and many of my blogs show, so do I.  Since operators had generally seen the NFV work as driving both near-term appliance replacement and long-term higher-level services, that puts the whole carrier-cloud-transformation scheme at risk.

I’m getting into this background, because it’s relevant to VMware’s carrier-cloud aspirations.  If carrier cloud equals NFV and NFV equals a market bust, then VMware has to pursue carrier cloud in such a way as to accommodate that truth.  If it doesn’t then the failure of, or even slow-roll of, NFV could be a disaster.  That’s particularly true given the fact that VMware competes with IBM/Red Hat, who also have their eye on the carrier-cloud space, and may move quickly once they get themselves organized.

What, then, is VMware doing?  Most obviously, they’ve put together a group responsible for the carrier cloud/service provider market.  They offer service providers an opportunity to resell VMware cloud technology, of course.  Most recently, as the first Light Reading piece I cited shows, they’ve done some M&A that seems directly aimed at operators, and perhaps in particular at the cloud-native aspects of carrier cloud.

Avi Networks is an application delivery controller (ADC) vendor, which in a practical sense means they’re a load-balancer.  One of the critical pieces of carrier cloud is a service mesh, and load balancing is central to service mesh technology.  If VMware wants to support true cloud-native development, then Avi is a critical part of that picture.  In addition, it’s already integrated with all the major cloud providers, which is important in that it would let carrier cloud public cloud services play a role in multi-cloud enterprise applications.

The Bitfusion deal is something I mentioned in passing when it was first announced.  Bitfusion virtualizes specialized processor resources, like GPUs and even FPGAs.  There are a lot of applications for these specialized processors, but surely they could play a critical role in virtualizing data-plane functions in a carrier cloud.  I think VMware will be extending the original Bitfusion approach, and also its target markets.

Uhana is the most carrier-specific of the acquisitions, one VMware specifically aligns to their Telco Cloud and Edge Cloud activities.  They use AI to analyze telemetry from all parts of a mobile network in order to determine optimization practices appropriate to changes in the overall subscriber distribution, traffic distribution, trends, and so forth.  Uhana is also a powerful stream processor, capable of analyzing traffic flows in real time.  This could be critical for things like real-time massive multiplayer gaming and even IoT, but it could also be put into service as part of any event processing application, and even for service lifecycle automation.

Only Uhana of the three is explicitly put into the carrier cloud space, but all three would have obvious applications there.  However, I don’t see the three as indicating a specific and coordinated carrier cloud push, as much as I see them as a specific and coordinated counterpunch to the IBM/Red Hat thing.  It’s very clear that IBM is preparing for a big cloud-native push, and where carrier cloud comes into the picture is that it’s the largest prospective source of new cloud data centers for the next decade…if it happens.

Let’s dissect that statement.  “Cloud-native push”, meaning that both IBM and VMware understand that the big push in software is cloud-native development and deployment.  It’s a step of such magnitude that enterprises are less likely to try to assemble their own toolkit for it when a vendor steps up to offer a complete and supported solution.  “If it happens”, of course, reflects the uncertainty that the carrier cloud will ever live up to its potential.  You can’t ignore a potential hundred thousand incremental data centers, but you can’t bet on a vertical market that’s disappointed more technology sellers more times than perhaps all the other verticals combined.

That’s the dilemma that VMware and its competitors all face.  Cloud-native is becoming more than a slogan, thanks to the pace of advance in the tools that relate to microservice-centric development and containerized deployment.  However, the service provider space has hurdled down the wrong path on virtually everything it’s undertaken in the cloud area, which means that it’s been diverging from the technologies it will depend on in the long term.  Can vendors present the truth to the telcos now, and hope they’ll see it, or must they blow kisses at NFV and ONAP and hope the telcos will come to their senses on their own, in their own good time?

Logically, the answer is “do both”.  Telco, or carrier, cloud depends on getting the right answer soon enough to be able to apply it.  Waiting for the earth to take a couple of whirls won’t cut it, but neither will sticking the telcos’ collective noses in the excrement they’ve created for themselves while hoping for their repentance.  An evolutionary model, something that redeems at least some of the vision of the current path while getting things on the right path, is essential.  That’s where VMware would have to apply its planning and thinking.

Open-source software has been an astounding technical success, but from a promotional sense it’s been appealing more to technicians than to senior management.  In my own surveys and contacts with enterprise management, I’ve found few who can articulate the value proposition of cloud-native, much less explain the technical features.  In the service provider space it’s even worse, in part because cloud-native is fighting a culture that’s perhaps more entrenched than any tech culture in the market.  As I’ve said in the past, we don’t even have the words to explain the cloud-native features and benefits, or its application model.  VMware, or a competitor, is going to have to provide them, and more.

The current M&A is a good thing.  It makes sense to build out a solution you can sell before you start stimulating evolutionary buying.  However, “selling” is more than “having”.  VMware will need to be more evangelistic regarding its new cloud-native model, more explicit in defining how they believe telcos will adopt it and how the benefits will accrue, or whatever they do to actually create deployable assets won’t matter.  Nobody will try to deploy them.

Do We Have a Problem Just Describing an Event-Driven System?

Could one of our problems with a software-defined future be as simple as terminology?  When I noted in my blog that terms like “network management system” or “operations support system” were implicitly monolithic, implying a traditional single-structure application, I had a lot of operators contact me to agree.  Certainly, we have a long history of using terms that describe not only the “what” but the “how”, and when we use them, we may be cementing our vision of software into an obsolete model.

To me, the best example of this in the networking space is NFV.  Back in the summer of 2013, as NFV was evolving from a goal (replace devices with software instances hosted on commercial off-the-shelf servers) to an architecture, the NFV ISG took what seemed a logical step and published what they called an “End-to-End” or E2E model of NFV.  By this time, I’d been promoting the idea that NFV was explicitly a cloud computing application and needed to adopt what’s now called a “cloud-native” architecture.  I objected to the model because it described a monolithic application structure, but everyone at the time assured me this was a “functional” description.  Well, we ended up defining interfaces and implementations based on it anyway, and we lost all the cloud innovations (and even the TMF NGOSS Contract innovations) along the way.

Part of the problem with terminology in software-centric processes is that few people are software developers or architects.  We struggle to convey software-based frameworks to people who don’t know programming, and in our struggle we fall back on the “lowest common denominator”, concepts that everyone sort-of-gets.  We know what applications are, whether we’re programmers or not.  Think PowerPoint or Excel or Word.  Thus, we think applications, and when we do our examples are monolithic applications.

An “application” is the term we use for a collection of software components that interact with the outside world to perform a useful function.  In a monolithic application vision, the application presents a user interface, accepts requests and changes, displays conditions/results, and performs its function in a very cohesive way.  You can see a word processor as such a monolithic application.

You could also see it another way.  We have this “document”.  We have a place in that document that’s defined as the “cursor position”.  You have keys you can punch, and those keys represent “events” that are processed in the context of the document and the cursor position.  This is much closer to the vision of a word processor that the development team would necessarily have, an event-driven approach, but it’s not easily communicated to the kind of people we’d expect to be using word processors, meaning to non-programmer types.

Word processors fit easily into both worlds, descriptively speaking.  You can describe one functionally, as a monolithic document-centric application, or you could describe it as a state/event system.  But word processors are supporting a user who’s naturally single-threaded in thinking.  There is one source of events, one destination for results, and one context to consider.  Suppose that the functionality we’re building has to deal with a large number of asynchronous events and a significant number of contexts?  The implementation of this is going to get a lot more complicated very quickly, and it will be increasingly difficult for any non-programmer (and even many programmers) to gain a sense of the functional logic from the implementation description.

Traditional (meaning early) event processing tended to be based on a monolithic model for the application, fed by an “event queue” where events were posted as they happened and popped for processing when the application had the resources available to do the work.  The processing had to first identify what the event was, then what the event related to, then the context of the event relative to what the application was doing at the time.  If the system the application was designed to support had multiple independent pieces (two different access services and a core service, for example), there was also the problem of making sure that something you were doing in one area didn’t get stepped on while processing an event in another area.

This model of application is still the basis for initiatives like the various NFV MANO implementations and ONAP.  It works as long as you don’t have too many events and too many interrelated contexts to deal with, so it’s OK for deployment and not OK at all for full-service lifecycle automation.  For that, you need a different approach.

The basic technical solution to asynchronous event handling is a finite state machine.  We say that the system our application is managing has a specific set of discrete functional states, like “orderable”, “activating”, “active”, “fault”, and so forth.  It also has a series of events, like “Order”, “Operating”, “Failure”, “Decommission”, and so forth.  For every state/event combination, there is an appropriate set of things that should happen, and in most cases there’s also a “next state” to indicate what state should be entered when those things have been done.  The combination is expressed in a state/event table, and this kind of thing is fundamental to building protocol handlers, for example.

The TMF’s NGOSS Contract innovation, which I’ve mentioned many times in blogs, said that this state/event table was part of a service contract, and that the contract then became the mechanism for steering service events to service processes.  This is a “data-driven” or “model-driven” approach to handling event-driven systems, and it’s a critical step forward.

The problem with the finite state machine model is that where the system consists of multiple interrelated but functionally independent pieces (like our access/core network segments), you have state/events for each piece, and you then need something to correlate all of this.  In my ExperiaSphere project (both the first phase in 2008 and the second in 2014), I presumed that a “service” was made up of functional assemblies that were themselves decomposed further, to the point where actual resources were committed.  This structural hierarchy, this successive decomposition, was how multiple contexts can be correlated.

If we have a functional element called “Access”, that functional element might decompose into an element called “Access Connection” and another called “Access Termination”.  Each of these might then decompose into something (“Ethernet Connection”, “MM-Wave Connection”) that eventually gets to actually creating a resource commitment.  The state of “Access” is determined by the state of what it decomposes into.  If the Access element, in the orderable state, gets an “Order” event, it would set itself in the “Setup” state, and send an Order to its subordinate objects.  When these objects complete their setup, they’d send an Operating event up to Access, which would then set its state to Active.  Events in this approach can be sent only to adjacent elements, superior or subordinate.

You can see from this very basic example that an implementation description of such a system would convey nothing to a non-programmer.  That’s almost surely how we ended up with monolithic implementations.  We tried to describe what we wanted in functional terms, because those terms would be more widely understood, but we then translated the functional model, which looks and is monolithic, directly into an implementation, and so implemented a monolith.

The cloud community has been advancing toward a true cloud-native model on a broad front, and here again there’s some hope from that quarter.  The advent of “serverless” computing (functions, lambdas, or whatever you’d like to call it) has launched some initiatives in event orchestration, the best known of which is Amazon’s Step Functions.  As they stand, they don’t fit the needs of service lifecycle automation, but they could easily evolve to fit, particularly given the fact that the whole space is only now developing and there’s a lot of interest from all the cloud providers and also from IBM/Red Hat and VMware.

It would be a major advance for cloud-native in general, and telecom software-centric thinking in particular, if the maturation of a technical approach came with a vision of how to describe these kinds of orchestrated-event systems so everyone could grasp the functional details.  It’s not an easy problem to solve, though, and we’ll have to keep an eye on the space to see how things shake out.

Is Cloud Spending Hitting a Plateau, or on the Verge of Breakout?

What’s happening in the cloud market?  It seems (from stories in both the technology and financial press) like growth in public cloud service revenues is slowing, and some have suggested the market is nearing a plateau.  The truth is more complex (as usual), because there are a number of forces acting on the same market and all its players, and some relatively new forces acting on the cloud buyer.  The cloud market overall has two parts, each of which have multiple phases.  It’s the interplay of all this that we have to be thinking about.

The two parts of the cloud market are the startup cloud and the enterprise cloud.  In the early days of cloud computing, we focused on the enterprise cloud opportunity, and on “migration” of applications to the cloud, largely as a part of the server consolidation trend.  At the time, my modeling was saying that the most we could hope for from the migration story was about 23% penetration into enterprise IT spending, and in point of fact we’ve not reached that yet.  The big early growth driver was the startup cloud part.

Social media, content, and other web-centric or “over-the-top” (OTT) startups need massive online capacity but the venture capitalists who fund them don’t want to spend money on building gigantic server farms.  As a result, these startups turned quickly to public cloud services, most often from Amazon, and this became the big driver of cloud computing.  In large part that was because there was an OTT-social startup wave that corresponded to the early deployment of public cloud services.  The startup-wave market segment and phase has been our cloud opportunity source—up to this year.

Enterprises, as early as 2017, had been framing their public cloud use differently.  Rather than “migrating” applications to the cloud, they were migrating the front-end piece of applications to the cloud.  This was more easily justified because the growing mobility trend was demanding better integration between business applications and mobile devices, particularly smartphones.  The back-end transaction processing stuff that creates all the compliance risk and migration cost wasn’t part of the cloud picture, so this released a barrier and launched the “front-end” phase of the enterprise cloud.  Microsoft’s Azure was the big beneficiary of this, since they’d focused on hybrid enterprise applications from the start.

We’re in the enterprise front-end phase now, in parallel with the startup-wave phase on the other side.  However, we now face a well-known phenomenon on both sides—the “low-apple” problem.  Everyone knows that projects require a cost/benefit analysis, and everyone knows that the early target projects tend to be the ones where cost/benefit is the clearest and biggest, meaning those proverbial low apples.  Like all low apples, the cloud ones have been getting picked, which means that the remaining projects have lower ROIs and are often more complex to implement and deploy.  This is slowing the growth of cloud computing, both because of the extended project cycles and that lower ROI.  Were this the end of the phases of the startup and enterprise cloud pieces of our puzzle, we could in fact expect to see a plateau.

It isn’t the end.  We have, in fact, about nine hundred billion dollars in potential incremental spending on the cloud yet to be realized.  The question, obviously, is what’s going to realize it, and the answer (using the modern term, which I realize is always dangerous) is cloud native.

Cloud-native is an evolution of the conceptual foundation of front-end enterprise cloud.  The front-end model works because pulling out the web-like elements of an application from the old-line monolithic transaction processing models of current applications lets the front-end take better advantage of the things that the cloud does differently, and better.  The cloud-native model is a movement to define a software architecture that takes full advantage of the cloud, and that can be applied to new application development across the board.  That means enterprise and startup, front-end and transaction processing.  It also means things that are practical in the cloud and would not be practical in the data center, things like contextual applications and IoT.  This is what’s going to put those nine hundred billion dollars per year on the table.

Not right away, though, which is the important truth behind our so-called “plateau” in cloud spending.  Right now, my model says that fewer than four percent of developers and two percent of software architects can really say they can do “cloud-native” development.  In order for the cloud-native wave to really get going, we need ten times those numbers.  We also need better development management, starting with the team level and moving upward to the CIO and even the CEO and CFO.

The challenge for both the startup and the enterprise sides of the cloud opportunity is to address the new missions of the cloud, not the cloud way of addressing the old data center missions.  That shift involves rethinking the relationship between computing and its users.  As I’ve pointed out in previous blogs (most recently HERE), there have been three waves of spending since 1950, corresponding to three decisive changes in the relationship between computing and its users.  Technologists alone couldn’t drive IT spending growth to 1.4 times GDP growth or more, it required significant management buy-in, and not just to sign the checks.  We needed to rethink how workers work, how consumers buy, and that’s outside the realm of IT.

In a realistic sense, this means that the cloud-native movement and its financial benefits are going to arrive slowly, but we have some control over just how slow “slowly” really has to be.  The open-source software community is doing a good job of framing the right tools the right way, and I’ve talked to many CIOs who really understand what “cloud-native” means.  The only problem is that the pace of progress in cloud-native technology is literally breathtaking, and education in the space isn’t able to keep up.

The cloud-native ecosystem is made up of a mixture of middleware tools, several mixtures in fact.  Containers are a universal element, and the container orchestrator Kubernetes is near-universal, but there are management, monitoring, networking, load-balancing and development tools added and mixed up by various sponsors.  What’s most important in today’s world is the growing interest among cloud-native prospects in a complete ecosystem from a single source.  IBM, with its acquisition of Red Hat, wants to be that source, and of course so does VMware and other companies.  Just this week, Mesosphere (the commercial front of the Apache Mesos project) changed its name to D2IQ and will now focus on cloud-native and Kubernetes-ecosystem developments.

The single-ecosystem approach is critical because it offers both a set of comprehensive architectures for buyers to pick from, and organizations to promote those architectures and raise the visibility of cloud-native.  It would be naïve for me to suggest that this will automatically give rise to market education on the grounds that it’s difficult to differentiate within a topic that buyers don’t even grasp as a whole.  Still, it would be nice, because it would shorten this apparent plateau in cloud spending growth.

There’s plenty of money to be had here, but to get at it, we’ll need to develop stuff and not just move or tweak it.  The necessary skills and tools are evolving, but the inertia of current software is considerable and the pool of labor that needs to be educated is too.  The benefits are also considerable, fortunately, and I think we can expect to see the pace of cloud spending return to its past trajectory in a year or so.