July 2019 – Page 2 – Welcome to CIMI Corporation's Public Blog

Scalability: Why We Don’t Have as Much as We Think

One of the most profound benefits of the cloud is the elasticity or scalability that virtualization of hosting resources can create. Think of it as having a computer whose power expands and contracts with the level of presented work and you’ll get the idea. The problem is that this kind of scalability isn’t a natural attribute of virtual hardware; you need to have a software framework in which the hardware deploys to harness it. Yes, I know it’s customary to think of software as deploying in a hardware framework, but remember we’re talking about virtualization here.

Traditional networks don’t expand and contract that way. Yes, you can allocate more capacity or resources to a given service or connection (providing you have some to spare where it’s needed), but you don’t typically make routers bigger or expand their number dynamically to create additional pathways for information flow when things get hot and heavy. Operators are almost uniformly of the view that hosted network functions are scalable and elastic, and that this benefit is important to justifying things like NFV.

Cloud computing’s basic tools and software have evolved to recognize and realize this scalability benefit. We know now that a scalable element has to “look” like a single element from the outside, that inside it must include some capability to balance work across the instances that scalable behavior create, and that the scaling process has to be able to create and destroy those instances as needed, without disrupting the way work is processed. These scalability features are a big part of what’s called “cloud-native” design, meaning designing software that optimizes the special properties of the cloud rather than moves pre-cloud design into the cloud.

Scalability is based on a simple principle of abstraction. A software element has to present itself to its partner elements as a single component, but resolve into as many components and elements as needed. If multiple instances of something are to be able to handle work as though they were a singularity, they have to behave based on what might be called “collective” or “distributed” state. To the extent that there is an order of outputs related to the order of inputs, that relationship has to be preserved no matter where the work goes.

This isn’t a new problem; IP networks which are “connectionless” or “datagram” networks, can create out-of-order arrivals, and so a layer (TCP) is added to sequence things or detect a missing piece. TCP is an end-to-end layer, so how the intermediate pieces are connected and how many paths there are doesn’t matter.

Transactional applications pose a different state problem. A transaction is usually an ordered sequence of steps at the user or functional level. These steps might involve a sequence of datagrams, and those datagrams might be suitable for resequencing among themselves via TCP, but the steps themselves have an order too. If an instance of a functional component handles the first step in a transaction, and a different one handles the second step, it’s still important that somehow Step 2 doesn’t get pushed ahead of Step 1, but also that nothing saved in the process of doing Step 1 isn’t lost if Step 2 goes to another software instance for handling.

This sort of thing may not seem like much of an issue, given that I’ve already said that NFV has focused on box-replacement missions and that VNFs are thus handling IP packets in most cases, just as routers would. There are two reasons that would be faulty reasoning. First, as I’ve noted in earlier blogs, there are no future connection services that will deliver improved profit per bit. Second, service lifecycle software, NFV software, and pretty much any management system software, is going to have to deal with massive numbers of events, and maintain at least chronological context while they’re doing it.

Let’s say that there are ten thousand active services, and that there’s a fault that impacts perhaps 20% of those services. That means two thousand impacts, and in many cases the “impact” is going to be signaled by every element of every service that recognizes a problem. We could easily have tens of thousands of events generated by our fault, all flowing into the conceptual mouth of a service lifecycle automation system and expecting remediation. So two thousand impacted services, twenty thousand events, as an example.

If you look at the pictures being drawn for NFV management and orchestration or ONAP, you’ll see a series of boxes that collectively represent a software automation system. Emphasis on the singular “a” here. The events are coming in in whatever order the timing of the problem recognition and the electrical distance from problem to management system happens to dictate. They’re not nicely arranged by service, nor are they guaranteed to arrive in order of their generation, even when they’re associated with the same service.

All this stuff is dumped into a queue, and as the system has the capacity to do something, it pops the queue to grab an event. First question; what does the event belong to, service-wise? Without knowing that, we can’t process it. Second, where are we (if anywhere) in the process of handling problems, based on prior events we might have processed and prior steps we might have taken? If we don’t establish that, we will see later events doing things that step on actions undertaken in response to earlier events.

This is the problem with monolithic models for operations or lifecycle automation. If you don’t see ONAP that way, I recommend you review THIS piece in Light Reading. The critical quote is “They’re able to get this huge elephant to work in the cloud in one virtual machine. Given the criticism historically that ONAP is a monster that needs a supercomputer to run it and an army to install it, here you have 20 blokes running a scaled-down version. It’s not as heavy a lift as it’s made out to be.”

I agree that having a small footprint to run ONAP instead of a supercomputer is better, but I submit that this way lies madness. It’s an admission that ONAP is based on a monolithic architecture, that there’s a point of processing though which all service events must funnel. Return now to our twenty thousand service events. How many per second could a single VM handle, and think of an operator with tens of thousands of customers and services instead of my modest example. What’s needed is scalability, and you can’t scale monolithic processes you’ve struggled to cram into a VM.

The exact opposite should be the goal; instead of making ONAP fit on a smaller server, ONAP should run as a series of coupled microservices, infinitely scalable and distributable. My beef with the project has centered around the fact that it wasn’t designed that way, and the Dublin release proves it’s not going in that direction—probably ever.

Imagine now a nice service-model-driven approach. The model is a hierarchy of the very type that the TMF envisioned with its SID model. At the bottom of each branch in the hierarchy lies the process set that maps the model to real infrastructure, so real problems will always be recognized at these bottom points. When that happens, we need to handle the associated event, and to do that, we look to the state/event table that’s resident in the model element that contains the fault. In my terms, that “looking” is done by an instance of a Service Factory element, whose only source of information is the model and the event. That means we can spin up the instance as needed. There is no resident process. In fact, logically, there is a process available for each of our two thousand services, if we want.

The Factory sees that it’s supposed to run Process X. That process is also likely suitable for instantiation as needed, and its only inputs are the event and the service model that launched it. The process might be the same for all two thousand services, but we can instantiate it as needed. When the process completes, it can set a new state in the state/event table for the element that spawned it, and if necessary, it can generate an event to the model element above, or the element below. The combined state/event tables keep everything synchronized. No looking to see what something is associated with; it’s associated with what generated it. No worrying about colliding activities, because the state tables maintain context.

This model-driven approach with a service contract coupling events to processes is naturally scalable. This is what the cloud would do. This is not what we’re specifying in places like the NFV ISG or the ETSI Zero-touch group or within the multiple implementations of MANO and VNFM, and not within ECOMP. We have defined things that are supposed to be cloud-hosted, and yet are not themselves cloud-ready or even cloud-compatible.

The TMF’s NGOSS Contract work promoted this approach, over a decade ago when the cloud was still largely a fantasy. As a sad coincidence, the primary architect of this truly visionary approach, John Reilly, passed away recently. I’d talked with John many times about NGOSS Contract and its applications and implications, and these conversations were the primary source of my own education on the notion of data-model steering of events. John had it right all along, and so I’m taking this opportunity to shout out in his honor.

Thanks to John’s insight, we’ve had over a decade to build the right approach to lifecycle software and also to hosted features and functions. I’d hate to think we wasted that time, and in fact the cloud community has embraced most of what John saw so long ago. I’d like to think that the TMF and the network operator community will eventually do the same.

Why “Infrastructure Services” are as Important as NFVi

In this second of my pieces on the things we’re forgetting in carrier cloud, NFV, and lifecycle automation, I want to look at the issue of tenancy. You may have noticed that VNFs, and the NFV processes in general, tend to be focused on single-tenant services. Virtual CPE (vCPE), the current leader in VNF popularity, is obviously designed for the tenant whose premises is involved.

What makes this interesting (and perhaps critically important) is that many of the most credible applications for function virtualization aren’t single-tenant by nature, and even single-tenant services may include features that are used by multiple tenants. Are these applications perhaps different enough in their requirements that the same framework used for single-tenant services isn’t optimum for them? If so, there’s a barrier to adoption of NFV principles just where they may be most important.

Back in August of 2013, I did a presentation on this issue, describing the concept I called “Infrastructure Services”. An infrastructure service is a functional element that isn’t instantiated for each tenant/service, but rather is shared in some way. A good example of such a service is the Internet. If you were to frame out a model of an SD-WAN service, you’d need to presume Internet connectivity at the target sites, and represent that connectivity in the model and management, but you’re not going to deploy an instance of the Internet every time someone orders SD-WAN.

One example of an infrastructure service that’s particularly important (and particularly unrepresented) is the elements of NFV software itself. In a TMF-modeled deployment, we’d have a data model (the SID in TMF terms) that includes a state/event table to direct service events to operations processes. The thing that processes a service data model to do the steering of events to processes is what I called the “Service Factory”, one of the infrastructure services. Similarly, each of the operations processes could be infrastructure services. You could then say that even the service lifecycle software could be framed in the same way as the services it’s deploying; “Deploy” is a lifecycle state, as is “order” and so forth.

One reason this is a good starting point for infrastructure services is that an operations process might be a “microservice”, fully stateless and thus fully scalable, and another operations process might be an infrastructure service that’s shared, like something that makes an entry in a billing journal. Note that these services—either type—could be created outside NFV or could be NFV-created, providing that NFV had (as I was recommending it have) a means of specifying that a given service was “infrastructure” rather than per-tenant.

Given this, what can we say about the properties of an infrastructure service? The best way to try to lay out the similarities/differences versus per-tenant VNF-like services is to look at what we can say about the latter first.

Recall from my other blogs on NFV that the NFV ISG has taken a very box-centric view of virtualization. A virtual function is the hosted analog of a physical network function, meaning a device. You connect virtual network functions by service chaining, which creates a linear path similar to what you’d find in a network of devices. A per-tenant VNF is deployed on the activation of the service, and disappears when the service is deactivated.

An infrastructure service is deployed either as a persistent function, or as a function that comes to life when it’s used. Both could be scalable and redeployable providing the implementation of the function was consistent with that goal. It probably lives inside a subnet, the same structure that VM and container deployments presume to hold the components of an application. It doesn’t expose the equivalent of an “interface” to a “device”, but rather an API.

In terms of access, you’d be able to access a per-tenant function from another function within the same service, or selectively from the outside if an API of a function were exposed. Inside a subnet, as used by both OpenStack and Kubernetes/Docker, you have a private address space that lets the subnet elements communicate, and you have a means of exposing selective APIs to a broader address space, like a VPN or the Internet. This is pretty much what the NFV community seems to be thinking about as the address policy.

An infrastructure service is by nature not limited to a single tenant/service instance, but it might not always be exposed to the world or to a VPN as the only option. This, in fact, might be the most profound difference between infrastructure services and virtual functions, so let’s dig on it a bit.

A given service has a “service domain”, representing the address space in which its unique functions are deployed. Think of this as the classic IPv4 private Class C IP address, like 192.168.x.x. There are over 65 thousand available addresses in this range, which should cover the needs of per-service deployed function addressing.

Next, it would be reasonable to say that every service buyer might have a “tenant domain”, an address space that exposes the APIs of the functions/services/elements that have to be shared. Let’s say that this tenant space is a Class B private address, something like 172.y.x.x, where y is between 16 and 31. That range has over a million available addresses, plenty to handle how cross-service functions could be addressed/exposed within a tenant’s service infrastructure. An example of such a function might be management systems.

What about the stuff outside the tenant’s stuff? The Class A private address, 10.x.x.x has almost 17 million available addresses. We could imagine this space being divided up into ## function groups. The first would be used for the service lifecycle software application itself. The second would represent the address space into which service- and tenant-specific APIs were exposed to allow lifecycle software to access them. The third would be for the infrastructure elements (hosts, etc.), and the final one for those infrastructure services that were available to per-service, per-tenant, or lifecycle software use. For this one, think IMS.

One of the important things about infrastructure services is that you don’t really deploy them as much as “register” them. Infrastructure services are known by APIs, meaning network addresses. The addresses may represent a persistent resource or one on-demand, but they will rarely represent something that’s deployed in the way VNFs are generally expected to be. Thus, infrastructure services are really cloud components.

This isn’t the first time I’ve said that many of the things we think of as “NFV” aren’t. Applications like IMS, like CDNs, like most of IoT, are consumers of infrastructure services not VNFs. In fact, the first NFV ISG PoC involved an open-source IMS implementation, and it demonstrated that deploying an infrastructure service like IMS could be done using the same process as could be used for traditional VNFs, but that addressing and address assignment was subnet-based and thus “cloud-like” rather than service-chain and box-like. Proof, I think, that even back in 2013 we had an implicit recognition that shared service components were more likely to follow cloud principles than VNF principles.

Infrastructure services aren’t theoretical, they’re real, and we have examples of them all through the current initiatives on carrier cloud, virtualization, and lifecycle automation. Like virtual networking, infrastructure services are fundamental, and like virtual networking they’ve been largely ignored. They were raised from the very first (in the first NFV PoC, as I’ve noted here) and they should have been an integrated part of NFV, but that didn’t happen. It may happen now, at least in the sense that they may get back-filled into NFV’s framework, because of the attention being paid to mobile, 5G, content delivery, and carrier cloud services. It makes sense to frame virtualization to address all its missions, and infrastructure services are certainly a mission to be addressed.

We Need a Virtual Network for Carrier Cloud!

It should be obvious that carrier cloud needs a network. Perhaps it is, but if so there are plenty of network operators who haven’t really considered what that network would be, where they’d get it, or how the choice might impact their overall success in transformation. We’ve come a long way down the line of hosting network features in carrier cloud without proper concern for this critical issue, so it’s time to take it on.

IP networking overall creates a flat, universal, address space that we call “the Internet” but that’s actually made up of a series of assigned IP address ranges. The “public” IP addresses can be used by their owners to join in the Internet community or to create a virtual private network (VPN). The dominant IP addressing model today is IPv4, and we’ve already got more things on the Internet than IPv4 can directly address. One obvious solution to that is to adopt the new IPv6 model, which creates a much larger (IPv6 supports ten to the 28^th power addresses, more than we’re likely to need), but IPv6 has been around for a decade and it’s still not the dominant model.

What’s saved us from IPv4 limitations so far are two concepts—“private” IP addresses and network address translation (NAT). The IETF assigned three ranges of addresses (a Class A range, a B range and a C range) that could be used internally but not sent out onto the Internet, since they are not assigned to an owner and are thus not unique. The NAT process lets devices that can generate an Internet query but can’t be reached as an open resource generate a “stack” of addresses, one being the inside private address of the device and the other the outer “public” address of the gateway to the Internet. Since most Internet growth has been in this type of request-only device (you can send a response to a NAT request, but you can’t originate a connection to a device with no public address), this has taken the heat off the IPv4 space.

The reason I’m going into this is that these same private IP addresses are widely used in application deployment. Multi-component applications typically have a bunch of internal connections to pass work among the components, and a very few connections open to the outside world. In the typical OpenStack and Kubernetes models, a private address space is assigned to an application, and the public APIs the applications need exposed are then translated to a public address. Not only does this save on IP addresses, it also means that the application components that should never be addressed from the outside don’t have an outside address to reference.

Cloud computing also stimulated the concept of “virtual networking”, which is a form of networking that creates an overlay of some sort on top of IP. This layer forms something like the private IP addresses do, except that there’s no theoretical limit on the size of a virtual network address space or what can connect with what. Public IP and NAT create IP-sanctioned partitioning of applications (or service features), and virtual networks create IP-independent partitioning.

Network operators today recognize two primary sources for virtual-network solutions. One is VMware, whose NSX is based on the Nicira acquisition that represented the first virtual or “software-defined” networking product. The other is Nokia, who acquired the Nuage virtual networking product.

The beauty of these virtual-network solutions is that that IP-independence property I mentioned. When you build an application, you’re building it within your infrastructure. You’re used to applying access control to keep out unauthorized users, and you don’t expect that your data center network itself is partitioned. That’s not workable for a cloud provider for security reasons. Every tenant, meaning every user and perhaps every application, has to be separated. That’s what gave rise to the virtual-network thing in the first place, and it remains the critical property. If you can separate tenants with a virtual network, you can provide a secure hosting framework shared by many. That’s true whether you’re providing public cloud services or providing feature hosting for an enterprise private service like a VPN, or a public voice/data service like 5G.

SD-WAN is another example of a virtual network. Technology varies, but overall SD-WAN implementations provide a form of an overlay, which means that they can ride on top of IP connectivity. They are not typically used to separate tenants in a data center, though. However, both VMware and Nokia offer SD-WAN features on their virtual networks, which means that there are at least two providers who specifically merge multi-tenant data center networking and SD-WAN.

In a carrier cloud, you could see a variety of virtual-network approaches, even several of them at the same time. You could use a virtual network for tenant isolation and overlay a public/private IP network model so you could take advantage of all the cloud application tools designed for that model. You could use a virtual network for everything. You could add an SD-WAN to a service built on a combination of virtual networks and public/private IP. Name your choice, and someone could probably come up with a reason it might be useful. But “every possible combination of network technology” is not an efficient approach to building carrier cloud or its services. Some cohesive approach would be better.

The big complication in carrier cloud is the multiplistic nature of monitoring and management. Think of Big Customer, an enterprise client. They have four different services, all of which have some features hosted in Big Telco’s carrier cloud. Each of the services should be deployed to ensure that their network service access point (NSAP) interfaces are addressable by the customer, and a suitable network management interface should also be exposed for the services. Each of the services should be fully visible to the operator, obviously, so they can be deployed and managed. Finally, all of the services may share some elements, so it might be necessary to make some of the service APIs that are not customer-visible visible to other components of those four services. That’s a lot of specificity, and getting it demands a plan.

Here’s an example of one. Every service is a unique “service subnet”. That subnet exposes a set of APIs that must be customer-visible to make the service useful. It also exposes a set of APIs (which might include some of the same ones) onto a “customer subnet” to facilitate sharing of resources at the customer level. Finally, it exposes a set of APIs onto an “operator subnet” that provides a connection to carrier cloud orchestration, monitoring, and management. What we need is a way to map these “subnets” to some sort of virtual-network or public/private IP network approach, so we know how everything has to be connected and can keep those connections intact through scaling, redeployment, and service changes.

Some overall carrier-cloud virtual-network approach is essential, or there’s a major risk that all the pieces of carrier cloud won’t come together. Yet where is the planning at this level? Have operators picked an approach? Not according to the operators. We seem to forget that the dynamism, the elasticity, the scalability of the cloud and virtualization can’t be met if we tie elastic hosting to rigid networking.

This isn’t the only issue that’s hampering progress in carrier cloud, NFV, and even SDN. Over the next week, I hope to look at a couple other issues that will in some cases illustrate why we need virtual networking, and in other cases illustrate why we need to think through the whole concept of virtualization more carefully.

Does the Industry Need More Order in NFV Infrastructure?

There is no question that disorder is hard to manage. Thus, it might seem reasonable to applaud the recent story that the GSMA is working to bring some order to the “Wild West” of NFVi, meaning NFV infrastructure. Look deeper, though, and you’ll find that disorder should have been the goal of NFV all along, and that managing it rather than eliminating it is essential if NFV is ever to redeem itself.

The stripped-to-the-bare-bones model of NFV is a management and orchestration layer (MANO), a virtual infrastructure manager (VIM), and an infrastructure layer (NFVi). This model is very similar to the model for cloud deployment of application components; a component selected by the orchestrator is mapped to infrastructure via an abstraction layer, which means via virtualization of resources. Virtual network functions (VNFs) are orchestrated, through the VIM, onto NFVi.

If we look at the early models of this, specifically at OpenStack, we see that a “virtual host” is mapped to a “real host” by a plugin that harmonizes differences in the interfaces involved. That’s true for actual hosting (OpenStack’s Nova) and also networking (Neutron). If we assumed that all virtual hosts had the same properties, meaning that any real host connected via a plugin could perform the role of hosting, this whole process is fine.

The issue arises when that’s not true. I pointed out in yesterday’s blog that there were really two missions for “carrier cloud”, one being retail cloud hosting and the other being “internal” service function hosting. I also noted that the latter mission might involve connection-like data plane requirements, more performance-stringent than typical event-or-transaction interfaces to application components.

The problem that the NFV community has been grappling with is that different applications with different missions could arguably demand different features from the infrastructure layer. The current initiative proposes three different NFVi classifications, network-intensive (data plane), compute-intensive (including GPU) and a “nominal” model that matches traditional application hosting done by cloud providers today. This, they propose, would reduce the variations in NFVi.

But why do that? Does it really hurt if there’s some specialization in the NFVi side? Remember that there are a bunch of parameters such as regulatory jurisdiction and power stability or isolation that could impact where a specific virtual function gets placed. Why wouldn’t you address technology features in the same way? There are two possible issues here. One is that VNFs are being written to different hosting environments, and the other is that there are just too many different hosting environments to allow for efficient resource pooling.

The first issue, that VNFs might be written to a specific hosting platform, is in fact a real issue, but one that the NFV ISG should have addressed long ago at the specification level. It is possible to create a VNF that needs a specific version of operating system and middleware, of course, but if VNFs are hosted in virtual machines then that should allow any essential set of middleware and OS to be bundled into the machine image. If there are deeper dependencies between VNFs and the hosting platform hardware or software, then the lack of a single middleware framework for all VNFs to match to is needed, and should have been part of the spec all along. Harmonizing NFVi isn’t the answer to that problem at all.

The second issue is a bit more complicated. A resource pool in a cloud is efficient because of the principle of resource equivalence. If a bunch of hosts can satisfy a given hosting request, it’s easier to load all the hosts fully and raise infrastructure utilization, which lowers capex and opex. If you need a bunch of different hosts because you have a bunch of different VNF configurations at the hardware level, then your resource pool is divided and less efficient. However….

…however, if there really are specialized hosting, networking, or other requirements (including power, regulatory, etc.), then those other requirements will either subdivide the three sub-pools, or we’d have to shoehorn everything into those three pools no matter what. Both these approaches have issues.

How many blue left-handed widgets are there? How many do you need there to be? Any form of subdivision of hosting options, for any reason, creates the possibility that a given virtual function with a given requirement will find zero candidates to host in. The more requirements there are, the more likely it is that it will be difficult to manage the number of candidate hosts for every possible requirement set. If the need to control where something goes is real, then the risk of not being able to put it there is real, too.

The risk is greater if there are competing virtual functions that need only approximately the same NFVi. If multiple characteristics are aggregated into a common pool, it’s difficult to prevent assignment from that pool from accidentally depleting all of a given sub-category. Five blue-widget VNFs could consume all my blue-left-handed slots when in fact there were right-handed options available, but we didn’t know we needed to distinguish.

Another problem with the aggregate-into-three-categories approach is that it means that in order to achieve a larger resource pool for efficiency (which doesn’t even work if there are other criteria that impact hosting, as I’ve just noted), you end up creating categories that are OK on the average but will oversupply some VNFs with resources. The loss in efficiency within the categories could well eradicate any benefit obtained by limiting the number of NFVi classifications to achieve overall resource pool efficiency.

To me, the real issue here is not NFVi, but VIM. The NFV approach has been to have a single VIM, which means that VIM has to accommodate all possible hosting scenarios. That doesn’t make a lot of sense if you have a distributed cloud infrastructure for your hosting, and it doesn’t make sense if you’re going to subdivide your NFVi or apply secondary selection criteria (power, location, whatever) to your hosting decisions. From the first, I argued that any given service model should, for each model element, be able to identify a VIM that was responsible for deploying that element.

If the NFV ISG wants resource pool efficiency, I’d suggest they have to look at it more broadly. Everything but the network-performance-critical categories of NFVi should probably be considered for container applications rather than for virtual machines. VMs are great wasters of resources; you can run many more containers on a given host than VMs. For specialized applications, rather than considering a “pool” of resources, the VIM for that class of VNF should map to a dedicated host or hosting complex. Another advantage of having multiple VIMs.

My view continues to be that NFV needs to presume multiple infrastructure domains, each with its own VIM, and that VIM selection should be explicit within the service data model. That lets a service architect decide how to apply resource virtualization to each new service, and how multiple hosting sites and even hosting providers are integrated. However, at present, NFV doesn’t mandate a service model at all, which is the big problem, not NFVi multiplicity.

If a service model decomposes eventually to a set of resource commitments, the specific mechanism to do the committing isn’t a big deal. You can have as many options as you like because the model’s bottom layer defines how actual resource commitment works. Containers, VMs, bare metal, whatever you like. Further, the process of model decomposition can make all the necessary selections on where and how to host based on service requirements and geography, which is also critical. This is what we should be looking at first.

Two factors would then drive diversification of NFVi. The first is legitimate requirements for different hosting/networking features for virtual functions. The second is attempts by infrastructure vendors to differentiate themselves by promoting different options. If you presume model-based decomposition as I’ve described, you can accommodate both. Whatever selectivity is needed to deploy a service, a model can define. Whenever a vendor presents a different hosting option, they can be required to provide a decomposition model that handles it, which then makes it compatible with the service deployment and automation process overall.

Some New Insights on “Transformation”

Transformation is surely the key topic for telco executives, and that’s been true for over a decade. I had a series of interesting discussions with telcos over the last three months, hoping to gain some clarity on the nature of their transformation plans. Perhaps I shouldn’t have been surprised to find that there was little real detail available. In fact, one senior planner gave me what I think was the comment that encapsulated the whole experience. “Transformation,” he said, “is the process that turns our business from what it is to what we want it to be.”

Hope springs eternal, but it rarely generates results. The comments got me thinking about just what transformation was, and was not, and that set me to doing some modeling on the topic. I want to convey the results here.

The first point to emerge was that it is no longer possible for service lifecycle automation to create revolutionary changes in profit per bit. I noted this in a couple of prior blogs, but the new models show that the best one-third share of addressable opex reductions (the share representing the highest ROI) have already been reaped, using point technology and less broadly efficient means. The opportunity to use those savings, which amounted to almost nine cents of every revenue dollar, to fund a massive opex automation project was lost.

This is perhaps the most critical truth in all the transformation story. Cost efficiencies are typically the easiest to address in projects because it’s possible to establish the benefit case clearly and usually possible to frame software and process model that can achieve it. The problem was that we waited for over five years from the time the need for service lifecycle automation was identified to even catch a glimpse of the right path, and that glimpse has come from the cloud software model not from the standards process that telcos (sadly, and misguidedly) still believe in.

The point here is simple. Opex low apples have been picked, and even if the telcos were to suddenly realize that they needed only to employ cloud measures and concepts to reduce opex, decisive changes aren’t available today at the same high ROIs that were available five years ago.

If you can’t improve profit per bit by reducing cost per bit, you have to raise revenue per bit. Higher revenue per bit means more spending on services, of course. The second point that emerged from the models is there are no new forms of connection service that have any positive impact on revenues. The Internet revolution created two service communities, one focused on creating experiences (which we call “the OTTs”) and the other on delivering the connectivity needed for the first. Operators comfortably hunkered down on the second mission, leaving the OTTs to the first. To fix that, there’s no option other than for telcos to start mining the OTT revenue opportunity in some way.

Mining the revenue opportunity is not another term to describe providing that vanilla connectivity whose revenue per bit can only continue to decline. If 5G is five times or ten times the speed of 4G, will it cost five or ten times as much? If so, nobody will buy it. Thus, no new telco service technology can succeed if it expects to deliver a lot better connectivity for little or no increase in price. Price divided by bits, after all, can get bigger only if price goes up faster than bits do, and we all know that’s never going to happen.

OK, what pushes bits? Network equipment. If we have non-bit services, those non-bit services won’t come from network equipment, they’ll come from experience-creating equipment. What kind of gear is that? Data center hosting systems and software platforms.

That establishes our third point, which is that the success of transformation can be measured by the shift of telco investment from network devices to carrier cloud. If operators spend (as they do today) way over 95% of their capital budgets on network equipment, they’re not transforming, much less transformed. My models say that there has to be a shift of at least 15% of capex to carrier cloud to validate the notion of transformation.

Getting there is, and has been, the big issue. I’ve modeled opportunity drivers for carrier cloud for five years now, and the modeling has consistently showed three phases to carrier cloud opportunity maturation. The first phase, ending in 2020, is driven by “visible” and near-term opportunities, including NFV, mobile modernization (IMS/EPC), and video/advertising. This phase should have created about three thousand carrier cloud data centers by 2020, and is falling short by over 50%, largely because operators took too long to define an approach to the opportunities, and when they did, failed to define one that could be implemented using current market tools.

The second phase of opportunity, ending in 2023, is related to mobile 5G modernization. This driver, of course, has been present in the first phase but it’s already failing to realize the full opportunity. The largest driver remains video/advertising, which is as I’ve said falling short even now, and operators show no signs of recognizing what IoT should produce for them. Thus, the model says we’ll fall further behind the opportunity curve in the second phase.

If that happens, operators will have little chance of realizing that 15% capex shift, and little chance of a successful transformation. However, even if operators get on the right track in terms of thinking, they will still face that problem of reliance on traditional standards rather than on cloud software.

Carrier cloud is a two-dimensional issue; one “functional” and the other “mission”. The functional dimension relates to the hardware/software platform on which carrier cloud is to be hosted, and the mission dimension to whether the application of the carrier cloud infrastructure is “internal” to the operator (used to support service creation) or “retail” to a customer, including enterprises. The two interplay in significant ways.

Cloud infrastructure today is largely based on a series of layers, with the bottom being an efficient hardware platform with all possible network acceleration. The second layer is a very high-performance hypervisor that creates very efficient VMs. Layer three is a virtual-network layer that creates the connectivity needed among the elements both below (the VMs) and above. The “above” is a container layer, which would form the foundation for both operator internal services and retail and enterprise services.

The challenge in picking technology for these structures is that the current focus of the cloud market is the cloud buyer, not the cloud seller. Carrier cloud is, as I’ve said above, a combination of an internal and a retail mission. The retail mission isn’t about using cloud services but providing them, and the internal mission of operators has requirements that public cloud services overall may not recognize. High data-plane speed is a requirement for some telco missions, and not for most retail missions.

You can see an example of this in the area of service mesh. A service mesh is a framework in which microservices deploy, including load balancing and related features. It’s an essential part of a retail microservice-cloud application, and it would likely figure in internal telco applications…except perhaps for the issue of performance. The top service mesh, Istio, is far slower (perhaps ten times slower) than native microservice connectivity would be. Linkerd, the service mesh de jure of the Cloud-Native Computing Foundation, is faster but still five times slower than native. Telcos looking for a carrier cloud service mesh should be looking at this point, but none of the planners I talked with in the last three months were aware of it, and few were even aware of service mesh.

Virtual networking is another major issue. The most important work being done with virtual networks is in the SD-WAN space, which currently focuses on enterprise use rather than on how virtual networking separates tenants in a multi-tenant provider network. In addition, given that “network services” that include connectivity have to merge public address spaces with hosting and tenant isolation, they present a whole new set of challenges. Operators need to address these, because cloud technology continues to focus on the retail buyer.

The risk the telcos are taking, and the risk they must avoid at all costs, is that they resolve the near-term carrier cloud opportunities using the same piece-based solution strategy they employed for opex. They would then never build a correct carrier cloud platform, and never be able to combine the hosting of the various drivers in an efficient, optimum, way. If that happens, they’re in big trouble with both carrier cloud and transformation.