In Search of a Model for the Cloud-Native Telco

They say that beneath every fable, lie, and marketing message there lies a kernel of truth. Is that statement a self-proof or does it give too much credit to falsehood? I can’t assess that, and probably don’t need to, but I do think I could try to apply it to a question I got on a blog I referenced on LinkedIn. Paraphrasing, the question was whether there was a role that cloud-hosted, cloud-native, microservice-based technology could play in telecom. Well, is there? Not surprisingly, the answer is “Yes, but….”

We have to start with an attempt to categorize what “cloud hosting” and “cloud-native” mean, and in particular how they differ. There are really three models of cloud hosting; IaaS virtual machines, containers, and functions/microservices. I’ve presented them in order of their “closeness” to the real hardware, so it would be fair to say that the progression I offered represents increasing overhead that could be expected to be added to message processing.

We also have to ask what “a role” would mean. It doesn’t make a lot of sense to stick a hosted router instance somewhere in a network and call it a transformation. What operators generally want is a broadly useful strategy, one that can replace enough specialized (vendor-specific, proprietary, expensive) devices with cheap hosted instances to make a difference in capex overall. That puts a lot of pressure on the hardware, hardware that’s designed to host applications and not push bits.

Whatever the model, nearly all “cloud hosting” is based on general-purpose CPUs (x86, ARM), and we wouldn’t have switching chips from players like Broadcom as the basis for white-box network devices if general-purpose CPUs were up to the job. It is possible to use general-purpose servers, in or out of the cloud, to host router instances, but operators aren’t all that excited about the idea.

About a decade ago, I had a long meeting with a Tier One operator about reducing capex. Their hope was that router software (remember Vyatta?) running on servers could replace routers, and they bought licenses to test that idea out. What they found was that it was possible to host what were essentially edge router instances on servers, but anything that had to carry transit traffic (metro or core) needed special hardware.

It wasn’t long after that when the NFV “Call for Action” was published and the NFV ISG was launched. From the first, the emphasis was on “appliances” more than on network routers and switches, and many of the early PoCs focused on virtualizing and hosting what were normally CPE functions, like firewalls. This dodged a lot of the performance problem, but even those PoCs ended up turning to a non-cloud hosting model, that of “universal CPE” or uCPE. NFV’s mission there was simply to load software onto an edge device, which frankly made all the standards work overkill. Would this have happened if virtualizing CPE, which was well within server limits, was really transformational? I don’t think so.

Where does this leave the cloud router? Answer: If by cloud router you mean “hosted in the cloud” router, nowhere. There is only one viable “cloud router” in all the world, and it’s DriveNets’ Network Cloud cluster of white boxes, which don’t rely on general-purpose servers in any form. Public-cloud routing is not cost- or performance-effective. Neither, in most cases, are any server-hosted routers. The only operators who haven’t rejected the hosted-in-the-cloud approach are those who haven’t tested it seriously.

So what does this mean for all the talk about “cloud-native”? I blogged about an article that, referencing an analyst report, predicted operators would move nearly half their network traffic to a “cloud-native” hosted framework. I said that there were zero operators telling me anything like that, but I didn’t go too much into the “why” of it. The answer is that issue of cost/performance effectiveness.

But there’s a deeper question, too, one I referenced HERE. The 3GPP 5G work, and the successor expansion of O-RAN, included the notion of hosting VNFs that handled both the “user plane” and the “control plane” of a 5G network. The standards separate those two planes, but the way the features/functions are divided makes the “user plane” of 5G different from a pure IP transport network, the “data plane”. I speculated that the best way to approach the requirements of the UPF might be to think of them as a functional extension to traditional routing. But what about the control plane? That’s the deeper question.

The control plane of a mobile network is largely aimed at managing registration of devices and managing mobility of those devices. Like an IP network’s control plane (yeah, 3GPP reused a term here and that can create some confusion), the control plane of 5G doesn’t carry traffic but rather carries signaling, and signaling is a whole different world in terms of cloud and cloud-native.

5G signaling is made up of internal messages to support the management of mobile networks. A 5G user sitting in a cafe in NYC watching a 4K video could be pushing a lot of bits but generating essentially zero signaling activity, because they’re in the same cell through the entire experience, and once the path to that user in that cell is established, there’s nothing much going on to require a lot of signal exchanges, at least not exchanges that impact 5G service elements (RAN-level messages might be exchanged). No registration, no mobility management. Thus, signaling message traffic is surely way lower than user data traffic, and that means it’s probably well within levels that cloud elements could handle.

In theory, if 5G signaling is a good application for cloud hosting, we could expect to use any of the three hosting models I cited. However, the way that the 5G standards are written creates “functional boxes” that have “functional interfaces” just as real devices would. That seems to favor the use of virtual devices, which in turn would favor hosting in either VM or container form. You could easily write software to play the role of a 5G signaling element and stick it in a VM or container in the cloud (or, of course, in a data center or other resource pool).

What about “cloud-native”. We can now turn to defining it, and the most-accepted (though not universally accepted) definition is that “cloud-native” means “built on microservices”, and “microservices” are stateless nubbins of functionality. It also means, or should mean, the more general “designed to optimally realize cloud benefits”. The question, IMHO, is whether it would be possible to meet both definitions with a 5G signaling application. The answer is “Not if you strictly conform to the 3GPP/O-RAN model”.

This is the same problem that the NFV ISG created for itself a decade ago, with their release of a functional specification. Defining functions as virtual devices steers implementations relentlessly toward that model, and that model is not cloud-native. I did a presentation for the NFV ISG that described a cloud-native implementation, and what it showed was a pool of primitive functions that could be individually invoked, not a collection of virtual devices. The assignment of functions to virtual devices converts cloud-native into a kind of monolith.

In the cloud-native model, the signal messages would be routed to the primitive function (the microservice) designed to handle them. Since microservices are stateless, the presumption would be that (for example) a mobile device would have a “record” associated with it, wherein we stored its state relative to the mobile service. That state record would be accessed by a “master” signal-message microservice to determine where the message would be steered, so we could say that it would contain a state/event table. There might be any number of instances of any or all of the signal-message microservices, and steering to the right one would likely be done through a service mesh. It’s also possible that signal messages would carry state, and thus would be steered only by the service mesh.

The next obvious question would be how this would tie to the “user plane” where there was a signaling-to-user-plane interplay, like for the UPFs in 5G. This is where you’d need a mechanism for a signal microservice to send a message to a “router sidecar” function that could induce data plane behavior. For example (and only as an example), we could assume that the “router” was an SDN switch and the “router sidecar” that was messaged from the signaling plane microservice was the SDN controller.

My point here is that so far, all the work that telco standards groups have done has pushed features into virtual boxes, thus moving them away from being “cloud-native”. If cloud-native is the preferred model for the cloud, that’s a very bad thing. But even for non-communications applications, cloud-native isn’t always a good idea because all the inter-microservice messages add latency and likely add hosting costs as well. It is very possible that virtual devices would be cheaper to deploy in either public cloud or carrier cloud resource pools, and would function better as well, with lower latency. Frankly, I think that’s very likely to be true for 5G control plane features. That, in turn, would mean that we shouldn’t be talking about “cloud-native” as a presumptive goal for everything involved in telecom-in-the-cloud.

A “cloud” is a resource pool, and while a public cloud today is based on x86 or ARM because that’s where the demand is, there’s no reason why we couldn’t build clouds from any resource, including white boxes. One of the interesting points/questions is whether we could build a hybrid white-box cloud by linking an open data plane to a cloud-native control plane via a user plane that could induce UPF behavior from white-box devices. Another is whether the way to harmonize a virtual device and cloud-native is to say that virtual devices are made up of cloud-native elements that are tightly coupled and so can function as a single element. Maybe we need to think a bit more about what “carrier cloud” really means, if we want to get “cloud-native” right there.

The biggest question of all, IMHO, is whether we should be thinking about re-thinking. Is the old 4G LTE Evolved Packet Core tunnel mechanism even the optimum “user plane” model? We’ve evolved both technology and missions for mobile networks. Maybe, with some thought, we could do better, and that’s a topic I’ll look at in a later blog.