What’s the Value of Cloud-Native in Network Software?

I’m sure that you, like me, has read plenty recently about “cloud native” technology in telecom. Given the fact that hype seems to be omnipresent in tech these days, we have to ask whether there’s more “cloud-native-washing” going on than actual “cloud-native”. Rather than try to survey all claims, why not start by asking what meaning the concept could have, and see where that leads us to look for its application. For a change, I propose to start at the bottom.

Networks are a web of trunks and nodes that create routes for traffic to follow. There’s typically a star-of-stars topology, where access tendrils collect at some metro point, and metro points then collect regionally. We’ve all seen examples of this, both as a whole and in pieces, over our careers.

Cloud-native refers to an application model that describes a style of construction that’s optimized for virtual/cloud deployment. That means that they’re designed to be scalable under load and redeployable in case of failure. In the real world, cloud-native is optimizing the benefits of virtualization, which is the abstraction of resources so that an application that that maps to the virtual is transportable to the real.

Both these last paragraphs state the obvious (to many, anyway), but they also reveal the first of our truths about cloud-native in networking. Networks, at the bottom where those trunks live, are essentially nailed to the ground. We can’t just say “OK, host the virtual router over there” when there may not be a trunk, or the right trunk, in that location. The scope of real resources suitable for mapping to bottom-level network functions is limited. However, if we think about rising up from the physical, what we’d generally call the “data plane” we find that we have more and more latitude with regard to where the real stuff we map to could be located.

Take 5G as an example. Towers are where they are, and so is backhaul fiber. We need to have data-plane resources that are very tightly coupled to the endpoints of real transmission media. If we move upward to the control plane, we still have to be somewhere generally proximate to the media and data plane, but the O-RAN reference to “Near-” and “Non-real-time” RAN Intelligent Controller (RIC) shows that there is, even in O-RAN, a widening scope of hosting options as we climb up from the data plane.

Let’s now shift back to my comfortable top-down-think. What’s at the top of a network? The answer is “operations systems”, NMS and OSS and BSS. These functions are not only high above the data plane, their scope of operation is network-wide, which means that there’s probably no specific single location to be close to at all.

The wide scope of operation is important in itself. Networks are not homogeneous, nor are the impacts of a problem. If one of the goals of cloud-native design is scalability, then we could expect there to be a greater need of cloud-native in areas where scaling is more likely to be needed, which would likely correspond to places with higher device, trunk, and customer density. If high availability is another goal, then we could expect it to be most needed where faults are more likely, which could be places with power and environmental issues.

What emerges from this little exercise is a representation of cloud-native value shaped like a kind of inverted jello cone. The tip of the cone represents the lower-layer functions, so the cone is sitting tip-downward on the network. The base of the cone, now at the top, represents high-level operational functions. The jello property means that we can push on the cone and shift the top a fair amount, representing flexibility in hosting, but the bottom piece is anchored and so moves very little.

In the data plane, at the bottom of our cone, we are accustomed to “nodes” being “devices” in a singular sense, and that’s not unreasonable given the fact that there’s little flexibility in where we place them relative to the trunk interfaces they support. Physical media demands a physical interface, and if we were to split a router into cooperative components, those components would have to be fairly close to each other in order to avoid introducing a lot of latency. Thus, a chassis router model is a smart one, creating a kind of large virtual router from the composite behavior of elements. The control plane of this router could be further separated to support hosting it on a nearby pool of elements, but it’s not likely that anyone would attempt to justify a control-plane resource pool; a bit of redundancy would serve.

If we step up a titch from the data plane, we’d find a kind of fork in the road. Taking one turn leads us to the IP control-plane functionality, notably the adaptive or centralized route control. Taking the left fork leads us to service control planes like those of 5G. In both these cases, we can assume that controlling latency is much less critical, and so the functions of the control plane could be more distributed, which means that they could likely justify a resource pool strategy. The question would be where the pool was located, and that relates mostly to the scope-of-operations issue.

Most 5G functionality is likely to reside at the metro level, where 5G RAN (New Radio, or NR) and Core are joined. That means that any pooling of resources would have to be a form of edge computing, and one thing we’d need to consider is the relationship between “the edge” as a kind of latency-specialized cloud and the O-RAN O-Cloud, which is controlled by the RICs. Do we need, in edge computing, an architecture that makes that relationship explicit, or do we simply share resources in some mediated way? In any event, while this functionality could be implemented in cloud-native form, we’d need to consider what specific cloud model we were being native to.

IP control-plane hosting, the route management part that’s not embedded with the data plane, is best considered by looking at the centralized control plane of SDN (OpenFlow and the ONF model). Centralized route control’s latency requirements are minimal because it takes considerable time for the alternative adaptive topology convergence to happen. It’s likely that we could host this in the cloud, and that it would be suitable for cloud-native development.

It’s the management/operations stuff, the things at the top of our inverted cone, that have the greatest flexibility in terms of hosting location, because they’re typically linked to human interactions and humans aren’t particularly real-time. NMS, OSS, and BSS systems could be viewed as being IoT-like, event-driven, processes at the NMS level, then gradually becoming more transactional as we move to OSS and BSS. It would surely be possible to implement an NMS in cloud-native form, but the OSS/BSS piece is more complicated.

Transactional applications have limited scalability because they require database activity to complete. If we have a half-dozen instances of a database update component, we’ll have to exercise some distributed database update/access discipline to keep our repository from getting corrupted or ensuring users always get correct data on a read. This limit to the scalability benefit could mean that OSS/BSS could justify cloud-native at the front end of the applications, but tend more to traditional structures to serialize processing as we move to the database side.

The notion that network software must be “cloud-native”, or even the claim that making it so is a good thing, is an oversimplification. We surely have opportunities for “microservice” application models, including orchestration and service meshing, but the specific way that this would be done is likely to change as we move from data-plane handling upward to OSS/BSS. It would be smart to keep that in mind when you consider cloud-native claims for telecom/network software implementations.