Fitting Cloud-Industry’s Cloud-Native Vision to Networks

I had a really interesting talk with a senior operator technology planner about a term we hear all the time, “cloud-native”.  She suggested that we substitute another term for it, “cloud-naïve” because she believed that there was “zero realism” in how the cloud in general and cloud-native in particular were being considered.  If that’s true, we could be in for a long slog in carrier cloud, so let’s look both at what the truth is and what seems to be happening.

The cloud industry, meaning the development community that both builds cloud development and deployment tools and creates cloud applications, doesn’t have a single consistent definition.  The most general one is that cloud-native is building and running applications to exploit the special characteristics of cloud computing.  The Cloud Native Computing Foundation (CNCF) says “Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.”

This isn’t much of a prescription for cloud-nativeness, so to speak.  The reason, perhaps, that we don’t have such a nice recipe is that it’s not a simple issue.  There are many applications that shouldn’t be done in the cloud at all, and many for which containers, service meshes, and microservices are a truly awful idea.  There are even multiple definitions of all three of these things, multiple paths through which they could be implemented.  Thus, it’s not surprising that operators are vexed by the whole topic.  If we’re going to take a stab at making things clear, we have to do what any application developer has to do in considering the cloud—work through the functionality of the application to see where the cloud could be applied, and how.

A network service typically has three layers of functionality.  The primary layer is the data plane, where information moves through the network from source to destination.  Above it is the control plane where in-service signaling takes place, and above that is the management plane where service operations is focused.  Obviously these three layers are related, but from an implementation perspective they are so different we could almost consider them to be different applications.  That’s important when we consider cloud-native behavior.

The data plane is all about packet forwarding, which in my view means that the design of features there and they way they’re committed to resources has to be highly performance-efficient and reliable.  So much so that one could argue that as you get deeper into the network and traffic levels rise, you might want to fulfill data plane requirements outside the general cloud resource pool.  That doesn’t mean you can’t edge-host them or that where traffic permits, you can’t use cloud hosting.  It does mean that specialized missions don’t fit with general resource pools.

The control plane is a kind of transitional thing.  If you look at the IP protocol, you find that packets are divided into those that pass end to end and packets that are actioned within the network.  The former packets are data-plane elements, and the latter are IMHO best thought of as representing events.

Events are where we see an evolution of software thinking, in part at least because of the cloud.  In classic software, you’d have a monolithic application running somewhere, with an “event queue” to hold events as they are generated and from which the application pulls work when it’s done with the last thing it was doing.  More recently, we had a queue that was serviced by a number of “threads” to allow for parallel operation.  Today, we tend to see event processing in state/event terms.

State/event processing is based on the notion that a functional system is made up of a number of “operating states”, one of which is the baseline “running” and the others states that represent transitions to and from the running state based on conditions signaled by events.  The states represent contexts in which events are interpreted.  Human conversation works this way; if you hear someone say “Can you…” you assume that the next phrase will be a request for you to do something.  “Did you…” would be interpreted as asking whether you did.  The response, obviously, has to be interpreted in context, and state provides that.

The other benefit of state/event processing is that it “atomizes” the application.  Each state/event intersection (state/event processing is usually visualized as a table with rows representing events and columns representing states, or vice versa) consists of a “run-process” and “next-state”.  The processes are then specific to the state/event intersection.  If you presume that the service data model is available (it’s probably what contains the state/event table) then that data model has everything the process needs, and you can kick off a process anywhere, in any quantity, with access to that model alone.

An example of a control-plane process is the topology and route exchanges that happen in IP.  The associated packets, coming from a number of port sources, drive what’s in effect a state/event engine with a data model that’s the routing table.  This model might exist in each router/node in a traditional IP network, or in a centralized server in an SDN implementation.

The point here is that control plane behavior relating to packets that are actioned rather than passed, if it’s managed using state/event processing, lends itself to containerized, microservice-based, service-mesh-connected handling.  Cloud-native, in other words.

The management layer is another place where we have a tradition of monolithic, serial, processes.  OSS/BSS systems have generally followed that model for ages, in fact, and most still do despite the protestations of vendors.  There’s a difference between having an event-driven management model and having a management system that processes events.

An event-driven management process is what I’ve described in my ExperiaSphere work.  A new service (or in theory any new application or combination of functions and features) is represented by a model that describes how the elements relate and cooperate.  Each element in this model has an associated dataset and state/event table.  The initial form of the model, the “orderable” form, is taken by an order entry or deployment system and instantiated as a “service” (or “application”).  The model is then “sent” an “Order” event, and things progress from there.

As this process shows, the entire service operations cycle can be assembled using a model structure to represent the functional elements of the service and a set of events to step things through the lifecycle.  There’s no need for monolithic processes, because every element of operations is defined as the product of a state and an event.

In cloud-native terms, this is where microservices and scalability come in.  The entire state/event system is a grid into which simple processes are inserted.  Each of these processes knows only what it’s passed, meaning the piece of the service data model that includes the state/event table that contains the process reference.  The process, when activated, can be standing by in a convenient place already, spun up because it’s newly needed, scaled because it’s already in use, and so forth.  Each process is stateless meaning that any instance of it can serve as well as any other, since nothing is used by the process other than what’s passed to it.

What this proves to me is that since scalability and resiliency demand that you be able to spin up a process when needed, and that process then has to serve in the same way as the original process, you need some form of stateless behavior, and the assertion that microservices are a reasonable requirement for cloud-native control and data plane activity.

Packet forwarding requires a packet to forward, and a process that gets a packet and then dies can’t expect a replacement to forward it; the storage of the packet in transit is stateful behavior.  Further, spinning up multiple packet-handlers to respond to an increase in traffic within a given flow could create out-of-order arrivals, which not all protocols will tolerate.  So, again, I think the forwarding piece of the story is a specialized mission.

Applying this to modern NFV and zero-touch automation implementations, it seems to me that creating virtual network functions (VNFs) as an analog to a physical network function or appliance means that there is no multi-planar functional separation.  The physical device is not cloud-native, and transporting features from that device into the cloud intact won’t create the separation of layers I’ve described, so it’s going to be difficult to make it cloud-native.

The implementation of NFV in a management/deployment sense, and the overall zero-touch operations automation process, could be made cloud-native providing that you had state/event-based logic that managed context and allowed stateless processes.  The logic of operations wouldn’t be written into an application whose components you could diagram, though.  There would be none of the blocks you find in NFV or ONAP diagrams.  Instead, you’d have systems of processes built as microservices and integrated via a model structure.

There is an interesting and not only unanswered but unasked question here, which is what an ideal cloud-hosted data-plane element would look like.  It’s not as simple as it sounds, particularly when at some point (or points) in a data path, you’d have to separate control and management traffic.  This question is part of the broader question of what a hosted network element should look like, how its planes should connect, and how it should be hosted.  It’s hard to take cloud native network transformation discussions seriously when we’re a long way from that answer.