Navigating the Road to Cloud-Native Network Functions

The NFV community accepts the need to modernize, but it’s more difficult for them to say what “modern” looks like.  Certainly there’s pressure for change, but the pressure seems as much about tuning terminology as actually changing technology.  Nowhere is this more obvious than in the area of “cloud-native”.

Virtual Network Functions (VNFs), the meat and potatoes of NFV, run in virtual machines.  That VM link generates two specific issues.  First, the number of VMs you can host on a server is limited, which means that the mechanism isn’t efficient for small VNFs.  Second, a VM carries with it the overhead of the whole OS and middleware stack, which not only fills up resources, it increases the operations burden.

One proposed solution is to go to “CNFs”, which some have called “cloud network functions” and some “containerized network functions”.  The latter would be a better definition because the approach is really about making containers work for VNF hosting, but even here we’re seeing some introduced cynicism.  The lingua franca of container orchestration is Kubernetes, but a fair chunk (and perhaps a dominant one) of the NFV community is looking more at the OpenStack platform, since OpenStack was a part of the original NFV spec.

The other solution is to go all the way to “cloud-native”, which is a challenge given that the term is tough to define even outside the telco world.  We can fairly say that “cloud-native” is not just sticking VNFs in containers, but what exactly is it, and what does it involve?  I’ve mentioned cloud-native network functions (CNNFs) in prior blogs, but not really addressed what’s involved.  Let’s at least give it a go now.

A CNNF, to be truly “cloud-native” should be a microservice, which means that it should be a fairly small element of functionality and that it should not store data or state internal to the code.  That allows it to be instantiated anywhere, and allows any instance to process a given unit of work.  The biggest problem we have in CNNF creation, though, may be less the definition and more the source of the code itself.

When the first NFV ISG meeting was held in Silicon Valley in 2013, there was a fairly vocal dispute over the question of whether we needed to worry about decomposition of current code before we worried about how to compose services from VNFs.  A few in the meeting believed strongly that if current physical network functions (PNFs) hosted in devices were simply declared “virtual” by extracting the device code and hosting it, the value of NFV would be reduced.  Others, myself included, were concerned for three reasons.

First, there would likely be a considerable effort involved in decomposing current code, and vendors who owned PNFs wouldn’t be likely to be willing to undertake the effort for free.  That would likely raise the licensing fees on VNFs and impact the business case for NFV.

Second, there would likely be pressure to allow decomposed PNFs to be reassembled in different ways, even mixing vendor sources.  That would require a standardization of how PNFs were decomposed, and the vendor-mixing process would surely reduce vendor interest.

Third, it was clear that if you composed a service from a chained series of VNFs, the network latency associated with the VNF-to-VNF connections could impact performance to the point where the result wouldn’t be salable at all.

Finally, there were clearly some network functions that were inherently monolithic.  It’s hard to decompose the forwarding plane of a device at the end of a packet trunk.  What would be the strategy for handling those?

In the end, the decision was made to not require decomposition of existing PNFs, and that was probably the right decision.  However, no decision was even considered on whether to support the notion of decomposed PNFs, and that has proved to be unfortunate, because had there been such a decision, we might have considered the CNNF concept earlier.

The four points above, in my view then as now, really mean that there is no single model that’s best for hosting VNFs.  The key point in support of CNNFs is that they’re not likely to be the only “NFs” to be hosted.  My own proposal was that there be a service model for each service, that there be an element of the model representing any network function, and that the element specify the “Infrastructure Manager” needed to deploy and manage it.  That still seems, in some form, at least, to be the best and only starting point for CNNFs.  That way, whatever is needed is specified.

Some network functions should deploy in white boxes.  Some in bare-metal servers, some in VMs, some in containers.  The deployment dimension comes first.  Second would come the orchestration and management dimension, and finally the functional dimension.  This order of addressing the issue of network functions is important, because if we disregard it, we end up missing something critical.

The orchestration and management processes used for deployment have to reflect the things on both sides.  Obviously, we have to deploy on what we’re targeting to deploy on.  Equally obvious, we need to deploy the function we want.  The nature of that function, and the target of deployment, establish the kind of management and orchestration we need, and indirectly that then relates to the whole issue of how we define CNFs and CNNFs, and what we do differently in each.

If we want to exploit cloud-native anywhere, I think we have to accept that the network layer divides into the data/forwarding plane and the control plane.  The former is really about fast packet throughput and so is almost surely linked to specialized hardware everywhere but the subscriber edge.  The latter is about processing events, which is what control-plane datagrams in IP are about.  The control plane is quite suitable for cloud-native implementation.  The management plane, the way all of the elements are configured and operationalized, is a vertical layer if we imagine the data/control planes to be horizontal.

The management-plane stuff is important, I think, because you can view management as being event-driven too.  However, if you are going to have event-driven management, you need some mechanism to steer events to processes.  The traditional approach of the event queue works for monolithic/stateful implementations, but it adds latency (while things wait in the queue), doesn’t easily support scaling under load (because it’s not stateless), and can create collisions when events come fast enough that there’s something in the queue that changes conditions while you’re trying to process something that came before.  The TMF NGOSS Contract approach is the right one; service models (contracts) steer events to processes.

The event-driven processes can be stateless and cloud native, and they can also be stateful and even monolithic, providing that they are executed autonomously (asynchronously) so they don’t hold up the rest of the processing.  Thus, you could in theory kick off transaction processing from an event-driven model as long as the number of transactions wasn’t excessive.

The hosting of all of this will depend on the situation.  Despite what’s been said many times, containers are not a necessary or sufficient condition for cloud-native.  I think that in most cases, cloud-native implementations will be based on containers for efficiency reasons, but there are probably situations where VMs or even bare metal are better.  There’s no reason to set a specific hosting requirement, because if you have a model-and-event approach, the deployment and redeployment can be handled in state/event processes.  If you don’t (meaning you have an NFV-like Virtual Infrastructure Manager), then the VIM should be specific to the hosting type.  I do not agree with the NFV approach of having one VIM; there should be as many VIMs as needed.

And then there’s latency.  If you are going to have distributed features make up services, you have to pay attention to the impact of the distribution process on the quality of experience (QoE).  Stringing three or four functions out in a service chain over some number of data centers is surely going to introduce more latency than having the same four processes locally resident in a device at the edge.  The whole idea was silly, in my view, but if latency can kill NFV service chains, what might it do to services built on a distributed set of microservices?  If you’re not careful, the same thing.

CNFs do have value, because containers have a value in comparison to the higher-overhead VMs.  CNNFs would have more value, but I think that realizing either is going to require serious architecting of the service and its components.  Separation of control and data planes is likely critical for almost any network function success, for example.  Even with that, though, we need to be thinking about how the control plane of IP can be harnessed, and perhaps even combined with “higher-layer” stuff to do something useful.  Utility, after all, is the ultimate step in justifying any technology change.