Why We Need to Totally Rethink our VNF Strategy to Make NFV Succeed

If we need to apply advanced cloud principles to virtual network functions (VNFs), what exactly would that mean for VNF design, onboarding, and management?  This is an important question because the current VNF onboarding process is broken from the operators’ almost universal perspective.  Management, in my own view, is at least as broken.  Fixing these issues isn’t trivial, and the effort could be wasted if we get behind the market with respect to supporting cloud-think.

A VNF is a hosted version of a network feature/function, something that in most cases would otherwise have been performed by a device.  The VNF, like the device, would have some number of port/trunk data-plane connections, and in most cases, would also have one or more management connection.  Usually the latter are port addresses associated with the address of the device itself.  Thus, the device (and the VNF that represents it in NFV) is a part of the user’s address space.

One thing that the cloud says is absolutely not to be tolerated is that the control elements of the cloud software (OpenStack, etc.) have management and control interfaces that are exposed in the users’ address space.  This would allow for hacking of shared infrastructure, which can never be allowed to happen, and a common element in two different address spaces can easily become a router, passing real traffic between what are supposed to be isolated networks.

When you deploy a VNF, then, you are putting the logic in some address space.  You have two options; you can put the VNF in the user address space where the port/trunk connections can be directly exposed, or you can put it in a private space and use a NAT facility to expose the port/trunk interfaces.  The latter would be the best practice, obviously, since it prevents the NFV hosting and control framework from appearing in the user address space.  Practically every home network uses a private IP address (192.168.x.x) and the second option follows that same model for VNFs.

In this approach, the management interfaces present an issue since they are now also in a private address space.  If you NAT them to expose them to the user, as would normally be the case, then you cede device management to the user, which makes it hard to provide automated service lifecycle management, even just to parameterize the VNF on deployment.  If you share the port somehow by NATing the NFV management system into the user space, you still have the problem of collision of commands/changes.  So logically you’d need to have a kind of two-step process.  First, you connect the NFV management software to the VNF management interfaces within the private address space.  Second, you create a proxy management port within the NFV software, which at the minimum accepts user management commands and serializes them with commands/requests generated by the NFV software.

In the cloud, whether it’s VMs or containers, everyone involved in deployment is intimately aware of address spaces, what goes into them, and how things are exposed from them.  Fundamental to cloud deployment is the establishing of addressability for the components of an application, among themselves and with respect to their users.  This is where VNF onboarding should start, and yet we hear little about it.

Suppose we resolve addressing.  All of this is then great but it still doesn’t let you onboard VNFs easily.  The problem is that each VNF has different management requirements, even different implementations of the same function.  That makes it hard to slot a given VNF in; there’s not likely to be a compatible “socket” for it.  Thus, the next step is to use that proxy management port process you’ve created to format the VNF’s specific management interface to a generic interface.  Think of this as being a kind of device-class MIB.  All firewalls have the same device-class MIB, and if either a real device or a VNF is managed by NFV, its real MIB is proxied by your management port process into (and, for writes, from) the device-class MIB.  Thus, the proxy management port includes a kind of plugin (of the kind we see in OpenStack Neutron) that adapts a generic interface to a specific one.

A VNF that’s ready for onboarding would then consist of the VNF functional element(s) and the proxy management port “plugin” that adapts its MIB to the standard structure of the device class.  You could, of course, write a VNF to use the device-class standard, in which case all you’d have to do is NAT the management port into the NFV control/management address space.

So, we’re done, right?  Sadly we’re not, as the cloud has already shown.  We have DevOps tools to deploy cloud elements, and while these tools go a long way toward standardizing the basic deployment task, you still have lifecycle management issues.  What do you do if you get an abnormal condition?  If you presume that, for each device class, there is a standard MIB, then it follows that for each standard device class you’d have a standard lifecycle response structure.  That means you’d have standard “events” generated by MIB conditions, and these would result in standard responses and state changes.  The responses might then result in a SET for a MIB variable.  If this is the case, then the stub element in the proxy management port would translate the change as it’s inbound to the VNF (or device).

Even this isn’t quite enough, because some lifecycle conditions aren’t internal to the VNF.  An example is the horizontal scaling capability.  Horizontal scaling means instantiating multiple parallel instances of a function to improve load-handling.  At the minimum, horizontal scaling requires load balancing, but simple load balancing only works if you have truly stateless behavior, and parallel routers (for example) are not truly stateless because packets could pass each other on the parallel paths created, and that generates out-of-order arrivals that might or might not be tolerated by the application.  A good general model of scaling is more complicated.

Assume we have a general model of a scalable device as being a striper function that divides a single input flow into multiple parallel flows and schedules each flow to a process instance.  We then have a series of these process instances, and then a de-striper that combines the parallel flows into a single flow again, doing whatever is required to preserve packet ordering if that’s a requirement.  If you don’t have out-of-order problems, then the striper function is a simple algorithmic load-balancer and there’s no de-striper required except for the collecting of the process flows into a single output port.

The point here is that we see all of this in the cloud even now, and we see in most of the service-chained VNFs an example of what might be called a pipelined microservice, particularly if we assume that operators would induce vendors to decompose device software into elements that would then be assembled.  Sure, a VNF Component might be simply bound with others into a single image, but the cloud teaches us not to fall prey to general-casing issues.  In the general case, the components are likely to live in a pipeline relationship if the VNF replaces a typical port/trunk device.  In the general case, we should be able to distribute these along a data flow.

The cloud is not hosted virtualization, or if it is then it would have limited impact on IT over the long term.  NFV is not just a matter of sticking static functions in static boxes.  It needs the same level of dynamism as the cloud needs, but the cloud is already moving to support that kind of dynamism, and it’s already accepted the most basic truth, which is that static applications build static solutions no matter what you lay underneath them.  VNFs will have to be designed to optimize NFV just as applications have to be designed to optimize the cloud, and it’s way past time for us to accept this and act on the knowledge.