Is NFV Really the “Training Wheels” of Cloud-Native?

I saw something recently in Fierce Telecom, with the tag line that if NFV was training wheels, then cloud-native was the real deal.  It’s heartening in one sense that we’re facing the fact that NFV is behind the times, perhaps hopelessly so, but it’s an oversimplification to say that “cloud-native” is the future of NFV.  Unless, of course, we redefine what we mean by “cloud-native”.

The Cloud Native Computing Foundation (CNCF) says “Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.”  We could take this very literally, or rather loosely, and the difference between the two is the gap into which oversimplification of the NFV-to-cloud-native transformation might fall.

I’m a software architect by background, and I’ve developed, designed, and managed a lot of realtime projects.  Many of these have included building protocol-handlers, software elements designed to process network protocols, including IP.  I can tell you from experience that these applications are incredibly sensitive to latency.  The biggest network problem I ever saw in a consulting role was created in an IP network through failure to consider latency, and it created a state-wide healthcare failure.

If we were to; build an application designed to do what, broadly, NFV was designed to do, and if we implemented it as a containerized, microservice-based, service-meshed, architecture, it is almost certain that the design would create delays of a magnitude that would make it useless for handling network traffic flows.  If that’s what we mean by “cloud-native”, then NFV is closer to the right answer for network functions.  But is it what we mean?  We need to look at the things in the CNCF definition and decide whether “cloud-native” means virtual functions are constructed entirely from that set of tools, or that there’s some broader principle that guides tool assignment, one that admits to things like microservices and service meshes, but doesn’t demand them.

Let’s start with containers.  I believe that containers are the foundation of future software development and deployment, no matter what the software is and no matter where it’s used.  Containers were around when NFV was launched, and there was no reason to make NFV so virtual-machine-specific, so in that sense, since containers are surely an element of cloud-native design, NFV could benefit from adopting them (which the ISG is sort-of-trying to do).  However, it is true that containers do not isolate their components from each other as well as VMs do, and it’s possible to write higher-performance applications in VMs than in containers.  It’s also possible to write better bare-metal ones than VM ones.  The thing NFV missed here isn’t “containers”, it’s the need to embrace whatever form of hosting is most effective.

OK, we’ll move on to microservices.  A microservice is a software component that’s designed to be scaled and replaced easily, which means that it doesn’t maintain operating variables within the software, where they’d be lost if another instance was spun up.  It doesn’t mean that they’re stateless, only that they keep state externally in some way.  Microservice concepts and design weren’t given that name twenty or thirty years ago, but they were around.

The problem with microservices is that if you divide features and then network-distribute them, you create latency and failure risk.  If you take a logical single function, one that has to be run entirely for every unit of work, then splitting that into five or so sequentially executed microservices isn’t going to accomplish anything but slowing it down.  Protocol handling is usually this kind of pipeline, so NFV is right in this area too.

Service meshes are ways of routing work to a web of distributed microservices.  With protocol handling, you don’t randomly distribute processes that handle work, you string them in the path between source and destination, and so the data path is the workflow.  They unite microservices into applications, and you don’t need that if (for latency reasons) you can’t divide things in the first place.  Another win for NFV.

At this point, you’d think that not only was I disagreeing totally with the idea that NFV has to transform to cloud-native.  You’d also be wondering why I’ve expressed such sharp criticism of how NFV has been implemented.  There are # reasons why NFV and cloud-native are training-wheels-versus-triathlon.

The first is that cloud design principles would have created a totally different way of thinking about NFV than the ISG devised, a much better way.  The key point is to think not of network functions, meaning things translated from device implementations of a monolithic model of handling, but rather of services-as-functions, which would think of network features like “routing” as a successive set of steps, each of which could be collected with others or distributed, as needed.  The point is that cloud-native doesn’t demand everything be microservices.  If it did, I’d say now and with great conviction that we’d never get there.  We couldn’t.

The second reason is that a network device is really not a network function, as the PNF-to-VNF translation NFV proposes would make it.  It’s really three functions.  The first is the data-plane movement, the second responses to control-plane packets, and the third is the management plane.  Just because we might, for reasons of latency, have to keep the data plane implementation on an efficient hosting platform, even an open appliance or bare-metal server, doesn’t mean the other stuff shouldn’t be a candidate for microservices and service meshes.

The third reason is that we could actually consider cloud-native microservices and service meshes as a form of network routing.  What does a service mesh do if it isn’t routing work?  What does cloud-native scalability and resilience do, if not provide some of what adaptive routing was designed to do?  We might well want to look at the concept of each network function, from routing to firewalls, and ask whether the concept doesn’t have a new implementation option.  Could service-mesh-like discovery of services be applied to learn a “route” that is then followed without further use of a high-latency process?

Our risk here is to oversimplify the concept of cloud-native, of making cloud-native into some silly formula like containers plus microservices times service mesh equals cloud-native.  Cloud-native is an application and deployment model designed for abstract pools of resources and based on resilient implementations and dynamic coupling.  We can’t get NFV to cloud-native with microservices, meshes, or even containers.  We can get to it only by learning cloud-think.