Keeping Up with the Cloud: The Developments that MUST Guide SDN and NFV

When we design a transformation strategy for operators today we’re really designing something to be deployed at scale in perhaps 2020 or 2021.  The telco world has the longest capital cycle of any technology sector, with elements that are expected to live for ten years in many cases—sometimes even more.  It’s critical for virtualization in any form to satisfy current requirements, but it’s just as critical that it support the longer-term trends.  Otherwise, transformation capital and effort is vulnerable to being fork-lifted out just as it’s getting started.  Are our “virtualization revolution” strategies, like SDN and NFV, looking forward?  I don’t think so, at least not far enough.

While the visionary ten operators who launched NFV back in 2012 didn’t envision it in these terms, what they were doing was applying virtualization principles of the time to problems of the time.  We do have a fair measure of how those problems are evolving, and so we can presume that the requirements at the business level are known quantities.  The virtualization technology is another matter.

The very first NFV paper stated that “Network Functions Virtualization will leverage modern technologies such as those developed for cloud computing.”  At the time (October 2012) that meant leveraging IaaS hosting.  We are on the cusp of a cloud revolution that’s being created by going beyond IaaS in a decisive way.  Doesn’t that mission statement back in 2012 mean that NFV should leverage the technology elements of this revolution too?  Surely, given how long it would take to transform networks with NFV, the state of the cloud will have moved forward by the time it’s being broadly deployed.  Surely, NFV should then be based on the leading-edge cloud stuff that would prevail at that point.

The evolution of cloud computing, at a high level, is an evolution that takes it from being a passive outsourced-hosting framework to an active fully distributed development platform for applications.  We’ve had platform-as-a-service clouds almost from the first (Microsoft Azure is the best-known), but what’s now happening is that IaaS is transforming to a kind of PaaS model that I’ve called “features-as-a-service” or FaaS to distinguish it from the past approach.  Both Amazon and Microsoft have added about two-dozen features that in the old days we’d call “middleware” or “services”.  These, accessed by APIs, let developers build applications that are specific to the cloud.  Some facilitate agile, distributable, cloud development, and others perform tasks that should be done differently (and often can be done better) in the cloud.

This new vision of the cloud is already clearly visible in the announcements of the cloud providers (Amazon’s Greengrass and Snowball most recently).  My modeling says that the new cloud vision will transform cloud usage decisively right around the 2020 milestone that’s also when we could expect to see a business-justified NFV model deploying.  The signs of the cloud transformation will be even more clear in 2017 than they are today, and they’ll be inescapable by 2018.  Thus, IMHO, a failure to consider the impact of this transformation on carrier virtualization in all its guises could stall progress while carrier visions of virtualization catch up with the public cloud.

What would that catch-up involve?  What specific major advances in cloud computing should or must be incorporated in the SDN and NFV vision?  We can’t answer a question like that in detail in a single blog, but I can present the questions and issues here and develop them in subsequent blogs.  So, let’s get started.

The primary attribute of the cloud of the future is that it expects to host fully distributable and scalable elements.  IaaS is nothing but hosted server consolidation.  Future cloud computing applications will be written to be placed and replicated dynamically.  Even today’s NFV specifications envision this kind of model, but they don’t define the basic design features that either virtual network functions (VNFs) or the NFV control software itself would have to adopt to enable all that dynamism.

For the VNFs, the problem is pretty clear.  The current approach is almost identical to the way that applications would be run on IaaS.  Every application/VNF is different, and that means that no standard mechanism will deploy VNFs or connect them to management systems.  There’s a goal of making the VNFs horizontally scalable, but nothing is said about how exactly that could be done.  In the emerging FaaS space we have distributed load balancing and, most important, we have state control practices and tools to ensure that the applications that are expected to scale in and out don’t rely on data stored in each instance.

To return to the Amazon announcement, Greengrass software includes what Amazon calls “Lambda”, which comes from “lambda expressions” used to create what’s also known as “functional programming”.  Lambda expressions are software designed to operate on inputs and create outputs without storing anything inside.  You can string them together (“pipeline”) to create a complex result, but the code is always stateless and simple.  It’s ideal for performing basic data-flow operations in a distributed environment because you can send such an expression anywhere to be hosted, and replicate it as needed.  It’s waiting, it’s used, it’s gone.  If the cloud is moving to this, shouldn’t NFV also be supporting it?

For the NFV control software, we have a more explicit problem.  The framework itself is described as a set of monolithic components.  There’s no indication that the NFV software itself can scale, or fail over, or be distributed.  What happens if a VNF fails?  The NFV people would tell you that it’s replaced dynamically.  What happens if the NFV MANO process fails?  Who replaces it, given that MANO is the element that’s supposed to do the replacing?

A lower-level issue just as difficult to rationalize is that of VMs versus containers.  From the first, the presumption was that VNFs would be hosted in virtual machines, largely because that was all that was available at the time.  Containers are now being considered, but they’re fitting into a model that was developed with the limitations of VM deployment in mind.  Is that correct?

Container deployment and orchestration are different from VM deployment and orchestration.  OpenStack operates on hosts and VMs, but container technologies like Docker and Kubernetes are deployed per host and then separately coordinated across host boundaries.  OpenStack is essentially single-threaded—it does one thing at a time.  Docker/Kubernetes lends itself to a distributed model; in fact, you have to explicitly pull the separate container hosts into a “swarm” or “cluster”.

I’m not suggesting that you could do fully distributed Docker/Kubernetes control processes today, but you could do more there than with VMs and OpenStack, and you could almost certainly transform the container environment into something that is fully distributable and scalable.  With less effort than would be required to transform VMs and OpenStack.  VMs are, of course, more tenant-isolated than containers, but if VNFs are carrier-onboarded and certified, do you need that isolation?  We need to decide, and if we do to either weigh that loss against the gains of distributability, or fix the isolation issues.

The final point is networking.  You could make a strong argument that both Amazon and Google built their clouds around their networks.  Google offers the most public model of what a cloud network should be, and it explicitly deals with address spaces, NAT, boundary gateway functions, and so forth.  We are building services with NFV, services that will include both virtual and non-virtual elements, services that need control-level and management-level isolation.  Google can do all of this; why are we not talking about it for NFV?  I don’t think it’s possible to build an agile public cloud without these features, and so we should be exploiting standard implementations of the features in SDN and NFV.  We are not.

Are the management ports of a VNF in the address space of the user service?  If not, how does the user change parameters?  If so, then are the connections between that VNF and the NFV control element called the VNF Manager also in the service address space?  If not, how can the VNF be in two spaces at once.  If so, then doesn’t that mean that user-network elements can hack the VNFM?

We already have an integration challenge with NFV, one widely recognized by network operators.  That which we do not address in the specifications for SDN or (more importantly) NFV are going to end up becoming perhaps bigger integration problems in the near term and obsolescence problems in the long term.  The cloud is already moving forward.  SDN’s greatest successes have come in cloud deployments.  NFV’s success will depend on effectively piggybacking on cloud evolution.  We have a clear indication of what the cloud is capable of, and where those capabilities are heading.  We need to align to those capabilities now, before the cost of alignment becomes too high or the lack of it threatens deployment.