Adapting NFV to Cloud-Native

Carrier cloud is IMHO the foundation of any rational network operator virtualization strategy.  It would make zero sense for operators to build out hosting infrastructure for specific applications or service missions.  This is the industry, after all, that has decried the notion of service-specific silos from the very first.  Capital and operations efficiency alike depend on a single resource pool.

I’ve noted in previous blogs that my biggest problem with the NFV community is its current and seemingly increasing divergence from the most relevant trends in cloud computing.  There is nothing that NFV has or needs that isn’t also needed in the cloud overall, and in fact nearly everything NFV has or needs is already provided in the cloud.  If we keep on our current track, we’ll build an NFV model that fails to take advantage of cloud development (in deployment and lifecycle management) and risks a separation between carrier cloud infrastructure and “NFV Infrastructure” or NFVi.  That would raise the capital and operations cost of NFV, and risk the whole concept.  We all know that would be bad, but can anything be done at this point?

I think there are two possible pathways to fixing the NFV situation.  The first is to limit NFV to deployment of VNFs on universal CPE (uCPE), and make a decision to separate uCPE from carrier cloud infrastructure.  By keeping NFV out of cloud hosting (in favor of per-site premises hosting), we could eliminate the risk of creating an NFVi silo cloud.  The second is to find some way of laying NFV on top of carrier cloud without separating the processes of NFV from cloud processes.  That requires making NFV overall into a “carrier cloud application”, not making VNFs into such an application.  All of this must also somehow address the “cloud-native” thrust we now see from operators, both in NFV specifically and in carrier cloud overall.  Let’s look at each approach to see what would be required.

Starting, obviously, with limiting VNFs to uCPE deployment.  That sounds drastic, but I’ve never liked the way that the NFV ISG glommed onto “service chaining” and “virtual CPE” as the prime focus within the cloud.  The basic problem is that any service termination needs some CPE, so you can’t pull everything in the cloud.  Consumer and small business/branch terminations need a WiFi hub, and the commercial products to support that include the basic firewall and termination-service features.  These cost between about $50 and $300, mostly depending on the WiFi features, so it seems clear that the vCPE mission is valid only for larger business-site terminations.  To support them inside the cloud using service chaining, custom NFVi, and an NFV-specific deployment and lifecycle automation process, seems unlikely to be justified.  uCPE is a better solution.

As I pointed out in a blog last week, though, vCPE/uCPE is very likely not a great mission that needs NFV.  You don’t need to service chain within a device (presuming that service chaining overall is even useful, which I doubt).  Some order in how you load and manage uCPE would be helpful, but a simple spec on the platform could resolve that.  In any event, it is a possible mission for NFV, and if operators believed in it, then at least some of the work of the ISG could be redeemed.

The second option is a lot more complicated.  Rather than doing what many had expected, which was to simply identify cloud strategies that NFV could leverage, the ISG framed a very specific model for management, deployment, configuration and parameterization, and even infrastructure.  That model didn’t align well even with the cloud framework of 2013, when the model evolved, and it’s not tracking further developments in the cloud.  Thus, the question is how easily we could retrofit NFV, not to past or even current cloud, but to the broad track of cloud evolution.

Let’s start with what I think would have to be the critical accommodation.  Right now, NFV expects to interact with its resource pool (NFV Infrastructure or NFVi) through the mediation of the Virtual Infrastructure Manager or VIM.  In order for this to track cloud evolution, we have to assume that the VIM offers a single virtual-host abstraction that’s mapped to the resource pool by a composable-infrastructure layer.  Think something like (and preferably based on) Apache Mesos with DC/OS, but the implementation specifics matter less than the architectural model.  Everything gets hosted on the virtual-host abstraction.  The mapping happens below, which means that a VNF has a “descriptor” that corresponds to the application description in cloud hosting.  There are also resource-level descriptors/policies that guide the mapping.

What I think would likely be required here is a kind of adapter function, one that presented a uniform hosting abstraction via the VIM API, and accepted the VNF Descriptor (VNFD) as a parameter.  This would then be mapped through into the parameters and policies required by the composable infrastructure API below.  That’s helpful because it ensures that we don’t, at a point where the cloud itself has perhaps agreed on infrastructure abstraction but not on the means, lock in an implementation at this level.

This isn’t too destructive to the ISG work, but even here there’s the fact that the ISG is tossing in a lot of parameters to control the service lifecycle process that are not found in the cloud at all.  Some may be helpful, and these should be reviewed for inclusion in cloud implementations.  We need, for example, the ability to define hosting points that are dependent on the same power resources (to avoid backing up to them) and to define places where certain processing and storage can’t take place for regulatory reasons.  Many are just diddling with hosting to the point where they could compromise resource efficiency.

The next step is more difficult.  There is no reason whatsoever to stay with the NFV ISG concept of MANO, though orchestration is clearly important.  The reason for the departure is that the cloud is offering a number of much more flexible, powerful, and broadly supported orchestration options.  If NFV wants to get itself aligned with the cloud, they have to give up the notion of NFV-specific orchestration.

From MANO-less NFV, we move to the real knot of the problem, which is management.  Deployment automation is fairly straightforward, which may be what led the NFV ISG to try to define it independent of the cloud.  If you add in “lifecycle automation”, you make things a lot more complicated because you introduce a very broad set of events into the picture.

There are two levels of management issue with virtualization and carrier cloud.  One level is the “simple” task of reflecting the status of virtual network functions in the same way that you’d reflect physical device status.  The NFV ISG dealt with that, in a sense at least, with its concept of VNF Manager (VNFM).  The second level is the real problem; how do you deal with management requirements that exist for the virtual function but not for the physical device?  An example is a service-chained set of VNFs replacing a simple piece of access CPE.  There are hosts, connections, in the former that are all hidden inside a box in the latter, so traditional management won’t handle these.

Once you accept that you need a broader vision of management to handle what happens inside virtual elements, it makes no sense to assume that you’d stick with the old operations tools and practices to handle the outside.  In order for that to work, the NFV community has to extend its view of service modeling (which has already embraced, sort of, the TOSCA cloud-centric approach) to include the TMF’s NGOSS-Contract vision of data models steering events to processes.  That requires a major shift in the way that VNFM happens today, because there is no longer a monolithic operations process handling things, but rather a series of processes coupled individually to events via TOSCA.

This doesn’t mean that VNFM goes away, or even that the notion of using traditional management elements (the classic element-management hierarchy) goes away.  You can feed a management system with data obtained from a modeled service.  I proposed using a management repository with query-derived management views (“derived operations”) in several operator meetings, well before the NFV ISG started.  You can retain a management-agent approach where you want to, but remember that management systems that are supposed to be driving lifecycle automation have to be automatic, so the state/event NGOSS-Contract approach is absolutely critical.

Derived operations, the extraction of information from a database through a query interface, is also a reasonable way to deliver service model data to a process that’s the target of an event.  It’s also possible to simply offer the entire model, but that creates efficiency and security issues.  You could offer an object only the model data that represented the object the event was associated with, which is more secure and efficient, but perhaps begs the question of where this data is stored.  The question of model-data distribution is one we can explore in a future blog.

Can we expect the standards community, and the NFV ISG in particular, to adopt this approach?  I doubt it.  Technically, none of these things presents a major challenge.  We have cloud tools that do the right thing already, as I’ve pointed out.  What’s more likely to be a problem is the inertia of the standards processes involved—the NFV ISG and the ETSI ZTA (Zero-Touch Automation) activity.  It’s hard to admit that you’ve spent five years (in the case of the ISG) doing something and now have to discard or change most of it.  But the alternative is to spend even longer doing something that will surely be overtaken by events.