Can We Make ETSI NFV Valuable Even If It’s Not Optimal?

Network Functions Virtualization (NFV) has been a focus for operators for five years now.  Anyone who’s following my blog knows I have disagreed with the approach the NFV ISG has taken, but it took it.  The current model will never, in my view, be optimal, as I’ve said many times in past blogs and media interviews.  The question now is whether it can be useful in any way.  The answer is “Yes”, providing that the industry, and the ISG, take some steps quickly.  The goal of these steps is to address what could be serious issues without mandating a complete redesign of the software, now largely based on a literal interpretation of the ETSI ISG’s End-to-End model.

The current focus of NFV trials and deployments is virtual CPE (vCPE), which is the use of NFV to substitute for traditional network-edge appliances.  This focus has, IMHO, dominated the ISG to the point where they’ve framed the architecture around it.  However, the actual deployments of vCPE suggest that the real-world vCPE differs from the conceptual model of the specs.  Because of the central role of vCPE in early NFV activity, it’s important that these issues be addressed.

What was conceptualized for vCPE was a series of cloud-hosted features, each in its own virtual machine, and each linked to the others in a “service chain”.  What we actually see today for most vCPE is a general-purpose edge device that is capable of receiving feature updates remotely.  This new general-purpose edge device is more agile than a set of fixed, purpose-built, appliances.  Furthermore, the facilities for remote feature loading make a general-purpose edge device less likely to require field replacement if the user upgrades functionality.  If vCPE is what’s happening, then we need to optimize our concept without major changes to the ETSI model or implementation.

Let’s start with actual hosting of vCPE features in the cloud, which was the original ETSI model.  The service-chain notion of features is completely impractical.  Every feature adds a hosting point and chain connection, which means every feature adds cost and complexity to the picture.  My suggestion here is that where cloud-hosting of features is contemplated, abandon service chaining in favor of deploying/redeploying a composite image of all the features used.  If a user has a firewall feature and adds an application acceleration feature, redeploy a software image that contains both to substitute for the image that supports only one feature.  Use the same VMs, the same connections.

Some may argue that this is disruptive at the service level.  So is adding something to a service chain.  You can’t change the data plane without creating issues.  The point is that the new-image model versus new-link model has much less operations intervention (you replace an image) and it doesn’t add additional hosting points and costs.  If the cost of multi-feature vCPE increases with each feature, then the price the user pays has to cover that cost, and that makes feature enhancement less attractive.  The ETSI ISG should endorse the new-image model for cloud-hosted vCPE.

Let’s now move to the market-dominant vCPE approach, which is a general-purpose edge device that substitutes for cloud resources.  Obviously, such a hosting point for vCPE doesn’t need additional hosting points and network connections to create a chain.  Each feature is in effect inserted into a “virtual slot” in an embedded-control computing device, where it runs.

One of the primary challenges in NFV is the onboarding virtual functions and interoperability of VNFs.  If every general-purpose edge device vendor takes their own path in terms of the device’s hosting features and local operating system, you could end up with a need for a different VNF for every vCPE device.  You need some standard presumption of a local operating system, a lightweight device-oriented Linux version for example, and you need some standard middleware that links the VNF to other VNFs in the same device, and to the NFV management processes.

What NFV could do here is define a standard middleware set to provide those “virtual slots” in the edge device and support the management of the features.  There should be a kind of two-plug mechanism for adding a feature.  One plug connects the feature component to the data plane in the designated place, and the other connects it to a standard management interface.  That interface then links to a management process that supplies management for all the features included.  Since the whole “chain” is in the box, it would be possible to cut in a new feature without significant (if any) data plane interruption.

This same approach could be taken for what I’ll call the “virtual edge device” approach.  Here, instead of service-chaining a bunch of features to create agility, the customer buys a virtual edge device, which is a cloud element that will accept feature insertion into the same image/element.  Thus, the network service user is “leasing” a hosting point into which features could be dynamically added.  This provides a dynamic way of feature-inserting that would preserve the efficiency of the new-image model but also potentially offer feature insertion with no disruption.

The second point where the NFV community could inject some order is in that management plug.  The notion here is that there is a specific, single, management process that’s resident with the component(s) and interacts with the rest of the NFV software.  That process has two standard APIs, one facing the NFV management system (VNFM) and the other facing the feature itself.  It is then the responsibility of any feature or VNF provider to offer a “stub” that connects their logic to the feature-side API.  That simplifies onboarding.

In theory, it would be possible to define a “feature API” for each class of feature, but I think the more logical approach to take would be to define an API whose data model defines parameters by feature, and includes all the feature classes to be supported.  For example, the API might define a “Firewall” device class and the parameters associated with it, and a “Accelerator” class that likewise has parameters.  That would continue as a kind of “name-details” hierarchy for each feature class.  You would then pass parameters only for the class(es) you implemented.

The next suggestion is to formalize and structure the notion of a “virtual infrastructure manager”.  There is still a question in NFV as to whether there’s a single VIM for everything or a possible group of VIMs.  The single-VIM model is way too restrictive because it’s doubtful that vendors would cooperate to provide such a thing, and almost every vendor (not to mention every new technology) has different management properties.  To make matters worse, there’s no organized way in which lifecycle management is handled.

VIMs should become “infrastructure managers” or IMs, and they should present the same kind of generalized API set that I noted above for VNFM.  This time, though, the API model would present only a set of SLA-type parameters that would then allow higher-level management processes to manage any IM the same way.  The IM should have the option of either handling lifecycle events internally or passing them up the chain through that API to higher-level management.  This would organize how diverse infrastructure is handled (via separate IMs), how legacy devices are integrated with NFV (via separate IMs), and how management is vertically integrated while still accommodating remediation at a low level.

The final suggestion is aimed at the problem I think is inherent in the strict implementation of the ETSI E2E model, which is scalability.  Software framed based on the functional model of NFV would be a serialized set of elements whose performance would be limited and which would not be easily scalable under load.  This could create a major problem should the failure of some key component of infrastructure cause a “fault cascade” that requires a lot of remediation and redeployment.  The only way to address this is by fragmenting NFV infrastructure and software into relatively contained domains which are harmonized above.

In ETSI-modeled NFV, we have to assume that every data center has a minimum of one NFV software instance, including MANO, VNFM, and VIM.  If it’s a large data center, then the number of instances would depend on the number of servers.  IMHO, you would want to presume that you had an instance for each 250 servers or so.

To make this work, a service would have to be decomposed into instance-specific pieces and each piece then dispatched to the proper spot.  That means you would have a kind of hierarchy of implementation.  The easiest way to do this is to say that there is a federation VIM that’s responsible for taking a piece of service and, rather than deploying it, sending it to another NFV instance for deployment.  You could have as many federation VIMs and layers thereof as needed.

All of this doesn’t substitute completely for an efficient NFV software architecture.  I’ve blogged enough about that to demonstrate what I think the problems with current NFV models are, and what I think would have to be done at the bottom to make things really good again.  These fixes won’t do that, but as I said at the opening of this blog, my goal isn’t to make current NFV great or even optimal, but rather to make it workable.  If that’s done, then we could at least hope that some deployment could occur, that fatal problems with NFV wouldn’t arise, and that successor implementations would have time to get it right at last.