So the NFV ISG Wants to Look at Being Cloud-Like: How?

The ETSI NFV ISG is having a meeting, one of which’s goals is to explore a more cloud model of NFV.  Obviously, I’d like to see that.  The question is what such a model would look like, and whether it (in some form) could be achieved from where we are now, without starting another four-year effort.  There are certainly some steps that could be taken.

A “cloud model of NFV” has two functional components.  First, the part of NFV that represents a deployed service would have to be made very “cloud-friendly”.  Second, the NFV software itself would have to be optimized to exploit the scalability, resiliency, and agility of the cloud.  We’ll take these in order.

The first step would actually benefit the cloud as well as NFV.  We need a cloud abstraction on which we can deploy, that represents everything that can host functions and applications.  The model today is about hosts or groups of hosts, and there are different mechanisms to deploy containers versus VMs and different processes within each.  All of this complicates the lifecycle management process.

The biggest NFV challenge here is dealing with virtual CPE (vCPE).  Stuff that’s hosted on the customer prem, in a cloud world, should look like a seamless extension of “the cloud”, and the same is true for public cloud services.  This is a federation problem, a problem of agreeing on a broad cloud abstraction and then agreeing to provide the mechanisms to implement it using whatever mixture of technology happens to be available.  The little boxes for vCPE, the edge servers Amazon uses in its Greengrass Lambda extension, and big enterprise data centers are all just the edge of “the cloud” and we need to treat them like that.

If we had a single abstraction to represent “the cloud” then we would radically simplify the higher-level management of services.  Lifecycle management would divide by “in-cloud” and “not-in-cloud” with the latter being the piece handled by legacy devices.  The highest-level service manager would simply hand off a blueprint for the cloud piece to the cloud abstraction and the various domains within that abstraction would be handed their pieces.  This not only simplifies management, it distributes work to improve performance.

Our next point is Cloudy VNFs, to coin an awkward term, should be for all intents and purposes a cloud application component, no different from a piece of a payroll or CRM system.  If it breaks you can redeploy it somewhere, and if it runs out of capacity you can replicate and load-balance it.  Is this possible?  Yes, but only possible because the attributes of a VNF that could make those attributes available aren’t necessarily there.

If I have a copy of an accounting system that runs out of capacity, can I just spin up another one?  The problem is that I have a database to update here, and that update process can’t be duplicated across multiple instances unless I have some mechanism for eliminating collisions that could result in erroneous data.  Systems like that are “stateful” meaning that they store stuff that will impact the way that subsequent steps/messages are interpreted.  A “stateless” system doesn’t have that, and so any copy can be made to process a unit of work.

A pure data-plane process, meaning get-a-packet-send-a-packet, is only potentially stateless.  Do you have the chance of queuing for congestion, or do you have flow control, or do you have ancillary control-plane processes invoked to manage the flow between you and partner elements?  If so then there is stateful behavior going on.  Some of these points have to be faced in any event; queuing creates a problem with lost data or out-of-order arrivals, but that also happens just by creating multiple paths or by replacing a device.  The point is that a VNF would have to be examined to determine if its properties were consistent with scaling, and new VNFs should be designed to offer optimum scalability and resiliency.

We see this trend in the cloud with functional programming, lambdas, or microservices.  It’s possible to create stateless elements, to do back-end state and context control, but the software that’s usually provided in a single device didn’t face the scalability/resiliency issue and so probably doesn’t do what’s necessary for statelessness.

Control-plane stuff is much worse.  If you report your state to a management process, it’s probably because it requested it.  Suppose you request state from Device Instance One, and Instance Two is spun up, and it gets the request and responds.  You may have been checking on the status of a loaded device to find out that it reports being unloaded.  In any event, you now have multiple devices, so how do you obtain meaningful status from the system of devices rather than from one of them, or each of them (when you may not know about the multiplicity)?

All this pales into insignificance when you look at the second piece of cloud-centric NFV, which is the NFV software itself.  Recall that the ETSI E2E model describes a transactional-looking framework that controls what looks like a domain of servers.  Is this model a data-center-specific model, meaning that there’s a reasonably small collection of devices, or does this model cover an entire operator infrastructure?  If it’s the former, then services will require some form of federation of the domains to cover the full geography.  If it’s the latter, then the single-instance model the E2E diagram describes could never work because it could never scale.

It’s pretty obvious that fixing the second problem would more work than fixing the first, and perhaps would involve that first step anyway.  In the cloud, we’d handle deployment across multiple resource pools by a set of higher-layer processes, usually DevOps-based, that would activate individual instances of container systems like Docker (hosts or clusters) or VM systems like OpenStack.  Making the E2E model cloud-ready would mean creating fairly contained domains, each with their own MANO/VNFM/VIM software set, and then assigning a service to domains by decomposing and dispatching to the right place.

The notion of having “domains” would be a big help, I think.  That means that having a single abstraction for “the cloud” should be followed by having one for “the network”, and both these abstractions would then decompose into domains based on geography, management span of control, and administrative ownership.  Within each abstraction you’d have some logic that looks perhaps like NFV MANO—we need to decompose a service into “connections” and “hosting”.  You’d also have domain-specific stuff, like OpenStack or an NMS.  A high-level manager would orchestrate into high-level requests for abstract services, and that would invoke a second-level manager that would divide things by domain.

We don’t have that now, of course.  Logically, you could say that if we had a higher-layer system that could model and decompose, and if we created those limited NFV domains, we could get to the good place without major surgery on NFV.  There are some products out there that provide what’s needed to do the modeling and decomposing, but they don’t seem to be mandatory parts of NFV.

I’d love to be able to go to meetings like this, frankly, but the problem is that as an independent consultant I have to do work that pays the bills, and all standards processes involve a huge commitment in time.  To take a proposal like this to a meeting, I’d have to turn it into a contribution, defend it in a series of calls, run through revision cycles, and then face the probability that the majority of the body isn’t ready to make radical changes anyway.  So, instead I offer my thoughts in a form I can support, which is this blog.  In the end, the ISG has the ability to absorb as much of it as they like, and discard what they don’t.  That’s the same place formal contributions would end up anyway.