Don’t Ignore the Scalability and Resilience of SDN/NFV Control and Management!

It would be fair for you to wonder whether the notion of intent-based service modeling for transformed telco infrastructure is anything more than a debate on software architecture.  In fact, that might be a very critical question because we’re really not addressing, so far, the execution of the control software associated with virtualization in carrier infrastructure.  We’ve talked about scaling VNFs, even scaling controllers in SDN.  What about scaling the SDN/NFV control, orchestration, and management elements?  Could we bring down a whole network by a classic fault avalanche, or even just by a highly successful marketing campaign?  Does this work under load?

This isn’t an idle question.  If you look at the E2E architecture of the NFV ISG, you see a model that if followed would result in an application with a MANO component, a VIM component, and a VNFM component.  How does work get divided up among those?  Could you have two of a given component sharing the load?  There’s nothing in the material to assure that an implementation is anything but single-threaded, meaning that it processes one request at a time.

I think there are some basic NFV truths and some basic software truths that apply here, or should.  On the NFV side, it makes absolutely no sense to demand that there be scalability under load and dynamic replacement of broken components at the VNF level, and then fail to provide either for the NFV software itself.  At the basic software truth level, we know how the cloud would approach the problem, and so we have a formula that could be applied and has been largely ignored.

In the cloud, it’s widely understood that scalable components have to be stateless and must never retain data within the component from call to call.  Every time a component is invoked, it has to look like it’s a fresh-from-the-library copy, because given scalability and availability management demands, it might just be that.  Microservices are an example of a modern software development trend (linked to the cloud but not dependent on it) that also mandate stateless behavior.

This whole thing came up back about a decade ago, with the work being done in the TMF on the Service Delivery Framework.  Operators expressed concern to me over whether the architecture being considered was practical:  “Tom, I want you to understand that we’re 100% behind implementable standards, but we’re not sure this is one of them,” was the comment from a big EU telco.  With the support of the concerned operators (and the equipment vendor Extreme Networks) I launched a Java project to prove out how you could build scalable service control.

The key to doing that, as I found and as others found in other related areas, is the notion of back-end state control, meaning that all of the variables associated with the way that a component handles a request are stored not in the component (which would make it stateful) but in a database.  That way, any instance of a component can go to the database and get everything it needs to fulfill the request it receives, and even if five different components process five successive stages of activity, the context is preserved.  That means that if you get more requests than you can handle, you simply spin up more copies of the components that do the work.

You could shoehorn this approach into the strict structure of NFV’s MANO, but it wouldn’t be the right way—the cloud way—of doing it.  The TMF work on NGOSS Contract demonstrated that the data model that should be used for back-end state control is the service contract.  If that contract manages state control, and if all the elements of the service (what the TMF calls “Customer-Facing” and “Resource-Facing Services, or CFS/RFS) store state variables in it, then a copy of the service contract will provide the correct context to any software element processing any event.  That’s how this should be done.

The ONF vision, as they explained it yesterday, provides state control in their model instances, and so have all my own initiatives in defining model-driven services.  If the “states” start with an “orderable” state and advance through the full service lifecycle, then all of the steps needed to deploy, redeploy, scale, replace, and remove services and service elements can be defined as the processes associated with events in those states.  If all those processes operate on the service contract data, then they can all be fully stateless, scalable, and dynamic.

Functionally, this can still map back to the NFV ISG’s E2E model, but the functions described in the model would be distributed in two ways—first by separating their processes and integrating them with the model state/event tables as appropriate, and second by allowing their execution to be distributed across as many instances as needed to spread the load or replace broken pieces.

There are some specific issues that would have to be addressed in a model-driven, state/event, service lifecycle management implementation like this.  Probably the most pressing is how you’d coordinate the assignment of finite resources.  You can’t have five or six or more parallel processes grabbing for hosting, storage, or network resources at the same time—some things may have to be serialized.  You can have the heavy lifting of making deployment decisions, etc. operating in parallel, though.  And there are ways of managing the collision of requests for resources too.

Every operator facility, whether it’s network or cloud, could be a control domain, and while multiple requests to resources in the same domain would have to be collision-proof, you could have multiple domain requests running in parallel.  Thus, you can reduce the impact of the collision of requests.  This is necessary in my distributed approach, but it’s also necessary in today’s monolithic model of NFV implementation.  Imagine how you’d be able to deploy national/international services with a single instance of MANO!

The final point to make here is that “deployment” is simply a part of the service lifecycle.  If you assume that you deploy things using one set of logic and then sustain them using another, you’re begging for the problems of duplication of effort and very likely inconsistencies in handling.  Everything in a service lifecycle should be handled the same way, be defined by the same model.  That’s true for VNFs and also for the NFV control elements.

This isn’t a new issue, which perhaps is why it’s so frustrating.  In cloud computing today, we’re seeing all kinds of initiatives to create software that scales to workload and that self-heals.  There’s no reason not to apply those principles to SDN and NFV, particularly when parts of NFV (the VNFs) are specifically supposed to have those properties.  There’s still time to work this sort of thing into designs, and that has to be done if we expect massive deployments of SDN/NFV technology.