Are The Multiple NFV MANO Candidates Helpful, or All Incomplete?

With the introduction of SK Telecom’s T-MANO into the mix, we have yet another promised improvement in the basic management and orchestration model for NFV.  In the past, I’ve tended to defend all these initiatives on the theory that somebody might end up getting it right.  Other than AT&T and Open-O with ONAP, I’ve not seen much progress in that direction, though.  In fact, as I’ve noted, it’s not clear whether ONAP is the “right” answer, meaning the answer that unlocks enough NFV benefits to actually drive significant deployment.

In my view as a programmer, software architect, and former director of software development commercially, there are two basic problems that implementations of NFV MANO have experienced.  Both arguably stem from limitations in the functional end-to-end model, though if you interpret the model as “functional” (as I’ve been assured it was intended to be), you could likely correct both.  A good implementation of NFV MANO would have to deal with these issues.

The first issue is taking an implied “serial” or sequential-step approach to the problem.  Functional models display the relationship between logical processes.  We have the NFV Orchestrator linked to Virtual Infrastructure Manager and to VNF Manager, for example.  It’s easy to interpret this as thinking the Orchestrator “calls” the other two, or is called by them, or both.  That, in fact, is generally how the model has been interpreted.

The problem with that is that deploying anything is either event-driven or serial/sequential in nature.  If the latter approach is chosen, because you’re following the ETSI E2E model literally, then you have to visualize these interactions (across the reference APIs) as being a “call” and a “response”.  The Orchestrator calls the VIM to deploy something, and the VIM responds with either a “you-got-it” or “it-failed”.  In the real world, this forces the calling process to wait for the result, or to keep checking back to see what the result was.  You can see that either of these options tends to tie up the calling process, which limits how much the calling process can do at once.

A solution to this, for at least some development, is “multi-threading”, which means that there can be multiple copies of a given process.  You can have several “Orchestrators” and “VIMs” in parallel.  Great in terms of increasing the number of things you can do at once, but this creates a problem when you’re assigning finite resources.  How do you insure that two threads don’t grab the same thing?

A second problem arises if, while you’re waiting for the hypothetical VIM to respond, and instead of that happening you get another request from somewhere else, like the OSS/BSS, or a condition arises in a piece of infrastructure you’ve already deployed.  You’re sitting waiting for your VIM, so how does this new condition get reported?  Most implementations simply hold the request (queue it, in programming terms) till the Orchestrator is free.  But what if the new condition means that you no longer can build the service as you envisioned?  The thing you’re waiting on the VIM for is now obsolete.

I wrestled with this almost a decade ago in the original ExperiaSphere project (done at the behest of some network operators who were involved with me in the TMF Service Delivery Framework work).  What I found was that controlling a service lifecycle even with multiple threads was very complicated.  We still see serialized single-thread pieces in OpenStack for the same reason.  You have to think in state/event terms when you have software that is intrinsically dependent on a bunch of “events” generated from asynchronous sources.  Every network fits that model, and every NFV implementation.

Event-driven software says that every multi-step process has a specific number of “states” it can be in, like waiting for a response from a VIM.  When the Orchestrator in an event-driven implementation asks the VIM for something, it enters a “waiting-for-VIM” state, and there awaits an event or a signal from the VIM.  Protocol handlers and protocol stacks, including that for IP, are traditionally implemented this way because you can’t predict what’s going to happen elsewhere.  There are at least two parties involved in all communications.

The big issue that serialized-versus-event-driven raises is scalability, particularly if a flood of issues arises out of some major thing like a trunk connection or data center failure.  If every request has to be handled in a single-threaded way, then there’s a high probability that remediation of any large-scale problem will involve a lot of time, far more than users would likely tolerate.

Issue number two is lack of an effective relationship between the virtual part of NFV and the real resources.  We still have people who read the ETSI End-to-End model and think there is a single virtual infrastructure manager, or that vendors would generate a VIM for their own equipment.  We still have people who think that you manage a virtual firewall by managing the software instance, which means that somehow the software instance has to be able to manage the real resources.  None of that is workable.

A single VIM can’t work because every server platform, every hypervisor, and every cloud stack has its own APIs and requirements.  Do we expect to see all of these implemented on one enormous VIM?  Then add in the now-recognized truth that we’ll also have to be able to manage some real network equipment, and you have a formula for a million silos.  We need multiple VIMs, and we need a way of linking them to a request for deployment.  See below for that point.

As far as managing software instances goes, meaning the VNF Manager, we have two problems.  One is that just as firewall devices all have their own management and configuration processes, so would the software versions of those devices.  That means every different firewall (or other VNF) would have to be paired with its own management tool.  Hardly a recipe for integration, onboarding, and interoperability.  What we need (and what I think SK Telecom is looking to provide), is a standard model for a management and configuration interface for any given device or virtual device type.  You then map a proprietary/specific interface to that standard.

The issue of how the real resources get managed is much harder.  The problem is that resource pools are inherently multi-tenant.  You can’t have every VNF running on pooled resources diddling with the behavior of those common resources, even indirectly.  That means that you have to be able to use the VIM as an intermediary, which is possible in the ETSI E2E model.  However, this complicates the issue of serial/sequential processing versus state/event, because the Orchestrator is also using the VIM and might actually be in the process of working with the same service!

The third issue is a lack of “models”, particularly intent models, to represent key elements.  If you’ve waded through the two prior issues, you realize that if service lifecycle management is a fully asynchronous system (which it is) and has to be state-event processed, then there has to be some way of recording the per-service data and state information.  In fact, if a service has multiple parts (access and WAN, east and west, US and EU, etc.) then each part has to have its own repository of data and state.  The TMF recognized this in its NGOSS Contract work, which proposed to use the service data model to steer events to processes.

The model-driven or “data model” approach has a lot of historicity, and the most recent innovation has been the introduction of a specific “intent model” which focuses on modeling elements of a service as functions described by what they are expected to do, not how they’re expected to do it.  There is nothing in the ETSI material on service modeling, and all of the implementations so far are also light at best in that area.  Even ECOMP lacks a comprehensive model picture.  That’s unfortunate, because the right approach to service modeling, coupled with the notion of the model as an event-steering or state/event hub, is critical to the success of any NFV implementation, and also in my view critical to operational transformation and cost reduction for networking overall.

One of the beauties of a service-model-driven, event-handling, process approach is that the processes themselves don’t save information within them, which means that any copy of software can operate on the service data model and obtain the same result.  In short, the software would be fully scalable (with the exception of pieces designed to serialize resource requests, such as some of the OpenStack stuff).  This points out that it’s very important to consider how resource-related stuff, which in ETSI NFV context would mean the VIM, has to be well-designed to allow for reliable resource control without sacrificing the ability to support multiple requests.

The challenge of NFV tests is that a limited service scope or deployment scope sets up a scenario where functional validation is possible but operational validation is not.  The three points I raise here are critical to the latter, and until we have implementations that address them, we don’t have a practical approach to NFV, to SDN, or to transformation overall.