Is NFV’s Virtual Network Function Manager the Wrong Approach?

I’ve noted before that the weak link in NFV is operations, or management if you prefer.  A big part of the problem, IMHO, is the need for the ISG to contain its efforts to meet its schedule for completing its Phase One work.  Another issue is the fact that the body didn’t approach NFV from the top down.  Management is a problem because so much of NFV’s near- and long-term value proposition depends on efficient operations.  Service agility means accelerating the service lifecycle—management.  Capex reductions are useful only if you don’t add on so much additional opex due to increased deployment complexity that you swamp the savings.

I’m not the only one who feels there’s a major issue here.  Last spring operators told me that they didn’t have confidence that they could make the business case for NFV and that management was the issue.  Some of their concerns are percolating into visibility in the industry now, and so I think we should do what the ISG didn’t and look at NFV management top-down.

To me, there two simple high-level principles in play.  First, NFV infrastructure must, at the minimum, fit into current network operations and management practices.  Otherwise it will not be possible to replace physical network functions with virtual functions without changing operations, and that will stall early attempts to prove out benefits.  Second, to the extent that NFV is expected to deliver either service agility or operations efficiency benefits, it must provide improved operations practices that deliver sufficient efficiency impact overall.

If we step down from the first of these, we can see that the immediate consequence of harmony with existing practices is management equivalence between VNFs and PNFs.  I think this requirement was accepted by operators and vendors alike, and their response was the notion of the VNF Manager.  If you could collect the management data from the VNFs you could present it to a management system in the same form a PNF would have presented it.  Thus, if you bind a VNFM element into a set of VNFs you can fill my first requirement.

Sadly, that’s not the case.  The problem here is that virtualization itself creates a set of challenges, foremost of which is the fact that a PNF is in a box with local, fixed, hardware assets.  The associated management elements of the PNF know their own hardware state because it’s locally available to be tested.  If we translate that functionality to VNF form, we run the functions in a connected set of virtual machines grabbed ad hoc from a resource pool.  How does the VNF learn what we grabbed, how to interpret the status of stuff like VMs and hypervisors and data path accelerators and oVSs that were never part of the native hardware?  The fact is that the biggest management problem for NFV isn’t how to present VNF status to management systems, it’s how to determine the state of the resources.

The problem with resource management linkage has created a response, of course.  When vendors talk about things like “policy management” for NFV what they are often saying is that their architecture decouples resources from services explicitly.  I won’t worry about how a slew of servers and VMs look to a management system that expects to manage a physical CPE gateway because I’ll manage the resources independent of the service and never report a fault.  Great, but think of what happens when a customer calls to report their service connection is down, and your CSRs say “Gee, on the average we have 99.999% uptime on our virtual devices so you must be mistaken.  Suck it up and send your payment, please!”

There are services like consumer broadband Internet that can be managed like this, because that’s how they’re managed already.  It is not how business services are managed, not how CPE is managed, not how elements of mobile infrastructure are managed.  For them, I contend that the current approach fails to meet the first requirement.

And guess what.  The first requirement only gets you in the game, preventing NFV from being more costly and less agile than what we have now.  We are asking for improved operations efficiency, and that raises two new realities.  First, you can’t make meaningful alterations to opex by diddling with one little piece of a service.  Just like you can’t alter the driving time from NYC to LA by changing one traffic light’s timing.  Second, you can’t make meaningful alterations to even a piece of opex if you don’t do anything different there.  We have decoupled operations and network processes today and if we want service automation we have to make operations event-driven.

Event-driven doesn’t mean that you simply componentize stuff so you can run it when an event occurs.  Event-driven processes need events, but they also need state, context.  A service ordered and not yet fulfilled is in (we could say) the “Unactivated” state.  Activate it and it transitions to “Activating” and then becomes “Ready”.  A fault in the “Activating” process has to be remedied but there’s no customer impact yet, so no operations processes like billing are impacted.  In the “Ready” state the same fault has to do something different—fail over, invoke escalation, offer a billing credit…you get the picture.

What is really needed for NFV is data-modeled operations where you define a service as a set of functional or structural objects, assign each object a set of states and define a set of outside events for each.  You then simply identify the processes that are to be run when you encounter a given event in a given state.  Those processes can be internal to NFV, they can be specialized for the VNF, they can be standard operations processes or management processes.

State/event is the only way that NFV management can work, and it makes zero sense to assume that every VNF vendor would invent their own state/event handling.  It makes no sense that every vendor would invent their own way of accessing the status of underlying resources on which their VNFs were hosted, or that operators would let VNF-specific processes control shared-tenancy elements like servers and switches directly.  We can, with a single management strategy, fulfill both the resource-status-and-service coupling needs of NFV (my first requirement) and the operations efficiency gains (my second).  But we can’t do it the way it’s being looked at today.

This shouldn’t be about whether we have a VNF-specific model of management or a general model.  We need a state-event model of management that lets us introduce both “common” management processes and VNF-specific processes as needed.  Without that it’s going to be darn hard to meet NFV’s operations objectives, gain any service agility, or even sustain capex reductions.  All of NFV hinges on management, and that is the simple truth.  It’s also true that we’re not doing management right in NFV yet, at least not in a standards-defined way, and that poses a big risk for NFV deployment and success.