Can Effective NFV Management/Analytics Solve SDN’s Management Problem?

Both SDN and NFV deal with virtualization, with abstractions realized by instantiating something on real infrastructure.  Both have management issues that stem from those factors, which means that they share many of the same management problems.  Formalistically speaking neither SDN nor NFV seem to be solving their problems, but there are signs that some of the NFV vendors are being forced to face them, and are facing them with analytics.  That may then solve the problems for SDN as well, and for the cloud.

You could argue that our first exposure to the reality of virtualization came from OpenStack.  In OpenStack, we had a series of models (CPU, network, image store, etc.) that represented collections of abstractions.  When something deployed you made the abstractions real.  DevOps, which also came into its own in the cloud even though the concept preceded the cloud, also recognized that abstract models were at least a viable approach to defining deployment.  OASIS TOSCA carries that forward today.

The basic problem abstractions create is that the abstraction represents the appearance, meaning what the user would see in a logical sense.  If you have a VPN, which is an abstraction, you expect to use and manage the VPN.  That the VPN has real elements may be something you’d have to come to terms with at some point, because you can’t fix real devices in the virtual world, but this coping with reality stuff is always problematic, and that’s true here in virtual management too.

My personal solution to this problem was what I called derived operations, which means that the management of an abstraction is done by creating a set of formulative bindings between management variables in the abstract plane and other real variables from real resource.  It’s not unlike driving a car; you have controls that are logical for the behavior of driving and these are linked to real car parts in such a way as to make the logical change you command convert to changes in auto elements that make that logical change real.

In one sense, derived operations is simple.  You could say “object status = worst-of(subordinate status)” or something similar.  In virtualization environments of course, you don’t know what the subordinates are until resources are allocated.  That means two levels of binding—you have to link abstract management variables to other abstract variables that will on deployment be linked to real variables.  You also have to accept the fact that in many cases you will have layers of abstraction created to facilitate composition.  Why define a VPN in detail when you can use a VPN abstraction as part of any service that uses VPNs?  But all of this is at least understood.

The next problem is more significant.  Say we have an abstract service object.  It has a formula set to describe its status.  We change the value of one of the resource status variables, one that probably impacts a dozen or so service objects.  How do we tell those objects that something changed?  If we’re in the world of “polled management” we assume that when somebody or something looks at our service object, we would refresh its variables by running the formulative bindings it contains.

Well, OK, but even that may not work.  It’s not efficient to keep running a management function just to see if something changed.  We’d be wasting a lot of cycles and potentially polling for state too many times from too many places.  What we need is the concept of an event.

Events are things that have to be handled, and the handling is usually described by referencing a table of “operating states” and events, the intersection of which identifies the process to be invoked.  We know how to do this sort of thing because virtually every protocol handler is written around such a table.  The challenge comes in distributed event sources.  Say a trunk that supports a thousand connections fails.  That failure has to be propagated up to the thousand service models that are impacted, but how does the resource management process know to do that?

This is where analytics should come in.  Unfortunately, the use of analytics in SDN or NFV management has gotten trivialized because there are a number of ways it could be used, one of which is simply to support management of resources independent of services.  Remember the notion of “directory-enabled networking” where you have a featureless pool of capacity that you draw on up to a point determined by admission control?  Well that’s the way that independent analytics works.  Fix resource faults and let services take care of themselves.

If you want real service management you have to correlate resource events with service conditions, which means you have to provide some way of activating a service management process to analyze the formulary bindings that define variable relationships, and anchor some of those binding conditions as “events” to be processed.  If I find a status of “fault” here, generate an event.

When you consider this point, you’ve summarized what I think are the three requirements for SDN/NFV analytics:

  1. The proactive requirement, which says that analytics applied to resource conditions should be able to do proactive management to prevent faults from happening. Some of this is traditional capacity planning, some might drive admission control for new services, and some might impact service routing.
  2. The resource pool management requirement, which says that actual resource pools have to be managed as pools of real resources with real remediation through craft intervention as the goal. At some point you have to dispatch a tech to pull a board or jiggle a plug or something.
  3. The event analysis requirement, which says that analytics has to be able to detect resource events and launch a chain of service-level reactions by tracking the events along the formulary bindings up to the services.

The nature of the services being supported determines the priority of these three requirements for a given installation, but if you presume the current service mix then you have to presume all three requirements are fulfilled.  Given that “service chaining” and “virtual CPE” both presume some level of service-level agreement because they’re likely first applied to business services, that means that early analytics models for SDN/NFV management would have to address the event analysis requirement that’s the current stumbling block.

From and implementation perspective, it’s my view that no SDN/NFV analytics approach is useful if it doesn’t rely on a repository.  Real-time event-passing and management of services from the customer and customer-service side would generate too much management traffic and load the real devices at the bottom of the chain.  So I think all of this has to be based on a repository and query function, something that at least some of the current NFV implementations already support.

Where this is important to SDN is that if you can describe SDN services as a modeled tree of abstract objects with formulary bindings to other objects and to the underlying resources, you can manage SDN exactly as you manage NFV.  For vendors who have a righteous model for NFV, that could be a huge benefit because SDN is both an element of any logical NFV deployment and a service strategy in and of itself.  Next time you look at management analytics, therefore, look for those three capabilities and how modeling and formulary binding can link them into a true service management strategy.