Is the New OPNFV Event Streams Project the Start of the Right Management Model?

One of those who comment regularly on my blog brought a news item to my attention.  The OPNFV project has a new activity, introduced by AT&T, called “Event Streams” and defined HERE.  The purpose of the project is to create a standard format for sending event data from the Service Assurance component of NFV to the management process for lifecycle management.  I’ve been very critical of NFV management, so the question now is whether Event Streams will address my concerns.  The short answer is “possibly, partly.”

The notion of events and event processing goes way back.  All protocol handlers treat messages as events, for example, and you can argue that even transaction processing is about “events” that represent things like bank deposits or inventory changes.  At the software level, the notion of an “event” is the basis for one form of exchanging information between processes, something sometimes called a “trigger” process.  The other popular form is called a “polled” process because in that form a software element isn’t signaled something is happening, it checks to see if it is.

Many of the traditional management and operations activities of networks have been more polled than triggered because provisioning was considered to be a linear process.  As networks got more complicated, more and more experts started talking about “event-driven” operations, meaning something that was triggered by conditions rather than written as a flow that checked on stuff.  So Event Streams could be a step in that direction.

A step far enough?  There are actually three things you need to make event-driven management work.  One, obviously, is the events.  The second is the concept of state and the third is a way to address the natural hierarchy of the service itself.  If we can find all those things in NFV, we can be event-driven.  Events we now have, but what about the rest?

Let’s start with “state”.  State is an indication of context.  Suppose you and I are conversing, and I’m asking you questions that you answer.  If there’s a delay or if you don’t hear me, you might miss a question and I might ask the next one.  Your answer, correct in the context you had, is now incorrect.  But if you and I each have a recognized “state” like “Asking”, “ConfirmHearing”, and “Answering” then we can synchronize through difficulties.

In network operations and management, state defines where we are in a lifecycle.  We might be “Ordered”, or “Activating” or “Operating”, and events mean different things in each state.  If I get an “Activate” in the “Ordered” state, it’s the trigger for the normal next step of deployment.  If I get one in the “Operating” state, it’s an indication of a lack of synchronicity between the OSS/BSS and the NFV processes.  It is, that is, if I have a state defined.

Let’s look now at a simple “service” consisting of a “VPN” component and a series of “Access” components.  The service won’t work if all the components aren’t working, so we could say that the service is in the “Operating” state when all the components are.  Logically, what should happen then is that when all the components are in the “Ordered” state, we’d send an “Activate” to the top-level “Service object”, and it would in turn generate an event to the subordinates to “Activate”.  When each had reported it was “Operating”, the service would enter the “Operating” state and generate an event to the OSS/BSS.

So what we have here is a whole series of event-driven elements, contextualized (state and relationship) by some sort of object model that defines how stuff is related.  It’s not just one state/event process (what software nerds call “finite-state machines”) but a whole collection of such processes, event-coupled so that the behaviors are synchronized.

This concept is incredibly important, but it’s not always obvious that’s the case.  But here’s an example.  Suppose that a single VNF inside an Access element fails and is going to re-deploy.  That access element would have to enter a new state, let’s call it “Recovering” and so the VNF that failed would have to signal with an event.  Does that access element go non-operational immediately or does it give the VNF some time?  Does it report even the recovery attempt to the service level via an event, or does it wait till it determines that the failure can’t be remedied?  All of this stuff would normally be defined in state/event tables for each service element.  In the real world of SDN and NFV, every VNF deployed and every set of connections could be an element, so the model we’re talking about could be multiple layers deep.

This has implications for building services.  If you have a three- or four-layer service model you’re building, every element in the model has to be able to communicate with the stuff above and below it through events, which means that they have to understand the same events and have to be able to respond as expected.  So what we really have to know about service elements in SDN or NFV is how their state/event processing works.

Obviously we don’t know that today, because we didn’t have even a consistent model of event exchange, which Event Streams would define.  But the project doesn’t define states, nor does it define state/event tables or standardized responses.  Without those definitions an architect couldn’t assemble a service from pieces because they couldn’t be sure that all the pieces would talk the same event language or interpret the context of lifecycles the same way.

The net of this is that Event Streams are enormously important to NFV, but they’re a necessary condition and not a sufficient condition.  We still don’t have the right framework for service modeling, a framework in which every functional component of a service is represented by a model “object” that stores its state and the table that relates to event-handling in every possible state.

The question is whether we need that, or whether we could make VNF Managers perform the function.  Could we send them events?  There’s no current mandate that a VNFM process events at all, much less process some standard set of events.  If a VNFM contains state/event knowledge, then the “place” of the associated VNF in a service would have to be consistent or the state/event interpretation wouldn’t be right.  That means that our VNF inside an access element might not be portable to another access element because that element wanted to report “Recovering” or “Faulting” under different conditions.  IMHO, this stuff has to be in the model, not in the software, or the software won’t be truly composable.

I’m not trying to minimize the value of Event Streams here.  It’s very important, providing that it provokes a complete discussion of state/event handling in network operations.  If it doesn’t, then it’s going to lead to a dead end.