Coupling Resource Conditions and Service SLAs in the Automation of Operations/Management

In a couple of past blogs, I’ve noted that operations automation is the key to both improved opex and to SDN/NFV deployment.  I’ve also said that to make it work, I think you have to model services as a series of hierarchical intent models synchronized with events through local state/event tables.  The goal is to be able to build everything in a service lifecycle as a state/event intersection, or set of synchronized intersects in linked models.  The key to this, of course, is being able to respond to “abnormal” service conditions, and that’s a complex problem.

If you looked at a single service object in a model, say “Firewall”, you would expect to see a state/event table to respond to things that could impact the SLA of “Firewall”.  Each condition that could impact an SLA would be reflected as an event, so that in the “operational” state, each of the events/conditions could trigger an appropriate action to remedy the problem and restore normal operation.  This framework is the key to operationalizing lifecycles through software automation.

Now, if you look “inside” the object “Firewall”, you might find a variety of devices, hosted software elements and the associated resources, or whatever.  You can now set a goal that however you decompose (meaning deploy or implement) “Firewall” you need to harmonize the native conditions of that implementation with the events that drive the “Firewall” object through its lifecycle processes.  If you can do that, then any implementation will look the same from above, and can be introduced freely as a means of realizing “Firewall” when it’s deployed.

This approach is what I called “derived operations” in many past blogs and presentations.  The principle, in summary, means that each function is an object or abstraction that has a set of defined states and responds to a set of defined events.  It is the responsibility of all implementations of the function to harmonize to this so that whatever happens below is reported in a fixed, open, interoperable framework.  This creates what’s effectively a chain of management derivations from level to level of a hierarchical model, so that a status change below is reflected upward if, at each level, it impacts the SLA.

This sort of approach is good for services that have an explicit SLA, and in particular for services where the SLA is demanding or where customers can be expected to participate in monitoring and enforcing it.  It’s clearly inappropriate for consumer services because the resources associated with linking the service objects and deriving operations would be too expensive for the service cost to justify.  Fortunately, the approach of an intent-model object can be applied in other ways.

The notion of an SLA is imprecise, and we could easily broaden it to cover any level of guarantee or any framework of desired operations responses to service or network conditions.  Now our “Firewall” has a set of events/conditions that represent not necessarily guarantees but actions.  Something breaks, and you generate an automatic email of apology to the user.  The “response” doesn’t have to be remedial, after all, and that opens a very interesting door.

Suppose that we build our network, including whatever realizes our “Firewall” feature, to have a specific capacity of users and traffic and to deliver a specific uptime.  Suppose that we decide that everything that deals with those assumptions is contained within our resource pool, meaning that all remediation is based on remedying resource problems and not service problems.  If the resources are functioning according to the capacity plan, then the services are also functioning as expected.  In theory, we could have a “Firewall” object that didn’t have to respond to any events at all, or that only had to post a status somewhere that the user could access.  “Sorry there’s a problem; we’re working on it!”

There are other possibilities too.  We could say that an object like “Firewall” could be a source of a policy set that would govern behavior in the real world below.  The events that “Firewall” would have to field would then represent situations where the lower-layer processes reported a policy violation.  If the policies were never violated no report is needed, and if the policy process was designed not to report violations but to handle them “internally” then this option reduces to the hands-off option just described.

It’s also possible to apply analytics to resource-level conditions, and from the results obtain service-level conditions.  This could allow the SLA-related processes to be incorporated in the model at a high level, which would simplify the lower-level model and also reduce or eliminate the need to have a standard set of events/conditions for each function class that’s composed into a service.

Finally, if you had something like an SD-WAN overlay and could introduce end-to-end exchanges to obtain delay/loss information, you could create service-level monitoring even if you had no lower-level resource events coupled up to the service level.  Note that this wouldn’t address whether knowing packet loss was occurring (for example) could be correlated with appropriate resource-level remediation.  The approach should be an adjunct to having fault management handled at the resource level.

The point of all of this is that we can make management work in everything from a very tight coupling with services to no coupling at all, a best-efforts extreme on the cheap end and a tight SLA on the other.  The operations processes that we couple to events through the intent-modeled structure can be as complicated as the service revenues can justify.  If we can make things more efficient in hosting operations processes we can go down-market with more specific service-event-handling activity and produce better results for the customer.

The examples here also illustrate the importance of the service models and the state/event coupling to processes through the model.  A service is built with a model set, and the model set defines all of the operations processes needed to do everything in the service lifecycle.  SDN and NFV management, OSS/BSS, and even the NFV processes themselves (MANO, VNFM) are simply processes in state/event tables.  If you have a service model and its associated processes you can run the service.

Resource independence is also in there.  Anything that realizes an object like “Firewall” is indistinguishable from anything else that realizes it.  You can realize the function with a real box, with a virtual function hosted in agile CPE, or with a function hosted in the cloud (your cloud or anyone else’s).

Finally, VNF onboarding is at least facilitated.  A VNF that claims to realize “Firewall” has to be combined with a model that describes the state/event processes the VNF needs to be linked with, and the way that “Firewall” is defined as an intent model defines the things that the implementation the VNF provides has to expose as unified features above.

Operations automation can work this way.  I’m not saying it couldn’t work other ways as well, but this way is IMHO the way a software type would architect it if the problem were handed over.  Since service automation is a software task, that’s how we should be looking at it.

The TMF got part-way to this point with its NGOSS Contract approach, which linked processes to events using SOA (Service-Oriented Architecture, a more rigid predecessor to modern microservices) through the service contract as a data model.  It really hasn’t caught on, in part I think because the details weren’t addressed.  That’s what the TMF’s ZOOM project, aimed at operations automation, should be doing in my view, and whether they do it or not, it’s what the industry should be doing.

I think some are doing it.  Operators tell me that Netcracker uses something like this in their presentations, and a few tell me that Ericsson is now starting to do that too.  I think the presentation made last year by Huawei at the fall TMF meeting in the US exposes a similar viewpoint, and remember that Huawei has both the tools needed for low-level orchestration and also an OSS/BSS product.

It’s a shame that it’s taking so long to address these issues properly, because the lack of software automation integration with operations and management has taken the largest pool of benefits off the table for now.  It’s also hampered both SDN and NFV deployment by making it difficult to prove that the additional complexity these technologies introduce won’t end up boosting opex.  If we’re going to see progress on transformation in 2017 we need to get going.