Software Architecture and Implementation in Zero-Touch Automation

I know that I’ve saved the hardest of the zero-touch issues for last.  The technical or software architecture almost has to be considered at the end, because you can’t talk about it without framing scope and functionality.  Those who read my last blog know that I touched on technical issues with the comment that a concurrent, model-driven, scalable process implementation of zero-touch was critical.  Here I’ll expand on that point, and raise other structural questions.

I want you to imagine for a moment the network and customer base of a large Tier One.  They have millions of customers, tens of millions of services, and tens of thousands of different offices and vaults where equipment is installed.  A typical service is a cooperative relationship that could easily require thousands of different devices and trunks.  And that’s without software virtualization and function hosting!

Every day, every hour or minute, something is happening in such an environment.  Worse yet, there are many critical and common elements in the infrastructure, things that when broken will impact many services and many customers.  Anyone who’s ever run a network operations center or even visited a NOC regularly will recognize the problems of a fault avalanche.  One thing triggers thousands of alarms, which of course means it triggers thousands of events.  It’s then up to the operator to remedy things.

Or it was.  Now it’s up to zero-touch automation, and our problems can be traced from the origin of that critical common fault to its ultimate remediation.

Suppose we have a big trunk failure.  It impacts thousands of services, but which ones are impacted?  The first question in any kind of automated response to lifecycle conditions is how you associate a condition to a lifecycle.  Some say you don’t.  The traditional approach to facility problems is to forget about service-level remediation and focus on “facility remediation.”  If Trunk A broke, either fix it or replace it, and if you do that you can put service Humpty back together again.  Stated without the nursery rhyme association, this approach says that you focus on resource management and forget services.

That doesn’t always work, of course.  Fixing Trunk A might require hours or even days, presuming some “cable-seeking backhoe” got into the mix.  Replacing it might or might not be possible, depending on the capacity Trunk A had, the residual capacity elsewhere, and the ability of that residual to meet the SLAs for the impacted services.  However, a replacement for Trunk A, if Trunk A was explicitly committed, is going to require that it be “uncommitted” and that whatever replaces it be connected instead.  The replacement might be a simple parallel trunk or, more likely, a more devious path through new nodal points, transiting new facilities.  That’s what we have to be able to automate.

There are five steps to zero-touch effectiveness.  One, awareness of the fault itself, which might come through direct management reports or through analytics.  Two, the correlation of the fault with impacted services so that per-service policies can be applied.  Three, the interpretation of the fault in the current service context, so appropriate responses can be mediated.  Four, the execution of the appropriate processes, and five, the analysis of the response of the remediation and, if necessary, the report of an unresolved problem to some higher-level process.

We’re assuming we have a fault report, so let’s start with our second point.  We need to know what services were impacted, and there are two ways that could happen.  We could learn about the trunk failure and have a correlation scheme whereby we know what services were assigned to it.  Or, we could wait for each service to recognize the fault in some way.  In either case, we have to prepare the mechanism.  Do we have fault correlation?  Do we have service-level awareness of the state of lower-level resources?  That’s going to depend on our architecture.

But the worst is yet to come.  We have to address our fourth and fifth points, the execution of appropriate service processes in response to conditions/events, and the “kicking upstairs” of problems that can’t be handled within the black box of an intent model.  We also have to do this at scale.  Remember that, by one means or another, learned about a fault in a thousand services.  All of these services will now have to be reprocessed to reflect the loss of that trunk, and that has both a functional and technical dimension.

Functionally, we need to be able to remediate a trunk loss at a level of service assessment where alternatives are defined and the original selection is made.  That means that whatever process selected the trunk now has to respond to its loss.  If it can’t be done at that level, we have to kick the issue upstairs, so to speak.  That requires that an understanding of how the service composition decisions were made has to be maintained for remediation, which means that either we had to have followed a specific model to build the service, or recorded what we did do.  We then have to be sure that we don’t make the same bad trunk decision again, either with the specific service component that reported the fault or with other components who might have relied on that same trunk.  Remember that the fault could have impacted multiple components of the same service set, and that it might be propagating up a tree of models in several places at once.

The technical side of this is that if there is only one place where we can address service faults, one string of logic, then we have to expect that hundreds, thousands, of events will end up being queued up for handling.  Worse, some of those events may well impact the same service sets, which means that we could in theory handle something by taking an action that a yet-to-be-processed event would invalidate.

What you need here is the notion of spawning a handling process for each event, when the event occurs.  Yes, you may have to apply some mechanism to accommodate the fact that hosting resources for these processes may be limited, but within those constraints you’re better off to launch something per-service.  That means that the intent model hierarchy for that service has to be viewed as a distributed-state database that somehow gives every process the information it needs to run, remediate, or pass on its own fault as an event.

Dynamic spawning of a process from a model-mediated state/event intersection, is critical in developing agile, scalable, zero-touch software.  It’s also a significant technology/architecture challenge.  In order for an event to spawn a process, there has to be an association between the two, similar to that which players like Amazon, Microsoft, and Google offer in the triggers for their functional computing services.  The presumption, then, would be that each model element had a “base process” that was kicked off, and that in turn would activate other processes based on state and event.  However, for that to happen the event would have to be associated with the element, not just the service or the fault.

If we go back to our notion of Trunk A faults, a Trunk A event might activate a “Trunk A distributor process” that knew what services depended on Trunk A.  That process might then kick off a service-specific process that would have to associate the fault with the correct element.  Might it pass it down the chain?  Implementations of various types are possible.  It would also be possible to have the Trunk A distributor process “know” the specific service model element that was dependent on the trunk, which would make the intermediate service-model process mediation of the event unnecessary.

So far, we’ve been considering the “primary events” of conditions from outside the service itself.  That would include resource/service faults detected in a variety of ways, but also “administrative events” generated by operations personnel or even by customers making service changes.  There are also “internal events” to consider.  When a single element in a complex intent-modeled service structure wants to report a condition that it has generated, perhaps as the result of processing some other event, it has to be able to generate an event to wake up another higher-level (or lower-level) process.

Primary events would typically be generated either at the top of the model hierarchy (administrative events) or at the bottom (resource conditions).  The rest of the events would be generated internally, and represent the passing of requirements down the chain to subordinates, or of issues up the chain to superiors.  Overall, this complex set of event flows is characteristic of asynchronous processes, and you either have to handle them that way or you have to end up creating some serialized single-process solution that could leave stuff waiting for a response for a very long time.  Doing the former means having dynamic processes associated not with zero-touch automation overall, but to each service.

I experimented with something like this in my original ExperiaSphere project, launched in response to some operator concerns about how the TMF’s Service Delivery Framework concept could be implemented.  My thought was that there was a “service factory” that spun out models for ordering, and any such factory could handle all of the events for any of the models it was associated with.  The service order data carried all the state and structure information needed.  In my CloudNFV project, EnterpriseWeb used their broader architecture to provide a kind of event fabric that also conveyed stored state and structure to a process that could be spun up as needed.

Ultimately, it should be obvious that zero-touch automation is an event-driven process and that it will therefore require event-driven processing.  Stateless elements with stored/distributed state coordination can be scaled easily and replace easily.  They can also be made to address the complexities of concurrency, the need to support events as they happen even when they happen in floods.  Functional computing, of the kind Amazon, Microsoft, and Google have already introduced, is going to be a key piece of the implementation of zero-touch automation, or it’s not going to work at scale.

So, this frames the industry dilemma.  There is no question in my mind that only ONAP has the credibility and scope to be able to touch all the essential pieces of service lifecycle automation.  Absent that scope/scale, you can’t generate enough benefits to drive deployment.  But like the ETSI ISG, the ONAP people have presented a picture of their architecture that could be interpreted as being what I’ve called a linear process and not a model-driven state/event process supported by stateless elements.  Is that what the software is?  Remember that the ETSI picture was initially described as a functional model, which means that it didn’t have to be implemented exactly as shown.  What broke that down was a combination of literal interpretation of the diagram by some, and the fact that when you describe specific interfaces between components you implicitly define those components.  Remember black boxes?

I hope to have some further discussions with the ONAP people on the way their software is really built.  Once we know how it was done, we’d know whether it can be made stateless and event-driven to the extent needed to make it scalable.  While everything in zero-touch automation could be done with a state/event-model-driven stateless-process combination, it doesn’t all have to be.  You could, for example, “eventify” the critical virtualization piece and have that handle most events.  The interface to the less-used processes, like those of OSS/BSS, could remain traditional event-queue based.

This is the last of my series on zero-touch automation issues and architectures.  What does all this prove?  In my view, it proves that we have to start thinking about networks in a totally different way, for several reasons.  First, zero-touch software is a prime example of the new model of event-driven systems, and that’s not how standards people and even many software people think about it.  Second, this is a complicated space, and finding even one solution much less a competitive basket of them, is going to be difficult.  It cries out for open-source.  Finally, we are never going to get useful coverage of this topic if everyone in the media/analyst space relies on vendor press releases and “sponsored” papers.  We need to figure out a way of exploring the issues in an open forum without either limiting the details the dialog can raise, or contaminating the process with special interests.

Operators, making that happen is up to you.