Organizing All Our “Automation” Concepts

Can we get some definitions here?  I’m as interested in software-directed operations processes as the next person (maybe more than most), but I confess that I’m getting buried in terms and concepts that clearly relate to the software-directed operations goal at the high level, but don’t seem to relate well, or consistently, with each other.  Time out for a taxonomy, I think.  This one is my own, gleaned from my long involvement in the space and an attempt at drawing consistent definitions from current usage.

What seems to be the universal goal of all of this is what’s called “zero-touch automation”, meaning the application of software technology to respond without human intervention to abnormal operations conditions in applications or services.  Something happens and software fixes it.  Humans would get involved only if something arose so out-of-scope to the goals that the automated system couldn’t be relied on to do the right thing, or perhaps anything at all.

There are other terms that seem to mean just about the same thing.  “Closed-loop automation” simply means that a report or event is processed to generate a response.  However, the term has been used most often in association with simple processes, something like “Light-beam-breaks-so-open-gate”.  Service (or application) lifecycle automation is a term that I’ve used, and ZTA does seem to have the same meaning that I at least assigned to the service lifecycle automation concept.  However, service lifecycle automation implies a cooperative system (the application or service and its associated resources) that has a specific lifecycle, meaning a progression of states that together form the evolution of functionality from “I-want-it” to “make-it-go-away.”

ZTA could be a broader concept, perhaps, but I don’t think so.  Most people who separate the two terms seem to apply the ZTA term to a “lifecycle system” that doesn’t have a beginning or end, meaning a steady-state thing like a wide-area network.  You’re going to build it and run it, and ZTA then means automating the response to the stuff that interferes with running it.  Services and applications have this “run” state as part of what we could call a “goal-state sequence”, the part that (hopefully) the service or application lives in the longest.

Whether we call our goal ZTA or service/application lifecycle automation, I do think that there are three elements that make it up.  The first element is our goal-state sequence, meaning some set of conditions that define what things should be or look like.  The second is an exception source, meaning a report of something that’s not as it should be, probably in the form of an event.  The final element is the response process, the software function that does something to remedy the exception.

The idea of associating an exception to a process can be applied at many levels.  At the basic level, it could be an expansion to a simple closed-look process, and here it would be fair to call it “machine learning”.  The idea is to let the system learn what a valid response to a condition is, mostly by watching what operators do but also perhaps by analyzing the results of best-effort estimates.  If you expand this idea to a complex system, you can see the glimmer of the meaning of ZTA.

A network or a cloud is a fairly vast interdependent system of components.  A lot of things can happen, a lot of them could happen at one time, and in many cases the response to a “happening” would be something that could involve several steps, any of which might fail during execution.  ZTA has to deal with the complex goal of restoring the goal-state, or at least getting as close to it as possible.  That is almost certain to involve a bunch of closed-loop processes coordinated in some way.  Think of machine learning at a larger scale, a system of machines rather than a single machine.

Look at the problem this way.  Suppose we have a machine we need to keep running.  We could define a “running state” in terms of the conditions of each component in the machine as they would be in that desired state.  We could then watch component conditions and respond, per component, when something wasn’t right.  But if the machine is truly complex and if the conditions we can measure don’t necessarily relate to a simple, single, component fault, how do we do things?

This is where I think the notion of lifecycle automation offers a better story.  Lifecycle automation presumes that there are multiple components, and components of components, forming a hierarchy that would be called a service or application model.  Each component is responsible for working, fixing itself based on internal processes, or reporting a fault.  If the last of the three happens, then the component into which the faulting component is composed has to remediate or report a fault in turn.

We can inject “analytics” (another term) into our process now.  It’s easy to see how analytics might be a source of the exceptions we talked about, but the problem is the classic one of fault correlation.  You can learn about problems, but is the point at which remediation has to occur, the nature of the remediation, and the sequencing of multiple steps (some of which might fail) something analytics can provide you?

Where analytics might help is where a systemic problem is the sum of lower-level conditions that might not be a problem at the element level.  Nobody is “faulting” but perhaps everyone is on the outer edge of their delay budget, which means that somewhere up the line the budget might be exceeded collectively.  But even here, just knowing that’s happened doesn’t necessarily close any automation loops.  Nobody is violating their SLA, so do you tighten everyone’s SLA?  Not without breaking down the service, probably.

Analytics and closed-loop processes are probably used more often as attractive window-dressing on positioning strategies than as true solutions to the problem of operations automation.  The same is true of script-based tools, models that define configuration but don’t align with events and corresponding process invocations, and models that define only the singular state of “working”.  Too much interpretation is needed on any of these to get to the right response to a condition, and getting to that response is what ZTA is supposed to be about.

So AI is the answer, right?  There’s a whole definitional black hole in itself.  What does “artificial intelligence” mean?  Does it mean self-learning systems, or autonomous systems, or rule-based systems?  In a sense, to say that you’re going to use AI for ZTA is circular because ZTA is zero-touch and so is AI.  In another sense it’s falling back on analytics, because the presumption is that we can simply identify a problem and let AI solve it.  Could we teach an AI system the right way to diagnose and fix problems in a global network?  Perhaps, but if we did I suspect we’d be applying AI at the functional-element level, as a substitute for lower-level state/event handling.

A model-based service and application automation strategy can easily be distributed, since the data model contains everything that every process at every level needs.  Do we know what stateless, distributed, AI would look like?  We really don’t know what centralized AI would look like at this point, and while there have been impressive gains in AI, the basic concepts have changed little, and to say that we’re on the verge of a breakthrough that would let AI resolve ZTA issues is to say what’s been said so many times before…and proved to be too optimistic.

Where are we, then?  First, the difference between ZTA/lifecycle automation and other concepts is that the former two reflect a service- or application-level vision where other approaches are a fault/event vision.  The more complex the system, the more difficult it is to simply fix things that pop up, and so in our era of the cloud and virtualization, we have long exceeded the limits of “basic” automation strategies, including classic closed-loop or machine learning.  The key to ZTA is the goal-state sequence model, and that’s a key that we seem curiously reluctant to unlock.

This is why I get frustrated on this topic.  You have two choices with ZTA.  You can define a hierarchical model of the service, with each element in the model representing a functional system with defined states and faults, or you can define an AI process so smart that it can replace a whole operations center’s worth of humans in doing problem isolation and resolution.  Which of these do you think is possible today?  Unless we want to wait an indefinite period for ZTA benefits, we have to get real, today.