Dissecting the Challenges of Zero-Touch Automation

There’s no question that “zero-touch automation” is emerging as the new industry buzzword (or words), and it has the advantage or disadvantage of being more complicated than most of our past ones.  Sure, at the high level at least, you can express the idea simply.  It’s about full automation of the service or application lifecycle, from conception to deployment and use.  The problem is that this broad scope implicates just about everything an operations organization does, and the mechanisms for automating all that stuff will require some fundamental changes in how we think about operations software.

ZTA (as I’ll call it to reduce the typing burden) has three specific issues.  First and foremost is scope of impact.  A lot of different people and activities are involved in lifecycle processes, and most of that stuff has already been supplied with tools to support the current more manual processes.  We need to be able to “automate” all of this, or we risk creating costs and reducing agility rather than improving either of those.  Second is the functional model.  It’s easy to say something like “AI will be used to make all decisions rather than having humans do the heavy lifting”, but we don’t have AI that can just be trained to do the right thing.  How does the automation process know what to do?  Finally, we have software architecture itself.  If software is driving the tasks of ZTA then what kind of software is it?  I’m going to open each of these topics with this blog, and follow up with a blog on each.

“Automation” strategies for management of applications and network services have typically fallen far short of their goals in terms of reducing costs and improving agility.  The reasons for the shortfall almost always start with too limited a scope of impact.  Say that three organizations are involved in a given lifecycle process.  You grab what seems to be the low apple of the three and automate it, but the other two are left alone.  That means you have to make your automation fit a process flow that’s dominantly manual, and if you then move on to the other tasks you’ve not had the advantage of thinking of the flow of activity as a whole.  Thus, the sum of your three automation projects probably won’t be optimal.

The biggest fault-line in scope that ZTA faces is the divide between network management and service management, which is embodied in the separation of network management software and systems and operations support or OSS/BSS.  Time and time again we have worked to simplify the two independently, and have created something that is, overall, more complicated and expensive.  A service or application lifecycle is an operations sequence that touches different job functions in different ways, and somehow that entire sequence has to be automated, not pieces of it.

This doesn’t mean that we have to unify all management in an organizational sense, or that our automation solutions don’t recognize a difference between lifecycle processes in the NMS and OSS/BSS spaces.  It does mean that rather than having those two activities independent or even interdependent, we have them co-dependent on our organizing operational sequence, the concept that guides the ZTA approach, which is our functional architecture.

Lifecycle software isn’t a new concept.  What we expect of ZTA models in the future is today delivered, in part, by what’s called “DevOps” software.  DevOps is short for “Developer/Operations” and it describes the way that people who build software or software-driven processes communicate the deployment and operational needs of their processes to those who are responsible for running them.  There are two broad approaches to this—a “proscriptive” and a “descriptive” approach.  One defines the specific steps to be taken to do something and the other defines the goal state of things, and then uses that to build steps to achieve that state.

We seem to have accepted that the latter approach is best for ZTA.  “Intent modeling” of a service or application breaks something complex down into simple atoms, often using a hierarchical decomposition of the top level to work down to the bottom.  Each element in the model is responsible for delivering a specific functionality, and in response to any changes in the conditions that impact that responsibility, each is expected to take action internally or report failure up the line.

In network management, this has been seen for ages as a “state-event” problem.  For each element there are a number of possible states, one of which is the target state of the moment.  The goal of the software is to respond to “events” in such a way as to achieve an orderly transition to the goal state.  In OSS/BSS, work done by the TMF a decade ago launched the notion that a contract contained elements that linked service events to operations processes (“NGOSS Contract”, which morphed into their GB942 spec.  To make either approach work, you have to be able to componentize your processes so that they can be triggered in an individual way by a state/event combination.  The state/event progressions determine process execution, not some abstract workflow implicit in the applications themselves.

If we have a bunch of abstract events collecting around a functional intent model hierarchy that defines a service or application, it follows that the events are asynchronous in their appearance and that some processes might be referenced in multiple places.  That makes it important to have a software architecture that lets you define component processes and scale them as needed, and that drives us toward what’s usually called a microservices approach.

Microservices are little functional pieces that expect to run on demand, and that don’t themselves try to force “context” or stepwise operation on the stuff that’s invoking them.  If you want Process A to run on demand, and if you want to scale it on demand, it’s essential that the process doesn’t store something that would alter its behavior the next time it’s run.  Otherwise secondary events have to be sent to the same process instance that got the primary events, and the way the process works will vary in a way that the state/event model can’t anticipate.

Like everything, the notion of microservices is easily trivialized.  Just saying you have the approach, of course, does nothing.  Even building software from microservices may not do anything either.  Scalable architecture helps overall scalability only if the stuff that’s subject to variable loads is the same stuff you’ve made scalable, and that you can scale to accommodate the loads expected or possible.

Consider an operator or cloud provider with a million services, meaning a million service models.  Normal issues could be expected to impact thousands of these in a day even if the rate of events is a tenth of one percent.  A single large-scale failure could hit ten percent of the total base, and that would swamp most systems.  The fact is that every service model has to be considered an independent process set.  That’s much more like the event-driven thinking of functional computing now evolving.

The ZTA of the future should really be thought of as a set of models, referencing processes, and dynamically linked to hosting resources when something happens.  Each service has the potential, at any point in the lifecycle, of being a process/resource consumer.  That means that in many cases we’ll have to be able to spin things up almost on a per-service basis.  Instead we tend to think of ZTA as a software engine that inputs model events and churns out answers.  That’s not going to work.

One reason all of this is important now is that Verizon has joined the ONAP project that AT&T’s ECOMP operations model anchors.  That makes it almost certain that ONAP will dominate open-source ZTA, which dominates operators’ plans.  If ZTA is going to be implemented, it will either be via the open-source ONAP project or through some integrated commercial tool set.  That means that we have to be able to fit ONAP to these three points, and that we have to use them to judge any competing solution suites or components thereof.

In past technology innovation adventures, including both SDN and NFV, we let ourselves get blinded by buzzwords, by anything that claimed a connection.  Automation has to be different because we’re running out of time to transform things.  In my next blog, I’m going to talk about the scope issue, and that will explain why I think that things like ONAP are so critical.  From there, we’ll move to the rest of the issues, ending with that most-difficult-of-all software architecture problem.  I hope you’ll find it all interesting!