What Role Can AI Play in Service Lifecycle Automation?

I hate to open another hype can of worms, but this is a question that has to be asked.  Is there a role for artificial intelligence (AI) in service lifecycle automation, virtualization, SDN, and NFV?  The notion raises the specter of a glittering robot sitting at a console in a network operations center, and surely, we’re going to be seeing media stories along these lines, because excitement and not realism is the goal.  Underneath the overstatement and underthinking, though, there may be some very interesting truths.

I asked an old friend who runs a big NOC my opening question, and the first response was a wry “I’m wondering if there’s a role for natural intelligence there!”  On further consideration, I got what I think is the starting point for the whole discussion, “It depends on what you mean by AI.”

What Wikipedia says about AI is “In computer science, the field of AI research defines itself as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of success at some goal.  Colloquially, the term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”.

If we take the formal, first, definition, it’s pretty clear that service lifecycle automation would fall within the scope of AI.  If we take the second, it’s a bit fuzzier, and to decode the fuzz we have to look deeper at the mission itself.

Service lifecycle automation is based on the notion that a given network service has a desired state of behavior, one that was sold to the service user and designed for by the network engineers who put the whole thing together.  The purpose of service lifecycle automation is to get a newly ordered service into that preferred state and keep it there for the lifetime of the service.  At that point, any capacity or resources would be returned and the service would be no longer available.

Not even a human operator would be able to perform this task without knowing what the preferred state of the service was, and also the current state of the service.  Generally, NOC personnel managing service lifecycles manually would respond to a condition that changed service state from the goal state, and that response would be expected to get things right again.  This process has become increasingly complicated as services and infrastructure become more complicated, and as a result there’s been growing focus on automating it.

DevOps tools are an example of automation of software deployment tasks, and much of network service lifecycle automation is based on similar concepts.  DevOps either supports the recording of “scripts” or series of steps that can be invoked either manually or in response to an event, or the definition of a series of possible states and how, in each state, processes would facilitate the transformation of the service into that ideal state.

In AI terms, the person who puts together these scripts would be called a “subject matter expert”, and would be expected to sit down with a “knowledge engineer” to transfer expertise into machine-applicable form.  I would argue that this is what happens when you do DevOps, or when an expert defines the state/event and event/process relationships associated with service lifecycle management.  This is why I think that the first definition of AI is met by the kind of service lifecycle automation I’ve been blogging about for years.

The real AI question, then, is that second part of the definition, coupled with the pithy comment of the NOC manager.  Would there be a value to a “cognitive” component to service lifecycle automation, one that perhaps involved “learning” and “problem-solving?”  If so, what might this component look like, how would it help, and how could it be implemented?

Most NOC people and operations specialists I’ve talked with say that they would not want a service lifecycle automation system to simply run off and do stuff in defiance of specific rules to the contrary, any more than they’d want ops personnel to do that.  That means that if we have a specific condition and specific instructions to do something when that condition arises, the AI system should do it.  No cognition there.

However, every NOC/operations type knows that there comes a time in service lifecycle management known as the “Oh **** moment”.  Everyone in technology operations has experienced one of these.  They usually happen for one of two reasons.  First, a bunch of bad things happen all at the same time.  In state/event terms, this means that you either have a flood of events or you have a combination of events that you never thought about, and didn’t create a state for.  The second reason is that you take what is supposed to be a restorative action and things get worse rather than better.

I saw the latter personally in the healthcare industry.  A seemingly minor parameter error was made in setting up a bunch of routers.  The result was that on the next full working day, connections between workers and critical applications began to fail.  The normal remedial process, which was to simply reset and restore the connections made things worse.  The final step was to assume that the host software had gone bad and reload/restart.  That killed everything.

You can make a darn convincing argument that machine cognition could have been valuable at this point.  The same can be said in any situation where there are either a bunch of bad things (which implies common cause) or a remediation failure (which implies unexpected consequences).  However, it may well be that these issues are beyond the range of reasonable AI response, at least in the near term.

In my healthcare example, diagnosis of the problem required a combination of experience that was apparently uncommon enough to not occur either in the healthcare provider or the router vendor organization.  Might the necessary skills have been “taught” to AI?  Perhaps, if somebody were willing to fund an enormous collection of subject-matter experts and the even-more-expensive dumping of their brains into a database.

A real advance in AI application to service lifecycle management would have to come, IMHO, from a combination of two factors.  First, we’d need to be able to substitute massive analytics for the subject matter expert.  Collecting data from the history of a given network, or in fact from all networks, might allow us to create the inputs about correct and incorrect operation, and the impacts of various restorative steps, that a knowledge engineer would need.  Second, we need an on-ramp to this more AI-centric state that’s not overly threatening.

What would be useful and incremental, perhaps, is what could be called the “Oh ****” state and an event of the same name.  The reception of the so-named event in any operating state causes a transition to the you-know-what state, where a process set designed to bring order out of presumptive chaos is launched.  That implies a designed-in capability of restoring the state of everything, perhaps by some form of by-domain redeployment.

There is an AI opportunity here, because it would be difficult for human operators to catalog the states of chaotic network infrastructure.  Analytics and AI principles could be used to match behavior patterns observed in the past with the way the situation developed and how it progressed.  This could then be used to decide what action to take.  In effect, AI becomes a backstop for policies, where policies simply haven’t, or can’t, be developed.

From there, of course, you could expand the scope.  Over time, NOC personnel and operations executives might become more comfortable with the idea of having their own rules of network state and lifecycle management supplanted by machine decisions.  If the automated processes can prove themselves where humans can’t go, it’s much easier to commission them to perform routine tasks.  People, having experience with AI operating where policies can’t easily be formulated, will eventually allow AI to formulate policies in an ever-broader way.

This isn’t going to be an overnight process, though.  AI still has to work from some base of knowledge, and just gathering data isn’t going to deliver everything it needs.  Even a long historical data timeline isn’t the full answer; networks rarely operate with unchanged infrastructure for even a year.  We’ll need to examine the question of how to gather and analyze information on network conditions to get the most of AI, and we’ll probably need humans to set policies for a long time to come.