Strategies for Automated Operations

There seem to be a lot of announcements around automated operations these days.  Many, perhaps even most, involve artificial intelligence and machine learning.  Obviously a part of this is due to the tried-and-true tactic of vendors jumping on hot terms for SEO, but part is also due to the fact that operations issues are increasingly recognized as being minefields of cost and churn.  To separate the hype from the truly useful, we need to frame a standard approach that’s optimized for reality.

Martin Creaner offered an assessment of telcos’ digital transformation progress that touches on this point with a number of his areas of assessment.  He makes a lot of good points, so while I don’t necessarily agree with everything, I think his view is at least as good as my own, and you’d benefit from considering the piece in conjunction with what I write here.  As usual, I’m trying to take things from the top down, and that yields a different perspective.

Martin’s piece rates telcos’ transformation progress in ten areas on a 1-10 scale, and my rough calculation says he’s giving them an average of 6 out of 10.  I’d probably shave at least one and perhaps two points from that, but Martin’s higher scores are associated with things like 5G, which are undeniably important but don’t make up anything like the totality of operator services and responsibilities.  I also think that there’s an issue that separating transformation into ten areas can actually disguise, which is lack of systematism in the overall approach.  I think we already have many areas where mobile and wireline are converging, and there are going to be more of them.  We really need to solve automated operations challenges throughout the telcos’ infrastructure.

The only solution to growing complexity is expanding automation.  Networks are getting complex naturally, from growth in traffic and connected users and from the introduction of new technologies.  The one creating the greatest risk of complexity, in fact, is the one that holds the most promise for the future, the concept of hosted and composable functions.  What do we do about growing complexity; how can we introduce automation?  There are two basic options.

Option one is to apply automation, and likely AI/ML, directly to the point-of-management interfaces.  We have humans sitting in network operations centers (NOCs) today, and this approach is aimed at making the human burden (and its associated costs and errors) smaller by taking on tasks that are presented to humans, implicitly or explicitly, as part of orderly operations.

Option two is to restructure network-building into the assembly of networks and services from autonomous elements.  A network, in this approach, is created by self-managed components, and these “intercept” faults and issues before they’d ever reach the NOC and human visibility.  As a result, today’s point-of-management interfaces have a diminished (and likely diminishing further, as practices improved) mission.

The big plus of the second option is that it’s essentially creating an automation layer attached to current management systems at the point where they interface with the NOC, or above the NOC itself.  That means nothing below requires any retooling, so it’s conservative of current investments in management and monitoring, and obviously it doesn’t imply any changes to network infrastructure itself.  A lot of network vendors are going this way, at least initially.

There are limitations to the second approach, and some could be significant.  First, it may be difficult to apply the new-layer solution at the top of each current management stack, because the automated remediation may require actions that cross over between management systems.  That would mean that to be fully effective, the strategy has to be applied in what’s essentially the “between NOC APIs and humans” level.

The problem with any such high-level approach is scalability.  The complexity of network infrastructure and the scale of issues that could be generated isn’t impacted; the stuff is all presented to the new automation layer to work on.  This could mean tens of thousands of events could be reported during a massive outage in a short period, and it calls into question whether the new layer could handle all the activity.

You can’t just have an event-handler at the top of the NOC stack (and the bottom of an event avalanche).  The problems of fault correlation and remedy interdependencies have to be addressed.  The former is an often-discussed problem, often called “root cause analysis” to reflect that often a single problem can generate tens, hundreds, or thousands of related problems, none of which really need to be fixed but all of which clutter the landscape.  The latter is rarely discussed, and it’s a major issue.

How do you know, when you take a step to solve a problem, that the step doesn’t collide with either other aspects of the same problem, or the effects of a parallel problem?  You can try to reroute traffic, for example, but it will be ineffective if the alternate route is already below spec, or has failed.  There has to be a classic “god’s eye” view of a network to assure the responses to a problem won’t create a different problem.

There are two possible paths to addressing this.  One is simulation, designed to determine what the outcome of a step would be by modeling the impact of taking it.  The other is failure modes, which try to describe different ways that a network could operate in a problem state, and use events to define what failure mode is present.  The failure mode would define a preferred response, which would be organized across all elements in the network, and could in fact be set by human policy in advance.

OK, so this approach has some potentially critical limitations.  How about the autonomous-network approach?  This one is much harder to assess because saying “autonomous network” is like saying “multi-vendor network.”  One box may make a network “multi-vendor” but one that’s 50-50 split is a lot more so.  With autonomous networking, the problem is defining both the scope of “network” and the meaning of “autonomy”.

If the autonomous network is a management domain, then what’s created is the same as adding a layer of automation to the top API of a current management system as it feeds into the NOC.  I covered that above.  Rather than list all the things that the term might mean, and that would produce sub-optimal behavior, let’s look instead at the right approach.

I think “autonomous network” means “intent-modeled network”.  You create functional black-box abstractions, like “VPN” or “access network” or “firewall”, and you define a specific set of external properties to each, including an SLA.  Any implementation (the inside of the black box) that meets the external properties is fine, and can replace any other.  Inside each abstract black box, an autonomous management process is responsible for meeting the SLA or reporting a failure, which then has to be handled at the higher level.

The notion of building a network service by composing functions is obviously a shift in thinking, but in some ways that’s helpful here because so is automated operations or “autonomous networking”.  If you spend a bit of time defining your intent-modeled functions properly, and if you have a proper modeling language to express the result, you can define every operations task within the model, which means you can automate everything short of the handling of situations for which no handling is possible.  Even there, you could initiate an escalation, trigger a service credit, etc.

To get back to Martin’s piece, I think this represents the key qualification to his comments.  All autonomous networks are not the same, and many aren’t really autonomous in the sense that they are self-healing.  If we credit some early activities that could loosely be called progress toward true autonomy as an indication that automated operations is making progress, we presume that these initiatives are actually aimed at that goal and being guided there by management.  I do not believe this is the case.

When I talk with operators themselves about automation efforts, they point to successes that are striking in that they’re limited-scope in nature.  They’ve fixed this or that, but they’ve not addressed the problems of operations overall.  That may, in the short term, be a driver for the AI-overlay approach.  Putting a layer of AI at the NOC has the advantage of being systemic by nature, but I think that the issues I’ve noted above will eventually mean it will hit the wall without lower intent-modeled element support.  That’s the strategy I’m most interested in seeing.

I still think that Juniper may have a shot at getting there.  Their Paragon Automation has a lot of the pieces of an intent-modeled approach, and their Apstra acquisition would let them take things further.  They have a great AI tool in Mist too, but Paragon doesn’t cite any deep Apstra or Mist AI roots, so it’s not yet the fully structured and autonomous-management piece it could be.

We need to go back to a key point, which is that you can’t do effective operations automation in little grope-the-elephant-behind-the-screen pieces, you have to do it through a strategy that spreads to every part of the network.  It will be interesting to see if Juniper takes it further, and whether competitors like Cisco (with Crosswork) will push their own offerings in the right direction…or both.