What Benefits Do Users See in Applying AI to Netops?

Artificial intelligence is, in a sense, like a UFO. Until it lands and submits for inspection, you’re free to assign to it whatever characteristics you find interesting. Network operations staff have only recently been facing an AI “landing” and so they’re only starting to assign specific characteristics to it, from which they can derive an assessment of value. But they have started, so let’s look at what they’re seeing, or at least hoping. What are our goals in “AI ops”?

There are a lot of stories about the use of AI in operations centers, to respond to failures, and that certainly seems a valid application. AI could provide quick, automatic, responses to problems and also likely be able to anticipate at least some of them. Sure, a sudden capacitor explosion in a router could create a zero-warning outage, but operations people say that most problems take a few minutes to develop. You’d expect that operations would love this sort of thing, but not so much.

The fact is that everyone from end users through carriers to cloud providers says that network change management is their top target, not “fault management” in the sense of recovering from failures. Almost all ops professionals say that their top problem is configuring the network (or even the computing resource pool) to fit the current work requirements, and that the growing complexity of their infrastructure means that it’s all too easy to make a mistake.

That “make a mistake” thinking may explain a lot here. An exploding capacitor in a router isn’t operations’ fault, but a configuration error is a human error that not only hurts the reputation of whoever made it, but also the reputations of their management and even those who planned out the operations practices and selected the tools. “Failure” is bad, but “error” is a lot worse because it can be pinned on so many.

In fact, the view of the role of AI in fault management may be tainted by this cover-your-you-know-what view. There’s a lot of reluctance in accepting a fully automated AI response to a problem, and if you dig into it, you find that it stems from a fear that the operator will be held accountable for the AI decision. What the majority of ops people want is for an AI system to tell them there’s a problem and suggest what it is. The operator would then take over. A smaller number want AI to lay out a solution for human review and commitment. The smallest number want AI to just run with it, perhaps generating a notification, and this approach is usually considered suitable for “minor” or “routine” issues.

The notion that an operations type would be tarred with the brush of AI error isn’t as far fetched as it seems. Non-technical people often see “the computer” or “the network” or “AI” as a co-conspirator with whoever is programming or running it. In my first job as a programmer, a printer error had resulted in a visible check number at the upper right that differed from the MICR-encoded number at the bottom. This resulted in widespread problems of reconciliation, and the internal auditor stormed up to me and shouted “Your computer is making mistakes and you’re covering up for it!”

If configuration management is really the goal, then what specifically do operations people want? Essentially, they’d like to be able to input a change in the terms used when it was described to them. Generally, what this means is that a service or application has an implicit “goal state”, which is the way infrastructure is bound to the fulfillment of the service/application requirements. They’d like AI to take the goal state and transform it into the commands necessary to achieve it. When I hear them talk, I’m struck by how similar this is to the “declarative” model of DevOps; tell me what’s supposed to be there and I’ll figure out the steps. Normal operations tends to be “imperative”, tell me the steps to take and hope they add up to the goal.

Another thing that operations types say they want from AI is simplification. Infrastructure is complicated, and that complexity limits the ability of a human specialist to assimilate the data needed for what pilots call “situational awareness”. Think of yourself as captain of a ship; there’s a lot going on and if you try to grasp all the details, you’re lost. You expect subordinates to ingest the stuff under their control and spit out a summary, which you then combine with the reports of others to understand whether you’re sailing or sinking. Operations people think AI could play that role.

The “how” part is a bit vague, but from how they talk about it, I think they want some form of abstraction or intent modeling. There are logical functional divisions in almost every task; a ship has the bridge, the engine room, combat intelligence, and so forth. Networks and data centers have the same thing, though exactly what divisions would be most useful or relevant may vary. Could AI be given what’s essentially a zone of things to watch, a set of policies to interpret what it sees, and a set of “states” that it could then say are prevailing? It would seem that that should be possible.

The final thing that operations people want is complete, comprehensive, journaling of AI activity. What happened, how was it interpreted, what action was recommended or taken, and what then happened? Part of this goes back to the CYA comment I made earlier; operations types who depend on AI have to be able to defend their own role if AI screws up. Part is also due to the fact that without understanding how a “wrong” choice came about, it’s impossible to ensure that the right one is made next time.

It’s surprising how little is said about journaling AI analysis and decisions, even when the capability actually exists. Journals are logs, and logs are among the most valuable tools in operations management, but an AI activity journal isn’t automatic, it has to be created by the developer of the AI system. Even if it is, there has to be a comprehensive document on how to use it, or you can bet it won’t be used. A few operations people wryly commented that they needed an AI tool to analyze their AI journal.

The journaling issue raised what might be a sub-issue, which was the need to understand what was available for AI to analyze. Most organizations said they had no idea what data was actually available, what its timing was, how it could be accessed. They had stuff they used, things they were familiar with, but they also had the uneasy feeling that if their AI was limited to knowing the very same things the operations people themselves knew, it probably wasn’t being used to full potential. A few very savvy types said that they thought a vendor should provide a kind of information inventory that had all the available information, its formats, conditions of availability, and so forth. Yes, they said, all that was out there, but not in any single convenient place.

This point, an afterthought on the last suggested AI priority, might actually be the key to the whole AI success-or-failure story. Garbage in, garbage out, after all. That may be the reason why single-vendor AI strategies that link AI tools to the vendor’s own products, work the best. It may also be the guidepost for how to integrate other vendors, other technologies into an AI system. You need that journal decoder to identify and characterize the important stuff, and also some control over what gets journaled in the first place.

Regarding that, I want to call out a point I made several years ago regarding SD-WAN implementations. The goal of networks is to serve goals, and business networks in particular have to be able to support the applications that are most important to the business, and whose execution benefits likely justify a big part of the cost of the network. Session awareness, the ability to capture information on the user-to-application relationships being run, is critical in getting the most out of network security, but also in getting the most out of AI. Enterprises aren’t fully aware of the implications here, but some leaders tell me that knowing whether “critical” sessions are being impacted by a problem, and considering the routing of critical sessions during configuration changes, is likely a key to effective use of AI down the line.