Making Sure AI Operations Deliver on their Promise

With Tesla about to release its humanoid robot, we’re going to see more talk about the dangers of machines taking over. Do we actually face the risk of having AI get smart enough to become truly sentient, and deciding that its creators are an impediment to its future? I think we’re safe from that in the immediate future, but we do have to be thoughtful about how we use the kinds of AI we really can create, or we may have to worry about them too…for more pedestrian reasons.

Isaac Asimov was famous for his “Three Laws” of robotics, which obviously addressed an AI model that was to all intents and purposes self-aware and intelligent. For those who don’t remember, or weren’t SiFi fans, they were:

  • A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  • A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
  • A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

A quick read of these shows that they were designed to protect us from robots, but suppose that robots and AI aren’t really the problem? A few early users of AI I’ve had chats with have offered some examples of real-today AI problems. They aren’t evidence that AI is more of a risk than we can tolerate, but that like all technologies that hasten things, it can move too fast.

Another blast-from-the-past memory I have is a cartoon about the early days of IT, when it was common to measure computing power by comparing it to the work of human mathematicians. A software type comes home from work and says “I made a mistake today that would have taken a thousand mathematicians a hundred years to make!” It’s this cartoon and not the Three Laws that illustrate our challenge with controlling AI, and it’s not limited to things like neural networks either.

Tech is full of examples of software that’s designed to seek optimality. In networking, the process of route discovery is algorithmic, which means it’s based on a mathematical theorem that’s used to solve a problem. The difference between algorithmic optimality and AI optimality is more one of degree. An algorithm has a limited and specific mission, and AI is designed to support a broader mission, meaning more information, more assessments, more human-like thought processes.

Routing algorithms make “mistakes”, which is perhaps the driving force behind the notion of SDN. You can create a more efficient network by planning routes explicitly to manage traffic and QoS, which is what Google does with its core network. However, central management a la SDN is difficult to scale. We could argue that AI in tech is designed to lift algorithmic optimization up a level. Maybe not to the top, where there are no issues, but at least to something more.

According to the feedback I’ve gotten on AI, users see two distinct models, the advisory form and the direct-response form. In the former, AI generates information that’s intended to guide human action, and in the latter AI actually takes action. The direct-response form of AI obviously works quicker and saves more human effort, but it presents more risk because the behavior of the AI system is likely difficult to override in real time. Users tell me that it’s easier to sell the advisory form of AI to both management and staff for reasons of perceived risk, and they also say that they’ve had more problems with direct-response AI.

The problems with automated responses is that they aren’t always right, or they’re not always executed as fast as expected. The former issue is the one most users report, and in most cases what they’re reporting is a response to a problem that doesn’t properly consider the network as a whole, or the state the network will be left in once the response is executed. That these problems boil down to “My AI didn’t do the right thing” is obvious; the issues behind that point need a bit of discussion.

Truly autonomous AI based on neural-network or built-in intelligence create problems because of the way the underlying rules are created. The classic process involves a subject-matter expert and a “knowledge engineer” who work together to define a set of rules that will then be applied by AI. While a failure of either or both these roles is an obvious risk, it’s not what enterprises say is at the root of the problem. That turns out to be biting off more than you can chew.

Most enterprise networks are complicated mixtures of technology, vendors, and missions. When you try to establish rules for AI handling, the policies that will guide AI interpretation of conditions and execution of actions, it’s easy to fall into a kind of “completeness trap”. You go to the netops people and ask what they’d like, and since they have a broad scope of responsibility, they offer a broad response. The problem is that the breadth means that the subject-matter people (netops) will simply overlook some things, that some things they describe won’t be perfectly understood by the knowledge engineers, or a combination of both.

Machine learning (ML) forms of AI fall into this same trap, for a slightly different reason. In an ML system, the root notion is that the system will learn from how things are handled. In order for that to work, the conditions that lead up to a netops intervention have to be fully described, and the action netops takes fully understood. In simple terms, an ML system for operations automation would say “when these-conditions occurred, this-action was taken, resulting in this-remediation-state. All the stuff in italics would have to be fully defined, meaning that specific conditions could be assigned to the first and third and specific steps to the second.

What, exactly, would fully define some network condition or state? Is it everything that has been gathered about the network? If so, that’s a daunting task. If not, then what specific information subset is appropriate? The bigger the bite of netops activity you take, the more difficult these questions become, to the point where it may be literally impossible to answer them. In addition, the more conditions there might be, the smaller the chance they’d appear during the learning period.

You can’t control network complexity, but you can control bite size. While, as I said above, “enterprise networks are complicated mixtures of technology, vendors, and missions,” it’s usually true that the elements can be divided into “zones”. There’s a data center network, a branch network, and likely a “switch network” and a “router network”. Maybe a Cisco network and a Juniper network. These zones can be treated separately by AI because netops assessments and actions normally take place within a zone.

Both operators and enterprises are starting to like the “bite control” approach if they decide to try fully automated AI responses to network conditions, but more are liking AI in an advisory role. Operators in particular are seeing the highest value of AI as facilitating diagnosis of problems, the second-highest as recommending solutions, and automatically applying solutions as last. Enterprises are a bit more likely to want fully automated responses, but not significantly more likely to try them in the near term.

The final point here is key, though. There are risks associated with AI, but those risks are really very similar to the risks of operations without AI. Humans make mistakes, and so does AI. Despite some early issues with fully autonomous operations, both network operators and enterprises say they are “fully committed” to the increased use of AI in their networks, and are similarly committed to AI in data center operations. There are some growing pains, but they’re not turning the potential AI users away.

We’re still kicking AI tires, it seems. That’s not a bad thing, because trusting a fully automatic response is a big step when you’re talking about a mission-critical system. To paraphrase that old cartoon, we don’t want our AI to make a mistake that would have taken a thousand netops specialists a hundred years to make.