What are the Options and Issues in AI in Networking?

It looks like our next overhyped concept will be AI.  To the “everything old is new again” crowd, this will be gratifying.  I worked on robotics concepts way back in the late 1970s, and also on a distributed-system speech recognition application that used AI principles.  Wikipedia says the idea was introduced in 1956, and there was at one time a formal approach to AI and everything.  Clearly, as the term gains media traction, we’re going to call anything that has even one automatic response to an event “AI”, so even simple sensor technologies are going to be cast in that direction.  It might be good to look at the space while we can still see at least the boundaries of reality.

In networking, AI positioning seems to be an evolution of “automation” and “analytics”, perhaps an amalgamation of the two concepts.  Automation is a broad term applied to almost anything that doesn’t require human intervention; to do something on a computer that was once done manually is to “automate” it.  Analytics is typically used to mean the application of statistical techniques to draw insight from masses of data.  “Closed-loop” is the term that’s often used to describe systems that employ analytics and automation in combination to respond to conditions without requiring a human mediator between condition and action.  AI is then an evolution of closed-loop technology, enhancing the ability to frame the “correct” response to conditions, meaning events.

There have been many definitions and “tests” for artificial intelligence, but they all seem to converge on the notion that AI systems have the ability to act as a human would act, meaning that they can interpret events and learn behaviors.  We seem to be adopting a bit broader meaning today, and thus we could say that in popular usage, AI divides into two classes of things—autonomous or self-organizing systems that can act as a human would based on defined rules, and self-learning systems that can learn rules through observation and behavior.

The dividing line between these two categories is fuzzy.  For example, you could define a self-drive car in a pure autonomous sense, meaning that the logic of the vehicle would have specific rules (“If the closure rate with what is detected by front sensors exceeds value x, apply the brakes until it reaches zero.”) that would drive its operation.  You could, in theory, say that the same system could “remember” situations where it was overridden.  Or you could say that the car, by observing driver behavior, learned the preferred rules.  The first category is autonomous, the second might be called “expanded autonomous” and the final one “self-learning”.  I’ll use those terms in this piece.

Equipped now, at least if you accept these terms as placeholders, to classify AI behaviors, we can now look at what I think is the top issue in AI, the rule of override.  Nobody wants a self-driving car that goes maverick.  To adopt AI in any form you have to provide the system with a manual override, not only in the sense that there might be a “panic button” but in the sense that the information the user needs to make an override decision is available in a timely way.

This rule is at the top because it’s not only the most important but the most difficult.  You can see that in a self-drive car, the rule means simply that the controls of the vehicle remain functional and that the driver isn’t inhibited from either observing conditions around the vehicle or using the override function if it’s necessary.  In a network, the problem is that the objective of network automation is to replace manual activity.  If no humans remain to do the overriding, you clearly can’t apply our rule, but in the real world, network operations center personnel would likely always be available.  The goal of automation, then, would be to cut down on routine activity in the NOC so that only override tasks would be required of them.

That’s where the problem arises.  Aside from the question of whether NOC personnel would be drinking coffee and shooting the bull, unresponsive to the network state, there’s the question of what impact you would have on automation if you decided to even offer an override.  The network sees a failure through analysis of probe data.  It could respond to that failure in milliseconds if it were granted permission, but if an override is to be made practical you’d have to signal the NOC about your intent, provide the information and time needed for the operator to get the picture, and then either take action or let the operator decide on an alternative, which might mean you’d have to suggest some other options.  That could take minutes, and in many cases make the difference between a hiccup in service and an outage.

This problem isn’t unsolvable; router networks today do automatic topology discovery and exercise remedial behavior without human intervention.  However, the more an AI system does, meaning the broader its span of control, the greater the concern that it will do something wrong—very wrong.  To make it AI both workable and acceptable, you need to provide even “self-learning” systems with rules.  Sci-Fi fans will perhaps recall Isaac Asimov’s “Three Laws of Robotics” as examples of policy constraints that operate even on highly intelligent AI elements, robots.  In network or IT applications, the purpose of the rules would be to guide behavior to fit within boundaries, and to define where crossing those boundaries had to be authorized from the outside.

An alternative approach in the AI-to-network-or-IT space would be to let a self-learning system learn by observation and build its own rules, with the understanding that if something came up for which no rule had been created (the system didn’t observe the condition) it could interpolate behavior from existing rules or predefined policies, and at least alert the NOC that something special had happened that might need manual review.  You could also have any action such a system takes be “scored” by impact on services overall, with the policy that impacts below a certain level could be “notify-only” and those above it might require explicit pre-authorization.

All of this is going to take time, which is why I think that we’ll likely see “AI” in networking applications focusing mostly on my “autonomous” system category.  If we look at the whole notion of intent modeling, and manifest that in some sort of reality, we have what should be autonomous processes (each intent-modeled element) organized into services, likely through higher layers of model.  If all of this is somehow constrained by rules and motivated by goals, you end up with an autonomous system.

This leads me back to my old AI projects, particularly the robotics one.  In that project, my robot was a series of interdependent function controllers, each responsible for doing something “ordered” from the system level.  You said “move north” and the movement controller set about carrying that out, and if nothing intervened it would just keep things moving.  If something interfered, the “context controller” would report a near-term obstacle to avoid, and the movement controller would get an order to move around it, after which its original order of northward movement would prevail.  This illustrates the autonomous process, but it also demonstrates that when there’s lots of layers of stuff going on, you need to be able to scale autonomy like you’d scale any load-variable element.

Returning to our network mission for AI, one potential barrier to this is the model of the service.  If events and processes join hands in the model, so to speak, then the model is an event destination that routes the event to the right place or places.  The question becomes whether the model itself can become a congestion point in the processing of events, whether events can pile up.  That’s more likely to happen if the processes that events are ultimately directed to are themselves single-threaded, because a given process would have to complete processing of an event before it could undertake the processing of a new one.

This additional dimension of AI systems, which we could call “event density”, is something that’s slipped through the cracks, largely because so far most of the “orchestration” focus has been on NFV-like service-chain deployments.  If you move from two or three chained elements to services with hundreds of elements, add in the business processes that surround the network parts, and then try to automate the entire mess, you have an awful lot of things that generate events that could change the requirements for a lot of other things.  We need to take event density seriously, in short, when we assess automation and orchestration goals that go beyond basic NFV MANO.

And maybe even there.  There’s nothing more frustrating than a system with self-limiting faults that are hidden until you get really committed to it.  New applications of NFV will be more complex than the old ones, because nobody starts with the most complicated stuff.  We don’t want to find out, a year or so into an NFV commitment, that our solutions have run out of gas.