How useful would artificial intelligence and machine learning (AI/ML) really be in network lifecycle automation? The topic gets a lot of attention, and a lot of vendors have made claims about it, but the real benefits are actually difficult to assess. Part of the reason is that there are many ways AI/ML could be applied, and part the fact that we tend to intuitively think of AI as “Hal” in “2001: A Space Odyssey”.
If we cut through the jargon, the goal of AI/ML in network operations is to provide some form of root cause analysis, some kind of response optimization, or both. Lifecycle automation means detecting conditions, performing analysis to establish a most-likely cause, and then triggering remediation based on that cause. We need to break these three points down to assess what might be practical in the application of AI/ML to networks.
Detecting conditions is a function of monitoring, which in the real world is all about receiving events that signal conditions or condition changes. However, events don’t necessary signal the necessity of action. Some events may signal that previous actions are working, in fact, and others may signal the initiation of self-remediating processes down in the network. The key point for monitoring is that in order to monitor effectively, you have to know what generated the event and how to read it. There are plenty of monitoring tools available, and we have long experience in monitoring, so there’s not much we need to worry about here as far as gathering information is concerned.
But is gathering events the same as “detecting conditions?” The first complicating factor in AI/ML is that detection of conditions and root-cause analysis are fuzzy areas, and you could rightfully assign some tasks to either of those overall steps. We therefore have to identify those fuzzily-positioned tasks and see where that takes us.
The first task on our list is that of event associations. A network is a system of complex, interconnected, functionality. An event is a report of conditions in a particular place, because probes that report conditions have to be placed somewhere and read conditions where they’re placed. But what is that place, in the overall context of the network? What, for example, is the association between information gathered at Point A and Point B? Unless you know what the network-topological relationship of those points is, you can’t make proper association of the results. Suppose the points are two ends of a trunk. Packet count discrepancies would indicate data loss, but having either count in isolation might tell you nothing. You need to be able to define associations rooted in topology, that then frame a kind of super-event from the relationship between sets of events.
Then there’s actionability. Not all events indicate a need for action to be taken. Some simply report conditions, and we may want to log them and even analyze them in a later step as a part of a machine learning process, but we don’t want to kick off an operations process based on the event. However, actionability is sometimes related to expected versus experienced conditions, where an event is actionable not because of what it is atomically, but what it is relative to what was expected. That gets us into context.
The third task on our list is state/event relationships. Many (perhaps most) events needs context to be analyzed properly. If a trunk reports zero traffic, is that an error? It depends on whether there’s supposed to be traffic, and there might be no such expectation if we just took the trunk out of service or were diddling with the devices on either end. We don’t want to trigger remediation based on an event that’s been created by our own reaction to a prior event or set of events.
Our final task is event correlation and root cause analysis. This, in most cases, is going to involve the integration of all the prior task results, conducted with reference to the topology of the service or network in question. If an automated (or human-driven) operations process responds to events discretely, it’s likely to step all over itself by treating symptoms rather than the problem. A series of events that are related and contextualized should be examined to determine whether they stem from a specific condition. If they do, and if that deduction can be validated, then the right action is one to address that common, specific, condition.
As I noted at the start of this blog, there are a lot of ways that we could think about applying AI/ML to network operations, but the ways can be broken into two groups. The first group is the application of AI/ML directly to events, with contextualization and causal analysis done by successively analyzing the event stream through the tasks noted above. The second is the creation of a contextual map that itself guides the entire event collection and analysis process.
It’s probably obvious to most of you that the first approach is going to have scalability challenges. Everyone who ever spent any time in a network operations center (NOC) has seen the result of an event storm, where a single problem generates thousands of alerts that quickly swamp both the operator and the entire process of gathering and displaying events. The bigger the network, the bigger the storm can be. I don’t think this is the right way to go, AI/ML or not.
But what is a “contextual map” and how could we get one? A contextual map is a model that divides a network, or a set of service resources, into functional units, each of which can be depicted as a “black box” with external interfaces and behaviors. When an event is generated in this situation, instead of going to some master event-handler, it goes instead to a contextual handler defined within that black box. The underlying principle is that a network/service’s state is the combined state of the unitary functions that make it up. We deal with the functions individually, and collectivize the results.
Contextual maps are a dazzling insight, and they were postulated first by John Reilly in the TMF NGOSS Contract work. Events are steered to processes via the state/event table of an element in the service contract data model. The TMF service data model, the “SID” wasn’t ideally structured for the mission, something John and I chatted about a number of times. TOSCA, the OASIS standard “Topology and Orchestration Specification for Cloud Applications”, appears to me to have the capability to define contextual maps that could then be used to guide the way that events are analyzed. I’m working through my own TOSCA guide (as time permits!) to validate the view and explain how it would work.
But what does this have to do with AI/ML? I believe that contextual mapping is an essential step in applying AI/ML to network operations, because it reduces the scope of a given analysis and permits hierarchical assessment of conditions. If Box 1 has an event, AI/ML can analyze within that scope, and if the analysis indicates the scope of the problem has to expand, then Box 1’s determination becomes an event to the next box up in the hierarchy. By this means, I can handle a wide range of events with independent AI/ML instances, because each is handling the interior of a black box. Wider scope just means kicking off an analysis at a higher-level box.
What, though, could AI/ML do within a given black box? If we have state/event tables, the boxes already contain (or, more accurately, their data model contains) the necessary indication-to-action representations. The obvious answer is that AI/ML could replace the state/event tables, which would mean that having people sit down and figure out those event-to-process associations, and the states related to handling them, would no longer be required. Machine learning, AI, or both combined could be used to create the event-to-process mapping needed, which could generate a lot more agile and effective event-handling while preserving the value of the service data model as a “contextual map” for the network or service.
This still leaves us with a potentially open question, which is the last step of our process—identifying the appropriate/optimum action to take. People have talked about how ML might “learn” the best way to address a given fault (properly contextualized, of course!), but the problem with ML is that it has to learn from experience, and while it’s learning it’s either got to be disconnected from the action-generating processes, or it has to be expected to fail some number of times. It might take a long time for ML to get the experience it needs.
The good news here is that contextual mapping will reduce the learning period by containing the number of condition/action correlations that have to be learned. The notion of dividing complex tasks into a series of simpler ones is a common human response to being overwhelmed by the scope of a problem, so it’s a good way to manage AI/ML too. But even here, we may need some additional help.
I’ve noted in a couple of past blogs that we’re overlooking a powerful ally in network lifecycle automation—simulation. If a set of conditions, created by the correlation of events within a context, can generate a recommended action or set of actions, simulations of the inside of that contextual black box could be run to establish the likely result of each action identified. The result, particularly if it’s generated with a specific confidence level, could then either automatically trigger the optimum action, or present it for human assessment.
One obvious application of simulation would be creating baseline “states” that ML could be taught. This is good, this is impaired, and this is pretty bad. It might take a long time for all these conditions to be visible in a live network, available for ML to examine and learn from. Giving it a head start with simulation could speed the process considerably.
Simulation could also model the way a given action might impact the network/service. It could also, in theory, model the progression of faults. Since simulation requires a model on which to base its recommendations, it could be said to help enforce the notion of contextual mapping. That alone could be valuable, and it would make sense to leverage a simulation map or model to provide additional insight. Think of it as a kind of machine learning, with a similar application to operations processes overall.
The conclusion here is that while AI/ML could be useful, the way they’re used has to be firmly anchored in the topology or context of the network/service, or the application of AI/ML is likely to create scalability and relevance issues. You can’t accomplish anything by gluing the term onto a product. You have to integrate it with the network and service, and to do that, you have to be able to represent the structure to define the constraints of your AI/ML.
I think AI and ML could be very beneficial in service automation and other operations tasks. However, we still get stuck in “Hal” mode, expecting real human (or superhuman) intelligence. We’re not there yet, but there are still things that we could do to enhance the way AI/ML is applied to operations tasks. The same, of course, would be true for other tasks. It would be helpful to focus on these things instead of just AI/ML-washing everything, don’t you think?