How Do We Get To, and Optimize For, AIOps? – Welcome to CIMI Corporation's Public Blog

What’s the right model for AI-based operations, or AIOps? There are obviously a lot of different “operations” AI could be used in, so does the optimum model depend on the specific mission? Or, just maybe, is there at least a set of guidelines that would cut across all the missions? That would enable us to frame out not only a general toolkit for AIOps, but also to at least hope for a toolkit that could cross over operations areas. Could we have One AI to Rule Them All? Maybe.

One thing I believe to be absolutely true is that the “Rule Them All” piece of that paraphrase is critical with AIOps. We are not, at least not in the near future, going to have a single AI element that can fully absorb the operation of a complex network and take it over in all kinds of conditions. AIOps is a hierarchy, by necessity. If you think of how humans work, that shouldn’t be surprising. When you’re walking along a corridor talking with a team member, you’re not plotting out each step and how to maintain your balance as you move your arms and change your center of gravity. You have an autonomous system that does that, one you can override but that’s happy to do what it’s supposed to do, on autopilot. So it must be with AIOps.

To me, then, the whole AIOps thing is about intent models or black boxes. A cohesive collection of devices and software in some mix can be visualized as a single black box. Its properties are those it exposes, and its contents are up to it to manage. The goal of that management is to fulfill the SLA that’s at least implicit if not explicit in those exposed properties. You’re working (or, in my example, walking) if you’re meeting the appropriate external conditions and properties of working/walking.

So does each of these black boxes have its own AIOps? Not necessarily. That will depend on the value of AI at that point, which depends on the nature of the contents of the black box, the complexity of managing it to its SLA, and the available tools to do the managing. It also depends on whether a given black box has a close relationship with others, and a bunch of what I could describe as “cooperative factors” that we’ll get into.

We’ll get into them by supposing that the manager of our black box, having expended its best efforts, cannot meet its SLA. I broke, and I’m sorry. Well, sorry probably won’t cut it, so the question then is how broke-ness will be dealt with. The answer is “whoever has an interface with this black box.” An SLA is a guarantee, and obviously it’s a guarantee to the element that’s connected at the specific interface(s) the SLA is provided for. Our black box has a cooperating element, or elements, and we can also presume that those connecting elements also have SLAs and other cooperating companions. The question is whether there’s some clear circle of cooperation here, a group of black boxes that are themselves perhaps an outer-level black box.

Networks, and combinations of network-and-hosting, are often systems of systems, and so the idea of a black-box hierarchy here isn’t outlandish. But it’s hard not to see an entire network, or a big conglomerate of data center and network, as the top element of every hierarchy. In other words, does “cooperation” extend as far as connection does? Not necessarily.

I’ve done a lot of work in describing service models that were aimed at automating operations. What I determined was that there was connection and there was cooperation as far as describing the way two black boxes interact. In the former case, a black box might have to be informed about an SLA fault in order to “know” that it might not be able to meet its own SLAs. In the latter case, the black box might be informed in order to mediate or participate in a remedy.

Let’s suppose our black box is an element in a service, and that its interface to a core network service fails. If it has a second interface to the same core network, and if that interface can be used to satisfy its SLA “upstream”, then there black box could remedy its own failure. If it doesn’t have another interface, then it would need to report a failure “upstream”. The question is then whether the upstream box has a connection to the core that doesn’t go back through the same failed box. If it does, then it has a remedy option, and we might well consider that “parent” box and whatever “child” boxes there might be with parallel interfaces to the core to be a cooperative system, and to provide operations automation, meaning AIOps, for that system of boxes. In that case, we might see a complex hierarchy of black boxes.

It’s also possible that there might be a black box system that’s created by management alone. An SLA violation might be reported not to the data-plane interface (or not just to it) but to a management system, one that’s an AIOps entity. In that case, the span of control of that management system would determine the boundary of the cooperative system, because it’s the thing that’s inducing the cooperation.

The purpose of this structure is to string AIOps elements in the mesh of things that form the infrastructure being managed, an infrastructure almost surely bound together with a network. These elements would be charged with ensuring that the subordinate black boxes under their control met their overall SLA. And, of course, there would be an AIOps element sitting at the top of the hierarchy. However, rather than being expected to understand all the pieces of that mesh of things, it only has to understand the properties of the black boxes just below it in the hierarchy. These boxes need to understand what’s below them, and so forth until we reach the bottom.

Effective AIOps depends on intent modeling, and intent modeling isn’t just defining black boxes that represent modeled elements, it’s also defining the way these black boxes are organized into a cooperative hierarchy. The “model” is as important as the “intent”. This reduces the burden on each AIOps element, and it also permits the elements to be specialized to the specific infrastructure components they’re expected to control. As you rise in the hierarchy, the elements could become more generalized, since their mission would usually be one of finding alternatives to a primary element that had failed.

The approach I’ve described also provides for the use of AIOps in testing and rehabilitation of assets, and for broader-level diagnosis of the health of the system overall. A specialized tool could be “connected” to a separate management channel to do this, and that channel could (like the data interfaces) represent either a pathway to specific real assets or to an AIOps element there, prepared to act on behalf of the generalized higher-level tool Testing and rehab, then, would be managed through the same sort of AIOps hierarchy as overall operations, though not necessarily one with the same exact structure.

We may be heading toward the notion of having AI deployed in a network or compute structure that’s hierarchical, but that doesn’t mean that we have a hierarchy of intent models or even that we have intent models at all. There’s not enough formalism with respect to the relationship between clusters of technology elements, their management tools, and the management of the entire infrastructure…at least not yet. Some vendors are making progress toward a holistic management vision, and such a vision is just as critical for AIOps as AI is itself.