Monitoring, Observability, and What Should Lie Beyond – Welcome to CIMI Corporation's Public Blog

We hear a lot these days about “observability”, and many in the network and IT space think this is just another cynical “create-another-category” move by some analyst firms. After all, the distinction between the new concept of observability and good old-fashioned monitoring seems pretty subtle and subjective. In fact, the most common remark I get from enterprise technologists on the topic of observability is “I thought all that was in monitoring”.

Well, guess what? We could make some distinctions between monitoring and observability, and in fact I’ll do that below. That’s not the most important thing, however. I propose that we really need to add another dimension to the whole debate, which I’ll call actionability. Knowing the what or the why is nice, but having some path forward is nicer, and that’s particularly true when we start talking about artificial intelligence or machine learning in operations management or planning.

I think a good definition of “observability” is that it’s a process that’s aimed at making information visible, where monitoring is getting the information. The real presumption behind that is that monitoring gets you what’s available and observability gets you what you need, but that’s where my problem comes in. How do you know what you need?

If what distinguishes observability is proactively getting what you need, then we may be missing the point with monitoring. Why even have information you don’t need gathered and analyzed, after all. I think we should be asking a broader question, which is how to determine the overall health of a cooperative system like a network, hosting infrastructure, or an application. We also need to think about what we need to frame the remedial processes that would have to be generated if we decided things were unhealthy. That’s what I mean by actionability.

I’ve worked on many cooperative systems in my career, ranging from healthcare and funds transfer networks to international data networks, time sharing computing systems, and IoT. One of the consistent challenges in lifecycle management for these systems has been what I came to recognize as the actionability dimension. You can’t do lifecycle management if you don’t know something is wrong, you can’t do it if you lack the ability to perform remedial action, and you can’t do it if you can’t assess the progress of whatever action you take.

The solution to my past actionability challenges (even though I didn’t call them that at the time!) turned out to be state/event thinking. In my biggest funds transfer and credit network application, the problem was that the “operating state” of the devices on the network was more complicated than the state of the connection. Obviously, even a simple credit terminal has to deal with the possibility that a connection problem would leave the application and the retail activity generating the transaction in an out-of-sync state. With ATMs, it gets more complicated, because a combination of conditions could indicate an attempt to breach the device’s security.

What’s necessary in funds transfer applications is a broad model of device state, and a recognition that network connectivity state has to be considered within this broader model. If you think about that, you can start to deduce what would be useful to know about in order to define the current state of a device, and also what you’d have to know to manage a transition back to its normal state. Today, we’d probably think of this situation as a series of hierarchical intent models. The highest-level model, the one representing the device, would have perhaps three sub-models. The first would be physical-device state (is it powered, where are the switches and doors and so forth), the second would be the transaction state (where are we in withdrawal or deposit or inquiry), and the final one the connection state. Each of these three would contribute to the device state, meaning that some conditions generated or recognized by the lower models would have to be signaled to the top layer. Looking at things this way lets us decide better just what we need to know and do, the actionability dimension.

I can’t tell you how many problems I’ve found in distributed systems or applications simply by trying to lay out the state/event processes. In some cases, I found that there was a state of a device/element that, once entered, couldn’t be reliably exited at all. In some cases, I found that there was a condition that could occur, but that didn’t result in a reliable event to signal it.

This might sound like a commercial for observability, for introducing signaling points in applications to make a specific condition “observable”, but it’s more than that. As I said in my opening, the question of how you know what you need to observe comes into play. My original ExperiaSphere project included a logging of what I knew were significant conditions, but that quickly became a major QoE bottleneck, so I had to make the process of creating log events selective, identifying classes of event and letting each one be set on or off. I discovered some of my loggings never got set on at all, which means I was observing something that never mattered.

Another problem showed up with multi-tenant systems. The shared, underlying, infrastructure that was used by all tenants was necessarily something whose state impacted them all. As a result, it was common for the applications or services to check the state of infrastructure elements, and as the number of tenants multiplied, the overhead associated with this polling became a performance factor. It was necessary to have the status management of the infrastructure abstracted by a single system, which could then poll at regular intervals and expose its own APIs to tenants to prevent the infrastructure itself from being overloaded with status requests.

These points illustrate the risk of both monitoring and observability, which is that the processes succeed only if some planning in advance yields reasonable points from which to gather information, and if remedial steps are identified not just vaguely, but in the form of state transitions. All of this means (in today’s terms) an intent-model basis for structuring infrastructure, services, and applications. From that, it’s possible to ensure that the right information is available and that remedies can be applied without any failures arising out of a failure to consider the steps involved and what drives the system from step to step.

I have to wonder whether we’ve advanced both networking and applications to create complex distributed systems while we’re still thinking monoliths. We think of linear processes that, before and after doing something, insert a check of status, instead of thinking of an event-driven system. I see that thinking in ONAP, for example. It’s a comforting way of viewing these complex systems because it’s based on how humans inherently want to sequence and organize things, but the world isn’t either of the two. Events represent a record of chaos, in effect, and if we want to control it we have to do more than just throw information at it. We have to organize our lifecycle tasks, and draw what’s needed into our processes. There’s no other way.