If a tree falls in the woods…. We’ve all heard that old saying, and many remember debating it in their youth. Now, it seems, the story is skirting the edge of becoming a paradigm, and we even have a name for it, observability. Since I blogged yesterday on the risks of tech-hype, is this another example of it? Not necessarily, but the truths of observability could get tangled with the tales it’s generating, to the detriment of the whole concept.
If there’s a difference between information and knowledge, the difference lies in a combination of context and interpretation. In things like network and IT monitoring, the goal is to take action based on information gathered, and that depends on somehow converting that information into knowledge. I submit that the difference between “monitoring” and “observability” lies in that conversion.
Protocol just did a piece on observability, and while I think it’s interesting, I don’t think it grasped this essential truth. I think they made a connection between monitoring at the operations level and interpreting information through the introduction of context at the application level, but they seem to be focusing on software changes as the target of the shift. It’s way broader than that.
Information is the sum of the statistical outcome of network and IT operations. You can count packets, accesses, time responses, and do a bunch of things, but it’s usually difficult to make a useful decision on the basis of this kind of data. As I noted above, there’s a need to inject context and then interpret the result. What I think provides the missing link, in the sense of creating an implementation anchor to the abstract term of “observability”, is workflow.
Networks and IT infrastructure serve a set of missions, the missions that justify their deployment and operation. These missions can be related to specific information flows, usually between workers and applications but sometimes between application components, and between “things” in an IoT sense, or between these things and applications. All these information flows are, in IT terms, workflows, and each of them has an implicit or explicit quality-of-experience relationship that describes the sensitivity of the mission(s) to variations in the workflows. Usually, QoE is related to quantifiable properties like packet loss, latency, and availability. I submit that observability is really about knowing the QoE associated with a given workflow, meaning knowing those quantifiable properties.
Monitoring is the process of recording statistical data on the performance of something, and that truth establishes its limitations. I can count the number of packets I receive at a given point, without any issues. However, that’s not really a part of my basic QoE formula; I need packet loss. The loss of a packet can’t be counted where it didn’t show up. It has to be determined by comparing the number of packets sent with the number received. But even that’s an oversimplification, because it presumes that every packet “sent” was supposed to be received. An interface supporting multiple conversations can be analyzed with the sent/received difference if there are no intervening points between sender and receiver that could divert some packets to a different path.
We usually end up reflecting this truth by measuring packet loss on a hop basis, meaning over a point-to-point path, and that works, but it creates another problem, which is that we now have to know what paths our workflow is transiting to know if it’s subject to a known packet loss from a hop, and even then, we don’t know if the packets lost include the packets from our workflow.
The simple answer to observability would be to eliminate the middleman in the QoE process. If the on-ramp agent for a particular workflow measured all the QoE quantities and the destination or off-ramp agent did the same, we could decide whether packets from that workflow were lost. Count what the agent for a particular direction sends and compare it with what’s received. Software can do this if the process is injected into the code, and that’s not difficult to do.
But does this really constitute “observability”? We’re supposed to be able to take action on what we find, and how can we do that if all we know is that something got lost in transit? We don’t know where the workflow was supposed to go, path-wise, or where it actually went, or what conditions existed along the path.
Software development does indeed impact the observability process. If workflows are threads connecting microservices, as they are in the cloud-native model, then changing the software architecture would change the elements of the workflow. A new microservice could lose data, data could be lost getting to or from it…you get the picture. However, we still have the fundamental requirement to track what was supposed to happen versus what did, or we have no means of adopting a strategy of remediation. Logically, can we assume that all packet loss in a multi-microservice workflow arises because a microservice logic error dropped it?
Then there’s the question of “contaminated” versus “dropped”. A microservice might drop a packet because of a software error, but it’s probably more likely it would process it incorrectly and pass it along. Thus, our workflow is prone to both packet loss and contamination. Packet error-checks will usually detect a transmission garbling of a packet, so we could assume the latter problem was caused by software logic. The former could be caused by either software or transmission/handling, and to decide which was the villain we’d either have to correlate packet counts at the hop level (meaning we’d have to know where the packet went) or infer it based on (perhaps) the fact that the problem occurs more often than network errors would be likely, or because it can be correlated with a new software element in the workflow or a change to an existing element.
I’m not suggesting that all these issues can’t be resolved. We can count packets everywhere they’re handled. We can track packet paths so that even adaptive changes to network topology won’t confuse us. If we lay out how this could be made to work, though, we’d see that the result is likely to generate more packets of statistical data than packets in the workflow, and we’ve added a whole new layer, or layers, of cost and complexity.
The alternative approach, perhaps the best approach, may already be in use. We don’t try to correlate faults, but to prevent or address them. If a network is self-healing, then there is no need for higher-level remediation should there be an issue with one of the network QoS parameters that relate to our mission QoE parameters. The network will fix itself, if repair is possible. The same could be done, through virtualization and the cloud, to fix issues with QoE created by server failures.
This aims us at a possible approach; intent modeling. If we can break down infrastructure (network and IT, data center and cloud) into manageable functional pieces and compose application hosting and connection from them, we can link mission and function easily, and then link functions with realizations.
If we presumed this approach was an acceptable path to observability, then the Protocol story makes more sense. Take the network and data center faults out of the picture, and what’s left is software. The problem is that users aren’t sensitive to where problems originate, only to whether their expected experiences are disrupted. We need to address “observability” in a general way because its goal is necessarily general.
This whole debate, in my view, is another example of our failure to look at problems holistically. Fixing one flat tire out of four is progress in one sense, but not in the sense of the mission of the tires, or the car, or the driver. The observability discussion needs to be elevated or we’re wasting our time.