Translating the Philosophy of Complexity Management to Reality

Could we be missing something fundamental in IT and network design?  Everyone knows that the focus of good design is creating good outcomes, but should at least equal attention to preventing bad outcomes?  A LinkedIn contact of mine who’s contributed some highly useful (and exceptionally thoughtful) comments sent me a reference on “Design for Prevention” or D4P, a book that’s summarized into a paper available online HERE.  I think there’s some useful stuff here, and I want to try to apply it to networking.  To do that, I have to translate the philosophical approach of the referenced document into a methodology or architecture that can be applied to networking.

The document is a slog to get through (I can only imagine what the book is like).  It’s more about philosophy than about engineering in a technical sense, so I guess it would be fair to say it’s about engineering philosophy.  The technical sense is that there’s always a good outcome, a goal outcome, in a project.  The evolution from simple to complex doesn’t alter the number of good outcomes (one), but it does alter the number of possible bad outcomes.  In other words, there’s a good chance that without some specific steps to address the situation, complex systems will fail for reasons unforeseen.

The piece starts with an insight worthy of consideration: “[The] old world was characterized by the need to manage things – stone, wood, iron.  The new world is characterized by the need to manage complexity. Complexity is the very stuff of today’s world. This mismatch lies at the root of our incompetence.”—Stafford Beer.  I’ve been in tech for longer than most of my readers have lived, and I remember the very transformation Beer is talking about.  To get better, we get more complicated, and if we want to avoid being buried in the very stuff of our advances, we have to manage it better.  That’s where my opening question comes in; are we doing enough?

Better management, in the D4P concept, is really about controlling and preventing the bad outcomes that arise from complexity, through application of professional engineering discipline.  Add this to the usual goal-seeking, add foresight to hindsight, and you have something useful, even compelling, providing that you can make the philosophical approach the paper takes into something more actionable.  To fulfill my goal of linking philosophy to architecture, it will be necessary to control complexity in that architecture.

D4P’s thesis is that it’s not enough to try to design for the desired outcome, you also have to design to prevent unfavorable outcomes.  I think it might even be fair to say that there are situations where the “right” (or at least “best”) outcome is one that isn’t one of the bad ones.  With a whole industry focused on “winning”, though, how do we look at “not-losing” as a goal?  General McArthur was asked his formula for success in offensive warfare, and he replied “Hit them where they ain’t”.  He was then asked for a strategy for defense, and he replied “Defeat”.  Applying this to network and IT projects, it means we have to take the offense against problems, not responding to them in operation but in planning.

Hitting them where they ain’t, in the D4P approach, means shifting from a hindsight view (fix a problem) to a foresight view (prevent a problem by anticipating it).  Obviously, preventing something from happening can be said to be a “foresight” approach, but of course you could say that about seeking a successful outcome.  How, in a complex system, to you manage complexity, discourage bad outcomes, by thinking or planning ahead?  There are certainly philosophers among the software and network engineering community, but most of both groups have a pretty pragmatic set of goals.  We don’t want them to develop the philosophy of networking, we want a network.  There has to be some methodology that gives us the network within D4P constraints.

The centerpiece of the methodology seems to me to be the concept of a “standard of care”, a blueprint to achieve the goal of avoiding bad outcomes.  It’s at this point that I’ll leave the philosophical and propose the methodological.  I suggest that this concept is a bit like an intent model.  That’s not exactly where D4P goes, but I want to take a step of mapping the “philosophy” to current industry terms and thinking.  I also think that intent modeling, applied hierarchically, is a great tool for managing complexity.

D4P’s goal is to avoid getting trapped in a process rather than responding to objective data.  We don’t have to look far, or hard, to find examples of how that trap gets sprung on us in the networking space.  NFV is a good one, so is SDN, ZTA, and arguably some of the 5G work.  How, exactly, does this trap get sprung?  The paper gives non-IT comments, but you could translate them into IT terms and situations, which of course is what I propose to do here.

Complexity is the product of the combination of large numbers of cooperating elements in a system and large numbers of relationships among the elements.  I think that when faced with something like that, people are forced to try to apply organization to the mess, and when they do that, they often “anthromorphize” the way the system would work.  They think of how they, or a team of humans, would to something.    In-boxes, human stepwise processes, outboxes, and somebody to carry work from “out” to “in”.  That’s how people do things, and how you can get trapped in process.

This approach, this “practice” has the effect of creating tight coupling between the cooperative elements, which means that the system’s complexity is directly reflected in the implementation of the software or network feature.  In IoT terms, what we’ve done is created a vast and complex “control loop”, and it’s hard to avoid having to ask questions like “Can I do this here, when something else might be operating on the same resource?”  Those questions, the need to ask them, are examples of not designing for prevention.

So many of our diagrams and architectures end up as monolithic visions because humans are monoliths.  The first thing I think needs to be done to achieve D4P is to come up with a better way of organizing this vast complexity.  That’s where I think that intent models come into play.  An intent model is a representation of functionality and not implementation.  That presents two benefits at the conceptualization stage of an IT or network project.  First, is lets you translate goal behavior to functional elements without worrying much about how the elements are implemented.  That frees the original organization of the complex elements from the details that make them complex, or from implementation assumptions that could contaminate the project by introducing too much “process” and not enough “data”.

Artificial intelligence isn’t the answer to this problem.  An artificial human shuffling paper doesn’t do any better than a real one.  AI, applied to systems that are too complex, will have the same kind of problems that we’ve faced all along.  The virtue of modeling, intent modeling, is that you can subdivide systems, and by containing elements into subsystems, reduce the possible interactions…the complexity.

Intent models, functionality models, aren’t enough, of course.  You need functionality maps, meaning that you need to understand how the functions relate to each other.  The best way to do that is through the age-old concept of the workflow.  A workflow is an ordered set of process responses to an event or condition.  The presumption of a workflow-centric functionality map is that a description of the application or service, a “contract”, can define the relationship of the functions within the end-result service or application.  That was the essence of the TMF NGOSS Contract stuff.

In the NGOSS Contract, every “function” (using my term) is a contract element that has a state/event table associated with it.  That table identifies every meaningful operating state that the element can be in, and how every possible event the element could receive should be processed for each of those states.  Remember here that we’re still not saying how any process is implemented, we’re simply defining how the black boxes relate to each other and to the end result.

The state/event table, in my view, is the key to introducing foresight and D4P principles to application and service design.  We can look at our elements/functions and define their meaningful states (meaningful, meaning visible from the outside), and we can define how the events associated with the elements are linked to abstract processes.  If we do this right, and the paper describes the philosophy associated with getting it right, we end up with something that not only recognizes the goal, but also handles unfavorable things.  We’ve created a goal-seeking framework for automation.

Does it really address the “design-for-prevention” paradigm, though?  We’ve done some of the work, I think, through intent-modeling and functional mapping, because we’ve organized the complexity without getting bogged down in implementation.  That reduces what I’ll call the “internal process problem”, the stuff associated with how you elect to organize your task in a complex world.  There’s another process issue, though, and we have to look at how it’s handled.

The very task of creating functional elements and functional maps is a kind of process.  The state/event table, because it has to link to processes, obviously has to define processes to link to.  In the approach I’m describing here, it is absolutely essential that the functional and functional-map pieces, and the event/process combinations, be thoroughly thought out.  One advantage of the state/event system is that it forces an architect to categorize how each event should be handled, and how events relate to a transition in operating states.  In any state/event table, there is typically one state, sometimes called “Operational”, that reflects the goal.  The other states are either steps along the way to that goal, or problems to be addressed or prevented.

At the functional map level, you prevent failures by defining all the events that are relevant to a function and associating a state/process progression to each.  Instead of having a certain event, unexpected, create a major outage, you define every event in every state so nothing is unexpected.  You can do that because you have a contained problem—your function abstraction and your functional map are all you need to work with, no matter how complex the implementation is.  In IoT-ish terms, functional separation creates shorter control loops, because every function is a black box that produces a specific set of interfaces/behaviors at the boundary. No interior process exits the function.

But what about what’s inside the black box?  A function could “decompose” into one of two things—another function set, with its own “contract” and state/event tables, or a primitive implementation.  Whatever is inside, the goal is to meet the external interface(s) and SLA of the function.  If each of the functions is designed to completely represent its own internal state/event/process relationships in some way, then it’s a suitable implementation and it should also be D4P-compliant.

I’ve seen the result of a failure to provide D4P thinking, in a network protocol.  A new network architecture was introduced, and unlike the old architecture, the new one allowed for the queuing of packets, sometimes for a protracted period of time.  The protocol running over the network was designed for a point-to-point connection, meaning that there was nothing inside the network to queue, and therefore its state/event tables didn’t accommodate the situation when messages were delayed for a long period.  What happened was that, under load, messages were delayed so much that the other end of the connection had “timed out” and entered a different state.  Context between endpoints was lost, and the system finally decided it must have a bad software load, so everything rebooted.  That made queuing even worse, and down everything came.  The right answer was simple; don’t ever queue messages for this protocol, throw them away.  The protocol state/event process could handle that, but not a delayed delivery.

I think this example illustrates why functionality maps and state/event/process specification is so important in preventing failures.  It also shows why it’s still not easy to get what you want.  Could people designing a protocol for direct-line connection have anticipated data delayed, intact, in flight and delivered many seconds after it was sent?  Clearly they didn’t.  Could people creating a new transport network model to replace physical lines with virtual paths have anticipated that their new architecture would introduce conditions that couldn’t have existed before, and thus fail when those conditions did happen?  Clearly they didn’t.

Near the end of the paper is another critical point: “Complexity reduction is elimination and unification.”  I think that’s what the approach I’m talking about here does, and why I think it’s a way to address D4P in the context of service and application virtualization.  That’s why it’s my way of taking D4P philosophy and converting it into a project methodology.

In the same place, I find one thing I disagree with, and it’s a thing that nicely frames the difficulty we face adopting this approach.  “Keep in mind that the best goal-seeking methods are scrutably connected to natural law and from that whence commeth your distinguishing difference and overwhelming advantage.”  Aside from validating my characterization of the piece as being pretty deep philosophy, this points out a problem.  Software architecture isn’t natural law, and cloud development and virtualization take us a long way out of the “natural”.  That’s the goal, in fact.  What is “virtual” if not “unnatural”.  We have to come to terms with the unnatural to make the future work for us.

I agree with the notion of D4P, and I agree with the philosophy behind it, a philosophy the paper discusses, but I’m not much of a philosopher myself.  The practical truth is that what we need to do is generalize our thinking within the constraints of intent models, functionality maps, and state/event/process associations, to ensure that we don’t treat things that are really like geometry’s theorems as geometry’s axioms.  I think that the process I’ve described has the advantage of encouraging us to do that, but it can’t work any better than we’re willing to make it work, and that may be so much of a change in mindset that many of our planners and architects will have trouble with the transition.