State/Event or Policies: Best Lifecycle Automation Option?

Any sort of lifecycle automation demands the generation of responses to conditions.  The industry has defined two broad approaches to that, the use of policies and the use of state/event logic.  Both these concepts have been around for (literally) half a century, so there’s plenty of experience with them.  There seems to be less experience in sorting out the plusses and minuses of each approach, so this is a good time to do that.

The basic problem with handling lifecycle automation, the series of tasks needed to keep a network or computing platform running properly, is that it involves a bunch of different conditions that arise in their own time, with variable relationships among them.  A condition might be anything that moves the technology target from its “normal” state of operation to any other state, and actions are things triggered to restore normalcy.

The oldest approach to dealing with this sort of thing comes from the software that handles network protocols.  A network protocol connects partner elements, and like human conversation, network connections only work if everyone is on the same page.

A typical data link protocol, for example, would have a “setup” state where the link is established, a “normal” state where it’s operating, an “error recovery” state where it tries to fix a problem, a “violation” state where it detects a sign the endpoints are out of context with each other, and a “shutting down” state where it’s going out of service.

Events in a network protocol are the messages received on the connection.  A data packet is an event, as is a request to enter the “normal” state after setup, or a request to repeat the last message.  Each event has to be interpreted in the context the state represents.  A data packet in the “normal” state is fine, but getting one in the “setup” state is a procedure error that shows a context problem.  So is a setup packet in the “normal” state, because things are set up already.

State/event tables are two-dimensional structures (state and event type), and each intersection typically defines a process to be invoked and the “next state” to be set.  When a message is received, it’s classified into one of the event types, and the event and current state then index the table.  You run the process, you set the current state to the “next state” value in the table, and you’re done.

Policies are the other way of dealing with conditions.  A policy is a description of a condition tied to the description of an action.  It’s not unlike the programming statement (in many different programming languages) if condition-exists then perform-action [else perform-other-action]”.  The “condition-exists” test can be compound, so you could test state and event through a policy and eliminate the whole state/event table concept.

Policies are usually defined in sets, which are policies related to a specific condition or goal.  Access control is something that’s often handled by policies.  Each policy is a rule, and policy-driven systems will often have “rule editors” to create the rules and policy sets.  The policies/rules are often directly readable, making it easier to create one or to understand one you’re reviewing.

So OK, we’ve defined state/event and policy systems.  What are the plusses and minuses of each?  Let’s start with the state/event systems.

State/event tables are great for real-time unitary systems.  What that means is that the conditions/events happen when they happen, and that each system of event generation and response has its own state/event table.  I can model a network connection using a state/event table, but if I want to track a whole network, I’ll need to define a model that relates each element of the network to its own state/event table, and I’ll have to define events that can be used to signal one unitary system from another, a means of collectivizing and organizing the system’s behavior.  It’s possible to create n-dimensional state/event tables for complex environments, but the notion falls apart quickly in practice.  I never worked on one, and I’ve worked on a lot of state/event systems.

If you tried to replicate unitary system (network connection, for example) state/event processes using policies, you’d run into a similar problem.  The policies would exist for every connection, but you’d have to reflect how those per-connection policies related to the higher-level system, and how higher-level policies would then be defined and executed.

In networks, the complexities of unitary versus systemic policies are often handled implicitly.  The presumption in many (nay, most) policy-based network implementations is that you have relatively autonomous elements composed of many unitary systems.  These deal with their own issues in their own ways, but those ways are influenced by a policy control point, and implemented via a policy enforcement point.  In other words, the structure of a policy-based system creates an implicit model that, for state/event systems, would have to be explicit.

This sounds sweet, but it begs two points.  The first is how those autonomous elements “deal with their own issues”.  Most networks rely on self-healing or adaptive behavior to maintain operation, and policies are then needed only to influence the way that behavior is tuned, or to deal with situations where resolution through self-healing isn’t possible.  The second is the risk of policy complexity overload.

Back in my early days of programming, before high-level languages were commonly used in business computing, one of the big problems in application design was the failure to efficiently organize tests of conditions.  Remember, there’s no if/then/else structure in the language.  Often a program would check for a dozen conditions, and the data would present condition number 13, which would “fall through”.  My employer had a genius software type who invented a “decision table” language to organize the tests, and the resulting tables could be machine-validated to determine if they tested all possible conditions, or worse tested things that prior tests had already ruled out.  If A=3 gets you started with this policy set, testing whether A=4 is clearly not a good sign of logical thinking.

Policies can present this same problem of test complexity.  You have policy rules to cover conditions, but are the rules complete and consistent?  This can be particularly difficult when there’s no central policy control point, so policies are distributed and are difficult to assess holistically.  This isn’t a defeat for policies, but a cry to support policies through central repositories where they can be examined for completeness and contradictions.

My own views on this are probably understood by those who have followed this blog regularly.  I believe that it’s essential that we model services as a set of functional blocks, modeled using intent principles, and based on a formal “class inheritance” structure where a “node” might decompose into a “router”, then into a “Cisco router”, and a “network” might decompose into “access” and “core”.  This approach has been in place within the TMF (with various degrees of rigor in compliance) for well over a decade.  If this is done, then I believe that the (again-TMF-initiated) NGOSS Contract approach that uses the contract data model to steer events through each functional element’s state/event table, is the way to go in implementation.

What makes me leery about policies is that the implementation threatens to lose real-time properties.  A state/event table is inherently event-driven, meaning real-time.  A “conditions test” rule is one that invites a “poll for status” implementation.  “If-condition-exists” is easily interpreted as meaning to go get the variables and run the test.

Policy implementations are also free to deviate from functional standards, meaning that they’re not automatically stateless.  A properly designed state/event-driven system uses stateless processes for the state/event intersection, making it resilient and fully scalable.

So, I’m firmly in the state/event corner on this one.  I’m happy to analyze policy-based approaches if they’re documented well enough for me to assess them, but I’m going to be looking hard at the points I’ve cited here to see if they deal with them.  If they don’t, then in my view, they won’t form a reasonable basis for lifecycle automation.