Lifecycle automation is all about handling events with automated processes. Event interpretation has to be contextual, which means that this process of event-handling needs to include a recognition of specific contexts or states, so that events are properly interpreted.
There are advantages of state/event tables over policies in controlling event-handling in lifecycle automation, at least in my view. There are also some disadvantages, and there are some special issues that relate to state/event systems that consist of multiple interrelated elements. I did extensive testing of state/event behavior over the last decade, in my ExperiaSphere project, and I want to both explain what I learned and point out a possible improvement in the classic state/event-table model.
One of the problems that quickly emerges in any lifecycle automation project is the sheer number of devices/elements that make up a network or hosting system. If you consider a “service” or an “application” as being the product of ten thousand individual elements, each of which has its own operating states and events to consider, you quickly reach a number of possible interactions that explode beyond any potential for handling. You have to think instead of structures.
A typical service or application has a discrete number of functional pieces. For services, you have a collection of access networks that feed a core, and for applications you have front- and back-end elements, database and compute, etc. The first lesson I learned is that you have to model a service using a “function hierarchy”. Staying with networks now (the same lessons apply to applications, but ExperiaSphere was a network project), you could model a service as “access” and “core”, and then further divide each based on geography/administration. If I had 30 different cities involved, I’d have 30 “access” subdivisions, and if I had three operators cooperating to create my core, I’d have three subdivisions there. Each of these subdivisions should be visualized as black-box or intent model elements.
The purpose of this hierarchy is to reduce the number of discrete elements that lifecycle automation has to model and handle. Each of the model elements in my example represent cohesive collections of devices that serve a unified purpose. Further, because of the way that traditional networks and network protocols work, each of these model elements also represents a collection of devices that perform as a collective, most having adaptive behavior that’s self-healing.
In order for that to happen, the service has to be structured by a coupled model, each element/node of which represents a “composition/decomposition” engine. At the lowest level, the elements encapsulate real infrastructure via the exposed management APIs, and harmonize this with the “intent” of the elements that model them. At higher levels, the purpose of the engine is to mediate the flow of commands and events through the structure, so the elements behave collectively, as they must.
A model of a service like the one I described would contain one “master service” element, two elements at the high level, and 33 at the next (30 under “access” and 3 under “core”). When the model was presented from a service order and “activated”, the order decomposes the two main elements into the proper lower-level elements based on service topology and carrier relationships. My testing showed that all of the elements (there are now 36 instantiated) should have its own state/event structure. In the work I did, it was possible to impose a standard state structure on all elements and presume a common set of events. Any condition that was recognized “inside” an element had to be communicated by mapping it to one of the standard events.
In this structure, events could flow only to adjacent elements. A lower-level element was always instantiated by the action of deploying the higher-level element, its “parent”, and parent and child elements “know” of each other and can communicate. No such knowledge or communications is possible beyond the adjacent, because for the structure to be scalable and manageable, you couldn’t presume to know what’s inside any element—it’s a black box or intent model.
The presumption in this structure is that each parent element instantiates a series of child elements, with the child elements presenting an implicit or explicit service-level agreement. The child is responsible for meeting the SLA, and if that cannot be done, for generating an event “upstream” to the parent. The parent can then do any number of things, including accepting a lower level of service, or replacing the child element by reprovisioning. That reprovisioning might be done via the same infrastructure, or it might seek a parallel option. It might even involve reprovisioning other elements to build the higher-level service represented by the parent in a different way.
An important point here is that when the child element changes state from “working” to “not-working” (or whatever you’d call them), it’s responsible for generating an event. If the parent wants to decommission the child at this point, they would issue an event calling for that, to the child. If the parent cannot jigger things to work according to the SLA that it, in turn, provided to the highest level (the singular object representing the service as a whole), then it must report a change of state via an event, and its own parent must then take whatever action is available.
I presumed in my testing that all the variables associated with a given element’s operation were stored in the service model. I also presumed that the variables associated with a given model element would be delivered to a process instance activated as the result of the element having received an event. This is consistent with the NGOSS Contract proposal of the TMF, which I acknowledged as the source of the concept. Thus, any instance of that software process could respond to the event, including one just scaled up or created.
My tests all used a state/event table within each model element variable space. The table represented all possible process states and all possible events, and where a given state/event intersection was invalid or should not occur, the system response was to enter the “offline” state and report a “procedure error” event to the higher level.
State/event tables like this were familiar to me from decades of work on protocol handlers, where they’re used routinely. Perhaps that familiarity makes them easy for me to read, but state/event relationships can also be conveyed in graph form. A system with two states could be represented as two ovals, labeled with the state (“working” and “not-working”, for example). Events are depicted as arrows from one state to another and labeled with the event. If necessary, process names can be inserted in the arrows as a box. There are plenty of languages that could be used to describe this, and the languages could be decomposed into an implementation.
In many cases, looking back on my tests, a modeled service had two or three “progressions” that represented transitions from the “normal” state and back, through perhaps several intermediate states. This kind of service structure could be drawn out as a kind of flow chart or directed graph. If that were done, then each “node” in the graph would represent a state and the event arrows out of it would represent the processes and progressions. In that case, if the variables for the model elements included the “node-position” instead of the “state”, the graph itself would describe event handling and state progression. For some, this approach is more readable and better expresses the primary flow of events associated with a service lifecycle.
Another point the testing showed is that you don’t want to get too cute with the number of states and events you define. Less is almost always better. Operators tended to create states and events that were designed to reflect everything, when what they need to do is reflect everything relevant to process activation. If two events are handled the same way, make them one. If two states handle all their events the same way, they need to be one state. Reducing the number of state/event intersects makes processing easier and also makes it easier to understand what’s happening.
A small number of states and events also contributes to the ability to construct a “flow graph” of multiple elements at once. An operations process would generate a fairly limited number of normal state/event flows across the entire spectrum of elements. That would be easier to inspect, and so to identify any illogical sequences. For example, there are only two “end-states” logical for a service. One is the normal operating state, and the second is the “disabled/decommissioned” state. All possible state/event flows should end up in one or the other of these. If some combination of states and events sticks you somewhere else, then there’s something wrong.
Operators seem to be able to understand, and like, the state/event approach if it’s explained to them, but most of them don’t see it much in vendor presentations. I think that policies have been more broadly used, and I think that’s a shame because I don’t think we have the same mature thinking on how to do “stateless policy handling” of lifecycle events. Perhaps more exposure to the TOSCA policy approach will offer some examples of implementations we can review, or perhaps people will take the TMF approach more seriously (including the TMF itself!). We’re coming to a critical point in lifecycle automation; too much more spent on one-off opex improvement strategies will pick all the low apples to the point where funding a better and more general approach will be difficult.