A Software-Centric View of Service Orchestration and Automation

What would an intent-modeled service lifecycle automation system look like? I’ve often talked about service modeling as an element in such a system, but what about the actual software? A service model without the related software is the classic “day without sunshine”, and in the early 2000s, an operator group in Europe actually pointed out that one modeling initiative seemed useful but might not be suitable for implementation. I think we’re due to talk about that issue now.

The software framework for service modeling was first described (as far as I have been able to determine) in the TMF work on the “NGOSS Contract”, which stands for “Next-Generation OSS”. The fundamental notion of that work was that a service contract would act as the steering mechanism for lifecycle events. When an event occurred, the model would tell software where that event had to be steered, based on traditional state/event (or graph) theory. Making this work requires two software elements.

The first element is explicit in the NGOSS Contract vision; it’s the set of destination processes to which events are steered. Implicit in the approach is the notion that these processes are functions or microservices that would receive, via the contract, the combination of event data (from the original event) and contract data (from the contract). Thus, the processes have everything they need to run. Also implicit is the idea that the process would return a “next state” indication and potentially a refresh of some contract data. This all combines to mean that you can spin up a destination process wherever and whenever it’s needed, as many as needed. It’s fully scalable.

The second element is implicit in the vision; there has to be software that makes the event-to-process connection via the contract. It’s this second element that creates the complexity in the implementation of a service lifecycle automation software system. To try to cut through the complexity, let’s start with what I’ll call the “ONAP Approach.”

The ONAP model is an orchestration system that is driven by events. The architecture is inherently monolithic, in that while there might be multiple instances of ONAP, the presumption is that the instances are controlling independent domains. Events generated within a domain are queued and processed, and to the extent that ONAP would admit to service modeling (which is a minimal admission at best, in my view), that monolith would then use a model to invoke a process, which you’ll recall was our first software element.

The obvious problem with this is that it doesn’t scale. If there are multiple events, the single process will have to queue them for handling, and thus it’s possible that the central event process would be overwhelmed by an event flood of the type that could occur with a massive failure.

As work accumulates, the state of the system as reflected by the set of generated events evolves (which is why the events were generated) but the state of the system as known by the central process will be whatever it was at the completion of the last event it processed. You could have Event T=10:47 sitting in the queue waiting to inform you that Resource C had failed, just as your processing of Event T=10:46 decides that Resource C must now be used to substitute for the Resource B that event reported had failed. If there is a single event queue and central process per administrative domain, then all services will compete for that single resource set, and the chances of a delay that creates a state disconnect between the real network and the process expands.

One possible solution is to have an instance of a process be associated with each service. I looked at that in my first attempt to do an implementation of NGOSS Contract for that group of EU operators, and while it was useful for high-value business services, it required that you dedicate a contract handler for each service, no matter how valuable it was to the operator. That limits its utility, obviously, so the next step was to see if you could make the central contract process itself into a function/microservice, something you could instantiate when you needed it.

My approach to this was to think of a contract like an order to build a bike. If you have a blueprint for Bike A, and your manufacturing facility has the ability to follow that blueprint, you can create an instance of Bike A from an order, right? So a “Service Factory” (to use my name) has the ability to fill a given set of contract orders. Give an instance of that factory an order, and it could fill it. Put in software process terms, I could spin up an instance of a compatible Service Factory when I received a lifecycle event, give it the event and the contract, and it could steer the event to the process. Remember, the contract is where all the data is stored, so there is no need for persistence of information within a Service Factory instance.

Then, let’s suppose that our bikes are built by a flexible manufacturing system that can follow any blueprint. A single Service Factory model can now be instantiated for anything. Putting this in software process terms, I can create a central “factory” process that can be instantiated any time I create an event and need to map it to a process. All I have to do is associate the event with a contract, but how does that happen?

In my approach, it happens because with two exceptions, all contract events are generated by contract processes. A service model is a hierarchy of intent model elements. Each element has a “parent” and multiple (potentially) “children”. Model elements can pass events only up or down a level within that hierarchy. Because it’s the event processing elements that have to generate anything, including an event to a hierarchy partner, and because those processing elements are always given the associated contract’s data, they pass within a single contract. No need to identify what’s involved.

It would be important to deal with collision risk on updating contract data from event-linked processes. You could serialize the event handling per service at the queuing level, you could send a process only the data associated with the model object that the event is associated with, you could provide a locking mechanism…the list goes on. My preferred approach was to say that an event-linked process could only alter the data for the part of the service model it was associated with.

What about those exceptions, though? The first exception is the events generated from the top, meaning service-level events like adding, removing, or changing a service. Obviously these events are associated with a contract by the higher-level customer or customer-service-rep portal, so they have what they need. The second exception is a bit more complicated.

What happens when something fails? This would be an event generated at the bottom, and obviously there’s not necessarily a clear association between a resource and the service(s) that depend on it. My approach is to say that there is a clear separation between what I’ll call a “service” object and what’s a “resource” object. The former are part of the service model and pass events among themselves. The latter represent a resource commitment, so it’s at the bottom of any resource-domain structure, where the rubber meets the road. Each object type has an SLA that it commits to, based on the fact that each is an intent model. Service object states with respect to the SLA would be determined by the events passed by subordinate service objects; if I have three sub-objects, I’m meeting my SLA if they are.

Passing events within a service object hierarchy means having some central mechanism for posting events. Logically, this could be associated with the service contract itself, in the form of a pointer set to define where to push and to pop the queue. That way you could use a central event handler, redundant handlers, area- or service-type-specific handlers, or even a dedicated per-service handler.

Resource objects can’t assume that because resource-to-service bindings are not exclusive and may not be visible, unless we make them so. Thus, a resource object has to be capable of analyzing its represented resources to establish whether they’re meeting their SLA. In my sample implementation, this object had only a timer event that kicked it off (it was a daemon process in UNIX/Linux terms). When it ran, it checked the state of the resources it represented (their MIBs, perhaps) and generated an SLA violation event to its superior object if that was indicated.

With these approaches, it’s possible to make both the software elements of a lifecycle automation system into scalable microservices which means you could build a service lifecycle automation system, based on a service model that’s the kind of hierarchy I’ve described. Note that this would also work for application lifecycle automation. I’m not suggesting this is the only way to do the job, but I do think that without the attributes of an approach like this, we’re spinning our wheels in service or application orchestration.