Inside the Most Critical Insight of Service Lifecycle Automation

I mentioned John Reilly, a TMF guru, in a prior blog, with a sad note at his passing.  And speaking of notes, I took the time over the weekend to read my notes on our conversations.  I was struck again by John’s insights, particularly when a lot of his points related to issues not yet raised and technologies not yet popular.  I’m writing this blog both as a tribute to John’s insights and in the hope that I can make the industry understand just how impressive…and important…those insights were.

It’s been my view from the first that John’s greatest insight came with his proposal known as “NGOSS Contract”.  NGOSS is “Next-Generation OSS”, of course, and John’s work was an attempt to create a model for the OSS of the future that was, as we’d say, “event-driven”.  Event-driven software responds in real time to events, which are signals of condition changes.  In the early part of this century, the OSS/BSS industry was viewing events as little more than transactions.  One came along, you queued it, and when you popped it off the queue in your own good time, you processed the event.  When you were finished, you popped the queue again.

John’s problem with this approach was context.  Events are real-time signals, implying real-time responses.  Yes, queuing things up in the traditional way compromises the timing of real-time, but the big problem John saw was that a modern service was made up of a bunch of loosely coupled functional systems—access, metro, core, vendor domains, device differences, subnet models, and so forth.  All these systems were doing their cooperative thing, but doing it in an asynchronous way.  Each functional system had to be considered independent at one level.  If something seen in one of the functional systems required attention, it was logical to assume an event would be generated.  What this meant was that there might be two or three or more independent sets of changes happening in a network, each of which would impact multiple services and would require handling.  How would these asynchronous needs be addressed by conventional software that simply queued events and processed them in order of arrival?

If asynchronicity and independence was an issue, the opposite condition was another.  A service made up of a series of functional systems has to coordinate the way those systems are deployed and managed, which means that we have to be able to synchronize functional system conditions at the appropriate times and places to ensure the service works overall.

There was also a challenge, or opportunity, relating to concurrency.  If functional systems were independent but coordinated pieces of a service, it made sense to process the lifecycle events for functional systems independently and concurrently.  Why feed them all into the queue for a big monolithic process?  What was needed was a mechanism to allow functional systems to be handled independently and then to provide a means of coordinating those times/places where they came together.  That would make the implementation much more scalable.

The answer, to John, started with the SID, the TMF service data model.  A service was represented as a contract, and the contract defined the service and a set of functional sub-services with a to-the-customer (“Customer-Facing”) and to-the-resource (“Resource-Facing”) specification.  Since there was a definition for each of the sub-services that made up a service, the SID could store the context of each and could also store the way the sub-service contexts related to the service context.

The applicable concept was a familiar one in protocol design—the state/event table.  A communications protocol has a number of specific states, like “Initializing”, “Communicating”, “Recovering” and so forth.  It also has a number of specifically recognized events, so it would be possible to build a kind of spreadsheet with the states being the rows and the events the columns.  Within each state/event intersect “cell”, we could define the way we expected that event to be handled in that state.  That’s the way context can be maintained, because as long as there is a single source of truth about the service—the SID’s Contract—then all the asynchronous behaviors of our functional sub-services can be coordinated through their state/event tables.

Of course, writing a spreadsheet won’t do much for event processing.  John’s solution was to stay that the state/event cells would contain the process link to handle the combination.  This was an evolution to the SID, a shift from a passive role as “data” to an active role in event-to-process steering.  It was essential if OSS/BSS were to transition to a next-generation event-driven model, the “NGOSS” or Next-Generation OSS”.  This created a new contract model, the “NGOSS Contract.”

Rather than try to frame this in a new software development model (for which there would be no tools or support in place), John wanted to use the software development model emerging at the time, which was Service Oriented Architecture or SOA.  An event, then, would be steered to a SOA service process through the state/event table that handled it.  That process would have the NGOSS Contract available, and from that all the information needed to correctly handle the event.

Suppose you got two events at the same time?  If they were associated with the same functional sub-service, they’d have to be handled sequentially, but if the state of that sub-service was changed by the first event (which likely it would be) then the second event would be treated according to the new state.  If the event was for a different functional sub-service, the handling of that second event could be done in parallel with that of the first event.  Just as with microservices today, SOA services were presumed to be scalable, so you could spin up a new process to handle an event and let it die off (or keep it around for later) when the processing of the event was completed.

I built on John’s thinking in my work on the TMF’s Service Delivery Framework (SDF) project.  At the request of five operators in the group, I did a quick Java project (the first ExperiaSphere project) to demonstrate that this approach could be used in SDF.  I called the software framework that processed an NGOSS Contract a “Service Factory”.  In my implementation, the Contract had all the information needed for event processing, so you could spin up a copy of a suitable factory (one that could “build” your contract) anywhere and let it process the event as it came.  Only the contract itself was singular; everything else was “serverless” and scalable in today’s terms.  This wasn’t explicit in John’s work or in my discussions with him, but it was implied.

What John had come up with was the first (as far as I know) model for distributed orchestration of scalable process components.  Think of it as the prototype for things like Amazon’s step functions, still a decade in the future when John did his work.  Yes, a service lifecycle automation process was what John was thinking about, but the principle could be applied to any system that consists of autonomous functional elements that have their own independent state and also obey one or more layers of collective state.  A series of functional subsystems could be assembled into a higher-layer subsystem, represented with its own model and collectivizing the state of what’s below.

You can manage any application like this.  You can represent any stateful process, including things like distributed application testing, as such a model and such a series of coupled processes.  It was a dazzling insight, one that had the TMF and the industry truly caught on, could have changed the course of virtualization, NFV, OSS/BSS, and a bunch of other things.

And it still could.  We’ve wasted an appalling amount of time, expended all too many dollars wastefully, but the market still needs the concept.  Distributed state as the sum of arbitrary collections of sub-states is still the hardest thing to get right of all the things that the cloud demands, but the most critical piece in service lifecycle automation.  If we’re ever going to use software, even AI, to manage services and reduce costs, John’s insights will be critical.