Here’s How Cloudification of Existing OSS/BSS/NMS Could Work

Can a monolithic application be cloudified?  That’s one of the critical questions for network operators, because so much of their operations/management software inventory is monolithic.  To make matters even more complicated (and, arguably, worse), most of the implementations of management tools for things like NFV and even zero-touch automation, are in large part monolithic.  Are these elements now insurmountable barriers to transformation and modernization?  Maybe not.

There are two overall models for software that handles or is dependent on external conditions.  One is the “polled” model, where the software accesses an external status resource when it needs to know about the system it represents.  Many of the old Simple Network Management Protocol (SNMP) tools would poll the management information base (MIB) for device information.  The other model is the “event” model, where the external system generates a message or signal, traditionally called an “event”.  This event is then processed by the management system.

Most operations/management software today relies on the event model, which leads to them being called “event-driven”.  We’ve had a decades-long set of initiatives aimed at making OSS/BSS systems “event-driven”, and so if everything (or nearly everything) today has graduated to events, why are we talking about this issue in a discussion of cloud-native or cloud-friendly operations?  Because there are three different subsets of the event-driven model, and they’re very different.

If we have software that’s “single-threaded”, meaning that it processes an item of work at a time (as many transaction processing systems do), it can be made to support events by adding an event queue in the front of the software.  When an event happens, it’s pushed onto the queue, which in software terms is called a “FIFO” for “first-in-first-out”.  These preserve the arrival order, and so when the software “pops” the queue when it has capacity to do more work, it gets the next event in order of arrival.

This is an OK approach for what we could call “limited and self-limiting” events.  That means that there’s a very small chance that the event source will generate another event before the first one is processed fully.  The problem of related events can arise is when an event is popped for processing, and a later event in the queue relating to the same source indicates a change in condition.  We now have to either risk having a processing task undertaken with out-of-date status, or we have to check the status of the event source in real time (either by polling it or by forward-scanning the event queue) before we process.  And, of course, where does this end?

Where software has multiple threads, meaning that it can do several things at once, you can have each thread pop the event queue, which brings some concurrency to the picture.  However, it also introduces an issue with respect to the relationship among events.  If Instance One of the software pops an event associated with System A, and Instance Three pops another System A event a moment later, we have two different software processes potentially colliding in the handling of related events.

With multi-threaded event-handling systems, we still have the same problem of multiple events from the same source being queued and popped for processing, but we have the additional problem that different processes could all be working on the same system, driven by the events they happened to pop.

Both these problems have additional complications arising from implementation.  One problem is that multi-threaded but monolithic software runs in one place (because it is monolithic), and the sources of the events and the places where process actions might have to be applied could be quite a distance from the monolithic process.  Another is that even multi-threading an application doesn’t make it suitable for a wide range of loads; there are limits on the number of threads you can run in a given place.  Our sub-model of event processing addresses, at least to a degree, all of these.

Suppose that our event-driven system is fully described by a functional model that illustrates how features, events, and resources relate to each other.  Suppose that the current state of the event-driven system, all the rules and policies associated with it, are all held in this model.  That model is then effectively a blueprint to describe goal behavior, and a single central record of current state for all the subsystems within our overall modeled system, right?

If that’s true, then any software process instance designed to work on the model, meaning to apply an event to the model, would be able to process that model correctly, and thus process events against it.  We could spin up an instance of the process set (which, in a presentation I did for the TMF’s SDF group about a decade ago, I described as a “service factory”) and it would be able to process the event by accessing the model.

This process would allow a “master event source” to scale a service factory to handle events.  The events could be associated with different services/systems, or a different part of the same one.  We’d still need a “locking” mechanism that would allow for the queuing of events, but only if the events were associated with the same subsystem, and a process for that subsystem was already running.

The critical piece of a service model is a state/event table.  Each element of the model would have such a table, and the table would identify the operating states (set by event processing) and the events that could be received.  The combination, by implementation tradition, would yield a process ID and a “next state” to indicate state transformation.  This is how monolithic stuff can be coupled to our third kind of event processing.

Each of the process elements, threads, components, or whatever, in a monolithic application could now be fed by the service factory according to the state/event table in the model.  The service factory process then acts as the queuing mechanism, resolves collisions, handles scaling up front, and so forth.  You can still have event queues for the monolithic component, but if the service model and factory are properly designed, the conflicts and collisions are weeded out before something is queued.

If the processes identified in the state/event table are designed to work from the data passed to them and to pass results back, they could likely be scaled in an almost-cloud-like way.  If not, the approach would at least define the specific goals of an operations software remake—turn the components into autonomous functions.  That alone would be an advance from today, where we have little specific guidance on what to look for in things like OSS/BSS modernization.

There’s nothing new about this approach.  As I noted above, I presented it to the TMF over a decade ago, and it’s also a fairly classic application of state/event and modeling technology.  All of this means that this could have been done at least a decade ago.  That doesn’t mean it couldn’t (and shouldn’t) be done now, but it does justify a bit of skepticism about the willingness of operators and vendors to take the necessary steps.

Either group could break the logjam.  AT&T still wields considerable influence on ONAP, and they could drive that initiative to do more.  I stopped taking ONAP briefings, until (as I told them) they adopted a service-model-centric approach.  AT&T could accelerate that shift.  Any major vendor in the networking space could easily assemble the pieces of this approach, buy a couple companies to make it real, and push the market in that direction.  Progress is possible, meaning that we could take a giant step toward cloud-native technology in carrier cloud infrastructure more easily, less expensively, than most operators think would be possible.

It’s also worthwhile to note that this same approach could be applied to enterprise software to integrate cloud-ready elements.  The cloud providers could be the drivers of this, or it could be vendors like VMware, whose aspirations and capabilities in the cloud space are obvious to all.  VMware, of course, is also casting covetous eyes at the telecom space, so perhaps they could take a lead in both areas.  Competitor Red Hat, under the IBM umbrella, could also do something in this space.

I do think that broader redesign of operations software for the cloud would be valuable, but the big benefits, the majority of opex savings and process improvements, could be realized with this much smaller step.  I hope somebody finally decides to take it.