Scalability: Why We Don’t Have as Much as We Think

One of the most profound benefits of the cloud is the elasticity or scalability that virtualization of hosting resources can create.  Think of it as having a computer whose power expands and contracts with the level of presented work and you’ll get the idea.  The problem is that this kind of scalability isn’t a natural attribute of virtual hardware; you need to have a software framework in which the hardware deploys to harness it.  Yes, I know it’s customary to think of software as deploying in a hardware framework, but remember we’re talking about virtualization here.

Traditional networks don’t expand and contract that way.  Yes, you can allocate more capacity or resources to a given service or connection (providing you have some to spare where it’s needed), but you don’t typically make routers bigger or expand their number dynamically to create additional pathways for information flow when things get hot and heavy.  Operators are almost uniformly of the view that hosted network functions are scalable and elastic, and that this benefit is important to justifying things like NFV.

Cloud computing’s basic tools and software have evolved to recognize and realize this scalability benefit.  We know now that a scalable element has to “look” like a single element from the outside, that inside it must include some capability to balance work across the instances that scalable behavior create, and that the scaling process has to be able to create and destroy those instances as needed, without disrupting the way work is processed.  These scalability features are a big part of what’s called “cloud-native” design, meaning designing software that optimizes the special properties of the cloud rather than moves pre-cloud design into the cloud.

Scalability is based on a simple principle of abstraction.  A software element has to present itself to its partner elements as a single component, but resolve into as many components and elements as needed.  If multiple instances of something are to be able to handle work as though they were a singularity, they have to behave based on what might be called “collective” or “distributed” state.  To the extent that there is an order of outputs related to the order of inputs, that relationship has to be preserved no matter where the work goes.

This isn’t a new problem; IP networks which are “connectionless” or “datagram” networks, can create out-of-order arrivals, and so a layer (TCP) is added to sequence things or detect a missing piece.  TCP is an end-to-end layer, so how the intermediate pieces are connected and how many paths there are doesn’t matter.

Transactional applications pose a different state problem.  A transaction is usually an ordered sequence of steps at the user or functional level.  These steps might involve a sequence of datagrams, and those datagrams might be suitable for resequencing among themselves via TCP, but the steps themselves have an order too.  If an instance of a functional component handles the first step in a transaction, and a different one handles the second step, it’s still important that somehow Step 2 doesn’t get pushed ahead of Step 1, but also that nothing saved in the process of doing Step 1 isn’t lost if Step 2 goes to another software instance for handling.

This sort of thing may not seem like much of an issue, given that I’ve already said that NFV has focused on box-replacement missions and that VNFs are thus handling IP packets in most cases, just as routers would.  There are two reasons that would be faulty reasoning.  First, as I’ve noted in earlier blogs, there are no future connection services that will deliver improved profit per bit.  Second, service lifecycle software, NFV software, and pretty much any management system software, is going to have to deal with massive numbers of events, and maintain at least chronological context while they’re doing it.

Let’s say that there are ten thousand active services, and that there’s a fault that impacts perhaps 20% of those services.  That means two thousand impacts, and in many cases the “impact” is going to be signaled by every element of every service that recognizes a problem.  We could easily have tens of thousands of events generated by our fault, all flowing into the conceptual mouth of a service lifecycle automation system and expecting remediation.  So two thousand impacted services, twenty thousand events, as an example.

If you look at the pictures being drawn for NFV management and orchestration or ONAP, you’ll see a series of boxes that collectively represent a software automation system.  Emphasis on the singular “a” here.  The events are coming in in whatever order the timing of the problem recognition and the electrical distance from problem to management system happens to dictate.  They’re not nicely arranged by service, nor are they guaranteed to arrive in order of their generation, even when they’re associated with the same service.

All this stuff is dumped into a queue, and as the system has the capacity to do something, it pops the queue to grab an event.  First question; what does the event belong to, service-wise?  Without knowing that, we can’t process it.  Second, where are we (if anywhere) in the process of handling problems, based on prior events we might have processed and prior steps we might have taken?  If we don’t establish that, we will see later events doing things that step on actions undertaken in response to earlier events.

This is the problem with monolithic models for operations or lifecycle automation.  If you don’t see ONAP that way, I recommend you review THIS piece in Light Reading.  The critical quote is “They’re able to get this huge elephant to work in the cloud in one virtual machine.  Given the criticism historically that ONAP is a monster that needs a supercomputer to run it and an army to install it, here you have 20 blokes running a scaled-down version. It’s not as heavy a lift as it’s made out to be.”

I agree that having a small footprint to run ONAP instead of a supercomputer is better, but I submit that this way lies madness.  It’s an admission that ONAP is based on a monolithic architecture, that there’s a point of processing though which all service events must funnel.  Return now to our twenty thousand service events.  How many per second could a single VM handle, and think of an operator with tens of thousands of customers and services instead of my modest example.  What’s needed is scalability, and you can’t scale monolithic processes you’ve struggled to cram into a VM.

The exact opposite should be the goal; instead of making ONAP fit on a smaller server, ONAP should run as a series of coupled microservices, infinitely scalable and distributable.  My beef with the project has centered around the fact that it wasn’t designed that way, and the Dublin release proves it’s not going in that direction—probably ever.

Imagine now a nice service-model-driven approach.  The model is a hierarchy of the very type that the TMF envisioned with its SID model.  At the bottom of each branch in the hierarchy lies the process set that maps the model to real infrastructure, so real problems will always be recognized at these bottom points.  When that happens, we need to handle the associated event, and to do that, we look to the state/event table that’s resident in the model element that contains the fault.  In my terms, that “looking” is done by an instance of a Service Factory element, whose only source of information is the model and the event.  That means we can spin up the instance as needed.  There is no resident process.  In fact, logically, there is a process available for each of our two thousand services, if we want.

The Factory sees that it’s supposed to run Process X.  That process is also likely suitable for instantiation as needed, and its only inputs are the event and the service model that launched it.  The process might be the same for all two thousand services, but we can instantiate it as needed.  When the process completes, it can set a new state in the state/event table for the element that spawned it, and if necessary, it can generate an event to the model element above, or the element below.  The combined state/event tables keep everything synchronized.  No looking to see what something is associated with; it’s associated with what generated it.  No worrying about colliding activities, because the state tables maintain context.

This model-driven approach with a service contract coupling events to processes is naturally scalable.  This is what the cloud would do.  This is not what we’re specifying in places like the NFV ISG or the ETSI Zero-touch group or within the multiple implementations of MANO and VNFM, and not within ECOMP.  We have defined things that are supposed to be cloud-hosted, and yet are not themselves cloud-ready or even cloud-compatible.

The TMF’s NGOSS Contract work promoted this approach, over a decade ago when the cloud was still largely a fantasy.  As a sad coincidence, the primary architect of this truly visionary approach, John Reilly, passed away recently.  I’d talked with John many times about NGOSS Contract and its applications and implications, and these conversations were the primary source of my own education on the notion of data-model steering of events.  John had it right all along, and so I’m taking this opportunity to shout out in his honor.

Thanks to John’s insight, we’ve had over a decade to build the right approach to lifecycle software and also to hosted features and functions.  I’d hate to think we wasted that time, and in fact the cloud community has embraced most of what John saw so long ago.  I’d like to think that the TMF and the network operator community will eventually do the same.