Why “State” Matters in NFV and the Cloud

It’s time we spent a bit more time on the subject of “state”, not in a governmental sense but in the way that software elements behave, or should behave.  State, in a distributed system, is everything.  The term “state” is used in software design to indicate the notion of context, meaning where you are in a multi-step process.  Things that are “stateful” have specific context and “stateless” things don’t.  When you have states, you use them to mark where you are in a process that involves multiple steps, and you move from one state to another in response to some outside condition we could call an “event”.  Sounds simple, right?  It is anything but.  Where we’ve run into state issues a lot in the networking space is NFV, because NFV deploys software functions and expects to provide resiliency by replacing them, or scalability by duplicating.  There are two dimensions of state in NFV, and both of them could be important.

When I’ve described NFV deployment as a hierarchical model structure, I’ve noted that each of the elements in the model was an independent state machine, meaning that each piece of the model had its own state.  That state represented the lifecycle progress of the modeled service, so we can call it “lifecycle state”.  Lifecycle state is critical to any NFV framework because there are many places in a service lifecycle where “subordinate” behaviors have to be done before “superior” ones can be.  A service, at a high level, isn’t operational till all its pieces are, and so lifecycle state is critical in resolving dependencies like that.  Where lifecycle state gets complicated is during special events like horizontal scaling of the number of instances or replacement of an instance because something broke.

The complexity of scaling, in lifecycle state, lies in the scope of the process and the mechanism for selecting an instance to receive a particular packet—load balancing.  When you instantiate a second instance of a scalable VNF, you probably have to introduce a load-balancer because you now have a choice of instances to make.  In effect, we have a service model with a load-balancer in it, but not yet active, and we have to activate it and connect it.

In replacement, the problem depends on just how widespread the impact of your replacement has to be.  If you can replace a broken server with another in the same rack, there is a minimal amount of reconnection.  In that case, the deployment of the new VNF could make the correct connections.  However, if you had to put the new VNF somewhere quite distant, there are WAN connection requirements that local deployment could not hope to fulfill.  That means that you have to buck part of the replacement work “upward” to another element.  Which, of course, means that you had to model another element in the first place.

The rightful meaning of the term “orchestration” is the coordination of separate processes, and that’s what’s needed for lifecycle management.  Lifecycle state is an instrument in that coordination, a way of telling whether something is set up as expected and working as planned, and if it isn’t tracking it through a series of steps to get the thing going correctly.

The individual virtual network functions (VNFs) of NFV also have functional state, meaning that the VNF software, as part of its data-plane dialog with users and/or other VNFs, may have a state as well.  For example, a VNF that’s “session-aware”, meaning that it recognizes when a TCP session is underway, has to remember that the session has started and that it hasn’t yet ended.  If you’re actually processing a TCP flow, you will have to implement slow-start, recognize out-of-order arrivals, etc.  All of this is stateful behavior.

Stateful behavior in VNF functionality means that you can’t readily spawn a new or additional copy of a VNF and have it substitute for the original, because the new copy won’t necessarily “know” about things like a TCP session, and thus won’t behave as the original copy did.  Thus, functional statefulness can limit the ability of lifecycle processes to scale or replace VNFs.

Functional state is difficult because it’s written into the VNFs.  You can impose lifecycle state from above, so to speak, because the VNFs themselves aren’t generally doing lifecycle stuff.  You can’t impose functional state because it’s part of the programming of the VNF.  This is why “functional programming” has to address state in some specific way; it’s used to create things that can be instantiated instantly, replaced instantly, and scaled in an unfettered way.  The process of instantiating, replacing, and scaling are still lifecycle-state-driven, but the techniques used by the programmer to manage functional state still have to be considered, or you may create a second copy of something only to find that it breaks the process instead of helping performance.

To make things a bit more complex, you can have things that are “stateless” in a true sense, and things that have no internal state but are still stateful.   This is what Nokia is apparently proposing in its AirGile model (I’ll blog more on AirGile next week).  Most such models rely on what’s called “back-end state”, where an outside element like a database holds the stateful variables for a process.  That way, when you instantiate a new copy of something, you can restore the state of the old copy.

The only negative about back-end state control is that there may be a delay associated in transporting the state—both saving the state in the “master” copy and moving that saved state to the point where a new copy is going to be instantiated.  This may have to be considered in some applications where the master state can change quickly, but in both scaling and fault recovery you can usually tolerate a reasonable delay.

Every NFV service has lifecycle state, but not every service or service element has functional, internal, state.  Things that are truly stateless, referencing back to functional programming, can be instantiated as needed, replicated as indicated, and nothing bad happens because every copy of the logic can stand in for every other copy since nothing is saved during operation.  True stateless logic is harder to write but easier to operationalize because you don’t have to worry about back-end state control, which adds at least one lifecycle state to reflect the process of restoring state from that master copy.

While state is important for NFV, it’s not unique to NFV.  Harkening back to my opening paragraph, NFV isn’t the only thing that has state; it’s an attribute of nearly all distributed systems because the process of deploying such systems will always, at the least, involve lifecycle states on the components.  That means that logically we might want to think about cloud systems, applications, and services as being the same thing under the covers, and look for a solution to managing both lifecycle state and functional state that can be applied to any distributed (meaning, these days, cloud-distributed) system.

Lifecycle state, as I’ve noted in earlier blogs, can be managed by using what I’ve called representational intent, a model that stands in for the real software component and manages the lifecycle process as it relates both to the service management system overall and to the individual components.  In effect, the model becomes a proxy for the real stuff, letting something that doesn’t necessarily have a common lifecycle management capability (or even have any lifecycle awareness) be fit into a generalized service management or application management framework.

Data models, or small software stubs, can provide representational intent modeling and there have been a number of examples of this sort of lifecycle state modeling, all discussed HERE.  It’s not clear whether modeling could handle functional state, however, beyond perhaps setting up the state in a back-end state control system.  The statefulness of logic is a property of the logic itself, and even back-end state control would be difficult to say the least if the underlying software didn’t anticipate it.

I think it’s clear that for the broad distributed-system applications, some unified way of managing both lifecycle and functional state would be very valuable.  We don’t, at present, have a real consensus on how to manage either one separately at the moment, so that goal may be difficult to reach quickly.  In particular, functional state will demand a transition to functional programming or stateless microservices, and that may require a rewriting of software.  That, in turn, demands a new programming model and perhaps new middleware to harmonize development.

We’ve not paid nearly enough attention to state in NFV or in the cloud.  If we want to reap the maximum benefit from either, that has to change.