Does Nokia’s AirGile Advance Stateless Event-Based VNFs?

The notion of stateless microservices for networking and the cloud is hardly new.   I introduced some of the broad points on state in my blog last week, but the notion is much older.  Twitter pioneered the concepts, and Amazon, Google, and Microsoft have all deployed web services to support the model, which is aimed at event processing.  Now, Nokia has announced what it calls “AirGile”, which wraps the stateless-microservice notion into a massive 5G package.  It’s a bold approach, and there are some interesting points made in the material, but is this just too big…not to fail but to succeed?  Or is it something else?

I’ve blogged often on functional programming, lambdas, and microservices, and I won’t bore everyone by repeating what I said (you can do a search on the terms on my blog to find the references).  The short version is that the goal of these concepts, which are in many ways just different faces on the same coin, is to create software components that can be spun up as needed, where needed, and in any quantity, then disappear till the next time.  You can see how this would be perfect for event-handling, since events are themselves transitory stuff.

Events are the justification for AirGile, and certainly event-based systems are critical for the cloud.  It’s also likely that many of the NFV applications are really event applications, though this is less true of the vCPE stuff that dominates the NFV space today.  vCPE VNFs are in the data path, and microservices and functional programming are less relevant to that space than to control-plane or transactional stuff.  Nokia doesn’t make the distinction in their material.

Overall, the AirGile story is a bit hard to pull out of the material; the press release is HERE.  I pick out three elements—a developer program and API set, a microservice-based model that’s more “cloud-agile”, and a target application set that seems to be primarily 5G but might also include NFV (Alcatel-Lucent’s CloudBand).  Two of the three things have been around all along and are presumably tweaked for the AirGile story, so it’s actually the microservices stuff that’s new.  Unfortunately, there is virtually nothing said about that part in the material.  As a result, I’m going to have to do some heavy lifting to assess what’s going on here, meaning presume that there is actual useful thinking behind the story and see what it might be.

I think that this is really mostly about NFV, because NFV is mentioned in the release (CloudBand), is a proposed element in 5G deployment, and is based on cloud technology (OpenStack, for example).  NFV and the cloud have a common touch-point in that components of functionality are deployed in both—NFV as virtual network functions and the cloud as applications.  Microservices are components of software, and so you could say that a microservice architecture could serve both NFV and the cloud.  However, Nokia is a network vendor and not an application provider, so it’s the NFV side that seems to be the linchpin.  There, mobile services and 5G offer an alternative target to that vCPE stuff I mentioned above, an alternative that is easier to cast as an event-based application.  That, in the simplest terms, is how I see AirGile; do the next generation of NFV and focus on control-plane events.

If for Nokia AirGile is mostly an NFV story, then what is being deployed are microservices as VNFs, and in fact they do make that point in their material.  Paraphrasing, operators could create services by assembling microservice-functions, presumably components of VNFs, and do this in a more agile way.  True in general, since composition of applications from microservices is widely accepted as promoting agility.  So let’s take this “truth” (if I’m correct) and run with it.

VNFs today are not microservices, and Nokia can’t do anything from the outside to make them so.  A piece of software is stateless and “functional” only because it was written to be.  Thus, a focus on microservice-VNFs means a focus on NFV applications that don’t depend on transporting legacy physical device code or current network software into VNF form.  You can transport that stuff to a VNF, but you can’t make it a microservice without rewriting it.

Stateless, microservice-based, VNFs are then the components of 5G implementations and other network advances Nokia hopes to exploit.  This supposes a model of NFV that’s very different from today’s model, but remember that NFV today is really all about the single application we’d call “virtual CPE” or vCPE, created by service-chaining VNFs that support things like app acceleration, firewalls, encryption, and so forth.  vCPE is valuable if it can exploit the range of CPE features that are already out there, and so it’s essentially nailed to a legacy software model, not a microservice model.  Nokia, if AirGile is important, has to find new stuff to do with VNFs, and new development to support it.

The advantage of microservice-VNFs, which I’ll shorthand to mVNFs, is that they are inherently stateless.  A stateful component has stuff that implies a contextual awareness of past events, stored within.  If you replace a stateful component, you lose that stuff.  If you scale it, the new copies don’t have what the original had, and thus they might interpret the next message differently.  However, most network functions need “state”, at least in the sense that they store some variables, and Nokia seems to be planning to handle state by providing a back-end database where the variables are stored, keeping them out of the components.  This back-end state control is used routinely in the cloud, so this isn’t a radical departure from industry norms.

Still, we don’t have this sort of VNF hanging around the vCPE world, as I’ve said.  I don’t think that Nokia believes that vCPE is going to set the carrier world on fire, opportunity-wise, and you all know I don’t think so either.  They do, however, need some carrier for their strategy or it’s just a software architecture that only programmers would care about.  Their carrier is 5G.

To quote myself from a recent discussion with an operator, “5G is a UFO.  It’s not going to land on your desk and present itself for inspection, so you’re free to assign whatever characteristics to it that you like!”  There are more moving parts to 5G than perhaps to any other networking standard, and most of them are not yet baked.  Thus, just what role mVNFs might play, or even NFV might play, is uncertain, and vendors like Nokia can spin their tale without fear of (at least near-term) contradiction.  If NFV is big in 5G, and if mVNFs are a good idea for greenfield network functions implementation, then 5G is the right place for AirGile.  Before you decide that I’ve written off this whole AirGile thing as a cynical play, let me make two points that in many ways expose the greater truth.

First, operators are already questioning the value of an NFV future based on porting old device functionality to the cloud.  If everyone was happy to hunker down on old physical-device stuff and stay there for a decade, Nokia and others would have a major uphill battle to push an approach that requires a VNF rewrite to mVNFs.  That’s not the case, even today, and it will obviously be less so over time as NFV goals are tied to things like 5G or IoT.  5G is important to NFV to get VNFs out of the old vCPE model, which frankly I don’t think will ever result in a big success for NFV.  Rather than address something like IoT, which probably has more potential, Nokia is aiming at a target that has near-term operator commitment and standardization support.

Second, whether 5G even deploys, it is still very smart for Nokia to link AirGile to it.  Nobody has AirGile budgets or plans, but they do have both for 5G.  Further, 5G lets Nokia say things about AirGile in the context of an accepted problem/opportunity set, using 5G to explain the need for AirGile’s features.  It’s fine to say that VNFs will migrate to mVNFs, but many won’t believe that without an example of where that would likely happen.  5G is that place, and AirGile is at least on the right track.

The question then is what the mVNF concept will do to/for 5G and NFV, and even more how it might impact IoT, which is the big event-driven champion market issue.  I think that if NFV is to play any role whatsoever in 5G, it will have to be in mVNF form because the simple monolithic VNF model just doesn’t do the job in a large-scale dynamic deployment.  Thus, while we can’t say at this stage what 5G will look like, exactly, or when it will happen (even approximately), we can say that without mVNFs it probably won’t have much NFV inside.  And IoT without mVNFs is just not going to happen, period.

I think that we’re long overdue in thinking about the hardware, network, and software platform needed for a realistic carrier cloud platform, and 5G and IoT combine to represent almost 40% of the opportunity there.  Event-driven applications are even more important, representing a minimum of 74% of carrier cloud opportunity in the long term.  But that 74% isn’t NFV as we think of it, and that’s perhaps the biggest challenge for AirGile and Nokia.  They need to think not so much of NFV but of carrier cloud, and the story there isn’t really well developed.  Might Nokia have exposed the importance of event-driven carrier cloud and not owned the opportunity?  If so, they could have greased the skids for competitors.

We don’t have enough detail on AirGile to say whether it has that golden set of features needed, but it will probably stimulate a lot of reaction from other vendors, and we will hopefully end up a lot closer to a full event-driven architecture than we are today.  I think that Nokia may help drive that closure, but I wish they’d have offered more detail on their microservices framework.  That’s where the real news and value lies, and until we understand what Nokia plans with respect to events overall, we can’t evaluate just how important it could be to Nokia and to the industry.

That’s the “something else”.  AirGile might be vague because the topic of stateless software is pretty complex, certainly beyond the typical media story.  It might also be vague because it’s “slideware” or “vaporware”, or a placeholder for future detail and relevance.  We don’t know based on what’s been released, and I hope Nokia steps up and tells its story completely.

Why “State” Matters in NFV and the Cloud

It’s time we spent a bit more time on the subject of “state”, not in a governmental sense but in the way that software elements behave, or should behave.  State, in a distributed system, is everything.  The term “state” is used in software design to indicate the notion of context, meaning where you are in a multi-step process.  Things that are “stateful” have specific context and “stateless” things don’t.  When you have states, you use them to mark where you are in a process that involves multiple steps, and you move from one state to another in response to some outside condition we could call an “event”.  Sounds simple, right?  It is anything but.  Where we’ve run into state issues a lot in the networking space is NFV, because NFV deploys software functions and expects to provide resiliency by replacing them, or scalability by duplicating.  There are two dimensions of state in NFV, and both of them could be important.

When I’ve described NFV deployment as a hierarchical model structure, I’ve noted that each of the elements in the model was an independent state machine, meaning that each piece of the model had its own state.  That state represented the lifecycle progress of the modeled service, so we can call it “lifecycle state”.  Lifecycle state is critical to any NFV framework because there are many places in a service lifecycle where “subordinate” behaviors have to be done before “superior” ones can be.  A service, at a high level, isn’t operational till all its pieces are, and so lifecycle state is critical in resolving dependencies like that.  Where lifecycle state gets complicated is during special events like horizontal scaling of the number of instances or replacement of an instance because something broke.

The complexity of scaling, in lifecycle state, lies in the scope of the process and the mechanism for selecting an instance to receive a particular packet—load balancing.  When you instantiate a second instance of a scalable VNF, you probably have to introduce a load-balancer because you now have a choice of instances to make.  In effect, we have a service model with a load-balancer in it, but not yet active, and we have to activate it and connect it.

In replacement, the problem depends on just how widespread the impact of your replacement has to be.  If you can replace a broken server with another in the same rack, there is a minimal amount of reconnection.  In that case, the deployment of the new VNF could make the correct connections.  However, if you had to put the new VNF somewhere quite distant, there are WAN connection requirements that local deployment could not hope to fulfill.  That means that you have to buck part of the replacement work “upward” to another element.  Which, of course, means that you had to model another element in the first place.

The rightful meaning of the term “orchestration” is the coordination of separate processes, and that’s what’s needed for lifecycle management.  Lifecycle state is an instrument in that coordination, a way of telling whether something is set up as expected and working as planned, and if it isn’t tracking it through a series of steps to get the thing going correctly.

The individual virtual network functions (VNFs) of NFV also have functional state, meaning that the VNF software, as part of its data-plane dialog with users and/or other VNFs, may have a state as well.  For example, a VNF that’s “session-aware”, meaning that it recognizes when a TCP session is underway, has to remember that the session has started and that it hasn’t yet ended.  If you’re actually processing a TCP flow, you will have to implement slow-start, recognize out-of-order arrivals, etc.  All of this is stateful behavior.

Stateful behavior in VNF functionality means that you can’t readily spawn a new or additional copy of a VNF and have it substitute for the original, because the new copy won’t necessarily “know” about things like a TCP session, and thus won’t behave as the original copy did.  Thus, functional statefulness can limit the ability of lifecycle processes to scale or replace VNFs.

Functional state is difficult because it’s written into the VNFs.  You can impose lifecycle state from above, so to speak, because the VNFs themselves aren’t generally doing lifecycle stuff.  You can’t impose functional state because it’s part of the programming of the VNF.  This is why “functional programming” has to address state in some specific way; it’s used to create things that can be instantiated instantly, replaced instantly, and scaled in an unfettered way.  The process of instantiating, replacing, and scaling are still lifecycle-state-driven, but the techniques used by the programmer to manage functional state still have to be considered, or you may create a second copy of something only to find that it breaks the process instead of helping performance.

To make things a bit more complex, you can have things that are “stateless” in a true sense, and things that have no internal state but are still stateful.   This is what Nokia is apparently proposing in its AirGile model (I’ll blog more on AirGile next week).  Most such models rely on what’s called “back-end state”, where an outside element like a database holds the stateful variables for a process.  That way, when you instantiate a new copy of something, you can restore the state of the old copy.

The only negative about back-end state control is that there may be a delay associated in transporting the state—both saving the state in the “master” copy and moving that saved state to the point where a new copy is going to be instantiated.  This may have to be considered in some applications where the master state can change quickly, but in both scaling and fault recovery you can usually tolerate a reasonable delay.

Every NFV service has lifecycle state, but not every service or service element has functional, internal, state.  Things that are truly stateless, referencing back to functional programming, can be instantiated as needed, replicated as indicated, and nothing bad happens because every copy of the logic can stand in for every other copy since nothing is saved during operation.  True stateless logic is harder to write but easier to operationalize because you don’t have to worry about back-end state control, which adds at least one lifecycle state to reflect the process of restoring state from that master copy.

While state is important for NFV, it’s not unique to NFV.  Harkening back to my opening paragraph, NFV isn’t the only thing that has state; it’s an attribute of nearly all distributed systems because the process of deploying such systems will always, at the least, involve lifecycle states on the components.  That means that logically we might want to think about cloud systems, applications, and services as being the same thing under the covers, and look for a solution to managing both lifecycle state and functional state that can be applied to any distributed (meaning, these days, cloud-distributed) system.

Lifecycle state, as I’ve noted in earlier blogs, can be managed by using what I’ve called representational intent, a model that stands in for the real software component and manages the lifecycle process as it relates both to the service management system overall and to the individual components.  In effect, the model becomes a proxy for the real stuff, letting something that doesn’t necessarily have a common lifecycle management capability (or even have any lifecycle awareness) be fit into a generalized service management or application management framework.

Data models, or small software stubs, can provide representational intent modeling and there have been a number of examples of this sort of lifecycle state modeling, all discussed HERE.  It’s not clear whether modeling could handle functional state, however, beyond perhaps setting up the state in a back-end state control system.  The statefulness of logic is a property of the logic itself, and even back-end state control would be difficult to say the least if the underlying software didn’t anticipate it.

I think it’s clear that for the broad distributed-system applications, some unified way of managing both lifecycle and functional state would be very valuable.  We don’t, at present, have a real consensus on how to manage either one separately at the moment, so that goal may be difficult to reach quickly.  In particular, functional state will demand a transition to functional programming or stateless microservices, and that may require a rewriting of software.  That, in turn, demands a new programming model and perhaps new middleware to harmonize development.

We’ve not paid nearly enough attention to state in NFV or in the cloud.  If we want to reap the maximum benefit from either, that has to change.