Is Carrier-Grade NFV Really Important?

OpenStack has been seen by most as an essential element in any NFV solution, but lately there have been questions raised on whether OpenStack can meet the grade, meaning “carrier-grade” or “five-nines”.  Light Reading did an article on this, and Stratus recently published an NFV PoC that they say proves that OpenStack VIM mechanisms are insufficient to assure carrier grade operation.  They go on to say that their PoC proves that it’s possible to add resiliency and availability management as a platform service, and that doing so would reduce the cost and difficulty associated with meeting high availability requirements.  The PoC is interesting on a number of fronts, some of which are troubling to classical NFV wisdom.

Let’s start with a bit of background.  People have generally recognized that specialized appliances used in networking could be made much more reliable/available than general-purpose servers.  That means that NFV implementations of features could be less reliable, and that could hurt NFV deployment.  Proponents of NFV have suggested this problem could be countered by engineering resiliency into the NFV configuration—parallel elements are more reliable than single elements.

The problem with the approach is twofold.  First, a network of redundant components deployed to avoid single points of failure is harder to build and more complicated, which can raise the operations costs enough to threaten the business case if you’re not careful.  Second, if you define a fault as a condition visible in the service data plane, most faults can’t be prevented with parallel component deployment because some in-flight packets will be lost.  That’s a problem often described as “state management” because new instances of a process don’t always know what the state was for the process instance they’re replacing.

I blogged early on in the NFV cycle that you could not engineer availability through redundant deployment of VNFs alone so I can hardly disagree with the primary point.  What Stratus is saying is that if you enhance the platform that hosts VNFs you can do things like maintain state for stateful switchovers, essential in maintaining operation in the face of a fault.  I agree with that too.  Stratus’ message is that you can address the two issues better than OpenStack can, with configuration-based availability by making the platform for hosting VNFs configuration-availability-aware.

Well, I’m not sure I can buy that point, not the least because OpenStack is about deployment of VNFs, and most availability issues arise in the steady state, when OpenStack has done its work.  Yes, you can invoke it again for redeployment of VNFs, but it seems to me that the questions of NFV reliability have to be solved at a deeper level than just OpenStack, and that OpenStack may be getting the rap for a broader set of problems.

State maintenance isn’t a slam dunk issue either.  Most stateful software these days likely uses “back end” state control (Metaswitch uses this in its insightful implementation of IMS, called Project Clearwater) and you can use back-end state control without OpenStack being aware or involved, and without any other special platform tools.

Worse, I don’t think that even state-aware platforms are going to be a suitable replacement for high-availability gear in all cases.  You can’t make a router state universal across instances without duplicating the data stream, which is hardly a strategy to build an NFV business case with.  But of course routers recover from the failure of devices or trunks, and so we may not need routers to be fully paralleled in configuration-based availability management.  Which raises the question of whether “failures” that are routine in IP or Ethernet networks have to be afforded special handled just because they’re handled with VNFs.

The final point is that we still have to consider whether five-nines is actually a necessary feature.  Availability is a feature, and like any other feature you have to trade it against cost to decide if it’s got any buyer utility.  The old days of the PSTN versus the new world of mobile services is a good example; people are happy to pay less for cellular services even if they’re not nearly as high-quality as wireline voice used to be.

Two situations argue for high availability for VNFs.  One is multi-tenancy, meaning VNFs that deploy not just for a single customer but for a large number.  The other is interior network features like “virtual core routing” that might be associated with a large-scale network virtualization application.  The mainstream VNF stuff, which all falls into the category of “service chaining”, is much more problematic as a high-availability app.  Since Stratus is citing the benefits of their availability-platform approach to VNF providers, the credibility of that space is important, so we’ll deal with the classic VNF applications of service chaining first.

Yes, it is true that if you could make state control a feature of a platform rather than something that VNFs have to control on their own, VNF vendors would have an easier time.  As a software architect (fallen, to be sure, to the dark side of consulting!) I have a problem believing that you can control distributed state of multiple availability-managed components without knowing just what each component thinks is its own state.  There are plenty of variables in a program; which are state-critical?

Even more fundamentally speaking, I doubt that service-chained VNFs, the focus of most VNF providers, really need carrier-grade availability.  These features have historically been provided by CPE on the customer premises, after all.  It’s also true that most of the new services that operators feel they are missing out on, services that OTTs are winning, have far less than five-nines reliability.  Why should the operators have to meet a different standard, likely at a higher cost?

Multi-tenant features like IMS or core routing would make sense as a high-availability service, but again I wonder whether we should be listening to the voice of perhaps the most experienced VNF provider of all—Metaswitch.  They built in resiliency at the VNF level, and that means others could do the same.  Give the limitations of having a platform anticipate the state-management and other needs of a high-availability application, letting VNFs do their own thing makes the most sense.

I think platformized NFV is not only a good idea, it’s inevitable.  There will surely be a set of services made available to VNFs and VNF developers, and while it would be nice if we had a standard to describe this sort of thing, there is zero chance of getting anything useful in place in time to influence market development.  I’d like to see people argue for platform features for VNFs and not against OpenStack or any other platform element.  That means describing what they propose to offer under realistic conditions, and I don’t think Stratus has yet done that.

I also think that we’re ignoring the big question here, which is the complexity/cost question.  We’re acting like NFV deployment is some sort of divine mandate rather than something that has to be justified.  We propose features and capabilities that add both direct cost and complexity-induced costs to an NFV deployment equation that we know isn’t all that favorably balanced at best.  We can make VNFs do anything, including a lot of stuff they should not be doing.