What the Heck is “Carrier Grade?”

One of the interesting issues that I encountered at the HP Discover 2014 event this week was that of “carrier grade”, and I even had someone make a related comment on a prior blog of mine.  For ages, people have talked about how important it was to be “carrier grade” and offer “five-nines” reliability.  NFV certainly has to support the standard for reliability, and so does SDN, but do we know what that standard is?

There are two factors that influence carrier requirements for reliability.  One is the service-level agreement offered for the service (explicit or implicit) and the other is the operational cost of an outage.  You don’t want SLA violations because it will hurt your churn rate and often cost you money in reparations, and you don’t want failures that drive up opex.  So the question is how to achieve enough availability to suit these two requirements.

In the SLA area, we inherited the notion of five-nines from the old days of TDM.  In TDM networks, operators measured “significantly errored seconds” and “error-free seconds” and corporate SLAs were stated in these terms.  Clearly this micro-managed SLA notion was going to create major reliability concerns, and if you’re writing SLAs with one-second granularity you can’t take the time to fail over to another device or path if something breaks.

Just try to buy an SLA with second-level granularity in VPNs or Ethernet.  In packet services of all types, we rely on what I’ll call “statistical SLAs” and not on highly granular ones.  A statistical SLA says that any event has some probability, including an outage, but that probability is low over time.  You write an SLA in order to reduce the violation rate, partly by managing availability but also partly by managing the granularity.  Packet SLAs usually measure outages over a long period—a week or a month.

It’s not totally about contractual stuff, though.  Anyone who uses a mobile phone knows that five nines is a joke; I don’t think I get one nine myself and I had major issues in hearing the other party in the last phone call I made (yesterday).  I also have regular issues with voice services over IP even with wireline access; maybe I have a nine or two there.  The Internet?  Forget it, nines-wise.  The fact is that we have accepted “best-efforts” services with no SLA at all for most of our communications usage.

So here we come to what could be called “the progress of the mythology” of SLAs.  We say “Five nines is crap; we don’t need highly available devices at all.  What we do is to fail over.”  That isn’t necessarily true either, for three reasons.

Reason one is that most alternate-path or alternate-device responses to failure will in fact break the service connection for at least some period of time.  If a packet connection breaks because of the loss of a device, packets in the flow are dropped and there is often a period of time when no path exists at all (before adaptive recovery finds a new one) during which more packets are dropped.  The point is that while we may accept this level of impairment, we cannot make a service five-nines in most cases through redundancy unless we have essentially hot standby.

Which brings us to reason number two.  The process of making something fail over fast enough to create a reasonable alternative to not failing at all involves both redundancy of facilities and agility of operations response.  Neither of these are free.  With NFV, we have an expectation that benefits will come from capex reduction, opex reduction, and service agility.  Two of the three benefits are impacted by infrastructure changes intended to improve availability.  So what we’re saying is that at some point, the cost of making something resilient is higher than making it reliable.  That’s more likely to happen as overall network complexity increases, and NFV’s substitution of chained resources for simple boxes creates more complexity in itself.

The third reason is that shared resources multiply problems as fast as they multiply users.  One device failing in the old days creates one SLA violation.  A device supporting a thousand streams of service might fault a thousand services.  The process of recovery for all thousand services is unlikely to be fast enough and effective enough to resolve all SLA problems, and in fact finding new resources for them all may spread the issue toward all the service endpoints, which can create a storm of management interventions that further add to cost and complexity, and reduce those nines.

My point here isn’t that NFV is a bad idea, that it’s not reliable.  What I’m saying is that all this talk about how many nines make a carrier grade is kind of useless.  We left absolutes behind when we moved to IP infrastructure from TDM.  We have to manage everything now, including availability.  Services have SLAs, and we have to guarantee SLAs within the economic framework set by the acceptable price of the service.  How reliable is NFV?  As reliable as users are willing to pay for it to be, and as carriers are willing to dip into profit margins to make it.  There is no standard to hold to here other than the standard of the market place—overall utility.

So what about servers?  First, there’s a myth that carrier-grade means NEBS-compliant.  NEBs was about power and RFI, not about availability.  You can have a piece of NEBS gear that needs a live-in tech.  Second, you can never make a box that isn’t five-nines into a virtual box that is five-nines by combining boxes.  The risk that two essential devices with the same reliability requirements creates in combination is higher than the risk either of them pose alone—read the combinatory rules for MTBF and you’ll see what I mean.

Servers that will support stringent SLAs in NFV will have to have higher availability than those that don’t, or to be more accurate it will be more cost-effective to support SLAs though server reliability measures than by operational measures as the SLAs become more stringent.  Servers that will support a lot of tenants need to be more available too, because the multi-tenancy multiplies the overall number of SLAs at risk and complicates recovery.  So while every server vendor who wants to populate NFV infrastructure may not need “carrier grade” technology, you can be darn sure that every operator who deploys NFV is going to have to deploy some of it via carrier-grade servers.