How to Make SDN and NFV About Zeros Instead of Nines

We chase a lot of phantoms in the tech space, none as dangerous as the old “five-nines” paradigm.  Everyone obsesses about getting reliability/availability up to the standards of TDM.  That’s not going to happen unless we don’t do the kind of network transformation we’re talking about.  Five-nines is too expensive to meet, and we don’t have it anyway with mobile services.  What we have to worry about isn’t too few nines, but too many zeros.

Telstra, the Australian telecom giant, has been hammered in the press there for having multiple major outages in just a few days, outages where the number of customers with service was zero.  To me this proves that SDN, NFV, cloud, or other network technology evolutions are going to be judged by customers not by the number of dropped calls or video glitches (who doesn’t see those regularly?) but by the number of no-service periods and the number of impacted consumers.  That’s a whole different game.

The overall effect of our proposed network infrastructure changes would be to centralize and software-ize more things, to move away from the adaptive and unpredictable to the more manageable and from the proprietary and expensive to the commodity.  All of this is being proposed to reduce costs, so it’s ridiculous to think that operators would then engineer in that old five-nines standard.  Packet networks in general, and centralized-and-software networks in particular, are not going to meet that except by over-engineering that compromises the whole business case for change.  That’s not a problem, but what is a problem is the fact that the five-nines debate has covered up the question of the major outage.

One of my enterprise consulting engagements of the past involved a major healthcare company who had a “simple” network problem of delayed packets during periods of congestion.  The problem was that the protocol involved was very sensitive to delay, and when confronted by a delay of more than a couple of seconds these protocols tended to let the endpoints get out of synchronization.  These endpoints then reset, which took down the device/user and forced a restart and recovery process—which generated more packets and created more delay.  What ended up happening is that over ten thousand users, everyone in the whole medical complex, lost service and the vendor could not get it back.  They limped along for days until I showed them that it would be better to drop the packets than delay them.  One simple change and it worked again.

Think now of a central control or management process.  It’s doing its thing, and perhaps there’s a trunk problem or a problem with a data center, and a bunch of VNFs or SDN switches fail.  The controller/lifecycle manager now has to recover them.  The recovery takes resources, which creates a waiting list of service incidents to address, which leaves more switches or VNFs disconnected, which creates more failures…you can see where this goes.

There are quite a few “zero-service-creating” conditions in emerging models of the network.  There are also some pretty profound recovery questions.  If an SDN controller has to talk to switches to change forwarding tables, what happens when the failure breaks the switch-to-controller linkage?  If an NFV domain is being lifecycle-managed by a set of processes, what happens if they get cut off from what they manage?

I’m not a fan of adaptive device behavior as a means of addressing problems, and I’m not proposing that we engineer in the five-nines I’ve been pooh-poohing above.  What I think is clear is that we’ve left out an important concept in our advances in network technology, which is the notion of multi-planar infrastructure.  In the old days we had a control/signaling plane and a data plane.  With SDN and NFV we need to reaffirm these two planes, and add in the notion of a management plane because of lifecycle management dependencies.  The control/signaling plane and the management plane, and the processes that live there, do have to be five-nines or maybe even more, because if they are not there’s a risk that a failure will cascade into an explosion of problems that will swamp remediation by swamping or breaking the signaling/management connectivity.  Then we’re in zero-land.

We don’t really have an explicit notion of signaling/control and management planes in either SDN or NFV.  In SDN, we don’t know whether it would be possible to build a network that didn’t expose operators to having large chunks cut off from the controller.  In NFV we don’t know whether we can build a service whose signal/control/management components can’t be hacked.  We haven’t addressed the question of authenticating and hardening control exchanges.  Financial institutions do multi-phase commit and fail-safe transaction processing, but we haven’t introduced those requirements into the control/management exchanges of SDN or NFV.

What do we have to do?  Here are some basic rules:

  1. Management and control processes have to be horizontally scalable themselves, and the hardest part of that is being able to somehow prevent collision when several of the instances of the processes try to change the network at the same time. See my last point below.
  2. Every management/control connection must live on a virtual network that is isolated and highly reliable, not subject to problems with hacking or cross-talk resource competition from the data plane. This network has to connect the instances of management/control processes as they expand and contract with load.
  3. Every control/management transaction has to be journaled, signed for authenticity, and timestamped for action, so we know when we’ve gotten behind and we know how to treat situations when a resource reports a problem and requests help, and then for a protracted period hears nothing from its control/management process.
  4. There can never be multiple restoration/management processes running at the same instant on the same resources. One process has to own remediation and coordinate with other processes who need it.

There are two general ways of doing that which is needed.  One is to approach the problem as one of redundant centralization, meaning you stay with the essential SDN/NFV model.  The other is to assume that you really don’t have centralized lifecycle management at all, but rather a form of distributed management.  It’s this second option that needs to be explored a big given that the path to the first has already been noted—you apply OLTP principles to SDN/NFV management and control.

If you’re going to distribute lifecycle management, you have two high-level options too.  One is to divide the network infrastructure into a series of control domains and let each of these domains manage the service elements that fall inside them.  The other is to forget “service lifecycles” for the most part, and manage the resource pools against a policy-set SLA that, if met, would then essentially guarantee that the services themselves were meeting their own SLAs.

A resource-management approach doesn’t eliminate the need for management/control, since presumably at least some of the resource-remediation processes would fail and require some SLA-escalation process at the least.  It could, however, reduce the lifecycle events that a service element had to address, and the chances that any lifecycle steps would actually require changes to infrastructure.  That could mitigate the difficulties of implementing centralized management and control by limiting what you’re actually managing and controlling.

The forget-lifecycles approach says that you use capacity planning to keep resources ahead of service needs, and you then manage resources against the capacity plan.  Services dip into an anonymous pool of resources and if something breaks you let resource-level management replace it.  Only if that can’t be done do you report a problem at the service level.

Some services demand the second approach, including most consumer services, but I think that in the end a hierarchy of management is the best idea.  My own notion was to assign management processes at the service model level, with each object in the model capable of managing what happens at its own level, and with each object potentially assignable to its own independently hosted management process.  It’s not the only way to do this—you can apply generalized policy-dissemination-and-control mechanisms too.  But I think that we’re going to end up with a hierarchy of management for SDN and NFV, and that working toward that goal explicitly would help both technologies advance.