Could SDN or NFV Save Us From Massive Outages?

Since the dual United Airlines and NYSE outages I’ve gotten a lot of email about the stability of new network architectures.  While I don’t have any special insight into those incidents and so can’t (and won’t) speculate on how they were caused or how they could be prevented, I do have some experience with network outages.  Are SDN and NFV going to make things easier for us, harder, or what?

The core reality of networking today is that it’s adaptive in operation.  Devices exchange information via control protocols, and the propagation of this information establishes forwarding rules that are then used to move packets around.  At the same time, management information moves to and from specific devices using the same data paths and forwarding rules.  The adaptive nature of today’s networks make them vulnerable in two distinct ways.

First, a device could be cut off from the rest of the network by a failure.  In this case, the device wouldn’t be accessible to management commands and thus could not be controlled, at least not until a pathway to that device was restored.  If the device had been contaminated by bad parameterization or software, the problem might prevent paths from ever being established, in which case you’d have to send someone to manually fix things (or provide an out-of-band management connection to every device).

The second issue is the bad apple problem.  You know (maybe) the old saying that “One bad apple spoils the barrel”.  The fact that devices in a legacy network derive most of their knowledge of topology and status from an exchange of information with adjacent devices means that a single device that’s contaminated could contaminate the whole network with bad information.  In most cases this means either that the device advertises “false” routes that are suboptimal or perhaps can’t even work, but it might also mean that the device floods partners with nonsense, ignores requests, and so forth.

Both these problems tend to happen for two reasons.  First, the device is parameterized incorrectly, meaning that there’s a human error dimension.  The largest network outage I ever saw in my career, which tool over fifty sites down hard for over 24 hours and caused failures of at least a quarter of sites at any given time for a week, was caused by a parameter error.  The second issue is a software problem.  We’ve all heard about how software updates to a device cause it to behave badly with its neighbors.

Logically, the questions to be asked for NFV and SDN are, first, how susceptible they’d be to the current pair of issues and second whether there might be new issues arising.  Let’s look at those points.

We have to set the stage here.  In SDN, we have a number of models of service in play these days.  Classic OpenFlow SDN says that white-box switches have their forwarding managed entirely by a central SDN controller.  In some cases classic SDN is overlaid on legacy forwarding, meaning that there’s still adaptive topology management being done by the device but explicit forwarding control via OpenFlow is possible.  Some other models (Cisco’s preferred approach) would utilize legacy adaptive behavior completely and use policies to add software control over the process.

In any model that retains adaptive behavior, we have the same risks that we have today.  If the model adds central SDN forwarding control, then we add the risks that such control might add.  Primarily, the risk of central control is the failure of the central controller.  If a controller fails, then it can’t respond to network conditions.  That doesn’t mean the network fails, only that it can’t be changed to reflect failures or changes in traffic or connection topology.  The big question for an SDN controller, IMHO, is whether it’s stable under load.  My biggest-network-failure example was caused by a parameter error, but the reason it exploded was that the error caused a flood of problems that swamped a central control mechanism.  When that failed under load, everything broke, and since everything now had to be restored, the controller never came up.

Bad-apple device problems in SDN wouldn’t impact the topology and forwarding of the network, but if a device went maverick and didn’t do forwarding updates correctly or at all, the central controller might not “understand” that the route it commanded hadn’t really worked.  I’ve not yet seen a demo/test of a controller that involved checking the integrity of routes and perhaps flagging a device as being down if it’s not doing what it’s supposed to do.

The cutoff problem in SDN has the same kind of risk.  A device could be cut off because an adjacent device killed the route to the controller.  If the device is functional enough to do what it would likely be supposed to do (try other paths) and if there were other paths available, you still might be able to restore contact.

Overall, my feeling is that purist OpenFlow SDN is less at risk to traditional adaptive-behavior-related outages for the obvious reason that it relies on central control.  If the controller is designed properly, hosted reliably, and if the devices are set up to deal with path loss to the controller in a reasonable way, then I think you can say that classic SDN would be more reliable than legacy networks.

NFV is a bit more complicated.  NFV doesn’t aim at changing network control-plane behavior, so if you hosted VNFs that did switching and routing via NFV you’d simply substitute a software version of a device for an appliance version.  All the adaptive risks would be the same.  If you hosted SDN VNFs and centrally controlled them, you’d now have the SDN risks.  Where NFV is different is first in the issue of node reliability and second in the management plane.

Servers, data center switches, and intra-VNF paths in an aggregate configuration make NFV more complex and likely generate a lower MTBF than you’d have with a ruggedized appliance.  NFV could potentially have an improved MTTR because you could fail over components, but you’d see an outage in most cases.  We also don’t really have much data on how fast service could be restored and how an extensive failure like a data center drop would impact the ability to even find alternative resources.  Thus, it’s hard to say at this point just what NFV will really do to network availability.

On the management side it’s even more complicated.  In traditional networks, management and data paths are handled equally, meaning that you have connectivity for both or for neither.  In NFV, the presumption is that at least a big chunk if not all of the management data is carried on a subnetwork separated from the service data paths.  It’s not unlike the SS7 signaling network of the old phone network (which we still have in most of the world).  If we presume that VNFs are isolated to secure them from accidental or malicious hacking from the data plane, we now have a subnet for every VNF and management connections within (and likely to and from) those subnets.  Because NFV depends on better remediation for its availability more than reliable appliance strategies, loss of management integrity could hurt it more.

The net for NFV is that we don’t know.  We have not built an NFV network large enough to be confident we’ve exposed all the issues.  We haven’t identified possible issues fully, and tested them in credible configurations.  I think that it would be possible to build NFV networks that were less susceptible to both the bad-apple and cut-off network problems, but I’m not sure the practices to do that have been accepted and codified.

The net, IMHO, is this.  If we do both SDN and NFV right, we could reduce the kind of outages we’ve seen this week.  If we do them badly, deploying either or both would make things worse.  Since we have far less experience managing SDN and NFV than managing legacy networks, that tells me that we have to be graceful and gradual in our evolution, or we’ll make reporters a lot happier with dramatic stories than we’ll make customers happy with reliable networks.