Is There a Cloud/NFV Lesson in the Delta Outage?

How can it happen?  You hear today that a power failure in Delta Airlines’ Atlanta hub put their systems down for an extended period of time.  Obviously all of the Delta problems weren’t ongoing IT problems; once you mess up a ballet like airline flight and crew scheduling it’s hard to get everyone dancing in order again.  However, a system outage started this.

I had a client ask me how this was possible in this age of supposedly high availability, and while I can’t answer the question specifically for the Delta outage, I did have an experience in the past that could shed light on why something like a power hit can sometimes cause a massive, protracted mess, even not considering the scheduling aftermath.

The client was in the healthcare industry, and like Delta they had an “old” data center system supporting a large number of old terminal devices.  They had made a decision to use a modern IP network to concentrate their terminal traffic at key points and aggregate it to transport back to their data center location.  I’ve worked with the airline protocols and they were similar to the ones used by the healthcare company.

The problem for my healthcare client started when, after setting up their new network over a weekend, they faced the first full, normal, workday.  As traffic loads grew, there were reports of terminals losing communication, and these reports mounted quickly that Monday.  In fact, the number of terminals that had lost service were so large that the data center system, which was designed to self-diagnose problems, determined that there must be a problem in the software image and triggered a reloading of software.

The reload, of course, tool all the terminals down, and this meant that the entire network of almost 17 thousand terminals tried to come up at one time.  It didn’t come up.  In fact, for the next three days they couldn’t even get half the network running at once no matter what they did.  They called me in a panic on Thursday, asking if I could help, but my schedule prevented me from giving them the full week on-site that they wanted until two weeks later.  I asked them to send me some specific information by express so I could get a head start on analysis, and told them I’d give them a schedule of interviews I needed within a week.

It only took a day to find the problem once I got there, and less than 8 hours to totally resolve it.  There were some ugly confrontations with vendors at meetings (including a memorable scene where a healthcare executive literally threw a binding of documents into the chest of a vendor executive), and a key manager for the healthcare company resigned.  It was their worst outage in their history.

So what caused it, you wonder?  Two things.

The first problem was that old devices, particularly old terminals that use old protocols, don’t work as expected on modern networks without considerable tweaking.  Router networks are not wires, they have significant propagation delay at best and during periods of congestion it can get worse.  Old terminal protocols that think they’re running on a wire will send a message, wait a couple of seconds, and then assume something got lost in transit so they send it again.  After a couple tries, the terminal or the computer will try to recover by reinitializing the dialog.  That works nearly all the time…except if an old message is stuck in a buffer somewhere in a router network.

If you’re on a wire you might have to deal with a lost message, but things don’t pass each other on a wire.  When you get an out-of-sequence message you assume something very bad has happened and you reinitialize.  As traffic in the healthcare network increased that Monday, more messages meant more delay and buffering, and eventually some of the terminals “timed out” and tried to start over.  Then the old messages came in, which caused them to start again…and again…and still more times.  And of course, when everything had to be restarted because the data center system reloaded, all these startings were synchronized and created even more traffic.  The network was swamped, nobody got logical message sequences, and nothing worked.

The solution to the problem was simple; tell the routers not to buffer messages from the terminals.  A lost message can be recovered from, but one that arrives too late after a recovery process has already started will never be handled correctly.  One parameter change and everything worked.

Power failures can create this same situation.  Delta had a power hit, and their backups failed to engage properly.  If my healthcare client had a data center power failure that took down the computers, or even just took down the network connections, they’d have experienced exactly the same problem of synchronized recovery.  I can’t say if Delta saw this problem, but they might have.

What about the second of the two causes?  That was failure to address simultaneous-event overload in the design.  A network is usually designed for some significant margin over the peak traffic experienced.  In the case of my healthcare client it was designed for double the load.  The problem was that nobody considered what would happen if everything failed at the same time and then tried to restart at the same time.  The peak traffic that was generated was five times the normal peak load and over double the design load.  Nobody thought about how recovering from a failure might impact traffic patterns.

The cloud could resolve many or perhaps even all of these problems, but only if the cloud and the applications running in it were properly prepared.   Have we ever had a cloud outage created by a power hit?  Darn straight, but if we had a truly distributed cloud, one with hundreds or thousands of data centers, they wouldn’t all be likely to get a power hit at once.  If devices needed special startup processes, could those processes have been spawned in as many instances as needed to get things running, so the devices could be “turned over” to the main application in a good state?  Darn straight, if we designed the applications like that.

Are we designing for high availability and resiliency today?  Some will say we are, but the interesting thing is that they’re designing for the failure of an application component but not for the overloading of the recovery processes themselves.  How many new instances will have to be deployed in response to a failure?  How many new connections will be needed?  How many threads can we run through OpenStack Nova or Neutron instances at a time, and how many instances of Nova or Neutron do we have available?

A resilient system isn’t resilient if the processes that sustain its resiliency are themselves brittle.  It’s easy to say that we can replace a component if it fails, scale one that’s under load, but just how many of these can be done at one time, and are there not conditions—like a power hit—that could generate the need for a massive, synchronized, recovery?  Remember that our vision of the cloud has evolved considerably from hosted server consolidation, but our tools are still rooted in that trivial model.  Delta’s experience may prove that we need to think harder about this problem.