Two Tales, One Cloud

If you’re a cloud fan, which I am in at least the sense that I believe there’s a cloud in everyone’s future, it’s been a mixed week for news.  VMware has announced its Nicira-based NaaS platform aimed I think at cloud data centers, and the move has gained a lot of traction among the “anybody-but-Cisco” crowd, and a major Amazon outage has made more people wonder how cloud reliability can be better when cloud outages seem a regular occurrence.

On the VMware side, I think that there’s an important move afoot, but not one as revolutionary as VMware might like to portray.  If you look at software-overlay-modeled SDN, which space Nicira launched, it’s evolving in two dimensions.  First, it’s spreading more to a complete network architecture by embracing more end-to-end capability.  Second, it’s becoming more a formal “network-as-a-service” framework, focusing on what it delivers more than how it’s delivered.

The challenge for VMware is that anything that’s designed to be a piggyback on virtualization is going to be inhibited with respect to both these evolutions and for the same reason—users.  Making NaaS work inside a data center or even at the heart of a cloud isn’t all that difficult, but the challenge is that you’re either focusing on NaaS services that are horizontal inter-process connections or you’re doing one half of a network—the half where the application or servers reside—and not the end where users connect.  With limited geographic scope you can’t be a total solution.

I think it’s very possible to construct a model for enterprise network services wherein each application runs in a subnet, each category of worker runs in a branch subnet, and application access control manages what connections are made between the branches and the applications.  VMware could do this, though I admit it would force them to create branch-level software SDN tools that would necessarily rely on software agents in end-system devices.  But would VMware’s “excited” new partners jump on a strategy that threatened network equipment?  “Anybody but Cisco” has more partner appeal than “Anything but routers!”

The thing is, all of this protective thinking is inhibiting realization of SDN opportunity by limiting the business case.  SDN isn’t one of those things that you can cut back on and still achieve full benefits.  The less there is of it, the less value it presents, the less revolution it creates.  For VMware and its partners, the big question is whether SDN in their new vision is really “new” enough, big enough, to make any difference.  What it might do is set Cisco up to turn the tables on them, because nobody will like little SDN in the long run.  Go big or go home.

With respect to Amazon, I think we’re seeing the inevitable collision of unrealistic expectations and market experiences.  Let me make a simple point.  Have twenty servers spread around your company with an MTBF of ten thousand hours each, and you can expect each server to fail on the average about once every year and a half, but there’s a pretty good chance that at least one of them will fail every month, so something will be down often.  Put the same 20 servers in a cloud data center behind a single cloud network with perhaps 20 devices in it and you have a whole new thing.  We can assume the same server MTBF, but if the network works only if a half-dozen devices all work, the MTBF of the network is a lot lower than that of the servers, and when the network fails all the applications fail, something that would have been outlandishly improbable with a bunch of servers spread around.

My point here is that the cloud is not intrinsically more reliable than discrete servers, it simply offers more and better mechanisms to control mean time to restore, or MTTR.  You may fail more often in the cloud but you’ll be able to recover faster.  If one of our 20 servers breaks it might take hours or days to get it back—you send a tech and replace it, then reload all the software.  Amazon was out less than an hour.  Could Amazon have engineered its cloud better, so the outage was shorter?  Perhaps, but it would then be less profitable and we have to get away from our childish notion that everything online or in the cloud is a divine right we get at zero cost.  Companies either make money on services or they don’t offer them.

The fault here isn’t Amazon, it’s ours.  We want to believe in the cloud, and we don’t want to fuss over complicated details like the formulas for calculating MTBF and MTTR for complex systems of devices.  The details of cloud computing are complicated, the suppliers of the services and the vendors involved have little to gain by exposing them, and so if buyers and the media don’t demand that exposure we’ll never get to those complexities, and never really have an idea for how reliable a given cloud service is, or could be.

The other point I think is interesting about the Amazon cloud outage is that we’ve had several of these now and the big news is the number of social-network sites that are out, not the companies who have lost all their compute capabilities.  It’s not that company outages aren’t important but that it’s likely all the big customers of Amazon’s cloud are social network startups.  That’s not a bad thing, but it does mean that those who use Amazon’s cloud growth as a measure of cloud adoption by enterprises may be mixing apples and oranges.

Two tales, and both suggest that we’re not getting the most from the cloud because we’re not trying hard enough.

Leave a Reply