Everyone has surely heard about Google’s massive outage, even those who weren’t directly impacted by it. The fact that “network congestion” could take down such a broad set of services—some essential to their users—is troubling, and it should be. The details of the cause will probably eventually come out, but it appears the issues were broader than Google’s cloud alone. Whatever they were, there are a lot of C-suite executives who are very worried this morning.
The biggest worry reported by execs who have been emailing me is that the cloud is less reliable than their own data centers. One said “I’ve committed my company to a cloud strategy to improve availability, and now parts of my cloud failed while my data center is running fine.” Obviously this sort of thing could hurt the cloud, eventually. My model of buyer behavior says that the credibility of Google’s cloud and public cloud services in general won’t be impacted by this outage, but if the problems with cloud failures persist, it could happen.
There’s a pretty solid chance that this problem, and in fact most of the publicized problems with the cloud, are really network problems rather than hosting problems. Cloud data centers, “hyperconverged” if you like, tend to be highly reliable because they have built-in redundancy. The problem with cloud computing isn’t the computing part at all, but the fact that network connectivity is essential in the cloud. In fact, it’s the foundation of the cloud.
IP networks are “adaptive”, meaning that they rely largely on cooperative behavior among devices to set up routes and respond to failures. Routers “advertise” reachability and various protocols are used to exchange this information, which is then turned into routing tables. For ages, back into the ‘70s in my own experience, there’s been a debate on whether this kind of autonomy is a good thing or a bad thing. The good part is that networks built on autonomous device behavior are self-adapting. The bad part is that if they get it wrong, it may be difficult to figure out what happened and how to remedy the problem, which seems to have been the case here.
Proponents of SDN have long felt that IP was a bit too much of a jolly band of brothers and to little an organized and controllable technology to build networks on. They’d like to replace IP with central control of routes, and if you look at that as a path to a more controllable, “traffic engineered” network, the notion seems valid. The problem is that nobody has proved it would work at anything approaching the scale of the Internet, and that it’s far from clear that any operator would be willing to make the massive technology change that adopting SDN to replace traditional adaptive routing would involve.
The other truth here is that when the dust settles on this outage, we’re very likely to find that the fault was caused by a configuration problem, meaning human error. Yes, you could argue that the configuration of devices that, when they’re set up, kind of do their own thing, makes it hard for network engineers to predict the relationship between changes they make and future network conditions. A number of operators have told me this disconnect is the cause of most of their significant network outages. The point is that human error isn’t going to be eliminated by moving to a networking model that’s based more on human traffic engineering decisions than on the decisions made by autonomous devices.
The thing that this outage proves is that we really do need some form of AI in networking, not just in the “management” or “traffic engineering” parts of it but also in the configuration management part. In the cloud, meaning in the hosting part of the cloud, we’ve already recognized the underpinnings of a mature, reliable, framework for host management, and they start with the notion of a “single source of truth” and a “declarative model”.
Declarative models describe the “goal-state” of a system, and employ automated processes to converge on that (or another acceptable) goal state when something goes awry. We’ve been shifting increasingly to declarative DevOps, for example, in both VM and container hosting, and in the public cloud. In networking, the rule is still script-based or “imperative” management, where instead of saying what you want, you tell the network what to do. That process is much harder to add AI to, but it’s also hard to have scripts changing things while devices themselves are adapting to conditions, including the ones you create with your changes.
The single source of truth is perhaps the solution to this. Rather than have a dozen or more sources of configuration information, single-source-of-truth approaches keep everything (which should mean all the goal-states) in a common repository. Since Git is the most common repository, this approach is often called “GitOps” because it builds everything around the repository. Everything that needs a configuration has to get it from there, and the repository can examine the totality of data and ensure that everything is consistent.
Why, if the cloud had figured out GitOps, has this problem with cloud/network symbiosis not been solved? The big reason is that even though we’ve virtualized practically everything these days, we’ve somehow not really done that with the network itself. Even technologies like overlay SDN and SD-WAN, which provide virtual connectivity, do so by layering it onto the transport network, which is first of all almost surely IP and second of all is treated as a separate resource, presumed to be available. Things like service mesh are just now starting to integrate network thinking and cloud thinking.
Connectivity is what makes the cloud a cloud and not just a big data center. We can’t make the cloud work if we can’t make the network it depends on work, and that network is the Internet. We need to figure out how to make GitOps-like technology converge with what network geeks often call “NetOps”, and we’re not there yet. Google’s outage shows we need to accelerate the process of getting there.