Google is another cloud or OTT company that’s done a lot of exciting things. In fact, Google’s infrastructure is a bit of a poster child for what the future telco cloud or Internet should look like. According to a recent SDxCentral piece, Google is both pleased and worried, and since Google is arguably the global leader in insight about cloud infrastructure, we need to look at their approach, their reasons to be happy, and their risks.
At the top level, what Google has done is to build an enormous, global, virtual data center using SDN and encapsulate it within a virtual BGP “autonomous system”. Inside the Google black box, it uses SDN in a way similar to how others use SDN in virtual data center networks, meaning that they build large-scale ad hoc connectivity using SDN tools. At the edge of the black box, Google emulates BGP routers using hosted software, and thus presents its services to users in a network-composable way. It’s a lot like intent-based networking; all of Google is an abstraction of an IP autonomous system, and inside Google is free to do what works best.
If you were a network operator trying to plan any kind of carrier cloud application, you’d be crazy not to adopt a similar approach. There is no example of cloud infrastructure on the planet that is as massive, has proved scalability and availability so convincingly, as Google’s. Why this wasn’t taken as the model for things like NFV and as a use case for SDN is anyone’s guess, but the fact is that it wasn’t.
The black-box abstraction notion that Google has used as the basis for its infrastructure is designed to make it possible to run something in any suitable location and have it connect to the user as a service. That means that it’s a distributed resource pool, and the two parts of that phrase are both the happy and concerning pieces of the Google story and the SDxCentral article.
A resource pool is like having multiple check-out lines (“queues” in many countries other than the US) in a supermarket; shoppers don’t have to wait in line as long. There’s a whole body of queuing theory that lets you calculate things like service times and the efficiency of the system of queues, and the results can be plotted in “Erlang” curves. If you look at the math, what you find is that the ability of a pool to serve users doesn’t improve in a linear way as the size of the pool increases; the curve flattens out at some point. That means that eventually, adding more resources to the pool doesn’t improve efficiency any further.
As it turns out, there are other brakes on the size of resource pools, the most important being the operational complexity of the distribution system that links work to resources—the network. Google realized about a decade ago that if they used traditional IP/Ethernet as their “internal” network, the lack of explicit traffic engineering inherent in the protocols would create “non-deterministic” service behavior long before Mister Erlang caught them. That’s why they were so much a leader in SDN thinking and in the use and management of distributed resource pools. The Google model has worked well, up to now, but it’s threatened today by three factors, and addressing any of them is going to take some new initiatives.
Factor one is that Internet experiences are getting a lot more complicated, in part because of the desire to present users with rich visual experiences and in part because of the need to monetize much of the Internet experience set through advertising. When you assemble a web page to view, you may be grabbing a half-dozen or more ads from many different locations. This process takes time, and the time reduces the quality of experience for the user.
Google, to control this first factor, has to expand what “distributed” means. If you want to use network delivery of a bunch of content pieces and you don’t want the delay to become intolerable, you need to shorten the delivery time. That’s a primary driver for edge hosting or caching in advertising and media-related web experiences. Google can be expected to expand its own resource pool into edge computing, probably not by expanding number of servers inside their current black box but by creating a new “edge-black-box” model that will work symbiotically with their current infrastructure. We don’t know yet exactly how this will work, but expect to see it by 2020.
The second factor is the Erlang problem combined with the repeal of Moore’s Law. SDxCentral hints at this in their article with the comment on Moore’s Law. Compute power per CPU chip has been growing steadily for decades, but laws of physics will eventually intervene here just as the laws of math are intervening in resource pool efficiency. Think of the problem in relation to our supermarket check-out. We can, up to a point, make cashiers more efficient through automated scales and registers. That makes the overall capacity of our distributed system better, even though we may hit the Erlang limit on the gains we could expect from adding more lines. In server terms, faster chips improve performance enough to offset the Erlang limits. Once we lose those regular annual gains in CPU performance, our old friend Erlang bites us.
The final factor is the operational complexity associated with distribution of resources. In rough terms, the complexity of a connected, interreactive, system grows faster-than-linear with respect to the number of elements. If the system is fully connected, it grows with the square of the number of elements. Google has, up to now, addressed this by the use of abstractions to define resources and model services. I believe this system is hitting the wall, that Google knows that, and that they plan to address this issue and the other points as well, with a new model-driven approach.
Distributed, grid, cloud, or whatever you call it, the application of collective compute and storage resources to complex problems has been the primary pathway toward a reduced dependence of IT on the classic notion that compute power was going to double every couple years. The scale of things like the Internet and cloud computing are clearly taxing the basic model of resource pools, and Google and others are now looking for a more comprehensive solution.
The new solution, I think, is going to be a pair of modeled hierarchies, one for resources and one for applications/services. Services and applications will be defined as a feature collection, totally composable and with some features offering multiple implementations either to jiggle the cost/benefit or to address possible differences in hosting capability. The resources will offer a composable infrastructure abstraction that will be subdivided by the type of resource (hosting, network, database, etc.) and in parallel by any specialization for the location (edge, metro, core).
Hierarchical modeling is a fairly common mechanism for managing complexity. You assemble units into logical groups and manage the groups as an abstract unit. It allows for the distribution of management tasks to some point within each group, and that limits the scope of management control and the load on management processes. It doesn’t solve all problems, though, because the control plane interactions needed to manage the relationship between service/application elements and each other, via resources, needs special work.
I’ve speculated for some time that the convergence of many different cloud forces, including IoT and other carrier cloud applications, would require a broader “orchestration” layer. Today, we tend to think of orchestration relating to deployment, redeployment, and scaling of components. However, all of that implies a unification of the orchestration task with other resource management and workflow analysis tasks, and these are external to the orchestrators. We are already seeing Google work toward some unification by creating a tighter bond between Kubernetes (traditional orchestration) and Istio (workflow management). I expect that the new model will take this convergence even further, to include serverless app orchestration as a subset of orchestration as a whole.
In effect, what I think Google plans to do (because they have to do it) is to create the “backplane” of the first true network computer, the virtual machine that goes everywhere and does everything. They’re not alone in working toward that goal, of course. Amazon and Microsoft are both well along with their own plans and initiatives, but two things may give Google a bit of a leg up. First, they did things right architecturally from the first, and thus have a technical head start. Second, they did things wrong from the first from a service/business perspective.
Google lags in the public cloud, and it’s clear now that public cloud applications are a very valuable application of our future network computer. Google’s mistake was to expect too much from others; businesses can’t behave the way Google did, and does. As a result, Google tried to promote future-think at a pace way beyond the ability of companies to adopt or even understand. Amazon and Microsoft have taken advantage of Google’s business mistake.
They’re still behind in technology, perhaps less because Google was smarter but because the very future-think vision that caused Google’s business mistake created an architecture that got Google much closer to the finish line in creating that futuristic network computer. Google faced problems with network-delivered services before anyone else did, and solved them. Amazon clearly sees the need for a new model and they’re working toward it, Microsoft perhaps a little behind in that particular race. Google is absolutely in front.
Will Google’s change in cloud service management fix the business problem and accelerate Google’s progress on the infrastructure evolution path they need to follow? I don’t know, but I’m pretty sure that we’re going to find out in 2019.