Real virtualization is based on clusters. Virtualization assigns tasks to resources, and it doesn’t make sense to go through the assignment process to provide a resource pool the consists of one host. Virtualization really implies a remapping of hosting across a pool of available servers not a 1:1 assignment. In cloud and container computing, a pool of resources is known as a cluster, and so the term cluster is probably a reasonable general term to describe a resource pool. Is a cluster also a good baseline for things like carrier cloud? We’ll see.
Let’s start with a simple question: What the heck is a “cluster” exactly? It’s a collection of “worker resources” under the control of a “master resource”, cooperating to perform some mission. That’s also a pretty good definition of a resource pool, and of many collective computing resources we find these days, as long as you don’t get too literal. Every resource pool doesn’t have an explicit master resource, and most today have a fair variety of missions, but the cluster concept is important because we have so much open-source software targeting clusters of some type. Might some of that be pressed into service to support virtualization and the carrier cloud? Clustering as a natural pathway to virtualization is important by itself, but it could be critical if clustering technology is directly applicable to carrier missions.
There are actually many different ways of characterizing clusters, including the technology and the mission. For example, clusters could be homogeneous based on a single operating system or container type of environment, they could be multifaceted in terms of their technologies, they could be designed for parallel computing, and for high availability. We’re going to look at some of the cluster types in more detail here, but just be aware of the fact that it is possible to specialize clusters to a mission. One of the questions that we have to ask when we evaluate cluster technology is whether that kind of specialization is useful or harmful overall. That’s particularly true when you talk about clustering as a part of a public service, including carrier cloud.
To navigate through all this confusion, it’s probably best to start by saying that everything in cluster computing is based on the same general thing underneath; it’s all a bunch of resources. Assigning a structure or definition to a cluster is really a matter of understanding the specific thing we expect those worker resources to be cooperating to do. Not so much for this specific vertical mission, but more in terms of the software structure that’s expected to be run on the cluster.
Everybody is most familiar with cloud computing, which is where a pooled resource creates a series of virtual hosting points that can be used by applications as though they were real. Most public cloud services use this kind of cluster, and in most cases the pool is made up of identical or very similar systems in terms of hardware and platform software. A task gets a “virtual server” that is always less than (a VM among many, or a container) or equal to (bare metal) a single host in terms of power. The tasks that run in a cloud don’t have any explicit relationship with each other, and public clouds presume the tasks are deliberately separated. Some describe the cloud as supporting a service relationship with users.
Grid computing is another special form of cluster computing, but this time the goal is to support an application that needs more resources than an entire server would provide. A grid application gets a lot of resources, so its virtual host is bigger than a physical host rather than the other way around. Unlike the cloud, which is designed for traditional programming tools and techniques, grid computing requires that applications be developed especially for that model of cluster use. There are commercial users of grid computing, but most of it is familiar to us from scientific applications. Grid applications are also specialized for the grid cluster hosting environment, and so it would be fairly difficult to base broad cloud services, or even function hosting, on the grid model of clusters.
One mission-related subdivision of the cluster concept is high availability, and this is the cluster model that applies most directly to things like function hosting for services. The chances of a single computer failing are finite. Add additional computers, and the probability of all of them failing at the same time diminishes exponentially. Some applications of clusters exploit this effect, and if virtualization means remapping of needs to resources in a dynamic way, then availability can be influenced strongly by the correct cluster design. That design starts by creating a large enough pool of resources that there’s always some available to step in if other resources fail. That, in turn, means that you have some homogeneity in the pool. If every application/component has unique requirements that require individualized hosting, you don’t have a pool of assignable resources.
That doesn’t mean that all cluster hosts have to be totally homogeneous. In many cases, applications will fall into groups according to the resources they need, and if there is enough demand for each of these resource groups, you can still create a statistically efficient resource pool from each of them. However, it is always going to be easier and more efficient to build clusters from masses of equivalent resources because you’ll reach statistical efficiency with a smaller number of resources overall.
To make a cluster of resources into an efficient resource pool does demand something beyond just numbers. For example, “resource equivalence” is the fundamental notion within a pool of resources. You have to be able to make free substitution within the pool, which means not only that the resources themselves have to be fairly uniform, but that the connectivity among them not create sharp differences in application/component QoE depending on where you put things. The more you distribute the cluster’s resources, the harder it is to connect them without impacting QoE, because of propagation delay, serialization delay, and handling delay in the network.
You also need to be able to schedule work, meaning make assignments of resources to missions, based on whatever criteria the application might require and the pool can support. Once you’ve scheduled work, you have to deploy, meaning make the assignment between virtual and actual resources, and then you have to provide for lifecycle management that will detect abnormal conditions and take automatic steps to remedy them. These three capabilities have to be overlaid on basic clustering for the concept to be useful, and how the two work would likely determine the range of services and applications that clusters could enable.
If you’ve followed this (which I hope you have) you recognize that a lot of what clusters need is also what something like Network Functions Virtualization (NFV) needs. In fact, NFV has never been anything but a specific application of virtualization and clustering principles. Or perhaps I should say “NFV should never have been”, because obviously the ISG didn’t follow the path of basing its approach on cluster technology. That could have been done, because unlike container-based solutions like DC/OS, there were cluster implementations available in open-source form when the ISG launched.
It’s not too late for NFV clusters, though. Most cluster strategies operate at a lower level than containers, working instead with bare metal or virtual machines. That might make it easier to adopt clusters explicitly in NFV, because containers/bare metal offers better tenant isolation and both also are more flexible with respect to networking options. Finally, there are a lot of tools available for scheduling and deployment on clusters, which means there would be more choices for operators who wanted to try it out.
All cluster-based strategies would pose the same difficulty as OpenStack in mapping to NFV, though. Scheduling, deployment, and lifecycle management in cluster applications is typically designed to be uniform in implementation and efficient in execution. There are many things the NFV ISG has suggested as requirements for scheduling resources for VNFs or for management that don’t map directly to cluster solutions to the same problems. The number of issues and the difficulty in addressing them will relate to the flexibility of the previously mentioned three software layers that control cluster behavior.
When you rise to the level of lifecycle management in that structure, things get truly complicated, for the simple reason that lifecycle behavior is very service/application dependent. Scaling of components under load, for example, is something that some applications (including NFV) mandate, but others don’t use because of stateful behavior complexities. Updating things is an example of stateful behavior, and obviously you can’t spin up unlimited things to update a common database unless you can manage collisions. Cluster tools won’t generally address this level of issue, and in fact even specifications for NFV don’t do a particularly good job there.
Clustering, as I said at the opening of this blog, is a critical element in virtualization of any kind. In later blogs, I’ll take a look at some clustering strategies and we’ll use them both to explain the issues and requirements, and to see what features apply best to network/cloud applications.