Should Amazon’s Nitro Open a Discussion about Virtualization?

Is virtualization already obsolete?  Or, at least, is it heading in that direction?  Amazon would like everyone to believe it is, as this SDxCentral Article shows, and they may be right.  What Amazon and others would do about it could be a benefit to both cloud and data center.  Amazon’s own answer is its “Nitro” model, and since it’s approach to the problem of virtualization is already published, it’s a good place to start in understanding how virtualization principles could change radically.

Virtualization is one of an evolving set of strategies to share hardware infrastructure among users and applications.  “Multi-tasking” and “multi-programming” have been fixtures of computing infrastructure since the introduction of true mainframe computers over 50 years ago.  The difference between early multi-use strategies and today’s virtualization is primarily one of isolation and transparency.  The goal of virtualization of servers is to keep the virtual users isolated from each other, so that a given user/application “sees” what appears to be a dedicated system and not a shared resource at all.

Amazon, arguably the company with the most experience in server virtualization, recognized that there were two factors that interfered with achieving the isolation/transparency goal.  The first was complexity and overhead in the hypervisor that mediated CPU and memory resources, and the second was the sharing of the I/O and network interfaces on the servers.  What Amazon hopes to do is fix those two problems.

The concept of a “thin hypervisor” is easily understood.  Hypervisors manage access to shared system resources, creating the “virtual machines” that applications then use.  If the management and sharing mediation takes a lot of resources, there’s less to actually divide among the virtual machines.  In theory, this is an issue with all VM applications, but the problem is most acute for public cloud providers, whose business model depends on achieving reasonable profits at as low a price point as possible.  Amazon’s “Nitro Hypervisor” focus here could remedy a problem we should have addressed better and more completely before.

The I/O and network interface problem is more subtle.  One issue, of course, is that if all the VMs on a server share the same interfaces, then the capacity limits of those interfaces can cause queuing of applications for access, and that means VMs are not truly isolated.  This problem is greatest for lower-speed interfaces, so the general shift to faster I/O and network connections mitigates it to some degree.

The second issue is more insidious.  Most I/O interfaces are “dumb”, which means that a lot of the I/O and network activity is passed off to the server CPU.  Access to a disk file or the support of a network connection can cause an “interrupt” or CPU event for every record/packet.  In most cases, this event-handling will suck more CPU time than an inefficient hypervisor would.  Worse yet, it means that applications that are heavy users of interrupt-driven interfaces will suck resources from the CPU that can then starve co-loaded applications.  The goals of isolation and transparency are compromised.

The solution is smart, or smarter, interfaces.  All forms of I/O, including network I/O, has an associated “protocol”, a standard for formatting and exchanging information.  In addition, all applications that use these I/O forms will have a standard mechanism to access the media, implemented via an API.  There’s been a long-term trend toward making I/O and network interfaces smart, so they can implement the low-level protocols without CPU involvement.  Even PCs and phones use this today.  What Amazon wants to do is to implement the high-level application-and-API mechanisms on an interface directly.  Smart interfaces then become smarter, and applications that use interfaces heavily create far less CPU load, and have less impact on shared resources in virtual environments.

One big benefit of the smarter Nitro interface cards is that you can add I/O and network connections to a server to upgrade the server’s throughput, without adding a proportional load on the server CPU to handle the data exchanges.  It seems to me that, in the limiting case, you could create what Amazon (and the article) call “near-bare-metal” performance by giving applications a dedicated interface card (or cards) rather than giving them dedicated servers.

Isolation also has to address the issue of intrusion—hacking or malicious applications.  Amazon offloads the virtualization and security functions onto a chip, the “Nitro Security Chip”.  Dedicated resources, not part of the server’s shared-resource partitioning, are used for these functions, which means that they’re not easily attacked from within the virtual server environment; they’re not part of it.

What does this all mean for the cloud, and cloud-centric movements like NFV?  One clear point is that a “commercial off-the-shelf server” (COTS) is really not an ideal platform for the cloud.  A cloud server is probably going to be different from a traditional server, and the difference is going to multiply over time as the missions of cloud virtualization and data center applications separate.  That’s what hybrid cloud is already about, we should note.

Initially, I’d expect to see the differences focus on things like smarter interface cards, and perhaps more server backplane space to insert them.  As this trend evolves, it necessarily specializes servers themselves based on the interface complement that’s being supported.  Amazon’s Nitro page illustrates this point by showing no less than nine different “configurations” of EC2 instances.  That, in turn, requires we enrich our capability to schedule application components or features onto “proper” resources.

I think this point is often missed in discussions of feature hosting and NFV.  There are clearly a lot of different “optimum” server configurations in the cloud, and it’s unrealistic to assume that support for all possible applications/features on a single platform would be an economically optimal (or even viable) approach.  As we subdivide the resource pool, we create a need for more complex rules for assigning components to hosting points.  NFV recognized this, but it’s chosen to establish its own rules rather than try to work with established scheduling software.

Kubernetes has developed a rich set of features designed to control scheduling, and I think it’s richer than that available on VMs.  Containers, though, are less “isolated and transparent” than VMs are, and so this probably means that container hosting versions of platforms like Linux will evolve to include many of the “Nitro” features Amazon is defining.  Hosting platforms, like servers, will evolve to have very distinct flavors for virtualization versus application hosting.

Perhaps the most radical impact of the Nitro model is that it suggests that it’s imperative we separate management- and control-plane resources from application/service resources to improve both security and stability.  IP networking is historically linked to in-band signaling, and that makes it harder to prevent attacks on fundamental network features, including topology and route management.  Evolving concepts like SDN (OpenFlow) perpetuate a presumption of in-band signaling that makes them vulnerable to a configuration change that cuts off nodes from central control.  How we should isolate and protect the control and management planes of evolving services is an important question, one that we need some work to answer.