Server Architectures for the Cloud and NFV Aren’t as “Commercial” as We Think

Complexity is often the enemy of revolution because things that are simple enough to grasp quickly get better coverage and wider appreciation.  A good example is the way we talk about hosting virtual service elements on “COTS” meaning “commercial off-the-shelf-servers”.  From the term and its usage, you’d think there was a single model of server, a single set of capabilities.  That’s not likely to be true at all, and the truth could have some interesting consequences.

To understand hosting requirements for virtualized features or network elements, you have to start by separating them into data-plane services or signaling-plane services.  Data-plane services are directly in the data path, and they include not only switches/routers but also things like firewalls or encryption services that have to operate on every packet.  Signaling plane services operate on control packets or higher-layer packets that represent exchanges of network information.  There are obviously a lot less of these than the data-plane packets that carry information.

In the data plane, the paramount hosting requirements include high enough throughput to insure that you can handle the load of all the connections at once, low process latency to insure you don’t introduce a lot of network delay, and high intrinsic reliability because you can’t fail over without creating a protracted service impact.

If you looked at a box ideal for the data plane mission, you’d see a high-throughput backplane to transfer packets between network adapters, high memory bandwidth, CPU requirements set entirely by the load that the switching of network packets would impose, and relatively modest disk I/O requirements.  Given that “COTS” is typically optimized for disk I/O and heavy computational load, this is actually quite a different box.  You’d want all of the data-plane acceleration capabilities out there, in both hardware and software.

Network adapter and data-plane throughput efficiency might not be enough.  Most network appliances (switches and routers) will use special hardware features like content-addressable memory to quickly process packet headers and determine the next hop to take (meaning which trunk to exit on).  Conventional CPU and memory technology could take a lot longer, and if the size of the forwarding table is large enough then you might want to have a CAM board or some special processor to assist in the lookup.  Otherwise network latency could be increased enough to impact some applications.

The reliability issue is probably the one that gets misunderstood most.  We think in terms of having failover as the alternative to reliable hardware in the cloud, and that might be true for conventional transactional applications.  For data switching, the obvious problem is that the time required to spin up an alternative image and make the necessary network connections to put it into the data path to replace a failure would certainly be noticed.  Because the fault would probably be detected by a higher level, it’s possible that adaptive recovery at that level might be initiated, which could then collide with efforts to replace the failed image.  The longer the failure the bigger the risk of cross-purpose recovery.  Thus, these boxes probably do have to be five-nines, and you could argue for even higher availability too.

Horizontal scaling is less likely to be useful for data-plane applications for three reasons.  First, it’s difficult to introduce a parallel path in the data plane because you have to introduce path separation and combination features that could cause disruption just because you temporarily break the connection.  Second, you’ll end up with out-of-order delivery in almost every case, and not all packet processing will reorder packets.  Third, your performance limitations are more likely to be on the access or connection side, and unless you paralleled the whole data path you’ve not accomplished much.

The final point in server design for data plane service applications is the need to deliver uniform performance under load.  I’ve seen demos of some COTS servers in multi-trunk data plane applications, and the problem you run into is that performance differs sharply between low and high load levels.  That means that a server that’s assigned to run more VMs is going to degrade everything, which means you can’t run multiple VMs and adhere to stringent SLAs.

The signaling-plane stuff is very different.  Control packets and management packets are relatively rare in a flow, and unlike data packets that essentially demand a uniform process—“Forward me!”—the signaling packets may spawn a fairly intensive process.  In many cases there will even be a requirement to access a database, as you’d see in mobile/IMS and EPC control-plane processing.  These processes are much more like classic COTS applications.

You don’t need as high hardware reliability in the signaling-plane services because you can spawn a new copy more easily, and you can also load-balance these services without interruption.  You don’t need as much data-plane acceleration unless you plan on doing a lot of different signaling applications on a single server, because the signaling packet load is smaller.

Signaling-plane services are also good candidates for containers versus virtual machines.  It’s easier to see data-plane services being VM-hosted because of their greater performance needs and their relatively static resource commitments.  Signaling-plane stuff needs less and runs less, and in some cases the requirements of the signaling plane are even web-like or transactional.

This combination of data and signaling plane requirements makes resource deployment more complicated.  A single resource pool designed for data-plane services could pose higher costs in signaling-plane applications because they need less resources.  Obviously a signaling-plane resource is sub-optimal in the data plane.  If the resource pool is divided up by service type, then it’s not uniform and thus not as efficient as it could be.

You also create more complexity in deployment because every application or virtual function has to be aligned with the right hosting paradigm, and the latency and cost of connection has to be managed in parallel with the hosting needs.  This doesn’t mean that the task is impossible; the truth is that the ETSI ISG is already considering more factors in hosting VNFs than would likely pay back in performance or reliability.

It seems to me that the most likely impact of these data-plane versus signaling-plane issues would be the creation of two distinct resource pools and deployment environments, one designed to be high-performance and support static commitments, and one to be highly dynamic and scalable—more like what we tend to think of when we think of cloud or NFV.

The notion of COTS hosting everything isn’t reasonable unless we define “COTS” very loosely.  The mission for servers in both cloud computing and NFV varies widely, and optimizing both opex and capex demands we don’t try to make one size fit all.  Thus, simple web-server technology, even the stuff that’s considered in the Open Compute Project, isn’t going to be the right answer for all applications, and we need to accept that up front and plan accordingly.