Initiatives Take Hosting Beyond x64, and Maybe Define the Edge

If edge computing is different from cloud computing, then it would seem likely that there are technical elements that would have different emphasis in those two spaces. One such element is fundamental to both; hosting. The differences, and the reasons for those differences, arise out of the mission of edge versus the mission of cloud.

There’s no reason to put hosting close to the edge other than to reduce latency. Edge is by nature going to be spread out, meaning more real estate costs and more difficulty in achieving optimum economy of scale. The justification for accepting these challenges is that some applications require low latency, lower than can likely be achieved through reliance on regional or central cloud data centers. Applications that require low latency tend to be event-driven rather than transactional, the traditional model of data center apps, and as I noted in an earlier blog, that often means a different CPU architecture, or even a GPU.

Architectural shifts don’t stop there, either. We already see white-box devices built with custom switching chips (Broadcom’s for example), and there’s also growing interest in adding intelligence to interface cards to offload work and improve overall performance. This trend means that it’s very likely that we’ll see “servers” that serve something other than x64 hosting and that include local interface intelligence as well as a collection of CPU/GPUs.

Then there’s AI, which is spawning its own set of custom chips. Google’s Tensor chip, used in the new Pixel 6 family, has an integrated AI process accelerator, and vendors like Cadence and Maxim Integrated have AI processors. Experience already shows that these are likely to be integrated into server platforms, combined with all the other CPU, GPU, networking chips, and interface smarts.

AI is already popular as a facilitator for some IoT applications, and as we start to face IoT for real (rather than for-ink, media-driven), we may well come up with IoT processes that could justify another set of custom silicon. The point is that as the volume of anything increases, the pressure to enhance its efficient processing increases too. If we do face edge, cloud, and network revolutions, those revolutions could be drivers of yet more diversity of custom chips.

Unless we assume that there’s only one chip vendor and one standard architecture for each of these chip types, there’s going to be an issue matching software to the hardware mixture of a given server. If we have three or four possible chip types, in two or three different versions, the combination that developers would have to deal with could be daunting; you could end up with a dozen or more different software versions, even if you took steps to constrain the mixture of chips in your deployment.

All of this poses management challenges at several levels. Deployment and redeployment of edge software (microservices) has to accommodate the specific architecture of a given server platform, if the architecture impacts how efficiently some microservices will run on a platform, or even if they’ll run at all. You could tune current orchestration tools (Kubernetes, Linkerd, etc.) to match complex platform features with complex microservice hosting needs, but the more custom chips you introduce, the more combinations you can expect, and the harder it is to manage the resulting puzzle.

The other level of management challenge this all creates is in the area of lifecycle or fault management. A multiplicity of platform configurations, combined with different data center switching vendors, can create a management problem in the very spot where you can’t afford to have one, where real-time systems that generate events are hosted.

This is the kind of challenge that cries out for a virtualization/abstraction solution. The smart approach at both the development and operations level would be an abstraction of the functionality of a given chip type, with a “driver” (like the P4 driver in switching chips) that matched the abstract or virtual device to one of multiple implementations of the basic functionality. We don’t have a good abstraction for anything much today; even P4 isn’t supported universally (in fact, the most commonly used chips, from Broadcom, don’t support it). In the near term, it’s likely that white-box vendors will provide platforms that have a specific chip/vendor mix and provide the necessary drivers. That could mean that software will initially support only a subset of the possible combinations, since there’s not even a good driver standard today.

On the operations side, we’re already facing issues with both GPUs and NIC chips. These new problems are exacerbated by the fact that the “clusters” of hosting and switching that make up a resource pool don’t present a common operations model. Adding in specialized elements can only make that worse, and edge computing has enough of a challenge in creating a resource pool with reasonable economies as it is.

Juniper has been on a tear recently in pushing some innovative concepts, including their spring Cloud Metro initiative and their recent media/analyst/influencer event where they promoted a new vision of network AI. They’ve now announced an initiative with NVIDIA to extend “hosting” to the NIC, and to bring NIC intelligence under their universal-AI umbrella. Given the edge shift and the fact that Cloud Metro defines a harmony of network and edge computing, it’s not a surprise that this new capability is called “Juniper Edge Services Platform” (JESP). JESP APIs link with the Apstra software to extend management and orchestration into the NIC.

JESP is based on NVIDIA’s BlueField DPU, where “DPU” is “data processing unit”, meaning the server platform itself. The idea is to extend the data center network (which is deployed and managed through Juniper’s Apstra acquisition) right to the servers’ interfaces, where those “smart NIC” chips are likely to be deploying. My own data from operators suggests that by early 2024, more than half of all newly shipped NICs will be smart (Gartner is more optimistic; they say 2023). In the edge, however, it’s likely that smart NICs will dominate quicker.

NVIDIA also has a plan to accelerate software harmony via its DOCA software model, a model that extends beyond just interface smarts to include even the P4 driver for switching, and therefore could describe a network element (including a server) with a single abstraction. DOCA stands for “Datacenter-On-a-Chip Architecture”, and it’s intended to accelerate the BlueField DPU concept, but it looks to me like it would be easy to extend the model to a white-box system, too.

There are two levels to any abstraction, the mapping to resources and the exposure of functional APIs. We’re seeing a lot more abstraction in the resource side, but applications at the edge and in the cloud still depend on middleware resources as much as hardware. It’s going to be interesting to see whether initiatives like DOCA and JESP will stimulate work on that middleware framework, something that would depend on thinking through the needs of edge hosting, and how it differs from cloud hosting. If it does, we could actually see the edge advance, and quickly.