Special Missions, Special Chips, Special Clouds

Nvidia, whose RISC goals have been stymied by regulatory pushback against their purchase of ARM, is now saying that their initiatives in RISC and other non-traditional CPU models may only be getting started. The reason that’s the case is tied to what’s going on in computing overall. The reason it’s important it that it could easily impact cloud, and especially multi- and hybrid-cloud.

When I got started in programming, an application was expected to capture a record of transactions created by converting the real-time records (usually in paper form) into something machine-readable, and then reading it to create a database that could be used for reporting. Gradually, we moved the role of the computer closer to the user, through online transaction processing, personal computing, smartphones, and so forth. We went, in processing terms, from batch processing to online, and then to real-time.

Applications obviously changed through this evolution, and so did how we run them. In a batch application, you get a “record” to process because you read it. The pace of the application was set by the application itself, and this was possible because the real activity that was being processed had already occurred, and the application was only making a machine-readable record of it available. Think “keypunching”, “card-to-tape”, and so forth. As we moved into online and real-time we confronted a new situation, one where the application didn’t set the pace.

There used to be a saying for this new age; “Data demands service”. Information was now being presented from the outside, pushed into the application by real business activity that had a characteristic speed and had to be supported at that speed, or productivity would be impacted. This impacts application design for obvious reasons that developers had to be able to accommodate variable workflows, but it also impacted something else that’s critical, which is the user interface.

A punched-card record of a sales receipt places no burden on the person writing the receipt, nor does processing it impact that person. Introduce online/real-time into the picture, and we now have to think not about the finished product (the sales receipt) but the sales process. If we considered a transaction with five fields, we’d have a hundred and twenty different ways of gathering those fields, and probably 119 of them would mess up the efficiency of the sale. Not only that, the way that we capture each of those five fields has to be optimized to fit in that same sales process. Maybe for different types of sales, the order and the field-handling might be different even though the same data was collected. We need to compose a GUI and we need to make the individual presentation/collection of information optimal.

If you look at the instruction set of the computer systems used in batch processing, they were heavy on what was important to batch processing, which was the handling of “fields” of data, numbers, dates, and other stuff that you found in business records. If you look at the process of supporting a GUI in a real-time application, you probably aren’t doing any of that. The GUI is just getting the data into “transactional” form, doing what a keypuncher did from a receipt in a batch application.

The difference between composing a real-time portal to support user interaction and processing transactions created in such a portal was always profound, and distributed transaction processing often used personal computers or custom devices for the former, and classic computers for the latter. That’s still the rule today, but today we see more and more cloud front-end user portals. The cloud is the GUI for more and more applications, unloading the complexity of user interactivity from the data center where resources are likely not very scalable.

Different missions mean that a different processing model, and chip, are optimum for different places in the workflow. In the data center, there’s a need for transaction processing, analytics and some kinds of AI, and database-oriented stuff. Outward toward the user, it’s all about event-handling, graphics, and other kinds of AI that can enhance the user experience. In the AI space, for example, AI for image and voice processing are things that fit the outward-toward-the-user space, and machine learning and inference engine processing fit closer to the data center. All of this encourages a multiplicity of chips, specialized for their expected placement in the flow. It also encourages a separation of processing resources according to what’s expected to be processed there, and that creates a potential divergence between chips for the cloud and chips for the data center.

Real-time stuff is best with RISC chips, like ARM. Data center transaction processing is best done using traditional x86/x64 chips. The cloud is already focused mostly on real-time tasks, and it’s getting more focused every day. The data center focus isn’t changing much, because the things that could change it, the GUI evolutions, are being done in the cloud. We’re already seeing cloud providers either do (as Amazon does) or use (as the rest do) more GPUs and RISC because those chips fit the real-time mission requirements better. We could end up with the majority of the cloud shifting toward GPUs and RISC CPUs.

One thing this might well do is solidify what’s now a de facto division of labor. If there are no real-time chips in the data center, we can’t do real-time processing efficiently there. That would mean the cloud would take over all those tasks, or they’d have to be handled through specialized devices (like IoT controllers). Similarly, moving data center apps to the cloud could be an issue if x64 chip resource were limited or more costly.

The biggest impact could be failover in hybrid cloud missions. If the CPU architecture is largely the same everywhere, then you can run a given app in the cloud or in the data center as long as you have access to whatever APIs the developer elected to use. If not, then you can’t expect to run the same code in each of the two places, and you might not even be able to secure reasonable performance in both places if you recompiled for the CPU that was there and maintained parallel versions of the components. This could also impact failover across clouds in multi-cloud, if the chip resources in the clouds involved were very different.

We might also see “specialty clouds”, ones whose chips and web services were specific to a real-time mission set. This would offer a more attractive market entry strategy for new cloud players, than going head-to-head with the current giants across the board. Specialty clouds could also be linked to on-premises controllers/servers for things like IoT, and again might be an attractive market entry strategy. It could extend beyond RISC and GPUs to custom AI chips, too. Would a chip player perhaps think about this? Nvidia comes to mind.

The cloud isn’t taking over computing, isn’t going to eat the data center. It is already taking over the important part, the piece of computing that touches the worker, the consumer, and our infrastructure. As that happens, it will increasingly specialize to that role, and the mission of the cloud and the data center will be fixed. We’re not looking at an elastic combination of resources in those places, but at two very different “virtual computing” frameworks, with linked but independent missions. That’s going to take some time for IT planners to accustom themselves to.