Creating an Optimum Platform for IoT Functions

There are a lot of axioms in the world, but in networking some of them are more like what we used to call “postulates” in my geometry days.  Postulates are statements presumed to be true, and it just might be that some of our cherished views fall more into that category.  That’s particularly true with IoT and event processing.

Let me start by saying that it’s an axiom that, all other things being equal, lower latency in event processing is better.  Latency has the effect of lengthening the response between a reported condition, meaning an event, and the arrival of the response at the appropriate control point.  Let that loop get long enough and you could do anything from alienate the user of the application to actually causing damage.  However, of course, all other things are rarely equal.  Networking and the cloud are trade-offs of various things against cost.

There’s another axiom, which is that when you have a problem with multiple contributing factors, the best course is to attack the factors that make the greatest contribution.  In IoT and event terms, that means that if there are many sources of incremental delay in a control loop, you should address the largest sources first.

Propagation delay, meaning the time it takes for an event to get to the place where it’s to be processed, is a source of latency.  As I’ve noted in other blogs, signals propagate in fiber at the rate of about 100,000 miles per second, which means that they cross a hundred miles in a millisecond.  An autonomous car would go 0.088 feet, or about an inch, in that time.  A hundred miles is probably a long way from a hypothetical edge, and a millisecond doesn’t sound like a lot to many people, including some who have commented on my blogs in the past.  On the surface, they could be right that propagation delay isn’t a big thing for IoT.  How about deeper than the surface?

I’ve done some modeling after discussion with some industrial control and event specialists, and also done some modeling.  The results are interesting.  They don’t say that edge computing isn’t justified, only that it may not be justified for the reasons we think it is.  That means that there may be alternatives to edge computing for IoT that should be reviewed.

The standard mechanism discussed for IoT and event-handling tasks is the serverless, or “functional/lambda” model.  With this model, there’s a series of what we could call “function hosts” that have the software platform elements needed to host small stateless process elements.  Above these function hosts is an orchestration layer that uses event sources, policies, and resource availability to decide where a function will be run.  The function is then loaded and run, receiving the event.

My contacts in the event and IoT world say that their experience with functional computing is that the process of loading and running a function will take a lot more time than a millisecond.  In fact, the majority of the experiences are ranging from 100 to 200 milliseconds, and many even longer.  Obviously that dwarfs propagation delay as a source of end-to-end delay.

The obvious way to eliminate the functional computing delay is to eliminate the serverless concept.  If we had resident processes, running perhaps in containers, that were handling the events, there’d be no delay in loading and running them.  The problem with that is that it’s likely that events are a relatively uncommon thing in the IoT world, making it expensive to have processes sitting around waiting for them to happen.

An enhancement to this approach would be to customize our function-hosting platforms, perhaps having a very large RAM to hold a lot of processes-in-waiting and also eliminating a lot of the normal overhead that even containers have.  That’s workable because functions are a very lightweight form of logic, not requiring a lot of middleware or operating system services.  Most of my contacts think that this approach could cut the load-and-run delay to less than 10 milliseconds.  The efficacy of this process-in-waiting model depends on how many different processes are involved.  This is perhaps one place where edge computing could come in.

It’s reasonable to assume that different kinds of events would be present at different frequencies depending on the specific location being monitored.  Putting it another way, it’s reasonable that in many cases, there might be only a small number of “processes-in-waiting” for a given area, based on the specific applications of IoT there.  It’s also reasonable to assume that the most time-critical stuff could be hosted via processes-in-waiting, with deeper processing and insight whose timing was less critical hosted deeper and more conventionally.  Thus, process-in-waiting combined with edge computing could make sense because it reduces load-and-run latency not propagation delay.

There’s some evidence that load-and-run latency is worse when you go to “deeper” and larger data centers.  A few contacts that have tried to test that presumption say that it appears that larger resource pools take longer to orchestrate, increasing latency to as much as two times that of a “local” small, or edge, platform.  There’s not much good data on this front, though.

My modeling says that if you were to host functions in a process-in-waiting model with a customized OS platform, you could get one hundred times as many in a server as you could “normal” containers, and perhaps forty times as many as you could versus a streamlined function-optimized container.  This, without assuming that you’d have to load-and-run for each event.  The number of processes-in-waiting could be increased by using more RAM, and beyond the RAM limit you could reduce latency with solid-state, high-speed, storage.

What all this suggests is that we should examine exactly what event processing requires in terms of function loads and runs, and customize function hosting according to what we find.  The challenge that poses for everyone is obvious; we run the risk of specializing the resource pool to the point where there’s a risk that we can’t easily switch resources from broad application or feature hosting missions to event missions.  It may be possible to resolve that issue by looking at function hosting as a specialized container mission.

I think that the Amazon and Microsoft models, which orchestrate function operations on top of a separate hosting model, could probably be applied to a container system as easily as it could to a specialized function-hosting system.  To me, that seems the best path forward.