Just What the Heck is an Event-Driven System Anyway?

In a number of my past blogs, I’ve talked about the value of an event-driven model for cloud and NFV deployment.  Since then I’ve gotten a few requests to explain just what the difference is between traditional and event-based models.  It’s a challenge to do that without dipping deeply into programming details, but I’m going to take a shot at it here.

Envision an assembly line in operation.  We have, let’s say, a car chassis enter the line at the beginning, and we then bolt various things to it along the way until an automobile emerges from the end.  We can, if we have a specific goal of the number of cars per hour, determine how fast the line will have to move to fulfill the production target.  This is pretty much how cars have been built since the days of Henry Ford, and we think of this as an orderly and well-controlled process.

We can draw an analogy between this approach and the way that a service would be created in the days of manual operations processes.  You get a service order.  You start taking steps to fulfill the order, in sequence.  When the steps are taken, the service is ready to turn over.  This process can be called a “workflow”, and it’s how manual things get done, including assembly lines.  When we translate manual processes to software, we tend to replicate this notion of a workflow because…we…it works and it fits our natural model of how things are done.

Let’s assume now that as our assembly line is humming along, we have a glitch in door delivery.  We can’t bolt the door in place because it’s not there.  We can’t do later steps that depend on door installation.  What are our options to meet our goals?

One option could be to have a number of assembly lines running in parallel, which means that we can stall a line and others will pick up the slack.  The problem with this is that it’s an overprovisioning strategy, meaning that we’re committing more resources than needed to allow for the failure of one of our workflows.  It’s going to leave facilities idle because if everything is working OK we’d have to throttle back on production or we’d build more cars than we need.  Another option is just to let the line stall while we get a door.  In that case, not only are we risking meeting our target, we’re spending money on the idle facilities even when work can’t get to them.

Suppose we kind of shunt the car that’s missing a door off to a siding and let other stuff pass?  We could do that, providing that we have some way of getting the door to the siding and getting it installed, then reintroducing the car to the flow.  But is there a gap in the flow, other than the one we created by putting our door-waiting car onto the siding?  Do we have to stop the line to create a gap?  If so, we’ve wasted the slot we created by side-lining the car, and we’ve introduced the need to synchronize the door with the car, then the car with the line.

Then let’s suppose that we have a shortage of another material down the line.  We get our door, we attach it, we stall things to get it back into the flow, and two steps later we find that we have to sideline it again.  Perhaps the car we held up didn’t have that later part dependency.  We could have done better by holding the car with the missing door till the later parts were ready.

You can apply this to service-building of course.  A single linear workflow is fine if everything goes in a linear way, but can that be expected?  We have a service order.  We’re waiting on some provisioning step, and if we have a single-threaded process we halt activity till we get what we want.  Meanwhile the customer alters or cancels the order and we don’t even need to take the step.  In a linear workflow, we’d have to write tests into the software to check for pending changes and cancellations while we were waiting for something to complete.  This makes the software “brittle” because every service might have different dependencies, different things to check for.

In software, we can have what’s called “multi-threading” which is analogous to having those parallel assembly lines.  When something is stalled waiting, it can be held up while another thread keeps going.  But just like assembly lines, you need to have resources for the threads and you’re wasting time and adding complexity by having things held up.  In addition, you still have the problem of knowing what you’re waiting on—if an order is waiting for something and is changed, or is dependent on something that’s still “down the line” and not available either, you have more waste.

What you need to make this thread-multiplying work is the ability to respond to the availability of something and handle it in an appropriate way.  In the world of car-building it’s hard to envision this because it would mean that every car would be built in a little “pod”, and that the start of the process would be the launching of a get-part activity based on the car’s parts list.  You’d then deal with the parts as they arrive and fit.  Obviously this is an event-driven approach, and just as obviously this would be hard to justify in physical manufacturing.

If we enter the software world were building a facility isn’t an issue, we have a different story.  You can have “orders” that drive a task of marshaling resources, and when you have everything you need you can release the “product” for use.  However, since you’ve lost the inherent context of a linear workflow, you need your software order process to somehow keep track of where it is.

You normally build an event-driven model using what are called “state-event tables”.  You define specific “states” representing stages of progress.  You define events as things that have to be accommodated.  In each of your states, you assign an event to a process that does the right thing.  If you are in the state “WaitingForProvisioning” and you’ve activated a series of steps to allocate resources and you get a cancellation, you activate the “ReleaseResources” process.  If you’re in the state “WaitingForActivation” and you receive an Activate event, you start assigning stuff and enter WaitingForProvisioning.  If you were to receive another Activate in that state you know something is awry in your software.

Where this gets important with respect to NFV and the cloud is that we have tended to follow linear workflow models when we built a lot of the software and standards.  Not only does it make the software brittle, it makes it less efficient in handling large numbers of requests.

One of the subtle problems that can be created in this area is “interfaces” or “APIs” that are workflow-presumptive.  An example from the NFV world is easy to find.  We have an interface between a VNF and the VNF manager, and that interface is an API.  Unless the interface is explicitly handling the passing of events then the presumption is that it’s part of a workflow, and we’re back to our assembly line and missing door.

DevOps, the deployment management framework used in the cloud, has begun to adopt a limited event-driven model to couple infrastructure issues to application deployment models.  To me, this shows that the industry is starting to accept event-driven systems.  It proves that you can retrofit events to current structures, but surely if you want to be event-driven it’s better to get there holistically and from the start.

The benefits of an event-driven software system for service lifecycle management are profound, and so are the differences between software designed for workflows and software designed to be event-driven.  We need, for both the cloud and for NFV, a decision on whether we’re willing to forego the benefits of event-driven software, because we are doing that by default.

Think for a moment about my service-model-pod concept for a moment.  Any application, any service, any management process fits in it.  You can spawn models for each service and the model keeps track of the service parameters and state, and it’s the conduit through which events are coupled.  If you assume microservices or Lambdas are the processes, then you can spawn them as needed.  There’s no risk of swamping management systems with fault events or scaling demands created by a popular app or activity.

This is what an event-driven structure could look like.  If we have this, whether it’s in cloud deployment or NFV provisioning, we have the ultimate in scalability and efficiency.  An implementation based on event-handling would beat everything else.  We don’t have one today, but as I’ve noted we’re starting to see little steps toward recognizing the model as superior, and we may get one yet, even soon.  If we do, whoever fields it will reap some significant benefits.