Have We Missed a Fundamental Point on State?

One of the big and hidden problems with cloud-native and microservices is state.  True cloud-native elements should be fully scalable and resilient, and that combination mandates that these elements save nothing within themselves, even the context of a dialog with a user.  “State” or “context” is intrinsic to virtually all business applications, including applications designed to support online retail operations or customer support.  Thus, getting state/context into cloud-native is critical, and we just might have missed something fundamental.

We’ve known for a long time that “stateful” behavior was important in applications.  Level 3 switches, application gateway controllers, and load balancers often mention “stateful” behavior as part of their feature set.  That’s because a transaction usually involves a number of back-and-forth exchanges, and if the same instance of a process doesn’t handle them all, it’s possible that transaction processing will get tied up in knots and even create security issues.

The cloud today is most often used by businesses not to host everything, but to host the front-end piece of transaction processing.  Some of that is complicated by the need to support inherently stateful behavior within the cloud.  There are two pieces to that; state control of stateless microservices, and cloud storage of data needed by all the possible instances of these front-end application components.

There are three broad ways to provide state to stateless components of an application.  The first and easiest to understand is to send dialog state information inward from the GUI or application, as part of each “event” that’s being generated.  Think of that as meaning that you tell a stateless process “this data is the response to a presentation of account information for update”, where you send not only the data the user updated but also the information needed by the process to interpret that data.  Any instance of the process can receive that message and handle it.

The second mechanism is to have the stages of contextual transaction processing recorded in a database and accessed by the cloud-native process when it receives some work.  This is “back-end” state control because the state database is “behind” the process.  You still need some sort of transaction ID to link each new message with the back-end data store.

The final mechanism is “orchestration”, a kind of distributed state control.  Amazon provides this with Step Functions, which plot out a sequence of steps a given transaction goes through, and runs stateless processes within a model framework of those steps.  While orchestration doesn’t have an explicit link to a database, the model and sequencing has to be stored somewhere.

Orchestration is interesting because it’s a link between two related but still separate concepts—microservices and functions (lambdas).  The two get conflated all the time, but generally a microservice is persistent except under unusual conditions.  They can scale and heal by redeployment, but they are available while the application is running.  Functions are kind of hyper, meaning that they are transient in-and-out things.  A function loads when you need it and disappears when it’s done.  That makes it “serverless” and makes the issues of state management more acute, because functions can pop up anywhere.  We’ll get to that below.

The precise mechanisms available for these options vary among cloud providers and also among “tool providers”, which presents some challenges for developers.  Front-end state control is the most portable approach in terms of working on any cloud and with any toolkit used to build and deploy microservices, but it introduces more work on the user interface to maintain state.  Back-end state control poses a risk if the state database can be separated (in distance/latency terms) from the process instances accessing it.  Orchestration state control is quite implementation-specific, and so all three mechanisms have challenges.

As I noted above, persistent data storage in the cloud is another state/context issue.  If multiple instances are to access a state database, the access probably has to be mediated to prevent having collisions arise in updating it.  This is a fairly classic problem that some database technologies address within themselves, so there are solutions available.  However, there’s been a lot of interest in cloud storage and data management recently, both among public cloud providers and among third-party vendors.

We have some recent news in the space.  One startup, Reduxio, is about to launch a cloud-native storage and data management tool that can provide persistence of state/context information without sacrificing cloud-native benefits.  The KubeCon event last week focused in part on stateful support in Kubernetes and state-centric advances in the Kubernetes ecosystem, and Robin.io announced a collaboration with Red Hat and OpenShift to enhance stateful application capability.

All this is good news for cloud-native in general, and for broader use of public cloud services by enterprises as well.  The challenge is that we’re still not really homed in on a total solution.  The tools I just cited are great strategies for cloud databases to hold state and other data that can be expected to be updated with user activity, but they don’t totally address the question of state.  Since we’re still seeing multiple implementation options among public cloud providers, that means that state management and persistent data strategies aren’t portable without some tuning.

There’s also the broader framework of deployment of this stuff to consider.  Kubernetes is the go-to orchestration approach for containerized applications, and you can get a number of Kubernetes-based solutions that work across multiple clouds as well as the data center.  For connecting services, we also have a number of approaches, including Istio, which I’ve noted in past blogs.  Logically we’d like to see stateful behavior integrated into/with both Kubernetes and Istio.

The question is what kind of stateful behavior we’re talking about.  Remember our earlier comment about microservices versus functions?  While functional computing is usually associated with event processing, an event is really just something external that demands software attention.  We have events in protocols all the time; every message is one.  We interpret events by associating them with processes through a state/event table.  You can surely frame such a table as an orchestration model for IoT events, but you can also frame a transactional dialog as a sequence of events, as noted above, and that’s what I think raises the big question.

Orchestration of functions combines the process of function hosting and state control.  You put a function somewhere, run it, and contextualize it through a distributed state control feature.  Could you, in theory, if a particular kind of event was regularly occurring, elect to keep the function in place?  Surely.  Could you, if you had a microservice that was rarely used, elect to unload it and reload it when needed?  Surely.

The point is that we’ve slipped into state control using a model that’s different for transient functions versus persistent microservices, and that’s a difference of application and conditions and not one that is as naturally polarized as we’ve made it.  That was a mistake, and we’re not moving as quickly to correct is as we should be.