It may be that the only thing worse than a trite “prescription” for something is a complete lack of definition for that same “something”. We may be in that situation with “cloud-native”. Worse, because we don’t really have a broadly accepted definition of what cloud-native is, we may be facing major challenges in achieving it.
The definitional part, I think, may be easier than we think. “Cloud-native” to me is an application model that works optimally in public cloud services, but is either difficult or impossible to achieve in a data center. The trick in getting anything from this definition is to confront what’s truly different about public cloud.
This can’t be something subtle, like economics. Even if it were true that public cloud was universally cheaper than data centers (which it is not), the universality of the truth would mean that you couldn’t accept “more economical” was the fundamental property of “cloud-native”. After all, “native” means “emerging uniquely from”. What are the “native” properties of the cloud? I think we could define four of them.
Number one is an elastic relationship between an application component and the resource that host it. Elasticity here means both “scalable” in the sense that the instances of a component can scale (infinitely, cost permitting) based on load, and resilience, meaning that a component can be replaced seamlessly if it’s broken.
This is the most profound of the defining points of cloud-native, because it reflects the focus of the whole idea. People think that containers equals cloud-native, but the fact is that you can run anything in a container, and run it the same way it could run in a data center. There’s nothing uniquely “cloud” about containers, nor should there be. A container is a mechanism to simplify component deployment by constraining what specifics that component can call upon from the hosting environment. A virtual machine is a general hosting concept; it looks like a server. A container looks like a specific kind of server, deployed in a specific network access framework.
I think you can make a strong argument that a cloud-native application should be based on containers, not because no other basis is possible, but because other hosting frameworks are (relatively) impractical. By presuming a specific but not particularly constraining framework within which componentized applications run, container-based systems enhance application lifecycle management.
In order for a component to be freely scalable (additional instances can be created in response to load) and resilient (the component can be replaced without impact if it fails), it cannot store information within itself. Otherwise, the execution of multiple instances or new instances would be different because that stored data wouldn’t be there. This is sometimes called “stateless” behavior, or “functional” behavior.
It’s difficult to build a transaction processing application entirely with stateless components; cashing a check or buying a product is inherently stateful. Hybrid cloud applications reflect this by creating an elastic front-end component set in the cloud, and then feed transactions to a transactional back-end usually run in the data center. That architecture is critical for cloud-native behavior.
The next thing you need is a mechanism for virtualizing the cloud-resident components. Containers, as I’ve said, aren’t about that. A “virtual component” looks like a single functional element no matter whether it’s been running an hour or has just spun up, no matter whether there’s only one of them or a thousand. To backstop the ability to scale and replace components because they’re stateless, we need to have a way of binding a request to a specific component instance. That means finding the darn thing, since it might have been spun up anywhere, and also finding an instance if it scaled, via a load-balancing mechanism.
You can do discovery and load-balancing already, and in fact many of the “level 3 switches” of the recent past were server load-balancing tools. The thing about cloud-native is that you don’t have servers in a virtual sense, so you don’t have a data center to sit load-balancing switches in. You need distributed load balancing and discovery, and that means you need service mesh technology.
A service mesh is both a concept and a tool. Conceptually, a service mesh is a kind of fabric into which you deploy stateless components (microservices), and which provides those components the necessary discovery and load-balancing capability. As a tool, service meshes today are based on what are called “sidecars”, which are small software elements that act as “adapters”. The sidecar binds with (connects to) the microservice on one end, and with the fabric and related software tools that establish the service mesh on the other. Sidecars mean that every microservice doesn’t need to be written to include service-mesh logic.
Istio is the most popular of the service mesh tools today, linkerd is in second place. However, unlike container orchestration which has decisively centered on Kubernetes, there’s still a chance that something better will come along in the service mesh space. Right now, service meshes tend to introduce considerable latency.
The third thing that cloud-native implementations need is serverless or “functional” capability. In programming, a “function” is an expression whose outputs are a function of the inputs and nothing else. That’s a subset of what we’ve called “stateless” behavior above. The functional property is important for serverless cloud, because in that cloud model, a component/microservice is loaded only when it’s invoked, so there’s no precommitted resource set in place to manage.
Serverless isn’t for everything, of course. The more often a microservice is invoked, the greater the chance that it would be cheaper just to instantiate the component permanently and pay for the resources full-time. But agility and resiliency are the fundamental properties of the cloud, and serverless extends these properties to applications that are so bursty in use that they wouldn’t be economical in the cloud at all without a different payment model.
Serverless raises the last of our cloud-native needs, but only because it came to light with the invention of serverless cloud. Actually, it’s needed in all forms of true stateless cloud-native computing. We’ll call it sequencing.
An “application” is usually considered to be an ordered set of actions along a “workflow”. The question is, how does Microservice A know what’s supposed to run next? Workflows stitch components into applications, but what defines them? We’ve had some sequencing strategies for decades, but cloud-native needs a complete set.
Most applications are sequenced by internal logic, meaning that there’s some master component that “calls” the others at the time and in the order needed. That’s sometimes called a “service broker” and sometimes referred to as a “storefront design pattern”. Some applications are sequenced by “source routing”, meaning that a list of tasks in a particular order are attached to a unit of work, and at each step the current item on the list is stripped off so the next item will be done next. In some systems, each unit of work has a specific ID, and a state/event table associates processes with the state and “event” ID.
There are tools available to support all of these options, and there’s really no “best” single strategy for sequencing, so it’s unlikely that we’ll see one supreme strategy emerge. What’s critical is that all strategies be possible.
That closes an important loop. Cloud-native is an application architecture, a model of development and deployment that maximizes the benefits of cloud computing. Like all application architectures, it has to be there in its entirety, not in pieces that vendors find convenient. That means that when you read that somebody has a “cloud-native” approach, you need to understand how their approach fits into all these cloud-native requirements. Most don’t, and that means they’re blowing smoke.