Building and Being Cloud-Native

Suppose I want to develop a true cloud-native version of a network service or feature, maybe even an application? I’ve noted in past blogs that the term “cloud-native” is rapidly becoming the greatest “wash” in the industry; vendors, operators, analysts, reporters, and editors all spread it over anything related to the cloud. That’s a bad idea, but correcting the problem depends in part on having a technical framework for the real thing. I can’t do a full cloud-native design tutorial in a blog, but I can lay out the main points, and I’ll do it in terms of trade-offs.

The first tradeoff is componentization versus performance. It’s nearly impossible to make a monolithic application scalable and resilient, so for the cloud-native goal to achieve, we have to divide an application into components that can be duplicated under increased load, and replaced with little or no impact if there’s a failure. Components are separate software pieces coupled through APIs and networks, and this generates latency. The more components, the more network connections add to overall latency, and the more likely it is that the applications’ QoE will fall below acceptable levels.

The second tradeoff is transactions versus events. Commercial activities are rooted in transactions, which are designed to record a significant activity. A “transaction” is a request for processing, one that could ask for the contents of something or provide an update to something. They’re typically things that make a persistent change in a record, like the balance in your bank account or the number of widgets still in stock. Events are signals of a real-world condition change. They’re “atomic” and they are more likely to signal a change in the state of some physical system, like a door opening or a server failing.

The third tradeoff is platform versus specialized. Application software is supported by “system” or “platform” software, which is software designed to perform tasks that are likely common across many applications. If there are going to be many cloud-native applications, then a platform to handle them will reduce overall development and maintenance effort and make operations more consistent and efficient. However, a Swiss army knife doesn’t drive screws or cut wood as well as a screwdriver or saw, respectively. A general toolkit is likely to impose more overhead and more restrictions on applications than a specialized implementation of what might be considered a “system” function.

The first of our tradeoffs relates to how componentized we’ll make software to make it possible to scale and replace components. Approaching this issue means understanding what “atomic services” really are. Just because you can take Task A and Task B and write separate code for them doesn’t mean they should be microservices. The question is whether anything is gained by making Task A and B separate. Is the load impact on the two different in any way, meaning they might scale differently? Will dividing the tasks make it easier to accommodate a failure, or will two separate components both have to be running in order for anything useful to be done? The latter means componentizing would increase total fault risk.

The biggest mistake most cloud-native architects make is over-componentizing. Often it’s as simple as believing that if microservices are good, more of them is better. Sometimes it’s a byproduct of component reuse policies; smaller units of functionality are easier to reuse. What you should do is divide your cloud application into the smallest truly independent functional units as a starting point, and be prepared to recombine or subdivide if you find out your assumptions were wrong.

The second tradeoff is transactional versus event-driven, and this is actually the most complex tradeoff in technical terms, particularly for network function applications of cloud-native. A network is a complex structure with a lot of things going on, some of which are indications of a problem. It’s critical that these things are accommodated by the software, which means first that there has to be a notice of them, and second that the things generated are interpreted correctly. That’s a fundamentally event-driven process. On the other hand, the standards bodies involved with telecom/networking, notably the 3GPP, O-RAN Alliance, and ETSI, tend to define interfaces/APIs between their elements that have transactional properties. An example is using an API to check for status, which is “polling”. This tends to impose a synchronous model of execution, because you would typically expect to wait for a response. Events are posted, so they encourage an asynchronous mode of execution; they happen when they happen. Software has to handle both situations.

Event processing is a complex topic in itself, but the most important piece is state or context. The difference between transactional and event-driven applications is that the former process is stateful and the latter is usually stateless. A transaction is part of a context between the originator and the system, where an event is simply the notification that something happened. This distinction is important because if something is contextual, then components processing the message have to be able to sustain their knowledge of context, of state, somehow. That’s a problem for cloud-native.

Suppose I have a request that’s assigned to a specific component for handling, but while that component is working, it fails. I can spin up another instance in the cloud easily enough, but what happens to the first request? The second instance doesn’t know anything about the request, or about the first instance’s role in it. If the first instance remembered that it had sent a record to the user to be updated, the second one knows nothing. Somehow, state has to be provided to that second instance or resiliency through replacement won’t work.

In an event-driven system, the events drive specific atomic tasks. If context is important, it either gets kept by the client or through a back-end database where the requests are given a unique ID, and that can be used to look up where things are in the sequence of actions. That means that the design facilitates both replacement and scaling, and in fact any instance of a microservice could field a compatible request and you’d get the same result.

You can make a system designed to support events recognize the sequence of events in a transaction, and if you do it right, you can make that system fully cloud-native, meaning it will exploit cloud scalability and resilience fully. It is very difficult to make an intrinsically stateful transactional system into cloud-native, except by making it contextual-event driven.

The last of our tradeoffs is the platform-versus-custom one. This is rarely an either/or choice; most cloud-native software will use some packaged system software, particularly the operating system. The majority of cloud-native development uses “middleware” software that provides broad services like deployment (Kubernetes) and discovery and connectivity among microservices (service meshes like Istio, Linkerd), but monitoring and management software may also be involved. Where the big question arises is in the specifics, and in particular in how state is managed.

In the middleware platform area, the question is just how generalized the cloud-native strategy has to be. In the current 5G hosting example, should architects be thinking about something specific like O-RAN, which doesn’t fully exploit a broad middleware toolkit because it doesn’t have to, or edge computing, which depends on the broadest application base? I’m of the view that it has to do the latter.

The specifics of cloud-native applications in networking have to reflect the fact that a network is a community, and a hierarchy. There are functional elements, today’s switches and routers, owned by various stakeholders, and all of these have to cooperate in moving traffic. A service is almost always coerced out of this complex structure, and so what we have in effect is a “platform”, the network, hosted on a “platform”, the device and server infrastructure.

This relates to state control particularly. The best way to think of a network service is as a modeled collection of intent models, each of which has its own operating state. That yields a state/event organization of tasks, where a table identifies each state and how each possible event is treated in it. That’s the model the TMF NGOSS Contract first suggested for network automation. It relates to cloud-native because the state/event intersections define atomic actions that can be mapped to cloud-native microservices, and because the service model contains all the data, so the microservices can be stateless, scalable, and resilient. I think that other methods of state control (such as client-side, back-end, or step-function) may be better for other applications in the cloud, and so I don’t think that state/event logic should be a general part of a cloud-native platform. Let the applications manage state themselves, in the way that works best.

If we sum this all up, the lesson is that you have to design an application to be able to take advantage of the cloud, and that design will have to address the tradeoff points I outlined. As I said above, and have said in other blogs, the fact that the term “cloud-native” is applied to things that aren’t architected to actually be scalable and resilient is a big risk, because we’re expecting early applications in telecommunications (like 5G RAN and O-RAN) to drive edge deployments that can then be leveraged for other services and experiences. If we don’t design those telecom applications to cloud-native standards, we may create the wrong platforms for broader use.

One challenge we have to address here was noted earlier; that the interfaces defined in most telecom standards, including 5G, tend to be transactional rather than event-based. This doesn’t mean that the entire software framework has to be transactional, but it does mean that there’s a risk that the interface specifications will induce architects to think of the specifications as defining boxes rather than elastic and agile processes. I’m going to try to get information on the real cloud-native state of some of the 5G stuff, and I’ll report what’s really being done, if I’m successful.