The Future of the Service Mesh in Telecom Applications

Are service meshes perfect for telecom? Some articles have suggested they are, or at least are essential. I’ve also heard from telecom users who stop just short of saying that service meshes are the death of telecom applications. It’s useful to see why there’s such a divergence in viewpoint, and also to see whether there’s a final decision to be made on the topic.

Let’s start by looking at what we mean by a service mesh. Cloud-native applications are composed, semi-dynamically, from a collection of agile microservices. The “agile” qualifier means that the microservices are designed to be freely distributable across a resource pool, with the ability to scale with load and to replicate to repair a failure. Picture, then, this dynamic system of microservices stitched by workflows. Nice picture, but to get to the gritty details, how does the work find the microservices? We have to address things in a network, and in order to pass work along, we have to know where we’re passing it, in addressing terms.

The problem with this in a microservice cloud-native world is that it may be challenge to know where something is, how many of them there are and which we should be using, and so forth. Reliable and efficient communications among microservices is essential to high-quality experiences and you can’t let every development team come up with their own approach, particularly since cloud-native design favors a lot of reuse of components.

Think of a service mesh as a kind of control-plane-and-data-plane structure, but rather than having each microservice then have to implement whatever technologies the service mesh might include, each is instead equipped with a sidecar or proxy element that represents the service in the mesh. The service stays in a “functional plane” while all the communications is handled through the sidecar. If you change service mesh technologies, the worst you’d face is changing sidecars. Usually even that’s not required because most service meshes use a common sidecar technology such as Lyft Envoy.

To understand what a service mesh was designed for, consider the concept of the API proxy as an alternative. Widely used in applications, an API proxy sits between users and services and presents a constant API to the user side, while mapping the service side to whatever instance should be run. This adds load balancing and failure recovery. A service mesh is a distributed form of the same concept, but the principle is very similar, and we’ll see why that’s important.

There are way more API proxies in the world than service meshes. The reason is that for “basic” service-to-user mapping, an API proxy is fine. In fact, it may well be a lot better, because a service mesh adds latency to something that may already have more than an ample supply—a microservice deployment. Every service is a hop no matter how you discover the route of the workflow, and so transit delay accumulates. Add in the process of sidecar proxy handling and you add in more latency. If we could assume that a given application was more about “services” than “microservices”, meaning that there were fewer components to lace into a workflow, then API proxies would be fine and meshes would be overkill.

Let’s look at 5G, and in particular Open RAN. We have, in the specification, five or six functional components, depending on whether you count the Non-Real-Time RIC as part of Open RAN or part of the orchestration function that’s separate from it. What that means is that compliant Open RAN implementations have five or six services/microservices (whatever you choose to call them). If those were to be broken into microservices, you’d be unable to expose the latter without being non-compliant. So, it would be my view that Open RAN, at least, doesn’t require microservices be used.

Now look at the APIs. I contend that the message flows defined (shown in diagrams with nice descriptive names starting with “Q” and “E” and “F”) are explicitly steered by the specification. You could easily implement this using an API proxy, which could handle auto-scaling and resilience.

My point is that if all you’re doing in 5G is implementing the Open RAN or 3GPP specifications, there simply isn’t enough component complexity to require either microservices or service meshes. Does this mean that meshes and telecom aren’t perfect together? Not yet.

There’s 5G, and then there’s 5G, as an ecosystem. By the former, I mean the communications and connectivity features mandated by the standards. By the latter, I mean the system of applications, features, and tools that create application value, manage the infrastructure, and manage the services. If you look back at an Open RAN reference with fresh eyes, you’ll notice that the RAN Intelligent Controller (RIC) is divided into six or so elements of an “application layer”. Those might be microservices, right? Then there are the applications that have justified 5G like (hypothetically) IoT. They may also benefit from cloud-native, microservice-based, design.

Or not. We still haven’t accepted a basic truth, which is that it’s not only possible but easy to decompose applications too far, to create so many coupled components that the network delay created is insurmountable, even without any additional processing or handling at the mesh or proxy level. It’s been a practice in software for decades to write modular code. A dozen modules combined into a single load loses the granular scalability of the same dozen distributed in the cloud, but gains efficiency. Before we decide that service meshes are essential in telecom, and in particular in 5G, we have to decide whether we’re “meshing” to accommodate a decision to decompose more than we should have.

We may be a victim of our organizations here. Development and operations have been traditionally separate, to the point where practices to unify them to deploy applications efficiently turned into an industry with its own name—DevOps. I’ve talked with a lot of enterprise operations people who are moaning about the lack of operations efficiency awareness among developers. They see over-componentization as a contributor to degradation of QoE, and in many cases an increase in operations cost and complexity too. We could see the same thing if we rush forward to componentizations with minimal benefits in the telco world too.

Every new technology isn’t universally valuable, and some aren’t even particularly useful. When the value of service meshes is linked to the value of cloud-native, and when we’re trying to apply cloud-native principles to an infrastructure model like 5G that was designed around virtual boxes not real virtual functions, we’re constraining the benefits by limiting the implementation.

However, and it’s a big “however”, we have to accept that there are indeed higher-layer services that depend on microservices, cloud-native, and service mesh. We may not want to make them universal in telecom, but we probably have to prepare to adopt them where they make sense. That means that, like cloud computing, cloud-native and service mesh may coexist with other more traditional development models for years, and maybe decades.