Making the Cloud-Heads and Bell-Heads See Eye to Eye

It’s pretty clear that cloud providers think they have a shot at the gold (forget brass) ring in terms of telco services. I hear the signals of interest from both cloud provider and telecom contacts, but I wonder whether either side understands just how wide the gulf between them really is. Over the last two weeks, I’ve gotten a lot of commentary from cloud provider telecom insiders and from the telcos themselves. What I’ve found sure looks like a combination of one group who knows the right approach but not what to do, and another group with the opposite problem.

The cloud people know how to do cloud-native. They understand how to build modern applications for the cloud, how to deploy and manage them. The telcos know how to do network services. They understand the challenges of keeping a vast cooperative device system working, and profitable. Making the twain meet is more than a philosophical challenge, in no small part because both parties are locked in their own private worlds, and neither of them understand that while they speak different languages and thus risk misunderstandings, there’s a more fundamental problem. They’re both wrong.

When I talk to cloud people about telecom and cloud-native design for services, they draw on their experience with social media and say that they have the answer. Social media is huge. It scales. It works. It makes money. All of that is true, but it’s not really relevant to telecom providers, because what the cloud and social media people assume as a kind of unstructured, undefined, taken-for-granted piece, is what the telcos are trying to sell.

Social media is huge, and that’s true. We have billions of users on it, and the systems and software scale to support them. Players like Twitter and Netflix and Google have pioneered network software models that work at this level, and while there have been glitches in all the social media platforms from time to time, users aren’t rushing to the exits, so the software works.

The problem is that social media expects underlying connectivity provided by somebody else. They don’t call the space “over-the-top” or OTT for nothing. The software that cloud and social media types provide is application software not network software. The principles of operation for social media are different. All of the events associated with social media are generated by people, and for the most part those people are clusters of connectivity within small groups. Twitter doesn’t have to organize the whole Twitter universe to manage a Tweet, only the followers of the person doing the Tweeting.

In networks, events are more likely to be generated by problems or at least abnormal conditions. A problem could crop up anywhere, and remediation that has to be organized across a large set of cooperative devices. Not only that, there’s likely a strict time constraint on operations to consider. If you post a Tweet, how long would you expect to wait for a response? Your followers might be doing any number of things (including ignoring you), and so a delay could be considered normal. With network features and services, a delay is often the indication of a lost response or a failed element, and you can’t wait minutes to see what’s happening and set about getting it fixed.

Social media scales, of course. You have those billions in the Twitterverse or on Facebook, and they all do stuff. The thing is, a Tweet or post on Facebook is atomic. The processing isn’t complicated; you look at the list of followers and you send the post/Tweet to them, replicating the message as needed. Where this processing happens is pretty open; you could in theory have the originating device do a lot of the heavy lifting, and surely you could have a resource pool from which you could grab capacity to do the replication and routing. With networks, the events have potentially universal scope, and there’s a big question about where you process them, and how.

A network problem often generates a whole chain of events, which is why we have fault correlation. The events may relate to a common cause (fault correlation is all about that), but there’s also the problem of coincidence, meaning events whose responses are likely to involve some of the same elements. How do you mediate remediation of multiple faults when some of the things that are involved are shared among them? How do you prevent resources from being allocated twice, or more?

Operators have understood the difference, and have fallen back on the model we might call “virtual humans”. We have people in a network operations center (NOC), and before the heady days of operations automation, problems got reported to this pool of humans and humans took care of them. If there were multiple problems, the humans took them in order, or by priority. If there was a risk of colliding remedies, the humans coordinated. No wonder we have monolithic software models in projects like NFV and ONAP!

The cloud community has created cloud-native, microservice-based, applications fairly easy, in no small part because their problem is easy. You could draw out the logic of a post-to-followers social-media framework on a napkin. It’s a lot harder to do that for network lifecycle automation, which is what things like 5G O-RAN are really about. How do you organize all the pieces? You could layer on run-time workflow orchestration, service meshes, and so forth, but how do you deal with collisions in remediation, with prioritization of problems?

I don’t believe for a moment that you couldn’t make telecom services into cloud-native applications, but I’m become less certain that either cloud providers or network operators have a clue as to how to proceed. Certainly the telcos have demonstrated that they can build applications only by assuming that there’s a single thread of processing fed by a queue of events. Just like humans. Will the cloud providers see the telco world as Tweets, just as the telcos see it as NOC-resident humans? If so, their approach won’t work any better.

There are a lot of suggested solutions to the two-different-worlds dilemma we face. Graph theory, state/event tables, you name it. Most of them would probably work, but do either the cloud providers or the telcos understand that they’re needed, and would either group know how to apply them?

We’re rushing to a decision on whether to host the future of networking, stuff like 5G, in the cloud, without knowing whether anyone really understands the problem or would recognize the solution. Among my telco contacts, most still think NFV or ONAP, with their monolithic “anthropic” approach, are perfectly suitable. Among my cloud contacts, most think that an event is an event, whether it’s a Tweet or a network fault.

You can hire people who understand the telco market. You can hire managers who can run telco projects, but this works better for selling something than for implementing it. Microsoft so far seems to have the best approach; buy companies who do telco stuff and run them in your cloud. Even that, though, depends on being able to spot good implementations.

We may, as all the coverage suggests, have a land rush starting here, a rush for cloud providers to gain critical traction and positioning with telcos as the biggest single application of cloud computing—hosting service features—evolves. Do those who are running know where they’re heading? That’s a whole different question.