Looking at Event-Driven Myths

I’m always eager to look at anything on event-driven architectures, so the piece in New Stack caught my attention. It’s focused on “three myths” regarding EDA, and debunking them, but there are points along the way that relate to the value of event-driven models for telecom software. That should be a critical topic right now in the 5G and even edge computing space, and it’s not getting the attention it deserves. I do have to point out that I think the article promotes its own mythology, so we’ll have to navigate through that.

My biggest problems with the article are 1) it doesn’t really explain the rationale behind event-driven applications, and 2) it focuses on a specific subset of such applications, a subset where the biggest question is getting a process associated with an event in a serverless-cloud model. That combination makes it less insightful than it could be, and it omits many of the things that make event-driven models ideal for network/service lifecycle management.

Event-driven systems recognize that real-world conditions can be described by relating new things that are happening to the state of a system, as set by what happened before. We all use event-driven thinking to manage our lives; in fact, most of what we do, even conversations, are event-driven. That’s because life is contextual, meaning that things that happen are interpreted in context, based on what’s happened before.

An explicit requirement of event-driven systems is the concept of state, which is the condition of the system as determined by those past happenings. My first exposure to event-driven systems was writing protocol handlers, and for these, the “events” were messages from the other side of the connection, and the “state” was reflective of the progress of getting the connection working, exchanging information, or recovering from a problem. The way I did this, and the way almost all protocol handlers are written, is to establish a state/event table that defines, for each state, the processing to be done for a given event (in this case, message).

Suppose for a moment that we didn’t want to do this, that we wanted something other than state/event processing. The fact that we’re not organizing processing around a state/event table doesn’t mean that the thing we’re doing isn’t state/event-oriented. In our protocol handler example, we might receive a “LINK-INIT” message to set up a connection. If get such a message, can’t we just set the link up? No, because it might already be set up and in the data transfer state, in which case the message indicates our partner in communication has lost sync with us. So we’d have to check link state, test the variable that determines it.

What’s wrong with that, you might ask? Well, that link state variable that we’re testing means that our processing of the LINK-INIT event is now stateful. If we invoke a process defined in a state/event table, and we pass it our link variables with the event, the process is stateless, which means we can spin it up when we need it, spin up many copies, and so forth. This is how cloud-native software is supposed to work.

State/event programming isn’t something everyone does, and it’s also not something everyone understands. That leads to resistance, which leads to objections. Myth One in the article is that event-driven systems are difficult to manage. It’s true that view is widely held, but the article doesn’t address the issue in general, but again focuses on specific cloud/serverless applications. It says that event-brokers that link events to the associated process aren’t necessarily complicated. True, but we need to generalize the story for network and service lifecycle enthusiasts to get anything from it.

In a protocol handler, the event-broker equivalent is the processing of the event against the state/event table entries. People think this is difficult to manage because they think that it’s hard to address all the possible state/event combinations and define processes for them. Well, tough it out, people. If a system is naturally contextual, there is no alternative to looking at the possible contexts and how events relate to them. State/event tables make it pretty easy to see what might happen and how to handle it, because every possible state/event combination has to be assigned to a process. In a system that simply checks variables to determine state, there’s a risk that the software doesn’t anticipate all the possible variable states, and thus will fail in operation.

Myth Two in the piece should probably have been Myth One because everything comes down to this particular point, which is that event-driven systems are difficult to understand because they’re asynchronous and loosely coupled. Nice, neat, linear transaction-like processing is so much easier. Yes it is, for people who never did event-driven development, but as I’ve already pointed out, if the system is generating events that are interpreted based on state, making it transactional/synchronous and tightly coupled will make it stateful and unable to fully exploit the cloud. Do you want cloud benefits or not? If you do, then you’ve got to suck it up here too.

The thing that the piece doesn’t note, but which is critical in networking, is that the event-driven, state/event-based, approach may be the only practical way to handle complex services. Suppose that a service had five functional sub-services, which is completely realistic given geographic and technology spread. These five elements have to cooperate to make the service work, and each of them is an independent subsystem under the service, so the service state is determined by the state of the five. In a state/event model, the service is a hierarchy, where a superior element is responsible for selecting and commissioning subordinates, and is also responsible for remediation should a subordinate fail and be unable to self-correct. It’s easy to represent this kind of structure with state/event tables in a service data model (the TMF did this elegantly with NGOSS Contract well over a decade ago). Think of how complicated it would be to test variables in processing an event when you had to consider not only the state of yourself, but the state of superior and subordinate elements!

That leads to the last of our myths, which is that event-driven software is difficult to test and debug. As someone who’s written quite a bit of state/event-table software, I dispute that totally. It’s far easier to debut a hierarchy of state/event systems defined in a data model than to debug “spaghetti code” transactional logic to accomplish the same thing. With a table and data model, everything you need is in one place. In transactional code, it’s spread all over the place.

The worst enterprise network failure I ever saw as a consultant, a major disaster in healthcare, puzzled vendor and user experts alike. I resolved it because I understood state/event processing and that the system that had failed was a state/event system. Others who didn’t happen to have experience in that area couldn’t find the problem, and there’s no way of knowing how bad things might have gotten had nobody who was called in happened to have written protocol handlers.

What is true is that for state/event handling to work, it has to be done right, and this is where the lack of experience in the approach can hurt. Let’s go back to human conversation. Suppose you’re told to do something. The first question is whether you’re already doing something else, or are unable to do it because of a lack of skill or resources. The second question is the steps involved in doing it, and if the “something” involves linking multiple separate tasks, then each of them have to be organized and strung in order. We do this all the time, and so all we’re doing in state/event systems is coding what we’ve lived. Yes, it’s a shift of thinking, but it’s not transformational. I’ve run development teams and taught state/event development to the members, and I never had a programmer who couldn’t understand it.

I did have a lot who didn’t understand it at first, though, and I had a few who, when not supervised into the state/event approach, went back to old-style transactional code. Like most everything in software, event-driven systems have to be built around an architecture, an architecture that’s supported with the right tools and implemented to take full advantage of the benefits of the selected hosting environment, like the cloud.

For those who want to understand how a data-model-centric, state/event system would work through the entire process of service definition through use, I refer you to the tutorials on my ExperiaSphere work. Please note that all the ideas contained there are open, released by me for use without restrictions, but that the material itself and some of the terminology (like “ExperiaSphere”) are trademarks of CIMI Corporation, and you can use them only with permission. We won’t charge for that, but we will review your use to ensure that you are using the terms/materials in conformance to our rules, which are laid out at the start of each presentation.

One thing the article demonstrates is that the software industry may have too narrow a perspective on event-driven systems. It’s a bit of the classic “groping the elephant behind the screen” story; everyone is seeing the concept in relation to their own specific (and narrow) application. Event-driven systems are a fundamental change in software architecture, not because they’re new (they’ve been around for at least 40 years) but because the model has been confined to specialized applications. The cloud is now popularizing the approach, and a holistic appreciation for it and its benefits and risks, is essential for proper use of the cloud.

What is IoT and How Big is the Impact?

How big is IoT? Obviously, it’s one of the most hyped of all technologies, with heady estimates of spending rampant and all manner of vague business propositions to back those estimates up. It turns out that it’s very difficult to estimate just what IoT’s impact would be, for a variety of reasons. I’ve undertaken a massive amount of modeling just to get some information on the US market, and also talked with a large number of enterprises who would theoretically be the ones spending that money. The results have been interesting, and frustrating.

If we consider IoT to be anything where a device communicates without a human pushing real or virtual keys, the global market in 2020 appears to have been about $800 billion. My model suggests that that total market will grow only about 8% in 2021, making it roughly $870 billion, and even in 2022, my model says that the number will barely top $900 billion. My model also says that most of the current IoT spending isn’t what would have been considered IoT at all under the logical definition that IoT means that you have a “thing” that is on the “Internet”. It’s largely in-home WiFi-based technology.

But IoT is, well, a highly diffused space. On the one hand, people who want to propose giant numbers and rates of adoption will include anything that communicates in the space. Consumer IoT, the household gadgets that connect to phones or to “assistant” applications, obviously represent a potentially enormous market, one that could easily hit a billion devices. However, these devices are assigned private subnet addresses in nearly every case, and their impact on Internet traffic and services is actually very limited. One consumer gadget technologist confided that their application, which involved some video exchange, used far less bandwidth in a month than an hour of streaming video used. A thermostat application would be a pimple on a single streaming show.

Consumer security applications are a bit of an IoT gray area. The majority of home security systems will use local wiring or short-range RF protocols like Zigbee, and they involve the “Internet” and literal IoT only insofar as they may provide for smartphone control, which almost always uses IP and often involves the Internet. In a sense, these applications are similar to smart thermostats in that their sensing and operation are separate, with the Internet involved in the latter but rarely in the former.

Smart buildings tend to differ from consumer IoT more in the way that information is integrated and used than in how it’s gathered. In fact, many smart buildings use residential-like technology up to a local controller, where the information is aggregated and where local actions are then supported. The Internet may be used to connect these local controllers back to an actual IT system, which might include providing for smartphone control and notifications.

Industrial IoT, which is the largest segment of the IoT space outside the consumer, is more like what we’d think of as “classical” IoT. Even IIoT, to use the popular term, is similar in many ways to smart-building systems. There are sensors, usually connected through wiring or a specialty RF protocol, and messages are passed from them to a local controller that will often generate local responses to control elements, keeping latency low and eliminating the risk of a communications failure creating havoc. From that local controller, data is often then passed on to a corporate information system where several applications may be involved.

The original notion of IoT, which was a rich set of public sensors that, like the Internet, could be exploited by third-party OTT-like players, has never emerged for obvious profit, security, and privacy reasons. What might replace that, at least for some missions, are the set of IoT services that would be related to a totally different set of new services, ones I’ve called contextual services, but that I’ll generalize to “visionary IoT” here. Some of this visionary IoT could be deployed by enterprises, and some might be deployed by network operators, cloud providers, or even governments at some level, in order to create a broad set of new applications for both businesses and consumers.

While IoT advocates who want to propose large markets tend to lump all these categories together, those who want to propose significant impacts, significant changes to how we live and work, will either cherry-pick segments of the total market and generalize them, or simply make vague statements about impact without linking them to anything in particular. This makes it very difficult to size the IoT market or predict IoT spending, and I’ve worked since early in 2021 to try to come up with something useful. Here’s what I’ve decided.

The potential for significant impact from IoT aligns primarily with what I’ve called “contextual services”, which means that the largest value of IoT lies in its ability to reflect conditions and actions between the real world and a digital twin or virtual-reality parallel. The problem is that visionary IoT services involve a lot more than just IoT, and there is little or no awareness of the broad concepts or what’s inside it. That means that it’s extremely difficult to model a market scenario with any credibility, unless you make assumptions.

The first critical assumption is that our visionary version of IoT will not create a universe of public sensors, but will instead be based on services created from sensors deployed by companies who plan to profit from their sales. The most likely model for these services is the one used by public cloud providers to offer cloud-resident “web services” that developers build on.

The second critical assumption is that visionary IoT will demand support from other applications that can convey “mission context”. IoT can tell us a lot, but it’s not enough to uncover why we’re doing what we’re doing, or even what our specific purpose might be. Traveling along (walking or riding) is something that IoT could likely detect, and it could perhaps extrapolate our destination based on path and past behavior. If we could tie in the fact that we had a reservation at a restaurant in the direction we’re heading, and that we had just had a conversation with someone we often dined with, IoT could offer a much better sense of what we were trying to do, and offer us more useful guidance.

The final critical assumption is that visionary IoT will pass through a traditional tech investment adoption curve, one that rises to a peak then falls back to a maintenance level. This is how past IT investment cycles have progressed, and there’s no reason to think that IoT would be any different. With this assumption, it’s possible to judge the market impact by knowing roughly when the transformation will start and how high the peak will go.

Source: CIMI Corporation, figures in $billions, US

The figure above shows the result of my modeling. According to my model, the visionary form of IoT we’ve been expecting (or at least hoping for) emerges in three phases. First (through this year), there’s early deployment of IoT equipment (sensors and local controllers) that prep for a more visionary use of the technology. In 2022 and 2023, this results in a deployment of computing services and equipment, and then of software, that exploit those deployments. This starts to reduce the portion of visionary IoT spending that relates purely to IoT elements, and it reflects the period when the architectural model for visionary IoT emerges, the model that will drive the cycle’s growth. I think it’s likely that cloud providers will create this model and it will tie local IoT processing into the public cloud.

The year 2025 represents the real start of the IoT cycle, with a significant increase in total visionary IoT investment as the architectural model takes hold. The cycle peaks in 2027, and then slowly declines through 2030, which is the limit of the model to project. IoT element spending is stable as a percentage of total spending from 2028, and software gains slightly while hosting declines slightly through the end of the model. This is because the market, at this point, is driven by expanded use of contextual applications, largely a matter of software.

The model also shows other shifts, in the “who” and the “how”. As we advance to 2024, we see first a period where visionary IoT is more focused on per-company strategies, and then (in 2024) a shift to a more ecosystemic model. This shift drives a significant increase in IoT software spending, associated with creating a broader model of how IoT data is used and shared (I’ve called this the “information field” approach in prior blogs). From just past the midpoint of 2024, this broader IoT model is what’s driving the market.

Another shift related to this one is the shift from premises-focused IoT hosting to cloud-and-edge-focused. This really starts in 2023 according to my model, and is based on cloud-provider improvements to their “private edge” strategy. By 2026, though, we’re seeing the majority of hosting growth coming from adopting true edge computing as a service, and that strategy dominates by the peak in 2027.

A couple of closing points are appropriate here. First, while I’m pleased with how my model has worked, no model is perfect, and in this case in particular there are so many variables and constraints involved that I can’t say I’m happy with the outcome. I’ve had to create constraining assumptions to prevent the model from “oscillating”, meaning showing wild and unrealistic swings, and that means that the constraints I’ve cited here have influenced the outcome. Until we have a clear market direction that will set its own constraints, this is unavoidable unless I don’t want to model at all. Second, the model suggests that the range of possible outcomes varies from about 12% of the numbers in the figure to almost five times those numbers, even within the constraints I’ve applied. That reflects the fact that even with constraints, there are questions about who and what drives the space, and answering these in a different way could have a major impact on the results. Visionary IoT could indeed be a revolution, or it could be a total dud.

What I do hope the exercise here could do is identify the things we have to validate about visionary IoT (my constraints) and identify critical parts of the evolution to success (the phase points I call out above) that need to be somehow supported with market initiatives. It’s likely to be how well, and how quickly, we can accomplish these things that will determine how impactful visionary IoT will be.

Finally, remember that I’m not trying to forecast everything that advocates want to call “IoT”; there are no boundaries to that. This is “visionary IoT”, the kind that involves the creation of some sort of IoT ecosystem that can be broadly exploited without risks to privacy, security, or public policy. I will revisit these numbers from time to time if conditions change.

Architecture Issues in 5G Hosting and Edge Computing

Separating the control and user/data planes is a fixture of 5G, but it opens some questions. One is whether the implementation of the two is also separate, whether software and hardware are disaggregated or co-dependent, and another is what specific hardware platform(s) might be considered for each. Perhaps the most important is just what a virtual 5G deployment should look like. There are no firm answers to these questions, but there are plenty of important points we can review.

Let’s start by contrasting the 5G virtualization approach to the NFV approach. NFV necessarily started with the appliances, the physical network functions (PNFs) that were the presumptive source of the logic to be virtualized and deployed. That was the mandate of the NFV ISG from the first. There was no current technology set for 5G to start with, but the presumption of 5G was that NFV would be used to deploy the virtual elements. NFV may therefore have influenced how the 3GPP framed 5G features.

There is no mandate for control/data-plane separation in NFV, but despite the fact that the control and user planes are separated in 5G, the 5G user plane contains both the IP control and data planes. The 5G user plane elements (RAN, UPF, XUF) are really bridging points between 5G and the presumptive IP network. 5G control plane elements are therefore really “service-plane” functions, at a higher level than anything we’d have traditionally considered the control plane.

For the service/control plane pieces, the 5G and O-RAN diagrams depict a series of functional elements that could be viewed as monolithic VNFs. These VNFs have interfaces that tend to look a bit more transactional than event-driven, and so a literal interpretation of the diagrams could lead to an implementation of 5G/O-RAN that wouldn’t be componentized below the level of the blocks in the standards diagrams. The structure has clearly influenced how people think about functions in 5G/O-RAN.

A LinkedIn post on O-RAN cites a Heavy Reading survey on “four types of Open RAN deployment options”, and the types are differentiated by what standard-box elements are put where. There is no discussion of how the functionality of a standard-box itself could be implemented as a set of microservices. Interestingly, the responses to the poll on which was the preferred approach showed nearly equal support for each, which to me suggests that people actually didn’t understand the issue and simply picked an option at random. Whatever the reason behind the poll, though, it’s the boxes and not microservices that are being distributed, and that could create a model of deployment that didn’t fully realize the benefits of the cloud.

What would the 5G cloud actually be, though? I think 5G features should be considered to be a set of distributed microservices from which “applications” that represent each of the functional boxes in the specs are assembled. The microservices themselves would be hosted on something appropriate for the performance, scalability, and availability goals of each microservice. Thus, there would be no need to look at types of deployment options; stuff deploys to where it has to.

5G functionality is really likely to be concentrated in metro and access networks. As we move inward from cell sites, we would first encounter places where hosting could deliver low latency but where pools of resources were unlikely to be cost-effective. These locations, from the tower inward to the backhaul, would likely be served by some sort of appliance or white box. Within the actual metro aggregation network, resources to host features in a true cloud resource pool could likely be made available. Thus, the stuff that 5G specs or O-RAN specs call the “control plane” could be hosted there.

There is no question that the 5G control plane (or service plane) can be run on fairly traditional hardware, either on dedicated devices or in the cloud as IaaS or containers. There is also ample evidence that the IP control plane can be run that way too. The question with regard to hardware arises with the user/data plane.

We know from the old days of NFV that even a decade ago it was possible to support data-plane handling with an x86 processor and the proper network adapter, at least for moderate data rates. We know from white-box design today that you can stuff a switching chip into a white box and build a competitive router (DriveNets has done it all the way to the core router level, through cluster routing). The only issue for metro 5G hardware in the data plane is the issue of resource pool efficiency.

As I pointed out earlier in this blog, resource pools exist where there’s enough activity to justify them, which means that they’re unlikely to be helpful close to the tower but could be helpful in the metro, particularly if you believe that 5G hosting is an on-ramp to justifying edge infrastructure that would serve other missions. The resource pool is most efficient where resources are homogeneous; a pool of disjoint resources is really a disjoint set of pools, and less efficient as the size of the individual “sub-pools” declines. So specialization of hardware in the pool could reduce efficiency, and also make edge missions that didn’t require special switching chips less profitable because those chips add to cost.

Separating the control/service planes from the data plane is a help in this sense, because it means that the more traditional control/service-handling missions are decoupled from the specialized data switching missions by design. It’s also true that since higher-volume trunks pass through data-plane devices, the device relationships to the trunks tends to mean that you can’t simply scale or replace a data-plane element by redeployment/orchestration. Something physical has to be done. Thus, you could imagine that the data plane would be supported by white-box configurations controlled by a cloud-hosted, pooled-resource control/service plane.

Perhaps the biggest question is how all this would look within a metro area. A metro area is almost certainly made up of multiple aggregation points that could be connected in a hierarchy, a mesh, or any number of ways. If we viewed each of these aggregation points as a hosting point for 5G (RAN and Core), then we might see each as a kind of single virtual element. O-RAN divides 5G RAN hosting according to latency demand—we have RAN Intelligent Controllers (RICs) for near-real-time and non-real-time features. That implies that the connectivity performance of the former differs from that of the latter, which suggests that perhaps each aggregation point has a “near-real-time” internal set of paths and a non-real-time external one.

It looks to me like the efficiency of an O-RAN deployment for 5G would be higher the more elements were supported within it, which means that connection performance could determine O-RAN implementation efficiency. That would mean that a fabric-like metro connectivity framework with high bandwidth and low latency could be highly valuable. If that’s the case, then there’s a mission for specialized network configurations and perhaps specialized network products in 5G metro deployments.

This same metro-cluster concept might be critical for edge computing, particularly missions like gaming. Remember that the location of “the edge” for any set of applications and users depends on the distribution of the users. If an entire metro looks like a single hosting point with full resource equivalence, then the number of potential users within that metro is its total population, and the number of credible applications that could be hosted entirely within the metro deployment would be larger.

There’s obviously some thinking required on the nature of 5G hosting, including not only the hardware but the topology and performance of the resource interconnection. There’s even more thinking required when we toss in the notion that 5G leads to a general edge capability that has to support a set of missions we can’t yet fully define or even characterize. Horizontal traffic within “the edge” is a critical element in setting the limits of latency there, and how we support that traffic may decide whether the edge has enough suitable missions to make a business case.

Microsoft’s Private MEC Could be a Game Changer

Microsoft is arguably the leader in enterprise cloud computing, given that a big chunk of Amazon’s cloud business comes from startups and social-network players. Now, Microsoft wants to be the leader in carrier cloud too, or so it seems. There’s no question that Microsoft has the credentials for carrier cloud, technology-wise, and they also seem to understand some of the thorny market issues. The questions are, first, whether that’s enough to give Microsoft a real edge, and second, whether other cloud players will work harder to counter the move.

Microsoft’s initiatives have coalesced in its private multi-access edge (MEC) offering, and the nomenclature is interesting in itself. Rather than pushing a specific 5G or carrier cloud strategy, they’re pushing edge computing software tools as a general offering, and then enriching it with tools that support, for example, the 5G hosting mission. There’s also a strong dose of “exploit on-premises equipment”, supplementing the cloud, a promise of white-box support, and a touch of carrier cloud in their material. All of this reflects their understanding of the basic truth that 5G is the likely driver of edge computing deployments, but it’s not the exclusive mission.

Edge computing is pretty much what the name suggests, computing that’s close to the point of activity, rather than tucked away in a distant data center. The business challenge of edge computing is the fact that to be “close” to the edge, there’d have to be a lot of hosting points, which means a lot of cost. A cloud provider like Microsoft would have to think twice about making a massive edge investment in the hope that something would come along to justify it. The technical challenge is that if edge facilities aren’t part of the public cloud, then what tools are available to build edge applications, and how would they relate to traditional public cloud application resources?

The reason why there’s so much attention paid to 5G and edge computing in combination, is that 5G hosting represents a large and logical application for the edge. It could be the most significant driver of edge hosting, and whatever tools and techniques are selected to support it would be available to leverage by other applications that benefit from low latency. Thus, if a public cloud provider can support edge computing and 5G with a nice tool set, allow users to deploy their own edge hosting based on that tool set, and expand their own edge hosting resources as opportunities allow, they’d be in a great position. That’s Microsoft’s goal in a nutshell.

One thing that makes Microsoft’s private MEC a technical winner is the company’s previous acquisition of Affirmed and Metaswitch, and (in my view), the latter in particular. Metaswitch has been a player in cloud-hosted cellular infrastructure for over a decade, so they’ve had the most experience of any vendor I know in virtualizing and optimizing 4G and 5G functionality for deployment in the cloud. Microsoft is offering both the Affirmed and Metaswitch software as core mobile solutions packages in their private MEC framework.

Because the software is also available separately, and because the whole Microsoft offering can be deployed onto premises equipment (which also means “onto operator servers”), Microsoft isn’t presenting mobile operators with a choice of adopting their stuff and getting locked into public cloud, or rejecting it and having to deploy their own infrastructure or using IaaS or container hosting without special latency support. While there are operators who are looking at hosting 5G functions in the public cloud, there’s a lot of 5G that’s going to be hosted on white boxes, and operators may deploy their own data centers in at least major metro areas. Microsoft’s approach supports that.

Mobile operators, particularly the larger ones, are always antsy about relying on a third party for hosting services, but they get even more skittish when they believe their stuff is swirling around in a vast public cloud pool along with a bunch of other customers’ applications. Operators have always believed their requirements for function hosting are distinct, which is one reason they’re focused on things like NFV rather than on the equivalent cloud management and orchestration tools. Microsoft even reflects that bias, offering the Azure Network Function Manager to deploy features (including mobile and SD-WAN) to compatible edge elements. The name of the offering should appeal to the NFV folks.

Interestingly, Microsoft doesn’t highlight NFV explicitly in their material, which you’ll see of you follow the reference link earlier in this blog. It seems that what Microsoft is intending is to create a network function store, where anyone can obtain edge features for hosting in Azure’s Stack Edge Pro elements, and I think likely also in a private MEC or even the Azure cloud. That would mean Microsoft intends to bridge in two dimensions—supporting NFV or cloud-native, and supporting CPE (SD-WAN, SASE, uCPE, vCPE) or edge/cloud.

If that’s true, it raises the question of whether Microsoft might be the first player to offer a pathway to integrate white-box technology into edge hosting. The Microsoft material says “Specialized network functions, such as mobile packet core, can now be deployed on the Azure Stack Edge device with faster data path performance on the access point network and user plane network.” This can surely be interpreted as offering separation of control and data/user planes, with each assigned to specialized technology to enhance performance. Might it then address the TIP DDBR principles?

Another technical point raised by Microsoft’s private MEC is whether 5G’s virtualization is going to be enhanced by cloud-think. A poll on LinkedIn recently asked how the elements of Open RAN would be distributed, giving four specific options that were actually ways that virtual boxes might be organized. If Microsoft has its way, the elements of Open RAN might be distributed to anywhere that’s convenient, even to different places in different networks, cities, or even towers. Virtual elements are hosted in distributable abstractions, after all.

The net of this is that Microsoft has definitely taken steps to make 5G hosting and even NFV hosting into a cloud-centric process. This is Azure more than NFV; NFV is just an application. The store concept also promises to unload some of the challenges of onboarding, because store applications would be Microsoft-certified against Azure principles, including whatever specialized principles might apply where MEC or Stack Edge hosting is supported.

Light Reading is suggesting that “A big reason such hyperscale companies are investing in areas like edge computing and private wireless networking is because enterprises across a wide variety of sectors are increasingly looking to build their own private wireless networks.” I think that’s a bit of hopeful editorial/reportorial thinking. Enterprises are being pushed by vendors to consider private wireless, but I don’t think that the business case for that has really improved that much from years past. I think the real reason is that the public cloud providers know that operator deployment of 5G could be the first large-scale driver of edge computing, and they want edge to be a subset of cloud, not a competitor.

One obvious question is how the private MEC strategy would address the Rakuten criticism of telecom software. Metaswitch, at least, is a cloud-native implementation of 5G features, but it’s not a complete telecom software suite, since it lacks the OSS/BSS piece. However, if private MEC tools support cloud-native application models, then there’s at least a chance that the Microsoft offering, including the store, would promote cloud-native solutions to a broader set of telecom applications, including OSS/BSS. At the least, it might solidify a model for how to build telco software in cloud-native form, a model someone might pick up. Not necessarily, nor quickly, but maybe.

Every Microsoft cloud competitor is also seeing the 5G and MEC opportunity, but right now Microsoft may have not only a technical edge, but also a slight credibility edge. Operators I’ve talked with say that they would rank Microsoft on top by a hair, then Google, then Amazon, and then IBM. That includes both their view of the cloud provider’s 5G hosting and edge capability and their view of the provider as a trusted partner. If you look at credibility alone, Microsoft gains over the others, and if you look at trust alone, IBM gains over Amazon. Given all of that, it looks like Google is the competitor most likely to mess up Microsoft’s plan.

The benefit Google has, according to operators, is a combination of trust and an ad advantage in knowledge of cloud-native technology and network technology. However, my information predates the latest Microsoft MEC announcements. Microsoft is getting better, fast, and that will put a lot of pressure on a lot of competitive players.

Is Rakuten’s Indictment of Telecom Software on Target?

Here’s a bold statement for you: “There is no magic that an Amazon Web Services, Google, and Microsoft could enable because the underlying software architecture is absolutely flawed. It needs to evolve.” This, from the CTO of Rakuten Mobile, as quoted in an SDxCentral piece. I agree, of course, and in fact I’ve been trying for months to get a reading on what senior planners and architects in the telecom space think about “cloud-native”.

Let’s start with the “why” question. Why is underlying telecom software absolutely flawed? The operator planners/architects offered three reasons. First, operations software (the OSS/BSS) is core business software that in telecom, as in other verticals, tends to change very slowly. One good reason is that vendors don’t want to throw the space open to new competitors with massive changes. Second, telecom standards still tend to drive the formulation of technology concepts, and standards bodies are neither anxious to start over in design, nor particularly equipped with the skills needed to do it. Third, the network operators are unable to for change because they don’t understand the new model.

One operator gave me an interesting example that crosses into all of these points. The example was 5G, and the operator noted that the 5G architectural model has a distinct device orientation. There are functional elements that are represented by boxes in the diagrams, and those boxes connect with each other through lines that represent APIs or interfaces. In the models, operations systems are either totally ignored or represented as what the planner said were “boxes in the sky”, high-level elements that were represented only as general capabilities.

I’ve noted in earlier blogs that when you draw a functional diagram that’s based on boxes connected by interfaces, you tend to guide implementations along those lines. As my planner contact noted, it’s difficult to draw a representation of a microservice-based implementation of 5G that conveys any sense of functional behavior. You end up with one big box that says “5G” or “O-RAN”, and that’s decomposed inside into microservices. Not only that, to make that structure “open”, you have to define a lot more APIs and message flow relationships. The old functional-box model is appealing because it can be grasped easily. It’s just not easy to turn it into a true cloud-native structure.

Rakuten’s point is that, for a variety of unspecified reasons, we’ve not turned much of anything in telecom software into a cloud-native structure, and because of that we don’t have what he calls “elasticity”, which means scalability, agility, and resilience. The properties of cloud computing don’t push through bad software design to somehow empower the end result; you have to design the software to exploit the cloud. But who does that?

Every incumbent wants evolution rather than revolution, for obvious reasons, like the fact that it preserves your win. That’s particularly true in the OSS/BSS space, because incumbency there is almost like an annuity; win and you reap the benefits for your career lifetime. A less-obvious reason is that box-in-the-sky point. Network management and operations has to link to service and business management and operations, and those linkages tend to be preserved through migrations and technology changes, because it limits the scope of things you need to change when something new comes along. That creates the problem of the “sea anchor effect”.

A sea anchor is something you toss over the stern to create a consistent backward pull to counter weather conditions. When you have a monolithic legacy OSS/BSS framework that presents interfaces to network operations that you need/want to preserve, those interfaces are a sea anchor. Not only do they slow you down, they also impose a direction on you. In particular, if those interfaces are the classic transactional, polling-for-status, sort of thing, it makes it difficult to couple to event-driven systems in network operations. Thus, legacy architectures for OSS/BSS, and legacy OSS/BSS thinking, tends to act as a brake on cloud-native designs for lifecycle management.

Operations systems are database-intensive, and the fact that Rakuten has invested in Robin.io, a specialist in cloud databases, likely indicates that they’re thinking a lot about a cloud-native implementation even at the OSS/BSS level. Not only would this make operations systems more resilient and agile, it would make it easier to couple them to event-driven, cloud-native, network orchestration and management.

Just making OSS/BSS cloud-native wouldn’t necessarily cure the problem of the old-style interfaces. Operations interfaces contributed to the problem that the NFV ISG had; they ruled operations systems out of scope, so they had to support those interfaces. They also drew up the traditional functional-box diagram, which was then taken as a literal structure, creating a monolithic box network rather than cloud-native. What they should have done was to adopt cloud technology from the first.

Software inertia bites even here, though. Most “network functions” that are virtualized are derived from appliances, which NFV calls “physical network functions” (PNFs). Appliance vendors immediately showed their own versions of OSS/BSS legacy-think. First, they wanted minimal adaptation of their device software to virtualize it, and second, they wanted most of the money they’d have gotten for an appliance. The latter goal tended to price the popular versions of PNFs like firewall right out of the market, and also increase the “onboarding” cost and difficulty. The former goal meant that the virtual functions were derived from monolithic, non-cloud-native, versions.

So does this mean that all of telecom is doomed to “monolithism” forever? No, but it does probably mean that telecom is going to have to start thinking not just about “cloud-native” but about the way that legacy core business software (OSS/BSS in this case) relates to new software. Enterprises today are increasing their use of the cloud, but not by making legacy applications cloud-native. Instead, they’re front-ending legacy applications with new cloud technology.

5G and O-RAN represent an opportunity for telecom, a chance to look at software in a modern way and to apply the cloud front-end strategy used by enterprises to their own OSS/BSS. Today, though, we don’t have a consistently cloud-native model for 5G or O-RAN, both of which still have the OSS/BSS interface ties and both of which still present structure diagrams that boil down to box networks with box-friendly interfaces, rather than microservices and distributed features.

Rakuten’s CTO is right, and in fact isn’t going quite far enough in his indictment. Not only is telco software flawed, the telco software process is flawed, which makes it much harder to create a new model for software, a cloud-native model. Consensus processes like standards activities require a broadly held industry vision, and in the telco world, we don’t have it. Will some vendor or cloud provider step up on this one? Among the cloud providers, Microsoft seems the most likely to try to fix the cloud-native problem, and I’ll look at how their 5G/O-RAN solution seems to be evolving in my next blog.

Distributed Service Elements and Risk Management

In our search for edge computing justifications, are we pushing the complexity of cloud and Internet hosting too far? Some recent developments suggest that we may be opening a new set of dependencies for applications, creating an availability challenge that cries out for some architectural solution to complex component relationships. A solution we don’t seem to have, that most may not even care about, but that we need badly.

Web pages have gotten more interactive and more complicated over time, and that has created a number of challenges, the primary one being performance. The more visual items there are to be served, the more time it takes to assemble them all and deliver them for viewing. Almost a decade ago, startups were already being created to create intermediate caching technology, creating a page at the server end (server site rendering or SSR) or somewhere in the middle, and delivering the completely assembled page to the client. This process was called “pre-rendering”. Interactivity made that more difficult to do, of course.

One of the impacts of the web-page-complexity issue was the creation of CDNs, and the use of CDN storage for more than video. Caching routine image elements, for example, could prevent the delays associated in pulling them from a company website. Your web page could then be viewed as two sections; a variable part that the company served from their data center because it contained variable data, and a bunch of image stuff cached locally. This approach contributed to the impact of Fastly’s outage; anything that needed a cached element was now going to fail.

There’s still plenty of impetus for improving the visual experience of web pages, though. Back in 2016, we saw the development is Jamstack, a model of web interaction that replaces the old web server, app server, and databases with a CDN and microservices. It’s a successor to the pre-rendering approach that combines its static-piece support, drawn from CDNs, with APIs, JavaScript, and microservices that could, for example, be hosted in edge computing. One article on Jamstack talks about “possibilities that a great unbundling or decoupling that next-generation web development could bring”, and that statement is both exciting and problematic.

Yes, it’s exciting that things like Jamstack could enhance the Internet experience. Jamstack and edge computing could turn web pages into distributed applications that could in theory span a range of resources and a range of providers. There is absolutely no question that this sort of thing could (and has) enhanced web sites and improved QoE for their users. It also both demonstrates and multiplies risks.

We tend to think of distributed models of experiences as being resilient and scalable, and that’s true in some cases. However, if an experience must draw from, let’s say, three distinct resource sets to be complete, then a failure of any of them causes it to fail, and that’s where the “problematic” side comes in. If we say that each of these resource sets has the same reliability (the probability of being up and running), of R, then the chances of all three being up is R3. Since R is always less than one (let’s make it 0.95 for this example), cubing it will give a smaller number—in this case, 0.857. Were we to say that we had three additional composing resources, the chances of them all working would be less than 3 out of 4.

There’s a lesson here in distributed applications and distributed web processing alike. If you create something that depends on the availability of multiple things, then you have to ensure that each of those things has a proportionally higher level of availability, or you risk creating something that has an unacceptably high risk of failing completely.

Creating higher availability means shifting our dependent relationships from all of these to any of these. If we take the same three resource sets, but say that any of the three can fulfill the experience on its own, then for the experience to fail, we’d need all three of them to fail. If our original probability of working was 0.95, then our chance of failing was 0.05. The chances of them all failing is that number cubed, which is 1.25 in ten thousand, pretty good odds. Cloud computing has higher availability because you can fail over to another hosting point.

Complex applications, and complex web pages, can build dependencies and complexities quickly and without any fanfare. In fact, if we think of complex structures that are composed from other complex structures, we might not even see the risks that we’re building in by including some seemingly simple component/element. When we partner with another provider of features, we inherit their risks. How many times do you suppose we know what that failure risk actually is?

We’ve seen symptoms of this problem already in past reports of outages. Most of our recent problems in network services have come from something that seems unrelated to the core logic of the services, from a small software bug that was introduced in some backwater element, and that then spread to have a broader impact. If we can’t deal with this sort of thing better, we risk a steady decline in service and application availability.

There’s another problem that’s harder to quantify but perhaps scarier to contemplate. Having a widely distributed and complex application means having a larger attack surface. Every API, every hosting point and administration, poses a risk. This risk accumulates as the numbers grow, and as before, when we partner to draw something from another party, we inherit the security problems there. We probably know less about the security risk than about the availability risk, too.

What this means for the future, particularly the future of the cloud, edge computing, and applications based on either of the two, is that we need to be very aware of the availability and security metrics associated with all the things we’re going to composite together to create an experience.

Composing experiences or applications from multiple components is a fundamental part of exploiting the features and benefits of the cloud. Reducing coding by sharing components across applications or services is a fundamental part of a modern response to managing development costs and reducing the time it takes to create or change applications. We can’t lose these benefits, so we have to think harder about risk management. At the application level, we have some experience with this in the web services offered by cloud providers. At the service level, including the service of the Internet, we’re still a bit in the wild west.

At the application level, there’s some awareness of the issues of component sharing, but my limited input from enterprises suggests that this awareness tends to be confined to what we might call “data center development”, meaning the core applications and not the front-end pieces that are increasingly cloud-hosted. In the cloud, I hear about project failures created in part by excessive componentization and by a failure to assess the availability, performance, and security impacts of components. Cloud provider web services sometimes pose a problem, but usually it’s because the developers haven’t taken the impact of using them into account.

At the service level, we’re caught in three issues. First, service providers are less likely to have a cadre of experienced cloud developers on hand, and so are more likely to create problems through lack of experience. Second, telco standards tend to reflect a box-network bias, meaning that services are created through the cooperation of “functional monoliths” rather than elastic, agile, individual features. Finally, a lot of decisions about how to create service experiences are made by people focusing on the experience and not on the framework that builds it, which makes risk assessment more difficult.

A good edge computing strategy should address these risks, particularly if we believe (as most do) that 5G service hosting will be the early driver of edge deployment. Recognition of that truth seems limited; a recent article in Protocol on Jamstack doesn’t even mention either the availability or security risk potential. We should look closely at all the emerging edge options and measure how they could introduce availability and security risks, and then deal with how to communicate and mitigate risks, so we don’t end up with wider-scope, higher-impact, problems down the road.

Will TIP’s Requirements for Disaggregated Routing Help?

The Telecom Infrastructure Project (TIP) has done some interesting things, and one of the most recent and interesting of the bunch is their release of the Distributed Disaggregated Backbone Router (DDBR) requirements document. TIP has been a significant force in open-model network hardware, and so anything they do is important, and we need to take a look inside the document to see what insights we can find.

The “why” of the project covers what we’ve come to know as the traditional justifications for open-model network elements, much the same as those cited by other projects, like O-RAN. There is an almost-universal view among network equipment buyers that the vendors are locking them in, stalling innovation, failing to control prices, and so forth. You could fairly argue that DDBR is aimed at not doing what operators in particular think their vendors have been doing for years.

That doesn’t mean that the justifications aren’t valid. Every business tries to maximize its own revenues, and so there’s a lot of truth behind the assertion that vendors are doing that at the expense of customers. The best way to stop that is to open the market up, eliminating both proprietary hardware and software in favor of open implementations. Competition among the players who produce open solutions, or parts of those solutions, will then drive innovation and control prices.

One interesting justification stands out from the rest: “Taking advantage of Software Defining Network to make the network operation simpler, give tools for automation, enhance the capabilities of our network, and introduce a set of capabilities that today are not present.” This makes SDN at least an implicit requirement for the DDBR, and we’ll get more into that below.

The paper moves on to what’s essentially an indictment of chassis routers, and in particular the monolithic model of routing that dominates the market. While major router vendors like Cisco and Juniper have “disaggregated” in the sense that they separate hardware and software and permit either of those elements to be used with an open-model version of the other, the point of the paper is that monolithic is monolithic no matter how you wordsmith it. For backbone routing in particular, the scalability and resilience issues work against the monolithic model, and to reap the benefits of DDBR that the paper opened with, you need a different approach.

That approach is based on the spine-and-leaf or hierarchical routing/switching model that’s common in the data center already. You have some number of “spline” or higher-level boxes that connect to “leaf” edge boxes by multi-homing, and the combination supports any-to-any connectivity. This works well in the data center, but there are potential issues when you apply it to a backbone router.

When traffic enters a spline-and-leaf hierarchy at a leaf, it has to be distributed to “exit leaves”. If they happen to be on the same leaf device, the traffic makes the jump via the device’s internal switching. If not, the traffic has to jump to a spline device then back to the exit leaf device. This creates two potential issues; capacity constraints and latency.

If we assumed that every inbound flow had to exit on a different leaf device, then all traffic would be going through the splines. You’d need more spline devices and more leaf-to-spline multi-home connections to build the network, and the total capacity of the complex would be determined by the aggregate spline capacity. You also have to worry about whether the connections between leaves and splines are overloaded, and whether spline-to-spline links might be required.

The latency issue is created by the fact that the device connections are network trunks and not backplanes or fabric, and as you move through the hierarchy of devices, each hop introduces some latency. How much accumulates depends on the number of hops (and thus on the depth of the hierarchy) and the length of the connections, meaning how “distributed” the disaggregated complex is.

With the right set of white boxes and some care in assigning ports to leaf devices, you can control a lot of these issues, and I don’t think that either creates a true barrier to the concept. As I’ve pointed out on other blogs, we have white-box switches with very high-performance switching chips even today, and things are likely to get even better as competition drives innovation at the chip level.

Software is loaded into the devices based on the TIP’s Open Network Install Environment, which defines how an OS gets loaded. The rest of the software framework isn’t so far specified in detail, the presumption being that competition for excellence (and market share) will drive multiple solutions, competition, and innovation. There are, however, some baseline presumptions in the document about the software structure.

The software framework for the new DDBR is the most interesting piece of the story. The software piece starts with the separation of control and data planes. The spline-and-leaf hierarchy is purely data-plane, and the control plane is hosted on either a standard server or in the cloud. The paper presumes container implementation and cloud-native software for the cloud side, but it seems likely that a more general VM approach and IaaS would also be supported, and that bare metal execution of a cloud-compatible model would be what creates dedicated server hosting capability.

The paper isn’t explicit with respect to what “control-plane” functions are handled in the separate server/cloud device. The heavy lifting of topology and route management is explicit, though. The paper is also not explicit about just what’s meant by “SDN” in terms of the relationship between the control and data plane elements of DDBR.

Most of us think of SDN in terms of the ONF project and the OpenFlow protocol for switch control. There is no specific reference to either in the paper, and I can’t find an OpenFlow reference in any of TIP’s documents, including the blogs. Thus, I can’t say whether TIP’s DDBR is making OpenFlow an option or not; since it’s not shown in the figures in the paper, even where connections between an “SDN controller” and the data plane devices are represented, and it’s not called out in the protocol requirements either, I think it’s clearly not a requirement. My guess is that they’re accepting any standard mechanism for the exchanges that would relay forwarding tables to the devices.

The document isn’t specific about management features, but it does relate SDN’s ability to support network abstraction to the ease of interfacing with OSS/BSS. I think the goal is to require that the DDBR cluster look like a single device and present the same sort of management features a core chassis router would present.

So what do I think about this, overall? Let me offer some points.

First, I think that this further cements the open networking trends we’ve seen over the last five years or so, and that I’ve been blogging about. While there are still steps needed to transform a requirements document into a technical specification, we’re on the path toward having an extension to open-model networking that generalizes us away from what’s been a bit of 5G-specificity up to now. That’s good for users, and not so good for vendors.

Second, I think that the document still has a bit too much telco-think woven into it. The paper has this quote: “running on an on-prem x86 server or as a container in a cloud-native fashion”, which suggests that cloud-native requires container hosting or that all you need for cloud-native is a container, neither of which is true. There’s also all manner of network-protocol-specific stuff in the paper but not much about the cloud. I hope the next spec will correct this.

Third, this should be a clear signal to network equipment vendors that the world is changing, and that they have to change with it. The Internet is data service populism, it’s the largest source of IP traffic, and it’s a market where feature differentiation for vendors is virtually impossible. Connectivity is a commodity, which means that vendor margins are dependent on stepping beyond and above it, to something that’s meaningful to the user.

Finally, this is a signal to NFV-and-hosting players, to white-box network players, and to network software players that this market is going to open up and get competitive. Nobody can rest on their laurels at this point. Remember that DDBR is a unified hardware/software cluster, not just the hardware, and so it may well be that it will make it harder for someone who wants to be just a white-box player, or just a software player, to engage. That could have special implications in the 5G space, where many software elements are being presented without being specific about what they run on.

It’s positive to have a set of requirements for something as critical as distributed routing, but I’d have liked to have seen this set more explicitly aligned with modern cloud practices. The step of turning the requirements into specs, the next step according to TIP, will be absolutely critical.

Building and Being Cloud-Native

Suppose I want to develop a true cloud-native version of a network service or feature, maybe even an application? I’ve noted in past blogs that the term “cloud-native” is rapidly becoming the greatest “wash” in the industry; vendors, operators, analysts, reporters, and editors all spread it over anything related to the cloud. That’s a bad idea, but correcting the problem depends in part on having a technical framework for the real thing. I can’t do a full cloud-native design tutorial in a blog, but I can lay out the main points, and I’ll do it in terms of trade-offs.

The first tradeoff is componentization versus performance. It’s nearly impossible to make a monolithic application scalable and resilient, so for the cloud-native goal to achieve, we have to divide an application into components that can be duplicated under increased load, and replaced with little or no impact if there’s a failure. Components are separate software pieces coupled through APIs and networks, and this generates latency. The more components, the more network connections add to overall latency, and the more likely it is that the applications’ QoE will fall below acceptable levels.

The second tradeoff is transactions versus events. Commercial activities are rooted in transactions, which are designed to record a significant activity. A “transaction” is a request for processing, one that could ask for the contents of something or provide an update to something. They’re typically things that make a persistent change in a record, like the balance in your bank account or the number of widgets still in stock. Events are signals of a real-world condition change. They’re “atomic” and they are more likely to signal a change in the state of some physical system, like a door opening or a server failing.

The third tradeoff is platform versus specialized. Application software is supported by “system” or “platform” software, which is software designed to perform tasks that are likely common across many applications. If there are going to be many cloud-native applications, then a platform to handle them will reduce overall development and maintenance effort and make operations more consistent and efficient. However, a Swiss army knife doesn’t drive screws or cut wood as well as a screwdriver or saw, respectively. A general toolkit is likely to impose more overhead and more restrictions on applications than a specialized implementation of what might be considered a “system” function.

The first of our tradeoffs relates to how componentized we’ll make software to make it possible to scale and replace components. Approaching this issue means understanding what “atomic services” really are. Just because you can take Task A and Task B and write separate code for them doesn’t mean they should be microservices. The question is whether anything is gained by making Task A and B separate. Is the load impact on the two different in any way, meaning they might scale differently? Will dividing the tasks make it easier to accommodate a failure, or will two separate components both have to be running in order for anything useful to be done? The latter means componentizing would increase total fault risk.

The biggest mistake most cloud-native architects make is over-componentizing. Often it’s as simple as believing that if microservices are good, more of them is better. Sometimes it’s a byproduct of component reuse policies; smaller units of functionality are easier to reuse. What you should do is divide your cloud application into the smallest truly independent functional units as a starting point, and be prepared to recombine or subdivide if you find out your assumptions were wrong.

The second tradeoff is transactional versus event-driven, and this is actually the most complex tradeoff in technical terms, particularly for network function applications of cloud-native. A network is a complex structure with a lot of things going on, some of which are indications of a problem. It’s critical that these things are accommodated by the software, which means first that there has to be a notice of them, and second that the things generated are interpreted correctly. That’s a fundamentally event-driven process. On the other hand, the standards bodies involved with telecom/networking, notably the 3GPP, O-RAN Alliance, and ETSI, tend to define interfaces/APIs between their elements that have transactional properties. An example is using an API to check for status, which is “polling”. This tends to impose a synchronous model of execution, because you would typically expect to wait for a response. Events are posted, so they encourage an asynchronous mode of execution; they happen when they happen. Software has to handle both situations.

Event processing is a complex topic in itself, but the most important piece is state or context. The difference between transactional and event-driven applications is that the former process is stateful and the latter is usually stateless. A transaction is part of a context between the originator and the system, where an event is simply the notification that something happened. This distinction is important because if something is contextual, then components processing the message have to be able to sustain their knowledge of context, of state, somehow. That’s a problem for cloud-native.

Suppose I have a request that’s assigned to a specific component for handling, but while that component is working, it fails. I can spin up another instance in the cloud easily enough, but what happens to the first request? The second instance doesn’t know anything about the request, or about the first instance’s role in it. If the first instance remembered that it had sent a record to the user to be updated, the second one knows nothing. Somehow, state has to be provided to that second instance or resiliency through replacement won’t work.

In an event-driven system, the events drive specific atomic tasks. If context is important, it either gets kept by the client or through a back-end database where the requests are given a unique ID, and that can be used to look up where things are in the sequence of actions. That means that the design facilitates both replacement and scaling, and in fact any instance of a microservice could field a compatible request and you’d get the same result.

You can make a system designed to support events recognize the sequence of events in a transaction, and if you do it right, you can make that system fully cloud-native, meaning it will exploit cloud scalability and resilience fully. It is very difficult to make an intrinsically stateful transactional system into cloud-native, except by making it contextual-event driven.

The last of our tradeoffs is the platform-versus-custom one. This is rarely an either/or choice; most cloud-native software will use some packaged system software, particularly the operating system. The majority of cloud-native development uses “middleware” software that provides broad services like deployment (Kubernetes) and discovery and connectivity among microservices (service meshes like Istio, Linkerd), but monitoring and management software may also be involved. Where the big question arises is in the specifics, and in particular in how state is managed.

In the middleware platform area, the question is just how generalized the cloud-native strategy has to be. In the current 5G hosting example, should architects be thinking about something specific like O-RAN, which doesn’t fully exploit a broad middleware toolkit because it doesn’t have to, or edge computing, which depends on the broadest application base? I’m of the view that it has to do the latter.

The specifics of cloud-native applications in networking have to reflect the fact that a network is a community, and a hierarchy. There are functional elements, today’s switches and routers, owned by various stakeholders, and all of these have to cooperate in moving traffic. A service is almost always coerced out of this complex structure, and so what we have in effect is a “platform”, the network, hosted on a “platform”, the device and server infrastructure.

This relates to state control particularly. The best way to think of a network service is as a modeled collection of intent models, each of which has its own operating state. That yields a state/event organization of tasks, where a table identifies each state and how each possible event is treated in it. That’s the model the TMF NGOSS Contract first suggested for network automation. It relates to cloud-native because the state/event intersections define atomic actions that can be mapped to cloud-native microservices, and because the service model contains all the data, so the microservices can be stateless, scalable, and resilient. I think that other methods of state control (such as client-side, back-end, or step-function) may be better for other applications in the cloud, and so I don’t think that state/event logic should be a general part of a cloud-native platform. Let the applications manage state themselves, in the way that works best.

If we sum this all up, the lesson is that you have to design an application to be able to take advantage of the cloud, and that design will have to address the tradeoff points I outlined. As I said above, and have said in other blogs, the fact that the term “cloud-native” is applied to things that aren’t architected to actually be scalable and resilient is a big risk, because we’re expecting early applications in telecommunications (like 5G RAN and O-RAN) to drive edge deployments that can then be leveraged for other services and experiences. If we don’t design those telecom applications to cloud-native standards, we may create the wrong platforms for broader use.

One challenge we have to address here was noted earlier; that the interfaces defined in most telecom standards, including 5G, tend to be transactional rather than event-based. This doesn’t mean that the entire software framework has to be transactional, but it does mean that there’s a risk that the interface specifications will induce architects to think of the specifications as defining boxes rather than elastic and agile processes. I’m going to try to get information on the real cloud-native state of some of the 5G stuff, and I’ll report what’s really being done, if I’m successful.

Proprietary vs Open 5G

Is there a reason to have proprietary technology in the 5G RAN? Is virtualization and disaggregation a good idea, or are there real costs and risks? Ericsson was quoted in an SDxCentral piece that outlines the case for a balance between open and proprietary. I want to come down on the issue a bit differently.

The sense of the Ericsson position is “‘There’s no doubt that when you build an integrated, purpose built system-on-a-chip (SoC) that you’re going to get better performance and get a better cost profile in the network,’ Jejdling said. As such, he expects vRAN and tightly integrated RAN, which comprises the vast majority of network infrastructure today, to operate in parallel in a hybrid environment.” I could agree with that, but still with qualifications, if the implied position it represents were true, but it isn’t.

If we build 5G based on a monolithic model, where appliances implemented both the control and user planes of 5G, then proprietary SoC technology might well offer some benefits. I would argue that such a model is a fundamental mistake, that the right model for the network of the future would separate control and data/user plane handling. The only problem with the vision of open RAN (O-RAN) in 5G is that we’re not taking the separation far enough. The core issue is that “data/user plane” point.

5G defines a “control plane” that represents the exchanges among the APIs defined for 5G by the 3GPP and by O-RAN. The “user plane” is really rooted in IP connectivity, and IP has its own control plane and data plane separation. Concepts like OpenFlow, and vendors like DriveNets, have touted the separation of control and data planes in IP networks as a benefit, and I believe it is, because it allows the performance-critical process of packet forwarding to be separated from the control-plane handling that’s much more like traditional event processing.

We have a number of excellent high-performance switching chips, notably the Jericho like from Broadcom, and if you separate data and control plane you can use them for the former, and the traditional hosted (x86) cloud processes for the latter. I believe firmly that the benefits of an open cloud-native implementation of the IP control plane outweigh any advantages of custom chips there. The best 5G implementation, then, might well be a combination of a white-box user/data plane and a cloud-hosted control plane. That’s obviously not what Ericsson is proposing.

It’s also not exactly what either the 3GPP or O-RAN is promoting. Prevailing-think on 5G is that virtual functions would provide the implementation of many/most features, which presumably would include the “user plane” and IP. You could implement the UP on white boxes, of course, but the standards are vague on how a decision to do that might impact deployment and management. That’s a problem in itself, IMHO.

Virtualization that fixes the relationship between virtual functions and real hardware is a bit of a paper tiger. Ideally, 5G should be based on the presumption that each of the boxes in the 3GPP or O-RAN diagrams are abstractions that could be implemented in the way that best fit operator requirements. There’s no reason to say that everything has to follow the same model, only that any loss of efficiency (capital or operational) that would be created by multiple models was considered and accommodated in the business case.

One thing this means is that concepts like NFV shouldn’t be a requirement for 5G, or for O-RAN. NFV dictates deployment and management relationships and that violates our theory that the abstractions associated with 5G (or any other carrier cloud or edge application) should be compatible with any useful realization. However, it doesn’t mean that we can just shout “cloud-native” and wave garlic at the operations werewolf to fend it off. There needs to be some attention given to the question of just how we create “universal” abstractions that can be deployed on whatever works and managed through whatever is available.

We also need to be thinking about the relationship between the various “control planes”. IP and 5G/O-RAN have their own control planes, but O-RAN defines the notion of a RAN Intelligent Controller or RIC, and there’s both a near- and a non-real-time RIC defined, which implies two levels of control plane for 5G, giving us three planes in total. Other services, including streaming video, would introduce their own “control planes”, meaning layers of functionality designed for service control and not for packet data movement. If everything in a data network comes down to pushing packets, which arguably it does, then we have to make sure that all these bosses don’t give conflicting instructions to the packet-pushers.

In fact, we should be thinking about how “layers” in networking work in the current age. The venerable OSI model was designed to describe open systems interconnect, but even a cursory look at the layers (all seven, not counting the layers that have been subdivided) shows that the definition is really about application-to-user communication overall. What is the overall model for a separate-control-plane world?

Is it possible to define abstractions for the control planes that, like the OSI model, are represented by interfaces between the layers? It seems to me that any attempt to create specific interfaces or features to abstractions in a virtual world limits the world you’ve worked to create. However, there could be “administrative” exchanges that could be defined. Could some rules about how layers communicated by possible and helpful at the same time? I don’t know, frankly, but I suspect that it would be possible…with a big “if”.

The “if” is “if we could create some high-level models of control behaviors. We know that control planes in IP exercise route control. We know that in 5G, they manage mobility. We know that CDNs work in part because a user’s request for content is redirected to an IP address that represents the optimum source. There are behavioral models we could identify, and we might be able to generalize them to create some of those high-level abstractions.

This is a process that has to be thought through carefully, so we don’t accidentally limit the opportunity scope of the cloud by limiting the abstract features and services it can support when we define models. Still, it’s a process that’s essential to guide thinking and development of both edge computing and carrier cloud. We’re on the verge of creating an edge mission with 5G RAN, and we can’t afford to get off to a bad start.

The Real Lessons from Fastly

The recent Fastly problem generated a raft of stories featuring pundits claiming this or that lesson. It was interesting to me that none of the stories addressed what I think is the key point in the whole mess, which is that the Internet isn’t what we think it is. A corollary is that the cloud isn’t what we think it is, either, so maybe we need to look at what the two really are.

There’s a lot in a name, often a lot of misconceptions. We talk about “the Internet” as though we’re talking about a single network, and the Internet isn’t that. It’s actually a kind of multidimensional federation. The connectivity is created by the interconnection of independent Internet Service Providers (ISPs) and what we perceive as the service of the Internet is created through multiple technologies.

Internet connections are made through ISPs, and these ISPs have “peering agreements” that define how they interconnect. ISPs, or at least the major ones, also connect through national exchange points that provide connectivity where no private peering agreements exist. Each ISP has its own peering pathway, and so some aspects of your Internet experience vary depending on the ISP or ISPs you use.

The primary service of the Internet, from an objective technology perspective, is connectivity. The primary service of the Internet, from the perspective of its users, is experience delivery. In fact, if we judge Internet use by traffic, the primary service is video streaming, and how video streaming actually works is very different from how people think it works.

When you access an Internet element, a provider of an experience, you do it via a URL, which is a symbolic reference to something that, to be used, has to be translated to an Internet address. That translation is normally done using the Domain Name Server (DNS) system. If URLs are translated to the web servers of the provider of a web page or a video, the traffic associated with the experience would have to transit all the ISPs and peering points between user and experience. The result would be significant latency and congestion potential, and the core of the Internet would be jammed with video traffic.

ISPs follow what’s called a “bill and keep” practice; everyone charges their own users and keeps all the money. ISPs who would generate a lot of demand for video would load down the ISPs who hosted the video sources, and the traffic wouldn’t be paid for. Some ISPs might refuse to peer with video-greedy ISPs. To address both the delay/congestion issue and the billing issue, Internet content that’s widely used and generates a lot of traffic can be moved to a Content Delivery Network, or CDN.

CDNs are a community of caches, which are servers and storage where content can be staged. When CDN content is requested, instead of decoding the URL to the actual owner of the content, the URL is decoded to the address of the best-positioned cache point for the content, and that cache point delivers it. In most cases, the cache points are directly connected to “access ISPs” who provide consumer Internet service, and so the majority of the Internet is bypassed completely. Content providers pay for this service, and it improves the quality of experience of their users.

Fastly isn’t an ISP, they’re a CDN provider. Their problem didn’t take down the Internet, as has been widely stated, or even a portion of it. What happened was that it took down the content that Fastly was caching, including the specific websites that were noted in various articles. Do CDN providers’ failures put their customers at risk? Sure, obviously, just as any provider of hosting will put those who use the hosting capability at risk should they fail…just like the cloud.

CDN applications and cloud applications have a lot of similarity, and the relationship between cloud providers and the Internet is the same as the relationship between the Internet and CDN providers, like Fastly. If Amazon’s cloud goes down, the applications that are hosted in it are lost to the applications’ owners and users.

The core of the problem with Fastly is that the CDN service it provides is part of the Internet experience even though it’s not strictly part of the Internet. That opens up an interesting and problematic point, which is that if public cloud providers absorb some of the features of another service, as Fastly has absorbed some features of the Internet experience, then a cloud failure would create a failure of the service it supported. For example, if cloud providers hosted 5G O-RAN and the cloud failed, then 5G O-RAN would fail for the customers of the network operator who used the cloud.

Most network operators are obsessed with reliability, the proverbial five-nines mindset. The cloud, the Internet, and the CDN communities are much less so, and part of the reason is that network services have been traditionally subject to regulation, where the other services have not. Another part is that service experiences created by a community of providers are usually subject to failure if any of those providers fail. The reliability of a “series connection”, one that reflects dependency on all elements functioning, is always lower than the reliability of any given element.

Do network operators have it right, thinking that cloud and Internet technology isn’t the answer as a foundation for telecommunications services? That depends on what you’re willing to trade for those five nines, and whether an NFV-and-operator-centric vision of hosting would offer any better level of reliability. Frankly, I don’t think it would.

The real lesson of Fastly is that we need to accept that the Internet, and cloud computing, create much more complex foundations for the experiences we depend on for work and life. Complicated things break more easily, and so we can expect that one price of our Internet and cloud riches is a growing risk that what we need is going to break.

Facing doesn’t mean accepting, though. One truth about cloud computing is that it’s been working to deliver better availability even through a higher level of application or service complexity. What we call “cloud-native” design is a path to that, but the reality of cloud-native is that almost everything that claims to be so really isn’t at all. There’s way more hype than reality. My biggest concern about things like cloud hosting of 5G and O-RAN, and about edge computing, is that we’ll accept the marketectural view of cloud-native rather than the architectural view. If we do, then we’re going to see more of these Fastly-like incidents.