An ONF-Sponsored Event on Open-Source 5G

How much can open source do for 5G?  That’s a question a lot of operators are asking (and a lot of vendors, too).  The ONF, who is taking a much bigger/broader role these days, thinks they have some answers, so it’s well worth looking at them.  This (topically speaking) a kind of follow-on to my blog yesterday on the TelecomTV cloud-native conference, so if you’ve not read that, you might want to read it first!

Nobody doubts the motivation behind the open movement in 5G.  Network operators have faced declining profit per bit for almost 15 years now.  Mobile was, for a time, the only space that was immune, because usage-based pricing in some form is still available in mobile services.  Now, even mobile services are seeing the same problems, and since 5G represents at least a major build-out, it’s both a major transformation opportunity and a major business risk.

If operators cannot build 5G using an open model, then what’s likely the last major network service transformation of this decade will reinforce the closed, proprietary, model.  Incremental transformation is a contradiction in terms, so getting out of that reinforced and renewed lock-in will be difficult.  The opportunity to create openness is now.

The presentations in the ONF session are a mixture of vendor and operator views, with the former obviously representing a positioning the vendors think they can win at.  Still, they’re useful in laying out what an open model would look like, and they’re also interesting, in that they almost uniformly illustrate the continued risk of what I’ll call the “functional-to-architectural” transformation, the thing that likely doomed NFV to failure.  I won’t call out individual vendors here because the issue is universal, but I will explain the problem.

We’re layer-obsessed in networking, ever since 1974 when the Basic Model for Open Systems Interconnect (the “OSI model”) was published.  When we draw 5G, we draw layers, and these layers are useful in that they define a building-up of functionality starting with the very basics and moving to the (hopefully) sublime experiences.  There’s no question that in 5G, “services” and “management” and so forth are at a higher functional layer than the user and control planes of 5G Core.

The problem is in how literally we take the diagrams.  How dramatic a leap is it for some poor architect to look at a diagram showing the function they’re trying to design as a box in a stack of boxes, to think of designing a box?  Monolithic implementations tend to develop out of simple functional diagrams taken too literally.  That was a fatal risk for NFV, and it’s still a risk for 5G, open or otherwise.  The universality of this form of explaining things in the material is proof we have to guard against that risk.

Probably a reasonable way to look at the material, which is obviously diverse in both target and approach, is to frame it as a kind of point-and-counterpoint between what I’ll describe as the “operator vision” and the “vendor vision”.  As I said, I’m not going to name people or firms here because I don’t want to personalize comments, only make observations.

Let’s start with an operator seen almost universally as an innovator in 5G, so it’s reasonable to start with their presentation to set the stage.  The first architecture diagram shows the high-stakes game we’re in; they see the future 5G infrastructure as being a combination of edge and central data centers.  That’s why “carrier cloud” had/has the potential to build out a hundred thousand incremental data centers, according to my model.  These map, in functional terms, to a resource pool for NFVI, showing how reluctant operators are to abandon the NFV notion.

The critical business claim made here is that the open approach creates a 40% reduction in capex and a 30% reduction in opex.  Neither of these stated benefits are consistent with the NFV experience data I’ve seen globally; capex reductions have rarely exceeded 25% and opex has been the same or even a bit higher in the open approach.  Since the data cited was for OpenRAN, I think the specialized application elements may offer better savings than the NFV average.  That may be true for 5G on a broader scale too.

Why OpenRAN might generate at least capital savings, and other benefits as well, is explained by a second operator presentation.  They cite four “disruptors” in the telecom sector.  First is competition from the hyperscalers (cloud providers, predominantly), second the evolution of technology overall toward software and the cloud, third the more-demanding customer experience expectations, and finally the lack of innovation created by vendor consolidation and loss of competition.

All these factors are interesting, but the last one may be especially so.  As network operators have responded to declining profit per bit with pressure on infrastructure pricing and projects, vendors have suffered.  This suffering leads to consolidation.  One of today’s vendors, Nokia, is the sum of three previous major network vendors and many smaller ones.

The same thing that’s leading to consolidation and loss of vendor innovation is also leading to “incrementalism” in network infrastructure.  A massive change in a network requires a massive business case.  If vendor innovation is indeed being stifled, there is little or no chance that the kind of technical shift that creates a massive business case would be presented by any vendor.  That, I think, is the real justification for looking for another model, something to replace a competitive field of proprietary giants.

The same second operator cites three key ingredients in a solution to their problems.  The first is disaggregation, which they’ve taken to mean the breaking up of monolithic devices into virtualized functions.  Second is orchestration to automate the inevitably more complicated operations associated with disaggregated functions, and third is open APIs to expose critical capabilities for exploitation by evolving services and techniques.

The final point this operator makes is that the “edge cloud” is going to be a key point in differentiating telcos from hyperscalers/cloud providers.  This begs the question of why so many operators are partnering with public cloud providers, and seem to be stalling on making any progress in carrier cloud at all, much less mobile edge.  It also suggests that either the operator believes that hyperscalers will enter the “carrier cloud” market, perhaps offering 5G-as-a-service, or that the telcos will inevitably have to climb up the value chain to compete on the hyperscalers’ own turf.

Particularly when a third operator has a public-cloud partner at the center of their own architecture.  Fortunately, this operator may offer an explanation, too.  They show the edge cloud, presumably owned by them, connecting to a public cloud.  This would suggest that operators are almost universally interested in public cloud as a supplementary or transitional resource, which of course would be good news for vendors if it’s true.

Speaking of vendors, this is a good place to start thinking and talking about them, and about their approach to the open 5G theme.  As I noted above, there’s still what is to me a disquieting tendency for vendors to hunker down on the NFV story, despite the fact that in another of the recent online events, Google admitted that NFV had succeeded primarily in changing the mindset of telcos, not through adoption.  One operator did retain NFV in their diagrams, but the others were more “cloud” oriented, generalizing their goal as “exploiting cloud technology” or even “cloud-native”.  I think there are a lot of people who don’t want to face up to the fact that NFV was a seven-year boondoggle, but they’ll quietly accept that something beyond it is needed.  One vendor presentation implies that with a platform layer that hosts “containers” and “NFV”.

The ONF presented its own view of what at least part of that “something beyond” might be, which is an SDN-centric vision of routing.  They have an SDN controller talking to a bunch of P4 Stratum switches, running applications like BGP and perhaps even 5G.  This is surely a step in a different direction, but I have concerns about whether it’s a better one, because of the implicit centralization.

I’m all for control/data-plane separation, as readers of my blog surely know.  I’m all for specializing forwarding devices for the data plane.  But I’m not for centralizing the control plane, because we have absolutely no convincing information to prove that central forwarding control can work at a network scale.  You need hierarchies or federation, and those would need some work to get defined.  We may well not have time for that work to be done.

I’m also concerned about later elements of the ONF presentation, in particular the way they seem to be coupling 5G to the picture.  They introduce policy control and enforcement, which to me makes no sense if you assume you have complete and central control of forwarding.  An SDN-like mechanism, or any mechanism designed to provide dynamic forwarding control, should present its capabilities as network-as-a-service, to be consumed by the higher elements, including 5G.

What I see at the vendor (or “source” level) overall is a tendency to draw diagrams and propose platforms rather than to define complete solutions.  It’s easy to show the future as being a set indefinite (and therefore presumably infinitely flexible) APIs leading up to a limitless set of services and refinements.  There is a sense that there has to be a “fabric” or “mesh” of some sort that lives above the forwarding process and hosts the control plane(s) (both the IP and 5G ones), but there is no proposed open-source solution for those elements.

The thread that ties all the material together is a thread of omission.  We don’t have a specific structure for hosting a separate control plane in a cloud-native, practical, way.  We don’t have an architecture to define how that control plane would work, how its elements would combine to do its job of controlling forwarding and keeping track of topology and status.  Google has done some of this in Andromeda, it’s SDN/BGP4 core, but it’s not a general solution and Google has never said it was.

Innovation, the innovation that the second operator said had been lost, is needed in this very area.  Without specificity in the framework of that cloud-native universe of floating functionality that lives above forwarding, we’re not going to have a practical transformed 5G, or much of anything else.

We also may have to get specific with respect to “open” in networks.  Does every piece of hardware have to be based on off-the-shelf stuff?  Does all the software have to be open-source?  We cannot achieve this today, consistent with having something that’s actually working and working within reasonable performance and cost limits.  There’s still a lot of room for innovation, and just because the giants of the past won’t or can’t innovate doesn’t mean the startups of today, and the future, shouldn’t be allowed to give it a go.  It may prove that they should have their chance.

Looming in the background here is the growing interest of public cloud providers in offering 5G.  Hosting 5G in the cloud could still rely on open implementations of 5G, but since the cloud can already host almost anything, such a basic approach wouldn’t offer a cloud provider much differentiation.  They’re all clearly angling for a role in supplying “5G Core-as-a-Service”, which is why Microsoft recently bought Metaswitch, a vendor who has the 5G software stack.  Can the cloud providers’ as-a-service approach defeat the open-source movement, or will operators see it as replace being locked into traditional mobile infrastructure vendors with being locked into cloud providers?

An open network doesn’t lock you in.  That’s the simple definition I think we have to accept, at least for now.  Since it’s the cost of the physical devices, and the contribution of any annual-subscription software, that creates lock-in, we have to match approaches to the test of controlling these two factors if we want to preserve openness…and that still leaves the question of innovation.

The ONF presentation showed innovation, but in a direction that’s likely to raise a serious risk in performance and scalability.  Yesterday, we heard about the DriveNets win in AT&T’s core.  DriveNets has, at least within one of their clusters, a cloud-hosted and separate control plane.  Could this spread between clusters, become something like a distributed form of SDN controller, and resolve the problems with the future model of networking that shows the most promise?  I hope to blog more on them, as soon as I can get the information I need.  This might be a critical innovation, even if DriveNets software isn’t open-source.

If operators want to open everything, to eliminate any vendor profit motive in building network equipment, they’ll need to accept that the innovation baton will necessarily pass to them.  Right now, by their own admission, they are completely unprepared to provide the technical and architectural support needed to play that role.  That means that their vision of the network of the future doesn’t just acknowledge the loss of the old proprietary innovators, but the fact that new ones will be needed, and new visions of “openness” to accommodate them.

Analyzing the Telecom TV Take on Cloud-Native

Industry events are good, providing you can time-shift to watch them.  That’s particularly true for analysts, because drinking bathwater is not an attractive image, whether it’s your own or not.  The term could be said to apply to any number of things that create a kind of intellectual recycling, a belief set that builds on itself rather than on outside needs and opportunities.  Like all market pundits, I try to avoid it, and one way to do that without becoming a conference gadfly is to watch replays of events.

TelecomTV had a nice series on a topic dear to my heart, cloud-native and telco cloud, and I want to review what I think it meant.  As always, I’m putting my own stamp on this, hopefully to explain, add perspective, and sometimes disagree.  Tomorrow’s blog will look at another conference series, this one on 5G and open-source, sponsored by the ONF.

The thing that seemed to cross all the sessions, in my own view, was that there’s still a lot of haze surrounding the notion of “cloud-native”.  There are some who think that it means something designed to be hosted, period.  Some think that it’s about reorganizing things a bit to allow the telcos to create a kind of hybrid-/multi-cloud framework of their own, much like other verticals.  Some think it’s about adopting cloud-like development practices and tools, like continuous integration and delivery (CI/CD).  Finally, of course, some think it’s what it really is, which is the disaggregation of functions and applications into microservices to build new models of old things.

Perhaps the most important thing about the conference videos is that each demonstrates that there are people who see cloud-native as it is in just about every organization.  Some of the speakers commented that they had real cloud-native dialogs with every operator they talked with, which would certainly be a good sign.  Since I know that I at least have had conversations with every operator that proved cloud-native wasn’t the universal language there, this shows that a lot of your cloud-native discussion will depend on who you happen to be talking with.

Another thread that cut across sessions is that operators recognize that they are not really cloud-qualified.  Not only are they fairly sure they don’t know how to build their own carrier cloud, they’re not entirely sure they know how to consume someone else’s cloud effectively.  My sense, from the comments, is that most of the operators don’t think this is going to be an easy problem to solve, either.  Since the fact that operators participated meaningfully in all the sections proves that some people in the operator space do get it (and refer again to my last paragraph here), the problem is obviously breadth of knowledge.  Network people, to paraphrase one operator, don’t do cloud.

What’s perhaps the most important “technical” point from the conference is that 5G is very likely the big driver of cloud-native awareness among operators.  The fact that 5G separates the control plane and the user plane was cited often, probably at least once in every session where cloud-native drivers were brought up.  Sadly, this good news is dampened by the fact that the comment was often accompanied by two points I think are at least sub-optimal, if not wrong.

The first point is that 5G “requires” cloud-native.  It doesn’t, it requires (meaning specifies) NFV, which is itself struggling toward cloud-native understanding.  5G is surely an opportunity to introduce and leverage cloud-native development, but it’s still a decision that operators and vendors will have to make explicitly, and so far, I’d say that most are not making it.  If 5G is truly a big driver for cloud-native, then it may be getting wasted as we speak.

The second point is that “control plane” is one of those highly ambiguous terms.  The term usually seems to mean what used to be called “signaling”, the process of controlling communication and facilities, rather than the channel that passed information end-to-end.  What makes this definition a risk in the 5G context is that 5G networks run on IP, which has its own “control plane”.  The 5G User Plane, then, consists of the IP control plane and the IP data plane (or forwarding plane).  Since we have a lot of discussions about separating the control plane in IP networks, it’s easy to see how people might think that 5G mandates that.  It does, but only with its own signaling, which is a higher layer than that of the IP control plane.

What 5G does, most explicitly, is to separate its own control and user planes.  That creates two incentives, one implicit and one explicit, for IP networks to do the same.

The first incentive is that if 5G Core represents the state of the art in operator thinking, then it’s a powerful architectural reference for other services.  That doesn’t just mean new or old services, but both.  We should expect to see control/user separation across the board, because it’s what operators think is the right approach.

The explicit incentive is that 5G Core presents interfaces (N1, N2, and N4) between the control and user planes.  If the user plane is IP, then you can argue that an IP network should be able to support these interfaces, meaning that the AMF/SMF (access and mobility management and session management) elements could rightfully be seen as network services built on those interfaces.  If we assume that the IP network has an independent, cloud-hosted, IP control plane, then the Nx interfaces are essentially APIs that invoke network service features.  This sure sounds like network-as-a-service, and it would represent a model of represent how other services, old and new, interface with the IP network.

The reason this matters for cloud-native wasn’t really brought out in the conference, but I’ll suggest one.  The higher you go in terms of service layers, from connectivity up to applications and experiences, the more cloud-like your architecture had better be, to conform to application trends.  If the Nx interfaces are the boundary between “legacy” connection services and “over-the-top” services, then they represent the place where cloud behavior starts to make sense.  That argues for considering the IP control plane and the 5G (or other service) control plane as sublayers, with a common and cloud-native implementation.

Vendors in the sessions were more likely to see cloud-native in its real, and realistic, form than the operators.  One vendor even talked about state control, which is critical for cloud-native but something most don’t understand at all.  But even the vendors had the view that 5G Core was written to be cloud-native, and I don’t see anything in the specs that admit to that interpretation.

Another area where vendors had a distinctive (and understandable) combined view was that operators really needed to have their own cloud, eventually.  Most operators also seemed to agree with this, but it seems like the cloud software vendors are recognizing that public cloud hosting of “carrier cloud” applications could tap off a lot of their opportunity, and they’re particularly sensitive to the loss of the “edge” opportunity.

If you were to dissect my forecast of carrier cloud data center opportunity, over 70% of the data centers would be expected to deploy at the edge, largely in central offices and mobile backhaul concentration points.  Given that, and given that the software vendors would face a significant opportunity loss were the operators to outsource their edge requirements to public cloud providers, all the software vendors saw evolution of carrier cloud as starting with 5G and quickly becoming edge-centric.  They also saw the biggest public cloud outsource opportunities in the operators’ administration and operations systems, not in hosting network features or functions.

Cloud-native is here, providing that we’re willing to accept a rather loosey-goosey definition of what “cloud-native” is.  I’m grateful for TelecomTV for running these kinds of events, and making the material available online afterward.  It gives us all a stylized but still grounded view of what the operators and vendors most committed to changing things are doing and thinking.  We’re not there yet, not in the transformed future, but we can see it through the weeds now, and that’s progress enough to applaud.

Network Equipment Differentiation: Still Possible?

Differentiation has always been a challenge, but it seems like the problems vendors and providers face in standing out from the masses is getting worse.  Part of that is because there are more and more areas that could offer differentiation, more difficulties making buyers understand the issues in those areas, and more overlap across the areas.  What’s a vendor to do, and what might be the result of all this “doing”, or at least trying?

Forty years ago, computing platforms were almost totally proprietary.  If you bought a machine from DEC, you bought into the whole DEC ecosystem from hardware up to networking products.  Networks are still largely built from proprietary devices that are still monoliths, bundles of everything, but computing has been shifting more to an open model, and networking is starting to move in that direction too.  Does everyone surrender to the Will of the Masses here, or are there still spaces where a good iconoclast vendor could stand out?

The obvious one is “what comes next?”  If we’re losing network equipment differentiation because of the transformation of networking, then the path of transformation and even the end game might well be differentiators.  In fact, they’re probably the first level of selection that will separate the good from the…deceased.

The current view of network transformation threatens traditional vendor positioning.  Some call this future transformed state “cloud”, some say it’s “software”, but in either case what we’re really talking about is shifting away from “box networking” meaning routers, to something that hosts network features.  Vendors in the network space have grown and prospered selling boxes, but if the future network is cobbled together from a bunch of Lego blocks, what do those vendors sell?  A Lego block is a lot less differentiated than a Lego castle, so unbundling network boxes into separate pieces seems to cry out “commoditization” in a loud (and scary) voice.  How loud and scary the future will turn out to be depends on exactly how the evolution happens.

The router-vendor vision of the future is simple; yes, hardware and software have to be separated.  We separated it by unbundling the router software and making it an annual license, a subscription.  In theory you could run other software on our boxes (good luck finding software, and to get support will likely require miraculous intervention).  In theory, you could run our software on other hardware (same good wishes).  But, hey, we’re trying.

The router-software vision is simple too.  A router transforms into a router instance running on a commercial off-the-shelf server (COTS).  The server won’t deliver the same performance as a customized hardware platform, but you could claim this was a “cloud-enable” solution because you could run the router software on a VM in the cloud.  Expect performance there to be even worse, of course.

White-box people have a germ of a truly new vision.  You run generic router software on a custom device, a device that is augmented with the chips and other technology needed to let it at least approach the performance of dedicated routers.  This is probably what most people today are seeing as the right path, but it has the disadvantage of leaving the cloud behind.  Network operators believe in the cloud, even if they don’t know exactly what it is.

There are some in-the-network differentiation opportunities, of course.  White boxes can be differentiated by the chip set included in them, which could make it possible to create a cluster device whose individual elements are optimized to the mission.  Some might look like brawny forwarding engines, others like lookup engines, and others like cloud application hosts.  Again, leveraging this would require the decomposition of “routing” as a function into abstract elements that could then be hosted where they fit best.

Chips themselves could be differentiators, too.  The challenge with creating your own custom chips (which major router vendors do for their systems) is that they can compromise your ability to claim an “open” architecture.  A dispersed cluster of routing elements like the one I just described could still be fully open if it exposed the proper APIs to permit customization, and of course, generic white boxes are inherently open.  Stick a custom chip in a white box and it’s open only to the extent that the chip is generally exploitable, and if it is, then it’s not differentiating.

The P4 language and the ONF Stratum architecture model could offer a way to maintain an open claim while using custom silicon nobody else could get.  P4 is a flow-management language, an API and “driver” that harmonizes different chips to a common programming language.  That means that you could support a standard flow manipulation approach with custom, differentiable, silicon.  It would also mean that other stuff could be run on your box, and that your stuff could be run on other boxes, providing that everything used P4.

So far, all of these options come with plusses and minuses, but a big minus for them all is that they’re really tweaking the old router model.  It’s not enough to just run a monolithic control plane and a monolithic data plane on either the same device or tightly coupled devices.  The future network disaggregates right down to the feature level.  You separate the control, management, and data planes and you host the former two on the cloud, using cloud-native techniques.  The data plane gets mapped to specialized white-box devices.  We get the best of the white-box world without the performance risks, and we make the cloud the centerpiece.

The common thread in network operator transformation discussions is “the cloud”.  There’s an implicit belief that if cloud principles were to be applied to building networks, the cloud could transform networks as it’s transformed IT.  If every operator believes that, then every differentiating position has to be somehow tied back to the cloud.  That’s the big failing of so many of the approaches to differentiation I’ve outlined above.  They don’t end up with a cloud, and that’s where every operator thinks they have to be heading.  But how do you apply cloud-think to networks?

There have been criticisms of the IP routing process, elements of the process that relate primarily to what would be called the “control plane” for about two decades now.  Things like MPLS, Next-Hop Resolution Protocol, and more recently the various area-routing approaches, are all based on refinements to the adaptive routing process.  SDN proposed to replace all of this with a central routing instance, a place that collected topology and status and sent out updates to all the forwarding tables impacted by changes.  Could the essential piece of uniting cloud and network be the replacement of the traditional IP control plane with something else?  SDN without the centralization risk?

You could surely pull the control plane out of a single box and create a cluster that lets control and data plane evolve in cloud-native form, but as separate as the two would need to be, given the extreme differences in control and data plane requirements.  DriveNets, who won a Light Reading startup award, does this, and they’ve recently claimed a PoC that illustrates some pretty profound benefits.  The story suggests that they could extend this separate control plane multi-dimensionally, upward to climb toward actually making the network experience-aware, and horizontally across multiple devices/clusters to create a distributed form of SDN’s central controller.  Forwarding, in this vision, is simply a kind of transport layer, controlled from above via the cloud.

That new model is what I think is the ultimate differentiator.  The biggest minus (or plus, depending on your perspective) is that traditional router vendors surely see this as being differentiating via total disruption.  If you change everything, then you’re telling customers that they might as well look at other alternatives than you, as long as they’re starting fresh.  Can a router vendor develop an evolutionary revolution?  I have to wonder why, if they could, they haven’t done it already, when it’s so clear that buyers are eager for something truly transformational.

The smart thing for router vendors, then, would be to accept what’s happening and begin a transformation of their own approach.  I can already pull forwarding (the data plane) and the control plane of routing apart.  If I were a router vendor, I could then implement that cloud-ready control plane, support my current product line, and evolve myself toward cheaper, simpler, “white-box-like” (or even white-box) devices.

It’s not going to be as good a business, though, because it can’t be.  You can’t make buyers spend more on a transformation when their primary transformation goal is to spend less.  Router vendors will have to accept that their sales, and their organizations, are inevitably going to get smaller. “Shrinkage” is tough to sell as a mantra in growth-driven Silicon Valley, but it’s preferable to vanishing.  Mainframe computers and even minicomputers once drove the IT market, and neither do that today.  DEC, Data General, Perkin-Elmer, RCA, CDC, and others all had to learn that success is riding the wave of change, not being drowned by it.  So do today’s network vendors, and just because past initiatives aimed at this radical sort of transformation failed, it doesn’t mean that eventually, even accidentally, someone won’t get it right.

We Can’t Put Off Thinking About Latency!

If latency is important, just what constitutes “good” and “bad” latency levels?  How does latency figure into network and application design, and what are the sources of latency that we can best control?  I’ve talked about latency before, but over the last couple months I’ve been collecting information on latency that should let me get a bit more precise, more quantitative.

Latency is a factor in two rather different ways.  In “control” applications, latency determines the length of the “control loop”, which is the time it takes for an event to be recognized and acted on.  In transactional and information transfer applications, latency determines the time it takes to acknowledge the transfer of something.  Difference here is important because the impact of latency in these areas is very different.

Control-loop latency is best understood by relating it to human reaction time.  People react to different sensory stimuli differently, but for visual stimulus, the average reaction time is about 250 milliseconds.  Auditory reaction time is shorter, at about 170ms, and touch the shortest of all at 150ms.  In control-loop processes whose behavior can be related to human behavior (automation), these represent the maximum latency that could be experienced without a perception of delay.

Transactional or information transfer latency is much more variable, because the former source can be related to human reaction time and the latter is purely a system reaction.  Online transaction processing and data entry can be shown to be just as latency-sensitive as a control loop.  Years ago, I developed a data entry application that required workers achieve a high speed.  We found that they actually were able to enter data faster if they were not shown the prompts on the screen, because they tended to read and absorb them even when experience told them the order of field entry.  But information transfer latency can be worse; if messages are sent at the pace that acknowledgments can be received, even latencies of less than 50ms can impact application performance.

The sources of latency in an actual networked application are just as complex, maybe even more.  There is what can be called “initiation latency”, which represents the time it takes to convert a real-world condition into an event.  Then we have “transmission latency” which is the time it takes to get an event to or from the processing point, and then the “process latency” which is the cumulative delay in actually processing an event through whatever number of stages are defined.  Finally, we have “termination latency” which is the delay in activating the control system that creates the real-world reaction.

The problem we tend to have in dealing with latency is rooted in the tendency to simplify things by omitting one, or even most, of the sources of latency in a discussion.  For example, if you send an event from an IoT device on 4G to a serverless element in the public cloud, you might experience a total delay of 300ms (the average reported to me by a dozen enterprises who have tested serverless).  If 5G can reduce latency by 75%, as some have proposed, does that mean I could see my latency drop to 75ms?  No, because 200 of the 300ms of latency is associated with the serverless load-and-process delay.  Only 100ms is due to the network connection, so the most I could hope for is a drop to 225ms.

The key point here is that you always have to separate “network” and “process” latencies, and expect new technology to impact only the area that the technology is changing.  IP networks with high-speed paths tend to have a low latency, so it’s very possible that the majority of network latency lies in the edge connection.  But mobile edge latency even without 5G averages only about 70ms (global average), compared to just under half that for wireline, and under 20ms for FTTH.  Processing latency varies according to a bunch of factors, and for application design it’s those factors that will likely dominate.

There are four factors associated with process latency, and they bear an interesting resemblance to the factors involved in latency overall.  First there’s “scheduling latency”, which is the delay in getting the event/message to the process point.  Second, there’s “deployment latency”, which is the time needed to put the target process in a runnable state.  Third is the actual process latency, and fourth the “return latency”, associated with getting the response back onto the network and onward to the target.  All of these can be influenced by application design and where and how things are deployed.

The practical reality in latency management is that it starts with a process hierarchy.  This reality has led to all sorts of hype around the concept of edge computing, and while there is an edge computing element involved in latency management, it’s in most cases not the kind of “edge-of-the-cloud” or “edge computing service” that we hear about.

The first step in latency management is to create a local event handler for the quick responses that make up most “real-time” demands.  Opening a gate based on the arrival of a truck, or the reading of an enabling RFID on a bumper, is a local matter.  Everything, in fact, is a “local matter” unless it either draws on a data source that can’t be locally maintained, or requires more processing than a local device can provide.  In IoT, this local event handler would likely be a small industrial computer, even a single-board computer (SBC).

The goal is to place the local event handler just where the name suggests, which is local to the event source.  You don’t want it in the cloud, in a special edge data center, but right there.  Onboard a vehicle, in a factory, etc.  The closer it is, the less latency is added to the most latency-critical tasks because that’s where you’ll want to move them.

In network terms, meaning virtual or cloud-network terms, you want to ensure that your local event handler is co-located with the network event source.  It can be literally in the same box, or it can be in a rack or cluster or even data center.  What you’re looking for is to shorten the communications path, so you don’t eat up your delay budget moving stuff around.

The second step is to measure the delay budget of what cannot be handled locally.  Once you’ve put what you can inside a local event handler, nothing further can be done to reduce latency for the tasks assigned to it, so there’s no sense worrying about that stuff.  It’s what can’t be done locally that you have to consider.  For each “deeper” event interaction, there will be a latency budget associated with its processing.  What you’ll likely find is that event-handling tasks will fall into categories according to their delay budgets.

The local-control stuff should be seen as “real-time” with latency budgets between 10 and 40ms, which means on the average as fast as any human reaction.  At the next level, the data I get from enterprises says that the budget range is between 40 and 150ms, and most enterprises recognize that there is a third level with a budget of 150 to 500ms.

In terms of architecture for latency-sensitive applications, this division suggests that you’d want to have a local controller (as local as you can make it) that hands off to another process that is resident and waiting.  The next level of the process can be serverless, consist of distributed microservices, or whatever, but it’s almost certain that this kind of structure, using today’s tools for orchestration and connectivity, couldn’t meet the budget requirements.  The data I have on cloud access suggests that it’s not necessary for even the intermediary-stage (40-150ms) processing to be in a special edge data center, only that it not be processed so distant from the local event handler that the hop latency gets excessive.

The latency issue, then, is a lot more complicated than it seems.  5G isn’t going to solve it, nor will any other single development, because of the spread of sources.  However, there are some lessons that I think should be learned from all of this.

The first one is that we’re being to cavalier with modern orchestration, serverless, and service mesh technology as applied to things like IoT or even protocol control planes.  Often these technologies will generate latencies far greater than even the third-level maximum of 500ms, and right now I’m doubtful that a true cloud implementation of microservice-based event handling using a service mesh could meet the second-level standard even under good conditions.  It would never meet the first-level standard.  Serverless could be even worse.  We need to be thinking seriously about the fundamental latency of our cloud technologies, especially when we’re componentizing the event path.

The second lesson is that application design to create a series of hierarchical control-loop paths is critical if there’s to be any hope of creating a responsive event-driven application.  You need to have “cutoff points” where you stage processing to respond to events at that point, rather than pass them deeper.  That may involve prepositioning data in digested form, but it might also mean “anticipatory triggers” in applications like IoT.  If you have to look up a truck’s bill of lading, you don’t wait till it presents itself at the gate to the loading dock.  Read the RFID on the access road in, so you can just open the gate and direct the vehicle as needed.

The third lesson is that, as always, we’re oversimplifying.  We are not going to build any new technology for any new application or mission using 500-word clickbait as our guiding light.  Buyers need to understand a technology innovation well enough to define their business case and assess the implementation process to manage risks.  It’s getting hard to do that, both because technology issues are getting more complex, and because our resources are becoming more superficial by the day.

I’ve done a lot of recent work in assessing the architectures of distributed applications, especially cloud and cloud-native ones.  What I’ve found is that there isn’t nearly enough attention being paid to the length of control loops, the QoE of users, or the latency impact of componentization and event/workflows.  I think we’re, in an architectural sense, still transitioning between the monolithic age and the distributed age, and cloud-native is creating a push to change that may be in danger of outrunning our experience.  I’m not saying we need to slow down, but I am saying we need to take software architecture and deployment environment for cloud-native very seriously.

Navigating the Road to Cloud-Native Network Functions

The NFV community accepts the need to modernize, but it’s more difficult for them to say what “modern” looks like.  Certainly there’s pressure for change, but the pressure seems as much about tuning terminology as actually changing technology.  Nowhere is this more obvious than in the area of “cloud-native”.

Virtual Network Functions (VNFs), the meat and potatoes of NFV, run in virtual machines.  That VM link generates two specific issues.  First, the number of VMs you can host on a server is limited, which means that the mechanism isn’t efficient for small VNFs.  Second, a VM carries with it the overhead of the whole OS and middleware stack, which not only fills up resources, it increases the operations burden.

One proposed solution is to go to “CNFs”, which some have called “cloud network functions” and some “containerized network functions”.  The latter would be a better definition because the approach is really about making containers work for VNF hosting, but even here we’re seeing some introduced cynicism.  The lingua franca of container orchestration is Kubernetes, but a fair chunk (and perhaps a dominant one) of the NFV community is looking more at the OpenStack platform, since OpenStack was a part of the original NFV spec.

The other solution is to go all the way to “cloud-native”, which is a challenge given that the term is tough to define even outside the telco world.  We can fairly say that “cloud-native” is not just sticking VNFs in containers, but what exactly is it, and what does it involve?  I’ve mentioned cloud-native network functions (CNNFs) in prior blogs, but not really addressed what’s involved.  Let’s at least give it a go now.

A CNNF, to be truly “cloud-native” should be a microservice, which means that it should be a fairly small element of functionality and that it should not store data or state internal to the code.  That allows it to be instantiated anywhere, and allows any instance to process a given unit of work.  The biggest problem we have in CNNF creation, though, may be less the definition and more the source of the code itself.

When the first NFV ISG meeting was held in Silicon Valley in 2013, there was a fairly vocal dispute over the question of whether we needed to worry about decomposition of current code before we worried about how to compose services from VNFs.  A few in the meeting believed strongly that if current physical network functions (PNFs) hosted in devices were simply declared “virtual” by extracting the device code and hosting it, the value of NFV would be reduced.  Others, myself included, were concerned for three reasons.

First, there would likely be a considerable effort involved in decomposing current code, and vendors who owned PNFs wouldn’t be likely to be willing to undertake the effort for free.  That would likely raise the licensing fees on VNFs and impact the business case for NFV.

Second, there would likely be pressure to allow decomposed PNFs to be reassembled in different ways, even mixing vendor sources.  That would require a standardization of how PNFs were decomposed, and the vendor-mixing process would surely reduce vendor interest.

Third, it was clear that if you composed a service from a chained series of VNFs, the network latency associated with the VNF-to-VNF connections could impact performance to the point where the result wouldn’t be salable at all.

Finally, there were clearly some network functions that were inherently monolithic.  It’s hard to decompose the forwarding plane of a device at the end of a packet trunk.  What would be the strategy for handling those?

In the end, the decision was made to not require decomposition of existing PNFs, and that was probably the right decision.  However, no decision was even considered on whether to support the notion of decomposed PNFs, and that has proved to be unfortunate, because had there been such a decision, we might have considered the CNNF concept earlier.

The four points above, in my view then as now, really mean that there is no single model that’s best for hosting VNFs.  The key point in support of CNNFs is that they’re not likely to be the only “NFs” to be hosted.  My own proposal was that there be a service model for each service, that there be an element of the model representing any network function, and that the element specify the “Infrastructure Manager” needed to deploy and manage it.  That still seems, in some form, at least, to be the best and only starting point for CNNFs.  That way, whatever is needed is specified.

Some network functions should deploy in white boxes.  Some in bare-metal servers, some in VMs, some in containers.  The deployment dimension comes first.  Second would come the orchestration and management dimension, and finally the functional dimension.  This order of addressing the issue of network functions is important, because if we disregard it, we end up missing something critical.

The orchestration and management processes used for deployment have to reflect the things on both sides.  Obviously, we have to deploy on what we’re targeting to deploy on.  Equally obvious, we need to deploy the function we want.  The nature of that function, and the target of deployment, establish the kind of management and orchestration we need, and indirectly that then relates to the whole issue of how we define CNFs and CNNFs, and what we do differently in each.

If we want to exploit cloud-native anywhere, I think we have to accept that the network layer divides into the data/forwarding plane and the control plane.  The former is really about fast packet throughput and so is almost surely linked to specialized hardware everywhere but the subscriber edge.  The latter is about processing events, which is what control-plane datagrams in IP are about.  The control plane is quite suitable for cloud-native implementation.  The management plane, the way all of the elements are configured and operationalized, is a vertical layer if we imagine the data/control planes to be horizontal.

The management-plane stuff is important, I think, because you can view management as being event-driven too.  However, if you are going to have event-driven management, you need some mechanism to steer events to processes.  The traditional approach of the event queue works for monolithic/stateful implementations, but it adds latency (while things wait in the queue), doesn’t easily support scaling under load (because it’s not stateless), and can create collisions when events come fast enough that there’s something in the queue that changes conditions while you’re trying to process something that came before.  The TMF NGOSS Contract approach is the right one; service models (contracts) steer events to processes.

The event-driven processes can be stateless and cloud native, and they can also be stateful and even monolithic, providing that they are executed autonomously (asynchronously) so they don’t hold up the rest of the processing.  Thus, you could in theory kick off transaction processing from an event-driven model as long as the number of transactions wasn’t excessive.

The hosting of all of this will depend on the situation.  Despite what’s been said many times, containers are not a necessary or sufficient condition for cloud-native.  I think that in most cases, cloud-native implementations will be based on containers for efficiency reasons, but there are probably situations where VMs or even bare metal are better.  There’s no reason to set a specific hosting requirement, because if you have a model-and-event approach, the deployment and redeployment can be handled in state/event processes.  If you don’t (meaning you have an NFV-like Virtual Infrastructure Manager), then the VIM should be specific to the hosting type.  I do not agree with the NFV approach of having one VIM; there should be as many VIMs as needed.

And then there’s latency.  If you are going to have distributed features make up services, you have to pay attention to the impact of the distribution process on the quality of experience (QoE).  Stringing three or four functions out in a service chain over some number of data centers is surely going to introduce more latency than having the same four processes locally resident in a device at the edge.  The whole idea was silly, in my view, but if latency can kill NFV service chains, what might it do to services built on a distributed set of microservices?  If you’re not careful, the same thing.

CNFs do have value, because containers have a value in comparison to the higher-overhead VMs.  CNNFs would have more value, but I think that realizing either is going to require serious architecting of the service and its components.  Separation of control and data planes is likely critical for almost any network function success, for example.  Even with that, though, we need to be thinking about how the control plane of IP can be harnessed, and perhaps even combined with “higher-layer” stuff to do something useful.  Utility, after all, is the ultimate step in justifying any technology change.

Are Network Vendors Jinxed on M&A?

Why do so many vendors mess up acquisitions?  It’s always been a relevant question because, well, vendors always seem to mess them up.  It’s relevant now because of the Ericsson announcement it was acquiring Cradlepoint, and that deal could be a poster child for a number of the issues I’m going to raise.

There are two broad reasons why a tech acquisition makes sense.  The first is that you’re buying revenue or a unique customer base, and the second is that you’re buying a position in a market growing so fast that you can’t wait to develop your own product.  While it might seem obvious, the key for any vendor looking to acquire is to first identify which of these are their motivation, and second to protect that proposed benefit with specific steps.

Buying revenue is a pretty straightforward reason, and if the deal is being shepherded by the CFO it’s likely to make sense.  The key is usually to look at the share-price-to-revenue relationship of both companies.  The ideal situation comes when a company that’s trading at a low share-to-revenue relationship is acquired by a company with a higher ratio.  The deal alone will apply the buyer’s multiple to the seller’s revenue and it’s a win.

Buying a customer base isn’t nearly as easy to navigate.  There has to be a sense of symbiosis, meaning that you could expect to sell your product into the other company’s base.  You obviously have to be able to keep the base intact, the base has to consist of a reasonable number of real prospects for your products, and you have to be able to deliver the message through the sales channel.  Sometimes the “old” sales channel won’t be able to absorb the “new” message, and if they can’t and you decide to augment or replace them, you may lose the base.

If you want real complexity, a real potential for a mess, though, it’s impossible to beat the situations where someone buys a company to get a position in a critical future technology.  This can fail for a whole bunch of reasons, so let’s just go through them, starting with the ones I’ve seen the most.

The biggest issue I’ve seen by far is lack of any real vision of the future, driving, opportunity.  Network vendors based in the US (not to name any names!) have a tendency to buy companies more for sales objection management than anything else.  “Gosh, I’d have won that deal if we’d had one of those darn widgets,” the salesperson tells the CEO.  The CEO thinks a moment, then says “Well, darn it, we’ll just buy us a widget company!” and does.  Maybe it’s not quite this stupid (at least, not always), but what does seem to be a universal problem is accepting sales input for what should be a strategic decision.

You have to start a future-strategic-position deal with a clear definition of the position you’re depending on.  A market is an ecosystem, so don’t think “IoT drives 5G” as being a statement of a strategic vision.  It’s a hope, unless you can frame out what the market that creates the drive is, who the players for the critical roles will be, and why those players will accept those roles.

You can easily see how believing that “IoT drives 5G” (for example) could lead a 5G player to think about buying into the IoT space, but that would likely be wise only by happy accident.  There are a lot of steps, players, products, and market steps that need to connect those two dots, and unless the buying company is darn sure they can make all those steps happen, the decision is a major risk.

A good question to ask is “is the company I’m buying already seeing revenue growth from the opportunity I think the deal will help me exploit?”  If the answer isn’t a decisive “Yes!” then there’s probably some pieces missing in the opportunity ecosystem.  Too many companies mistake media hype for revenue growth here.  The media will run any exciting story.  They will seek out people who can say something that makes it more exciting.  If they ask an analyst how big the market is, and they get a ten-billion-dollar-per-year answer, they’ll tell the next analyst they ask that the bid is ten billion, and ask if they’re willing to raise.  That’s how we get future market estimates that rival the global GDP.

Even if you have a viable opportunity and credible ecosystem, you still have to decide whether you can exploit it.  If you don’t have a product available to support an emerging opportunity so credible and immediate that you can’t wait to build something on your own, then you’re probably not very good at strategizing that market.  Is the company you’re buying any better at it?  Where are the smarts needed going to come from?  Can either company quickly generate a lot of media buzz (get some analysts to help if you can’t!)?  Can you put together a strategy that you can execute on quickly enough?  All these are critical questions.

All of the answers to all the questions may depend on another point, which happens to be another of the reasons why company acquisitions go wrong.  Can you, acquiring a new company, merge the workforces and cultures quickly enough, and with minimal resentment, so that a new working team can be created?  Assume the answer is “No!” unless you’re ready to work hard, and when your whole reason for the merger is to reduce combined headcount, you’re really in it deep.

A decent number of M&As are driven by what’s called “consolidation”.  The theme is that Company A and B are competitors, both with decent market share.  If one buys the other, the resulting company would have the combined revenue of the two, and would be able to cut a bunch of cost, including workforce and real estate.  That’s arguably what brought Alcatel and Lucent together, and that marriage is still in counseling, even after the who package ended up inside Nokia.  Any time there are going to be job cuts associated with M&A, you can assume that all those who see themselves at risk will start looking around “just in case”.  If they find something, they may not wait to see how the dice fall.  Those who are most likely to move are those with the best skills, who can attract the most favorable offers.  In short, the very ones you needed to keep.

Symbiotic M&A is fairly common in the software space, but even there it poses challenges.  The VMware decision to buy Pivotal, for example, is going to take some positioning, marketing, and strategy finesse to make it work.  In the network space, I think it’s fairly rare to see a good M&A.  For some companies, it never seems to happen.

That’s what Ericsson should be thinking about.  Ericsson has never been a marketing powerhouse.  Like most telco suppliers, they’re not strategic wizards either.  Imagine, given the formula for successful M&A, how difficult it would be if you don’t understand much about strategy and you can’t communicate what you know in any case!  There has to be a real vision of the future behind the deal, behind almost any M&A deal that’s destined to succeed.  Ericsson needs to promote it in every way, or they’ve just tossed a bundle of money away.

Event-Driven OSS/BSS: Useful, Possible…?

Why are operators interested in event-driven OSS/BSS?  Maybe I should qualify that as “some operators” or even “some people in some operators”.  In the last ten days, I’ve heard (for the second time) a sharp dispute within an operator regarding the future of OSS/BSS.  I’ve also heard vendors and analysts wonder what’s really on.  Here’s my response.

First, it’s important to remember that OSS/BSS systems were originally “batch” operations, meaning that their inputs weren’t even created in real time.  People processed orders and entered them, or field personnel went out and installed or fixed things, and then changed the inventories.  In short, OSS/BSS started off like all IT started, with something like keypunching and card reading.

Over time, what we call “OSS/BSS” today evolved as a transaction processing application set built for the network operator or communications service provider.  Online transaction processing (OLTP) replaced batch processing for the human-to-systems interaction model.  Eventually, as it did in other verticals, this OLTP was front-ended by web portals for customer order and self-service, and even for use by both the operators’ office staff and field personnel.

The recent trends (recent in telco-speak, meaning in the last decade) that have been pushing some people/operators for change has emerged largely from the network side rather than from the service side.  Physical provisioning used to be the rule.  If you wanted a phone line, somebody came out with a toolbelt and diddled in some strange-shaped can sitting (usually at a crooked angle) somewhere on the lawn.  You still have to get some physical provisioning, of course, but with IP networks the “service” is delivered over access facilities often set in place earlier.  Most of the stuff that has to be handled relates to the network side, now automated as opposed to manually configured.

So now, we have this network management system that actually tweaks all the hidden boxes in various places and creates our service.  You make service changes as box-tweaks, and if there’s a problem a box manager tells the service manager about it, so there can be an update to the customer care portal and so there can be (sometimes) consideration for an SLA violation and escalation or charge-back.

You can glue this new stuff onto the existing transaction-and-front-end-portal stuff, of course, and that’s what most operators have done.  Changing out an OSS/BSS system would be, for most operators, about as stressful as changing out a demand-deposit accounting system for retail banks.  The people who don’t want to see OSS/BSS revolution represent the group of operators where this stress dominates.

There are two issues that are driving some operators and some operator planners to doubt this approach.  First, there’s the long-standing operator fear of lock-in.  Remember how hard changing OSS/BSS systems would be?  For a vendor it’s like buying a bond—regular payments and little risk.  Second, as services created via the NMS get more complex and change more often, the relationship between OSS/BSS and NMS gets tighter, and the OSS/BSS can constrain what could be offered to customers.

When you have users at portals, customer service reps at transaction screens or portals of their own, and network condition changes all pouring into the same essentially monolithic applications, no good can come from it.  Collisions in commands and conditions can bring about truly bad results, and this is why service complexity tends to favor a modernized approach to OSS/BSS.  It’s also why you hear about making things “event-driven”.

But being event-driven opens other doors.  If we go back (as I know I often do!) to the TMF NGOSS Contract model, we find an approach that links network and operations events to processes via a contract data model.  Event “X” in Service State “3” means “Run Service Bravo”.  This has major implications on lock-in, and even on whether there needs to be anything we’d call an OSS/BSS at all.

What the NGOSS Contract does is to “service-ize” operations software.  Instead of a big monolithic chunk of code, you have a bunch of services/microservices that do very specific things.  A software vendor could offer “event services” instead of monolithic OSS/BSS systems.  Some of the event services might actually look a lot like things like a retail portal or an analytics tool, so some general-purpose “horizontal” software could be included where appropriate.  Operators could mix and match, which may be why vendors really hate this approach.

It may also explain why the whole TMF NGOSS Contract thing didn’t take off when it came along well over a decade ago.  The TMF has recently shown some signs it would like to resurrect the concept and make something of it, but in order for that to happen, the network operator members of the body will have to out-muscle the OSS/BSS vendors.  In most international groups I’ve been involved with, the operators are novices about manipulating group processes, so this is going to be a difficult challenge both for the operators and for the TMF.

What happens here could be important for the OSS/BSS space, providing that the TMF does advance the NGOSS Contract notion and that it’s actually implemented.  Remember that this was advanced once and wasn’t implemented, so we can’t take TMF acceptance as an indication of an actual product change.  If we do get this change, it could be the beginning of a period of rationalization of operations and business support software, which most would probably agree is long overdue.  It might also have other impacts.

The first is that this change might percolate into service provider operations overall.  Network management is even more event-centric than OSS/BSS, and yet the NMS and even zero-touch automation models currently evolving are really as monolithic as OSS/BSS systems.  If operators see the advantages of event-driven OSS/BSS, can they fail to see that extending the principle into network management, and thus to operations overall, would benefit them significantly?

The next question is whether a “contract-data-model” approach, combined with event-to-process steering by the model, could be used in other applications.  Remember that operations processes have been generally converging on the same transactional and portal shift that applications have followed in general, could operations software lead the rest of the space into a contract-driven approach?  If so, it would be a total software revolution.

The use of the contract-data-model approach would guide the path toward the use of microservices in application software.  Almost anything that deals with the real world, including traditional transaction processing, can be re-visualized as event-driven.  The limitations that the approach put on the processes (which is that they’re stateless insofar as they operate only on the contract data model data itself) could encourage functional decomposition to the appropriate level.  The result would be resilient and scalable because any instance of any given process could handle the events that the state/event relationship targeted to it.

Might this revolutionize the cloud, even create a kind of SaaS-as-microservices future?  Not so fast.  All event-driven systems have an inherent sensitivity to latency, because the in-flight time of data creates a window in which simultaneous events can’t be contextualized.  The same problem occurs (more often) in monolithic software implementations of event systems, where events have to be queued for processing when resources are available, and this loss of context is one reason why that monolithic approach isn’t suitable for event-driven systems, including those of network and service management (which is why I don’t like ONAP).

Apart from the contextual problems, event-driven systems have to manage latency to prevent workflows from accumulating too much of it as they pass around through a sea of microservices.  One of the reasons why it’s important to view a network as a series of hierarchical black-box intent models is to control the scope of event flows, so that you don’t end up having excessive response times.  If you want to believe in SaaS-for-everything and event-driven at the same time, we’d have to think carefully about how the contract data models and state/event tables were structured, and plot the event- and workflows carefully.  Of course, you don’t have to go event-driven to make OSS/BSS a SaaS project; monolithic software can be hosted in the cloud and offered as a service, but that’s another topic.

I believe that OSS/BSS systems are inevitably moving from specialized purpose-built software to collections of horizontal tools.  The key, for operators, is to recognize that at the same time there’s an event-driven dimension moving the needle, and if they ignore the latter trend, they may end up with a bunch of connected monoliths instead of a collection of services and microservices, and in the cloud age that would be a very bad outcome.

Why Separating the Control and Data Planes is Important

The separation between control and data planes is a pretty-well-accepted concept, but one complicating factor in the picture is that the term “control plane” isn’t used consistently.  Originally, the “control plane” was the set of protocol messages used to control data exchanges, and so it was (as an example) a reference to the in-band exchanges that are used in IP networks for topology and status updates.  With the advent of more complex services, including 5G, the term “control plane” references the exchanges that manage the mobile network and infrastructure, riding above the “user plane”, which is now the IP network and thus includes the data plane and the old control plane!

About a year ago, I did a presentation to some operators that used a slide that defined four layers of a network.  I put the data plane at the bottom, the control plane next, and then the service plane, which I said was a better name for the layer of signaling that mediates IP data flows, the lower two layers.  At the top of it all, I sat the experience plane, which provides the actual experiences that users of a network are seeking—video, websites, etc.

We actually have examples of things in each of these four areas today, and the way the planes relate to each other and to the user is important in understanding the best approach to defining future network architectures and infrastructure.  However, as soon as you start looking at that problem, you confront yet another “plane” that runs vertically along all these horizontal layers.  That’s the virtualization plane, and it contains the signaling that manages whatever hosting and deployment processes are used to realize the functionality of the other planes on a cloud infrastructure.

When we talk about things like 5G, we’re dealing mostly with service functions, and so we’re dealing mostly with things that live in the three lower planes.  My experience plane is relevant to 5G only in the sense that the things in it are part of the public data network that 5G user plane activity is connecting with.  The specs don’t define how we actually achieve virtualization, except through reference to NFV, which actually does a pretty poor job of defining the way we’d want to deploy and manage functional components in any of those planes.

You can see a bit of this, by inference, in my blog about Ericsson’s complaints about Open RAN security.  Since we’re not dealing with the actual mechanisms of virtualization in an effective way, there are indeed potential security issues that emerge.  Not just with Open RAN (as I noted), though.  All of the network planes have the potential for security issues at the functional level (a function could be compromised), and to the extent that they interconnect elements or virtualize them, they also present security risks at the implementation level, meaning an attack on infrastructure or the orchestration and management interfaces.

The special challenge of virtualization, meaning the replacement of fixed appliances like routers or firewalls by hosted features (single features or chains or combinations thereof) that do the same thing, is that we’ve exposed a lot of additional stuff that not only has to be secure, but has to be somehow created and managed without generating a lot of operations burden.

This is why I’ve always liked the notion of “fractal service assembly” or “intent modeling”.  If we presumed that a given feature or function was represented by a functional abstraction linked to an opaque (black-box) implementation, and if we could make the resulting model enforce its own SLA within the black box, then the complexities of implementation are separated from the functional vision of the service.  Since that vision is what users of the service are paying for, that creates a level where service provider and service consumer speak the same (functional) language, and the devilish details are hidden in those black boxes.

One of the advantages of the black-box approach is that creating an abstraction called (for example) “Router-Element” with the same interfaces and management APIs as a real router would let you map those 1:1 to a real router.  Before you say “So what?”, the point is that you could map any software instance of a router, any SDN implementation of a router, or any collection of all of the above that did routing, all to that abstraction.  You now have an evolution strategy, but this isn’t the topic of this blog, so let’s move ono.

The relevant advantage is that we now have, because the implementation of what’s inside the black box is opaque, a unit of functionality that doesn’t assert any of the interfaces or APIs that the implementation (versus the functionality) requires.  We can contain the scope of the internal service and control plane layers of my model.  We can also now define a virtualization-plane process that operates only on the functions.  Inside the black boxes there may be virtualization, but it’s invisible.  This is what fractal service assembly is about.  You divide things up, make each thing opaque and autonomous, and then deal with a network of things and not one of specific technologies.

If the service plane, then, deals with “thing networks”, it becomes independent of the implementation of the things.  We can build services without worrying about implementations.  We can, by asserting proper interfaces from the data and control planes into the service plane, construct service interfaces for 5G’s “control plane” to consume.  Our black-box service plane is then the 5G user plane, and our service plane has a layer that creates the proper 5G interfaces, then defines the 5G processes that consume them.  Those, like our lower level, can be functional black boxes that hide the implementation.

This is the important point here, the conclusion I’m aiming for.  Every layer in our approach is a black box to the layers above, just like the OSI model mandated in 1974.  We have to define the binding interfaces between layers, represent them as intent-modeled interfaces, and then focus on how to implement the inside of those black boxes, both in terms of current technology and in terms of the various visions of the future.  The approaches that can do that will save us, and the others will be simply more marketing pap for the masses.

Is Ericsson Right About Open RAN Security?

Vendors love to rain on open initiatives, so it’s not surprising that Ericsson, perhaps the leading 5G vendor, is now casting clouds and shade on Open RAN.  Specifically, they’re warning about the security risks it might present.  Is this yet another of those cynical vendor ploys, or is there something to the issue?  In particular, are there practices that potential Open RAN users should look for, or look out for?

Open RAN is just what the name suggests, a model for implementing the 3GPP 5G RAN (or New Radio, NR) spec in an open way, rather than a way that encourages monolithic devices.  It’s gained a lot of traction in the media (see HERE and HERE in just the last week), and a lot of operators are also paying attention.  That, of course, is likely to arouse vendor concerns.

The fundamental objection Ericsson makes in the blog I referenced above is that Open RAN creates an “expanded threat surface.”  The key quote is “The introduction of new and additional touch points in Open RAN architecture, along with the decoupling of hardware and software, has the potential to expand the threat and attack surface of the network in numerous ways….”

I have a major problem with this, frankly.  What it sounds like is a plea to abandon software/hardware systems in favor of proprietary appliances.  Every 5G network, or any other kind of network, is controlled at the service and network levels by software running on servers, which sounds like decoupling of hardware and software to me.  Not to mention the fact that there’s probably not a single initiative in all of networking whose purpose is to bring software and hardware together in an appliance.

But let’s look at the specifics, and to do that, we have to use a reference for an Open RAN architecture, so I’m providing this figure from the O-RAN Software Community.  One specific point is the hardware/software decoupling, which is IMHO total nonsense, so we won’t pursue that further.

The first specific complaint Ericsson makes is that “New interfaces increase threat surface – for example, open fronthaul, A1, E2, etc.”  On the surface, this might seem at least possibly true, but I think that doesn’t survive examination.

The A1 interface, as the figure shows, links the Near-Real-Time RAN Intelligent Controller (RIC) element with orchestration and service management.  Every device in any network is going to have a management interface.  Every software element in an application or virtual-function-based service (remember that 3G is supposed to be based on NFV) has orchestration and management.  To claim this is an attribute specific to Open RAN is silly.  If your implementation of 5G doesn’t protect management and control-plane interfaces, you have a problem whether you’re Open RAN or not.

The E2 interface(s) are similar; they are part of the 5G “Control Plane” and should be totally separated from the user plane.  Who’s attacking there?  They’d have to be part of the 5G infrastructure, and again, if you have no protection for the components of your own infrastructure, there’s no hope for security in any form.

“Management interfaces may not be secured to industry best practices?”  Sure, that’s true, and it’s true for all the management and control-plane interfaces in any compliant 5G implementation.  One reason to separate data and control planes is to prevent security problems.  5G mandates that.  Sorry, but this is silly too.

The final point Ericsson makes is “(not exclusive to Open RAN):  adherence to Open Source best practices.”  The parenthetical qualifier says it all.  If this is a blog about Open RAN security problems, why include something that’s not specific to Open RAN?  In any event, open-source best practices aren’t necessarily best for network function virtualization in any form; they tend to be oriented toward IT and applications.

It is true that there are orchestration (DevOps), administration, and security measures appropriate to the design, development, and deployment of the Open RAN elements.  There are the same issues with VNFs, or management system components, or any other piece of software that’s used in a network service application.  Again, the solution to any Open RAN issues would be to address the issues for everything that’s software-implemented in 5G, NFV, OSS/BSS, NMS, ZTA, and all the rest.  We don’t even know at this point whether cloud solutions to deployment like Kubernetes, or for event movement like a service mesh (Istio) could fully address the requirements of a network implementation.  Do we think that this means that we should turn back the clock to the network equivalent of the Old Stone Age and commit to a future of bearskins, stone knives, and network appliances?

So, is this a cynical play or a real definition of issues?  Both.  On the one hand, there is nothing in Ericsson’s blog that IMHO points to an Open RAN-specific problem with security.  What they’re saying is that if you’re stupid enough to fail to secure your own control and management interfaces, Open RAN will add to the things you’ve been stupid about.  True, but in a security sense, not relevant, which brings us to the other hand.  3GPP, NFV, SDN, and just about every other initiative in the telco space for the last decade, has played fast and loose with the question of how you isolate the critical control- and management-plane functions.  If they’re even addressable in the data plane, it’s a problem, period.  Even if they’re not, there needs to be a barrier between the components that could be provided as part of a service, and the components that are part of the service infrastructure.

I’ve complained about the lack of address space discipline in many of my blogs, relating most often to NFV (Ericsson was a member of the NFV ISG and they didn’t stand up in support).  There’s a general tendency among telco people to ignore the issue, and I think it’s been ignored in the 3GPP too.  Any service control plane or “Level 3 control plane” (the two are different; in 3GPP 5G terms, the “user plane” is the Level 3 data plane and control plane and the “control plane” is a higher-level 5G layer) should be separate from the data/user plane.  That can be accomplished through maintaining them in a separate address space (one of the RFC1918 private IP addresses) or by defining a virtual private network or SD-WAN.

That same thing holds true for software elements and their deployment.  If you have a lazy process of development and certification for security and compliance, nobody has to attack you from the outside, they can simply sell a piece of malware into your framework and let it eat everything from the inside out.  A new generation of parasitism?  That’s not what we should be doing, but Open RAN doesn’t create the risk.

It may be true that the Open RAN stuff adds interfaces and APIs and even components to the threat surface, but there are more than enough interfaces there already to constitute a threat, and I don’t think that Open RAN increases the threat level in any meaningful way.  It’s surely true that if one were to apply address-space or virtual-network separation for the control layers of 5G, however it’s implemented, the result would protect all those interfaces/APIs and defend the threat surface universally, as long as you didn’t for some inexplicable reason keep some of the interfaces outside the scope of your remedy.

There are two lessons to be learned from Ericsson’s paper, I think.  First, these are troubling times for network operators and network equipment vendors, and in those situations, we have to be particularly wary of any statements or documents that clearly serve vendors’ interests, even if they’re couched in service-to-the-industry terms.  Vendors have always had blinders on with respect to things that touch their bottom lines, just as pretty much everyone else has always had.  If this is a surprise to anyone, welcome to the real world.

The second lesson is more important.  In the network operator or telco space, we are seeing a movement to the cloud that’s guided by bodies and people who don’t really understand the cloud.  That’s perhaps understandable too, because the technologists in both the vendor and operator spaces have focused on network-building as a task of assembling and coordinating boxes.  Once you make an explicit commitment to virtualization and functions, you are entering a whole new world.  Add in an explicit commitment to the cloud, and it’s immeasurably worse.  This is a field for specialists, and those specialists rarely work for telcos or their vendors.

At some point, the networking community is going to have to make a choice.  They can keep blundering around in the pseudo-cloud world they’ve created, a world that bears little resemblance to the real cloud world.  They can sit back and wait for the real cloud world to solve the problems of virtualized networking, either as a mission or by accident.  Or…they can learn enough about the real world of the cloud to guide themselves to their own best outcome.  The real message of the Ericsson complaint about Open RAN security is that it’s time to make that choice, or it will be made by default, and not in the optimum way.

I don’t want to beat Ericsson up unfairly on this; every single major network vendor engages in at least a subtle marketing sabotage campaign on threatening initiatives.  I’ve seen real arguments between vendors and operators at standards meetings, around points that would threaten a vendor’s incumbency in the name of openness.  I also want to note that the actual Ericsson blog is more a qualified list of concerns than the kind of overt attack on Open RAN that many stories on it described.  Still, if the goal of the piece was to call attention to the risks of virtual infrastructure, it should have been done in a more general and constructive way.

Oracle’s Quarter and the Future of SaaS

The biggest players in a space always set the tone, but they don’t always tell the story.  Oracle last week turned in a good quarter, and they’re in many ways an outlier in the public cloud space.  They don’t figure in most media cloud discussions as they rank number five in most public cloud reports, but they do represent a fairly unique blend of traditional (IaaS) and SaaS cloud services, and they do better in the latter.  They also represent a company that’s made the transition from selling software to selling cloud services pretty well.  They’re also the (apparent, but with unspecified terms) winner in the TikTok battle, which I’ll leave out of this blog pending more detail.

Financially speaking, Oracle reported a successful quarter, not only beating estimates but beating versus last year, pre-COVID.  They also guided higher for the next quarter.  The bad news (at least somewhat bad) is that their latest quarter was below both of the prior quarters, so they did have some exposure to the COVID problem.  They may have a better exposure to the recovery, though, and it’s the market factors that create that exposure that make them interesting.

The impact of COVID on enterprise IT is still evolving, but it’s pretty clear that the virus and lockdown have influenced budget planning.  Today, over two-thirds of enterprises tell me that they plan to shift more spending to IT-as-an-expense, away from traditional capital-centric IT.  They like the idea of being able to grown and shrink spending to match their expected revenues, and they realize that a hit like the one they got from COVID could be weathered more easily were they more agile.

Some software vendors have addressed this trend, visible even before COVID, by shifting more to a license basis for software versus payment-and-maintenance, and licensing based in some way on usage rather than something static like company size.  However, the most obvious way to address a need to shift from capital-IT to expense-IT is to use the cloud.  But….

…there is a big difference in the agility of cloud services.  Traditional cloud services will largely displace data center equipment and/or data center incremental investment.  Since cloud hosting tends to consume either newly developed software or software that’s moved from somewhere else, it doesn’t impact software costs (unless you’re unlucky enough to have a strange software license that might actually charge more for cloud hosting).  The kind of cloud service that best fits the goal of shifting IT costs from capital to expense is SaaS, which happens to be the big focus of Oracle.

Big, but not exclusive.  Oracle also has an IaaS cloud offering, though their market share in IaaS (which I estimate at about 6%) is less than its share of SaaS (which is about 12% if you include Microsoft Office 365 as a SaaS offering, or over 28% if you don’t).  This year, I’ve noticed that Oracle is cropping up in comments enterprises make to me about cloud planning, which it didn’t do much in 2019.  Companies who were willing to talk about the reason had an interesting tale to tell.

Shifting to a SaaS model isn’t easy for enterprises, for the obvious reason that whatever applications account for the largest part of their IT budgets aren’t provided in SaaS form.  They have to change applications to move to SaaS, and that sort of change can create major pushback from line departments whose users see different interfaces and often different process flows.  That’s problem enough, but there’s another problem rarely talked about in SaaS transformations.

Moving applications to SaaS form, to save money, has to displace IT resources, which means that the data center is likely to have less overall capacity.  Moving applications to SaaS may also have an impact on how the applications integrate with other applications that haven’t been moved, and perhaps can’t be moved yet.  Some users are telling me that Oracle’s SaaS/IaaS combination, combined with Oracle’s features and skills at integrating the two inside Oracle’s cloud, facilitates their shift to SaaS.  It might then be that Oracle’s approach will gain them not only SaaS market share but IaaS market share as well.

Oracle cited some indication of this on their call.  After citing analyst firms’ notice of their SaaS offering, Ellison says that “…what’s interesting is that those same analysts are beginning to take notice of the technical quality and customer satisfaction associated with Oracle’s Cloud infrastructure as a service business.”  It’s likely that the analysts, like me, were hearing from users about their strategies for increasing SaaS adoption.

Oracle, of course, had the advantage of having applications they could move to the cloud in SaaS form, and it’s actually fairly easy to spawn a user-specific application instance on top of an IaaS service and frame the offering as SaaS.  Over time, you could then redo critical parts of the application to facilitate more efficient use of the compute platform.

The ability to create a symbiosis between a SaaS and IaaS strategy is helpful, but you need SaaS to drive the bus here.  The big question in both software applications and cloud computing is whether the COVID challenge will result in a major shift toward SaaS, which would imply a major shift away from enterprise custom development in favor of packaged software.  It’s not rational to assume there’s much money in framing an enterprise’s own software as SaaS; it has to be a vertical or horizontal package.

Enterprises don’t write nearly as much of their own software today as they did in the past.  I can easily remember a period when the larger enterprises wrote most of their own applications, and when even a “fourth-generation language to facilitate development (what we’d call “low-code” today) was a revolution.  A prerequisite for acceptance of SaaS is an acceptance of canned applications to replace self-developed stuff.

That’s not something that develops instantly, and we have to remember that we’ve had SaaS for some time.  There have been local successes in SaaS, mostly in the very applications that Oracle and Salesforce compete in, but broader SaaS has to come from broader and more vertically focused applications.  That’s something Oracle may be thinking about, but I think that class of SaaS, and the full realization of Oracle’s ambitions in the cloud, may take some time.