Why Optimum Transformation Depends on a Hidden Metric

I know I talk a lot about “demand density” in my blogs, so much that you could think I’m saying it’s the essential underpinning of network evolution.  Well, it kind of is.  The future of networking is written in dollars, not in bits, and demand density is the fundamental dollar reality.  I’ve been working through modeling just how it’s impacting us today, and how it will continue to impact us over the rest of this decade, and I want to share it here.

To start with, the “demand density” concept is one I developed over a decade ago in response to what seemed like a simple question: “Why do some countries have so much better broadband than others?”  I took a lot of economic data on the major market areas and ran a series of models to try to find correlations.  It was obvious that the potential ROI of network infrastructure in a given area (like a country) depends on the combination of the economic potential of the area and the extent to which public right of way is available.

The model showed that the former factor was significant if you related it to GDP per square mile of inhabited area, and the latter to the road and rail miles within the country.  Demand density was the combination of these factors, and I’ve always expressed it in relation to the US demand density, set to 1.0.  The easiest way to understand the importance of demand density is to take two extremes, the very low and the very high, and compare them.

When a country has a very high demand density, its economic potential is highly concentrated, meaning that the infrastructure investment needed to connect the country and realize that economic potential is relatively low.  ROI on infrastructure is high, so countries with high demand density are likely not to have felt the profit-per-bit squeeze at this point.  The high economic return of infrastructure makes it easier to invest in new services, and so service innovation is high.  Overall, infrastructure planning is likely to be aimed at revenue generation more than at cost control.

Operations costs are also lower in these high-demand-density countries.  On the average, the “process opex” costs associated with service and network operations in countries with a demand density of greater than 7 is 40% less than that of a country with a demand density of 1.  That’s partly because human resources are more efficient when their range of activity is small; they spend less time moving around and a central pool of resources can support a greater number of users and devices.  The other part is that it’s possible to oversupply resources where demand density is high, because of that higher ROI on infrastructure.  More resources mean less resource management.

The low-demand-density country is in the opposite situation.  Because demand is spread over a larger geography, connecting users is more expensive and infrastructure ROI is lower.  That translates to quick compression of profit-per-bit, and in the extreme cases makes it difficult to sustain investment in infrastructure at all.  Opex for countries with demand densities below 0.33 have opex costs that average 20% higher than those with a demand density of 1.0, because human resource usage is relatively inefficient.

I’ve used “countries” here because it’s generally easier to get economic data at a country level, and because most countries have been served by a national carrier.  In the US, it’s fairly easy to get data by state, and there are multiple operators serving the US.  A quick look at two, AT&T and Verizon, is another window into the importance of demand density.

AT&T’s demand density is 1.3, which means that it fits into the “relatively low” value range.  Comparing it with country data, it ranks roughly the same as Chile.  Verizon’s demand density is 11.5, which ranks slightly below that of Japan, in the “high” range.  Verizon has been aggressive in deploying FTTH and AT&T has not, because the former’s demand density suggests it could profitably connect about 40% of its customers with fiber, and the latter’s data says they could connect only 22%.  The company has recently said it was dropping new DSL deployments, and since its demand density is such that fiber support for these customers would not likely be profitable, and you don’t have to look further than the stock prices for the two companies to see the difference in how the financial markets see them.

AT&T has been perhaps the most aggressive large operator in the world on infrastructure transformation, particularly in the deployment of “white-box” technology.  That’s obviously aimed at reducing capex, but the largest component of capex for an ISP is the access network.  AT&T’s low demand density means its access network technology options are crippled.  DSL has to reach to far, and fiber to the home is too expensive.

This is where 5G mobile and 5G millimeter wave come in.  In areas where demand density is low, 5G technology could provide an alternative to copper-loop or FTTH but with a much lower “pass cost”, meaning the cost to bring service into an area so customers can then be connected.  5G in any form reduces the access cost component of capex, which can relieve pressure on the overall capital budget.  More significantly, it lets an operator raise its per-user bandwidth at a lower cost, making it more competitive and opening the opportunity to deliver new services, like streaming video.

5G seems most valuable as a tool in improving service cost and profit where demand densities are below about 3, which would correspond to US states like Vermont and West Virginia or countries like Italy.  Where demand density is higher, fiber becomes more practical and the impact of 5G on overall profit per bit is likely to be steadily less.

This is important in transformation planning, because it divides network transformation goals into “zones”.  Where demand density is high (greater than 5), profit per bit is not under immediate pressure and neither significant network transformation nor significant 5G exploitation is likely to be needed in the near term (to 2023).  Where it’s between 3 and 5, 5G is likely a competitive and service opportunity driver.  Between 1 and 3 and 5G and general network cost effectiveness combine to create the transformation drivers, and between 0.2 and 1.0, transformation has to be pervasive in both access and core, in order to control profit compression.

Obviously, pure mobile operators are going to be transformed primarily through the evolution of the mobile backhaul and core networks, so 5G standards would likely dominate transformation planning goals.  Where demand density comes in is in the area of cell-size planning.  High demand densities mean efficient backhaul even if microcells are used, and so 5G networks would likely trend toward smaller cell sizes and larger numbers of cells.  This would also favor the introduction of bandwidth-intensive applications of 5G because per-user bandwidth could be higher.  In low-demand-density areas, cell sizes would likely be larger to contain overall deployment costs, which would also reduce per-user bandwidth available and limit the new services that 5G could support.

Millimeter-wave 5G/FTTN hybrids would seem most valuable where demand densities hover in the 1-4 range, too low for large-scale FTTH but high enough that the range limitations of 5G/FTTN wouldn’t be a killer and so that delivered bandwidth could remain at a competitive level.  As demand density falls to the low end of that range, 5G mobile infrastructure to serve fixed locations would become increasingly economical, and as it rises to (and above) the high end, FTTH becomes competitive.

This last issue of transformation focus may be the most important near-term factor influenced by demand density.  Is a mobile network mobile-first or network-first in planning?  For vendors prospecting network operators, that’s a big question because it relates to both sales strategy and product planning focus.  Obviously, operators whose planning is dominated by 5G issues aren’t as likely to respond to a generalized network transformation story; they want to hear 5G specifics.  The opposite is also true.  However, there are exceptions.  Operators whose demand density favors microcells would be doing a lot of backhaul and aggregation, and thus would build a core that looked a lot like that of a wireline network.  Those with large 5G mobile cells could be doing so much aggregation in the backhaul network that they’d almost be dropping their mobile traffic on the edge of their core.

One thing I think is clear in all of this is that demand density has always been an issue.  Australia, who went to a not-for-profit NBN experiment, has a demand density of 0.2.  AT&T, who just committed to an open-model network (see my blog HERE) has a demand density of 1.3.  The general curve of profit compression that operators always draw will reach the critical point of inadequate ROI quicker for those whose demand density is lower, and measures to contain capex and opex will be taken there first, as they have been already.

Another thing that’s clear is that transformation strategies aren’t going to be uniform when demand density is not.  It’s simplistic to believe that a salesforce could be armed with a single song to sing worldwide, and succeed in selling their products across the globe.  There’s always been a need to tune messages, and with operator budget predictions down almost universally, this is probably a good time to pay special attention to that tuning.  Factoring demand density into the picture can be a big help.

Filling the Holes in Opex Reduction Strategies

Vendors are finally discovering the virtue of opex reduction.  Cisco has included the network sector in its overall AI/ML strategy, complementing “intent-based networks”.  Juniper’s Mist AI deal, combined with their recent acquisition of Netrounds, shows that they’re looking at more ways to apply automation, testing, monitoring, and other stuff that qualifies as operations automation.  The reason, of course, is that if operators are looking to cut costs to reduce profit-per-bit squeeze, cutting opex might prevent them from cutting capex, which hits vendors’ own bottom lines.

Vendor desire to support operator cost reduction in places where the reduction doesn’t impact operators’ spending on the vendors is understandable.  That doesn’t make it automatically achievable, or mean that any measure that has a claim to enhancing operations would necessarily reduce opex significantly, or at all.  I’ve done a lot of modeling on opex, and this modeling suggests that there are some basic truths about opex reduction that need to be addressed any time opex enhancements are claimed as a benefit.

The biggest truth about opex is, for operators, a sad one.  Recall from prior blogs that I’ve noted that “process opex”, the cost related to customer and infrastructure operations, is already a greater percentage of revenue than capex is.  Starting in 2016 when operators moved to address the problem, and despite measures taken largely in 2018, opex costs have continued to grow even when capex has been reduced.

The biggest and saddest truth for vendors is that securing opex improvements is getting harder.  Besides the year 2018 when major changes to installation and field support were initiated, to create a major impact on opex overall, further technical improvements have only made the growth curve of opex a bit less steep.  It has never, even in 2018, turned back, and it’s continuing to grow in 2020.  The biggest challenge is that one area where cost containment seems to have worked this year is in the area of network operations, where most of the announcements (like Juniper’s) are targeted.

It would be easy to say that service lifecycle automation and other forms of operations cost management are already out of gas, but that oversimplifies, even though it’s a view that could be gleaned from the data.  A “deeper glean” results in a deeper understanding, which is that while service automation is important, it has to be viewed from the top and not from the bottom.

If you take a kind of horizontal slice across all the categories of “process opex”, the largest contributor by far is customer care and retention, which accounts for an astonishing 40% of all opex.  You could argue that making network operations automated is important primarily if it reduces customer care and retention costs, and for that to be true, you have to be able to translate the netops benefits directly into the customer domain, which even in 2020 is a disaster.  The reason we had a brief improvement in process opex in 2018 is that operators wrung human cost out of the process.  The reason why it didn’t last is because the system changes made were simply badly done.

Let me offer a personal example.  I recently had a Verizon FiOS outage, caused (I believe) by a construction crew cutting the fiber cable.  Having lost wireline Internet, I went to Verizon’s website to report and resolve the problem.  Thanks to that 2018 change, there was no human interaction involved; I was told to run through an automated troubleshooter, and candidly it was a truly frustrating and terrible experience.

First, the troubleshooter asked me whether I was having a power failure.  It then told me there were no known outages in my area, and had me go to the location of the optical terminal.  Now, there was a red light on the OTN, but they never asked for status.  Instead I had to make sure the OTN had power (duh, there’s a red light on) and second to reboot the OTN to see if that fixed it.  It didn’t, so they told me it would now take a service call, and gave me a bunch of disclaimers on how I’d have to pay for the call if this turned out to be my fault.  Then it asked if I still wanted to schedule a service call.  I said I did, and it told me there were no slots available for the next two weeks, so I’d have to be connected to a human.

It never connected me.  I tried to call the support number and got another (this time voice-driven) troubleshooter, which proceeded to tell me that there was an outage in my area and it would be repaired by 6 AM the following morning.  Good service?  Hardly, but the problem wasn’t service automation.

There’s an automated element at the end of the broadband connection, the ONT.  There is no reason why Verizon could not have known that some number of customers’ ONTs had gone away, and from the distribution have been able to determine the likely location of the problem.  They had my cell number, so they could have texted me to say there was an outage, that they’d have it repaired overnight, and that they were sorry for the problem.  That would have left a good taste in my mouth, and reduced the chances that I’d look for another broadband provider.  It would have unloaded their troubleshooting system too, and it would have required nothing that would qualify as a network automation enhancement.

Start with the customer, with how to proactively notify them of problems and how to give them the correct answers and procedures.  When you’ve done that, think about what information could be provided to drive and improve that customer relationship.  When that’s done, think about what might have been done to prevent the problem.  The proverbial “cable-seeking backhoe” isn’t going to be service-automated out of existence, nor is the repair of the cable, which shows that some of the most common service problems aren’t even related to the network’s operation.  We absolutely have to fix customer care.

This doesn’t say that you don’t need service lifecycle automation, though, and there’s both a “shallow” and “deep” reason that you do.  Let’s start with the shallow one first.

Some customer problems are caused by network problems, and in particular they’re caused by operator error.  A half-dozen operators out of a group of 44 told me that operator error was their largest cause of network problems, but almost 30 of them wouldn’t say what was, so it could well be that operator error is the largest source of errors overall.  Misconfiguration lies at or near the top of the list.  Lifecycle automation, by reducing human intervention, reduces misadventures in their intervention (to add a bit of conversational interest to the picture).

The second, deep, reason is that in an effort to reduce capex, we’re developing a more complex infrastructure framework.  A real router is a box, and everything about managing it is inside the box.  A virtual router is a virtual box, and it still has to be managed as a box, but its hosting environment, its orchestration, and even the management processes associated with hosting, also have to be managed.  If we break our box into a chain of virtual-features, we have even more things to manage.  Management costs money, both in wages and benefits, and in the errors that raise customer care and retention costs.

You can see what might be a sign of that in the operations numbers for this year.  While network operations costs are up about 16% over the last five years, IT operations costs are up over 20%.  Given that these costs are all “internal”, meaning there’s no contribution to direct customer interactions, installations, or repairs, that’s a significant shift.  Could it be the indication that greater adoption of virtualization is creating more complex infrastructure at the IT level, and that the biggest contribution that service lifecycle automation could make is in controlling the increase in opex related to this increased complexity?

Then there’s the final point, which is the impact of demand density on opex and on opex reduction strategies.  We already know that wireline has a higher opex contribution than wireless, but it’s also clear that in areas where demand density (roughly, opportunity per square mile) is high, opex is lower because craft efficiency is higher and the cost of infrastructure redundancy is lower.  As demand density falls, there’s a tendency to conserve infrastructure to manage capex, which means opex tends to rise because of loss of reserve resources.  It’s possible that this factor could impact capex-centric approaches to improving profit per bit; if new-model networks are cheaper to buy and more expensive to operate, what’s the net benefit?

The fact is that you can’t let opex rise, in large part because a rise in opex is often a sign that customer care is suffering.  It’s possible that a customer-care-centric approach to operations, even without massive changes in lifecycle automation, could improve opex as much as new automated technology could.  It’s also possible that wrapping new service lifecycle automation in an outmoded customer care portal and practice set could bury any benefits the new system could generate.

My experience with Verizon couldn’t have been resolved by an automated system if it was indeed caused by nearby construction.  No new systems or AI was required to do a better job, only more insight into designing the customer care portal.  I’m not saying that we should forget service lifecycle automation and focus on customer portals, but we can’t forget the latter while chasing the former.

Is There Really an “Edge” to the Cloud at All?

Where is the best place to host computing power?  Answer: Where it’s needed.  Where’s the most economical place?  Answer: Probably not where it’s needed.  The dilemma of the cloud, then, is how to balance optimality in QoE and the business case.  I’m going to propose that this dilemma changes the nature of the cloud, and makes a better definition of an “edge” a “local” processing resource.  A Light Reading article quotes Dish as saying that for 5G, “the edge is everywhere”.  Is that true, or is it true that the cloud is subsuming the edge?

The relationship between humans and applications has been a constant process of increasing intimacy.  We started off by keypunching in stuff we’d already done manually, evolved to online transaction processing, empowered ourselves with our own desktop computers, and now we carry a computer with us that’s more powerful than the one we started reading punched cards with.  How can you see this other than the classic paso doble, circling ever closer?  It proves that we’ve historically benefitted from having our compute intimately available.

The cloud, in this context, is then a bit counterintuitive.  We collectivize computing, which seems to be pulling it away from us.  Of course, now the cloud people want to be edge people too, getting them back in our faces, but I digress.  The point here is that economics favors pooled resources and performance favors dedicated, personalized, resources.

We could probably do some flashy math that modeled the attractive force of economics and the attractive force of personalization to come up with a surface that represented a good compromise, but the problem is more complicated than that.  The reason is that we’ve learned that most applications exist in layers, some of which offer significant benefits if they’re pushed out toward us, and others where economics will overwhelm such a movement.

IoT is a good place to see the effect of this.  An IoT application that turns a sensor signal into a command to open a gate is one where having the application close to the event could save someone from hitting the gate.  However, a mechanical switch could have done this cheaper.  In the real world, our application likely involves a complex set of interactions that might actually start well before the gate sensor is reached.  We might read an RFID from a truck as it turns into an access road, look up what’s supposed to be on it, and then route it to the correct gate, and onward to the correct off- or on-load point.

This application might sound similar to that first simple example, but it changes the economic/QoE tradeoff of compute placement.  Unless the access road is a couple paces long, we’d have plenty of time to send our RFID truck code almost anywhere in the world for a lookup.  Since we’re guiding the truck, we can open the gates based on our guidance, and so a lot of the real-time response nature of the application is gone.

Network services offer similar examples of layers of that economics-to-QoE trade, and one place where that’s particularly true is in the handling of control-plane messages.  Where is the best place to handle a control plane message?  The answer seems simple—the edge—but the control plane in IP is largely a hop function not an end-to-end function.  There are hops everywhere there’s a node, a router.

Let’s look at things another way.  We call on a cloud-native implementation of a control-plane function.  We go through service discovery and run it, and it happens that the instance we run was deployed twelve time zones away.  What’s the chance that this sort of allocation is going to create a favorable network behavior, particularly if it’s multiplied by all the control-plane packets that might be seen in a given place?

One solution is to create what could be called a “local cloud”, a cloud that contains a set of hosting resources that are logically linked to a given source of messages, like a set of data-plane switches.  Grabbing a resource from this would provide some resource pool benefits versus fixed allocation of resources to features, and it wouldn’t require a major shift in thinking in the area of cloud hosting overall.

Where broader cloud-think comes in is where we either have to overflow out of our “local cloud” for resources, or where we have local-cloud missions that tie logically to deeper functionality.  If the IP control plane is a “local cloud” mission, how local is the 5G control plane, and how local is the application set that might be built on 5G?  Do we push these things out across those twelve time zones?  Logically there would be another set of resources that might be less “local” but would sure as the dickens not be “distant”.

The cloud is hierarchical, or at least it should be.  There is no “edge” in a sense, because what’s “edge” to one application might be deep core to another.  There’s a topology that’s defined, for a given application, and represents the way that things that need a deeper resource (either because the shallow ones ran out or because its current task isn’t that latency-sensitive) would be connected with it.

This view would present some profound issues in cloud orchestration, because while we do have mechanisms for steering containers to specific nodes (hosting points), those mechanisms aren’t based on the location where the user is, or the nature of the “deep request” tree I’ve described.  This issue, in fact, starts to make the cloud look more “serverless”, or at least “server-independent”.

How should something like this work?  The answer is that when a request for processing is made, a request for something like control-packet handling, the request would be matched to a “tree” that’s rooted in the location from which the request originated (or to where the response is needed).  The request would include a latency requirement, plus any special features that the processing might require.  The orchestration process would parse the tree looking for something that fit.  It should preference instances of the needed process that were already loaded and ready, and it should account for how long it would take to instantiate the process if it weren’t loaded anywhere.  All that would eventually either match a location, triggering a deployment, or create a fault.

The functionality I’m proposing here is perhaps somewhere between a service mesh and an orchestrator, but with a bit of new stuff thrown in.  There might be a way to diddle through some of this using current tools in the orchestration package (Kubernetes has a bunch of features to help select nodes to match pods), but I don’t see a foolproof way of covering all the situations that could arise.  Kubernetes has its affinities, taints, and tolerations, but they do a better job of getting a pod to a place where other related pods might be located, and they’re limited to node awareness, rather than supporting a general characteristic like “In my data center” or “in-rack”.  It might also be possible to create multiple clusters, each representing a “local collection” of resources, and use federation policies to steer things, but I’m not sure that would work if the deployment was triggered within a specific cluster.  I welcome anyone with more detailed experience in this area to provide me a reference!

Another point is that if service mesh technology is used to map messages to processes, should that mapping consider the same issue of proximity or location?  There may be process instances available in multiple locations, which is why load-balancing is a common feature of service meshes.  The selection of the optimum currently available instance is as important as picking the optimum point to instantiate something that’s not currently hosted.

Why is all this important?  Because it’s kind of useless to be talking about reducing things like mobile-network latency when you might well introduce a lot more latency by making bad choices on where to host the processes.  Event-driven applications of any sort are the ones most likely to have latency issues, and they’re also the ones that often have specific “locality” to them.  We may need to look a bit deeper into this as we plot the evolution of serverless, container orchestration, and service mesh, especially in IoT and telecom.

A First Step to an Open-Model Network Future

Do you think that core routing is just for big routers?  Think again.  DriveNets, a startup who developed a “Network Cloud” cloud-routing solution and AT&T co-announced (HERE and HERE) that DriveNets “is providing the software-based core routing solution for AT&T, the largest backbone in the US.”  That could fairly be called a blockbuster announcement (covered HERE and HERE and HERE), I think.  It’s also likely the first step in actually realizing “transformation” of network infrastructure.

I’ve mentioned DriveNets in a couple past blogs, relating to the fact that they separated the control and data planes, and I think that particular attribute is the core to their success.  What the AT&T deal does is validate both the company and the general notion of a network beyond proprietary routers.  That’s obviously going to create some competitive angst among vendors, and likely renew hope among operators.  The basics of their story has already been captured in my references, so I want to dig deeper, based on material from the company, to see just how revolutionary this might be.

Looking at the big picture first, the AT&T decision puts real buyer dollars behind a software-centric vision of networking, one that’s had its bumps in the road as standards efforts to create the architecture for the new system have failed to catch on.  Some operators I’ve talked with were enthusiastic about the shift in technology from routers to software, but concerned that they wouldn’t see a viable product in the near term.  DriveNets may now have relieved that concern, because now they have a very big reference account, an operator using DriveNets in the most critical of all missions.  That’s a pretty big revolution right there.

The next revolutionary truth is the fact that it is software that’s creating the DriveNets technology.  While DriveNets runs on white boxes, what makes it different is control/data separation, and how a kind of local cloud hosts the control plane.  The data plane is hosted on white-box switches based on Broadcom’s Jericho chip, and the control plane on a more generic-looking series of certified white-box devices.  White box devices also provide the external interfaces and connect with the data-plane fabric.  You can add white boxes as needed to augment resources for any of these missions.

Different-sized routers are created by combining the elements as needed into a “cluster”, and capacities up to 768TB can be supported with the newest generation of the chip (the AT&T deal goes up to 128TB).  AT&T says that they’ll have future announcements for other applications running on the same white-box devices, and that offers strong support for the notion that this is an open approach.  This is an important point I’ll return to later in this blog.

One obvious benefit of this new model of networking is that the same white boxes (except for extreme-edge applications that would use a single device) can be used to build up what’s effectively an infinite series of router models, so the same boxes are spares for everything.  Another benefit is that no matter how complex a cluster is, it looks like a single device both at the interface-and-topology level and at the management level.  But the less-obvious benefits may be the best of all.

Here’s a good one.  The control-plane software is a series of cloud-native microservices hosted in containers and connected with a secret-sauce, optimized, service mesh for message control.  This gives DriveNets (and, in this case, AT&T) the ability to benefit from the CI/CD (continuous integration and continuous delivery) experience of cloud applications.  In fact, the control-plane software and DriveNets software overall is based on cloud principles, which is where operators have been saying they want to go for ages.  There are no complicated, multi-forked, code trains that have haunted traditional routers for ages.  All the software microservices are available, loaded when needed, and changed as needed.

And another.  When you get a new Network Cloud cluster, or just a new box, it’s plug and play.  When it’s first turned up, it will go to a DriveNets cloud service to get basic software, and it then contacts a local management server for its configuration, setup, and the details.  Sounds a lot like GitOps in the cloud, right?

And still more.  The data-plane boxes can be configured with fast-failover paths that allow for incredibly quick reaction time to faults.  The control plane, in its cloud-cluster of white boxes, will reconverge on the new configuration.  More complex network issues that require the control plane benefit from having the cluster’s internal configuration tunable to support the overall connection topology that’s required to return to normal operation.  From the outside, it’s a single device.

And another…the operating system (DNOS) and orchestrator (DNOR) elements combine with the Network Cloud Controller and Network Cloud Management elements to provide for internal lifecycle management.  Everything that’s inside the cluster, which is thus a classic “black box”, is managed by the cluster software to meet the external interface and (implicit) SLA.  The fact that the cluster is a cloud is invisible to the outside management framework, so the architecture doesn’t add to complexity or increase opex.

To recap the market impact, what we have is a validation that a major Tier One (AT&T) who has been committed to open-model networking is satisfied that there is an implementation of that concept that’s credible enough to bet their core network on.  We also have a pretty strong statement that an “open-model” network is a network composed of white-box devices that can, at least in theory, host anything.

Remember that AT&T says they may host other software on the same white boxes.  I think that means that to AT&T, openness means protection against stranded investment, not necessarily that every component of the solution, the non-capital components in particular, are open.  You don’t need to have open-source software on white boxes to be “open”, but you do have to be able to run multiple classes of software on whatever white boxes you select.

I’m a strategist, a futurist if you like, so for me it’s always the major strategic implications that matter.  Where is the future open-model network heading?  What’s the next level up, strategy-wise, that we could derive from the announcement?  I think it may arise from another announcement I blogged on just last week.  If we go back that blog on the ONF conference, there was an ONF presentation on “The Network as a Programmable Platform”, which proposed an SDN-centric vision of the network of the future.  As the title suggests, it was a “programmable” network.

One figure in that presentation shows the SDN Controller with a series of “northbound” APIs linked to “applications”, one of which is BGP4.  What the figure is proposing is that BGP4 implementations, running as a separate application, could control the forwarding of SDN OpenFlow white-box switches, and create what looks like a BGP core.  You could do the same thing, according to the ONF vision, to implement the 5G interfaces between their “control plane” (which I remind you is not the same as the IP control plane) and an IP network.  This is almost exactly what Google’s Andromeda project did for Google.

Why would AT&T not have selected an ONF implementation, then?  They’ve supported, and contributed elements to, the ONF solution.  The answer, I think, could be simple:  there is no validated implementation of the ONF solution available commercially.  It may be that DriveNets is seen by AT&T as the closest thing to that utopian model that’s available today, and of course they may also believe that DriveNets is close enough that they could evolve to the model faster than someone (just starting with just the ONF diagrams to work with) could implement it.

Why could they think that?  If you dig into the DriveNets material (particularly their Tech Field Day stuff), the architecture of their separated control plane is characterized as a web of cloud-native microservices.  These work together to do a lot of things, one of which being creating what looks like a single router from the combined behavior of a lot of separate devices.  And DriveNets, in the video, says “Twice the number of network operators say they expect radical change in network architecture within three years, versus those who say they do not.”  They also say that their approach “builds networks like hyperscaler clouds.”

Let me see…radical changes are needed, build networks like hyperscaler clouds, consolidate multiple devices into a single virtual view?  This sounds like a network-programmability model that doesn’t rely on SDN controllers, single or a federation.  Do all that at the network level and you’ve implemented the bottom half of the ONF vision.

How about the top half, those northbound APIs?  There’s no detail on exactly what northbound APIs are currently exposed by DriveNets, but since their control plane is made up of microservices, there’s no reason why any API couldn’t be added fairly easily.  The current DriveNets cluster has to have a single forwarding table from which it would derive the forwarding tables for each of their data plane fabric devices, so could that table be used to create a network-as-a-service offering, including BGP4?  It seems possible.

New microservices could be developed by DriveNets, by partners, and even by customers, and these microservices could extend the whole DriveNets model.  ONF OpenFlow control?  It could be done.  Network-as-a-service APIs to support the mobility management and media access elements of 5G, via the N2 and N4 interfaces?  It could be done.  I’m not saying it will be, only that it seems possible, just as it’s possible that the ONF model could be implemented.

“Could” being the operative word.  The problem with the ONF model, as I said in that blog on their conference referenced above, is that central SDN controller.  That’s a massive scalability and single-point-of-failure problem.  Federated SDN controllers is a logical step I called out years ago when this issue was raised, but it’s not been developed (you can see that by the fact that the ONF’s pitch doesn’t reference it).  There is no industry standards initiative in the history of telecom that developed something in under two years, so the ONF solution can only be realized if somebody simply extends it on their own.  DriveNets extension to programmability and ONF implementation are both “coulds”.

Even without being able to create a programmable network like the ONF diagram shows, DriveNets has made a tremendous advance here.  It might also demonstrate that there are at least two ways to create that programmable network, the yet-to-be-defined SDN controller federation model and the “network-wide control plane” model.  Two options to achieve our goals are surely better than one.

Customers or opportunities are another place where more is better, and to achieve that, any competitor in the new-model network space is going to have to confront the business case question.  AT&T has an unusually high sensitivity to infrastructure costs, owing to its overall low demand density.  It would be logical for them to be on the leading edge of an open-model network revolution.  How far behind might the other operators be?  That’s likely to depend on what value proposition, beyond the simple “the new network costs less than routers”, is presented to prospects.  Most operators with high demand densities won’t face the issues AT&T is facing for another couple years, but there are other factors that could drive them to a transformation decision before that.  It’s a question of who, if anyone, presents those other factors.

We are going to have a transformation in networking eventually, for every operator, and I think the AT&T/DriveNets deal makes that clear.  New models work and they’re cheaper, and every Wall Street research firm I know of (and I know of a lot of them) expects telco spending to be at least slightly off for the balance of this year, and of 2021 as well.   In fact, they don’t really see anything to turn that trend around.  Even operators with high demand densities and correspondingly lower pressure on capex savings will still not throw money away when a cheaper and better option is generated.  Opportunities for a new strategy are growing.

So are alternatives to what that strategy might look like.  We’re going to have a bit of a race in fielding a solution, between cheaper routers, white-boxes with simple router instances aboard, clusters with separated control and data planes like DriveNets, and SDN/ONF-based solutions.  The combination of opportunity and competition means there’s a race to pick the right prospects and tell the right story.  It’s going to be an interesting 2021 for sure.

Does “Lean NFV” Move the NFV Ball?

NFV has seen a lot of movement recently, but all movement isn’t progress.  I noted in earlier blogs that NFV still shows up in a lot of telco diagrams on implementation of 5G, including OpenRAN, and it’s also included in vendor diagrams of their support for telco cloud, even cloud-native.  The problem is that NFV isn’t really suited for that broad an application set, a view I hold firmly and one many operators share.  Thus, it’s not surprising that there’s interest in doing something to redeem NFV.

One initiative getting a lot of attention these days is Lean NFV, whose white paper is now in the 2020 “C” revision.  I told Fierce Telecom about my reservations regarding the concept, and I want to dig into their latest material to see if there’s anything there to either resolve or harden them.  The MEF’s support, after all, could mean something.  At least, it might take NFV out of a venue that didn’t produce into one that still might have a chance.

The first subhead in the paper I referenced above is a good start: “Where did we go wrong?”  The paragraph under that heading is promising, if a big lacking in specificity.  That, we could hope for in the rest of the document.  The main theme is that NFV is too complex, in no small part because it was never truly architected (my words) to define the pieces and how they all fit.  The functional diagram became an implementation guide, which created something that’s “too closely coupled to how the rest of the infrastructure is managed,” to quote the paper.

This is perhaps the best sign of all, because the biggest single problem with NFV was indeed the relentless effort to advance it by containing its scope.  The rest of the telco network and operations world presented a bunch of potential tie points, and rather than define something that was optimum for the mission of virtualizing functions, the ISG optimized the use of those legacy tie points.  But does Lean NFV do any better?

Lean NFV defines three pieces of functionality that represent goals to be addressed by Lean NFV, and if there’s to be a benefit to the concept, it has to come because these are different/better than the ISG model.  The first is the NFV manager, which manages not only the VNFs but “the end-to-end NFV service chains”.  The paper takes pains to say that this isn’t necessarily a monolithic software structure; it could be a collection of management functions.  The second is the “computational infrastructure”, which is I think an unnecessarily complicated way of saying the “resource pool” or “cloud”.  The third is the VNFs themselves, which the paper says might have their own EMS for “VNF-specific configuration”.

The way that Lean NFV proposes to be different/better is by concentrating on what it describes as the “three points of integration”, “when the NFV manager is integrated with the existing computational infrastructure, when VNFs are integrated with the NFV manager, and when coordination is required between the various components of the NFV manager.”  It proposes to use a key-value store to serve as a kind of “repository” to unify things at these three critical points, and leave the rest of the implementation to float as the competitive market dictates.  The paper goes on to describe how the three critical integration points would be addressed, and simplified, by the key-value approach.

What I propose to do to assess this is to forget, for the moment, the specifics of how the integration is to be done (the key-value store) and look instead at what the solution is supposed to be delivering to each of these three integration areas.  If the deliverables aren’t suitable, it doesn’t matter how they’re achieved.  If they are, we can look at whether key-value stores are the best approach.

The first suggestion, regarding the interface to computational resources, is certainly sensible.  The original NFV was very OpenStack-centric, and what the paper proposes is to, in effect, turn the whole computational-resource thing into a kind of intent model.  You define some APIs that represent the basic features that all forms of infrastructure manager should support, and then you allow the implementation to fill in what’s inside that black box.  All of the goals of the paragraph describing this are sensible and, I think, important to the success of NFV.

The second suggestion relates to the NFV manager, and I will take the liberty of reading it as an endorsement of a data-model-driven coupling of events to processes.  The data model serves as the interface between processes, implying that it sets a standard for data that all processes adhere to regardless of their internal representations.  This can all, at the high level, be related to the TMF NGOSS Contract work that I think is the seminal material on this topic.

The third suggestion is the one I have the most concern about, and it may well be the most important.  Lean NFV suggests that the issues with VNF onboarding relate to the configuration information and the fact that VNFs use old and often proprietary interfaces.  Lean NFV will provide “a universal and scalable mechanism for bidirectional communication between NFV management systems and VNFs”, which I believe is saying that the data model will set a standard to “rewrite” VNFs.  I don’t think that there’s much interest among vendors in rewriting, so I’m not comfortable with this approach, even at the high level.

OK, where this has taken us is to accept two of the three “goal-level” pieces of Lean NFV, but not the third.  That leads to the question of whether the key-value store approach is the way to approach those goals, and in my view it is not.  I have to say, reluctantly, that I think the Lean NFV process makes the same kind of mistakes as the original NFV did.  They’re wrong, differently.

One problem is that a key-value store doesn’t define the relationship between services, VNFs, and infrastructure.  Yes, it describes how to parameterize stuff, but a service is a graph, not a list.  It shows relationships and not just values.  In order to commit compute resources or connect things, I need a data model that lets me know what needs to be connected and how things have to be deployed.  I told the ONAP people that until they were model-driven, I would decline to take further briefings.  The same goes here.

The second issue is the lack of explicit event-driven behavior.  APIs are static and tend to encourage “synchronous” (call-and-wait) behavior, where events dictate an asynchronous “call-me-back” approach.  Not only does Lean NFV not mandate events, it provides no specific mechanism to associate events to processes.  It suggests that microcontrollers could “watch” for changes in the key-value store, which makes the implementation more like a poll-for-status approach, something we know isn’t scalable to large-scale networks and services.

The biggest problem, though, is that we’re still not addressing the basic question of what a virtual network function is.  Recall, in a prior blog, that I noted that there was early interest in decomposing current “physical network functions” (things like router or firewall code) into logical features, and then permitting recomposition via cloud behavior.  If we decide that a VNF is the entire logic of a device, then making it virtual does nothing but let us run it on a different device.  There may be differences in subtle performance and economic issues when we look at hosting a VNF in a white box (uCPE), a commercial server, a container, a VM, or even serverless, but will this be enough to “transform?”

There are some good ideas here, starting with a pretty-straight recognition of what’s wrong with ETSI NFV.  The problem I see is that, like the ISG, the Lean NFV people got fixated on a concept rather than embarking on a quest to match virtual functions to modern abstract resource pools like the cloud.  In the case of Lean NFV, the concept was the key-value store.  To a hammer, everything looks like a nail, and so it appears that the Lean NFV strategy was shaped by the proposed solution rather than the other way around.

There’s still time to fix this.  The paper is very light on implementation details, as I’ve noted before regarding the Lean NFV initiative.  That could mean I’ve missed the mark, but it could also give the group the chance to consider some of these points.  The goals are good, the way to achieve them isn’t so good.  I’m really hopeful that the organization will move to fix things, because there’s a lot of wasted motion in the NFV space, and this at least has some potential.

The problem is that if “Lean NFV” were in fact to adopt my suggestions, it might still be “Lean” but it would have moved itself rather far from ETSI NFV.  There’s never been a standard that a telco couldn’t place too much reliance upon, for too long.  NFV is surely not one to break that rule.

An ONF-Sponsored Event on Open-Source 5G

How much can open source do for 5G?  That’s a question a lot of operators are asking (and a lot of vendors, too).  The ONF, who is taking a much bigger/broader role these days, thinks they have some answers, so it’s well worth looking at them.  This (topically speaking) a kind of follow-on to my blog yesterday on the TelecomTV cloud-native conference, so if you’ve not read that, you might want to read it first!

Nobody doubts the motivation behind the open movement in 5G.  Network operators have faced declining profit per bit for almost 15 years now.  Mobile was, for a time, the only space that was immune, because usage-based pricing in some form is still available in mobile services.  Now, even mobile services are seeing the same problems, and since 5G represents at least a major build-out, it’s both a major transformation opportunity and a major business risk.

If operators cannot build 5G using an open model, then what’s likely the last major network service transformation of this decade will reinforce the closed, proprietary, model.  Incremental transformation is a contradiction in terms, so getting out of that reinforced and renewed lock-in will be difficult.  The opportunity to create openness is now.

The presentations in the ONF session are a mixture of vendor and operator views, with the former obviously representing a positioning the vendors think they can win at.  Still, they’re useful in laying out what an open model would look like, and they’re also interesting, in that they almost uniformly illustrate the continued risk of what I’ll call the “functional-to-architectural” transformation, the thing that likely doomed NFV to failure.  I won’t call out individual vendors here because the issue is universal, but I will explain the problem.

We’re layer-obsessed in networking, ever since 1974 when the Basic Model for Open Systems Interconnect (the “OSI model”) was published.  When we draw 5G, we draw layers, and these layers are useful in that they define a building-up of functionality starting with the very basics and moving to the (hopefully) sublime experiences.  There’s no question that in 5G, “services” and “management” and so forth are at a higher functional layer than the user and control planes of 5G Core.

The problem is in how literally we take the diagrams.  How dramatic a leap is it for some poor architect to look at a diagram showing the function they’re trying to design as a box in a stack of boxes, to think of designing a box?  Monolithic implementations tend to develop out of simple functional diagrams taken too literally.  That was a fatal risk for NFV, and it’s still a risk for 5G, open or otherwise.  The universality of this form of explaining things in the material is proof we have to guard against that risk.

Probably a reasonable way to look at the material, which is obviously diverse in both target and approach, is to frame it as a kind of point-and-counterpoint between what I’ll describe as the “operator vision” and the “vendor vision”.  As I said, I’m not going to name people or firms here because I don’t want to personalize comments, only make observations.

Let’s start with an operator seen almost universally as an innovator in 5G, so it’s reasonable to start with their presentation to set the stage.  The first architecture diagram shows the high-stakes game we’re in; they see the future 5G infrastructure as being a combination of edge and central data centers.  That’s why “carrier cloud” had/has the potential to build out a hundred thousand incremental data centers, according to my model.  These map, in functional terms, to a resource pool for NFVI, showing how reluctant operators are to abandon the NFV notion.

The critical business claim made here is that the open approach creates a 40% reduction in capex and a 30% reduction in opex.  Neither of these stated benefits are consistent with the NFV experience data I’ve seen globally; capex reductions have rarely exceeded 25% and opex has been the same or even a bit higher in the open approach.  Since the data cited was for OpenRAN, I think the specialized application elements may offer better savings than the NFV average.  That may be true for 5G on a broader scale too.

Why OpenRAN might generate at least capital savings, and other benefits as well, is explained by a second operator presentation.  They cite four “disruptors” in the telecom sector.  First is competition from the hyperscalers (cloud providers, predominantly), second the evolution of technology overall toward software and the cloud, third the more-demanding customer experience expectations, and finally the lack of innovation created by vendor consolidation and loss of competition.

All these factors are interesting, but the last one may be especially so.  As network operators have responded to declining profit per bit with pressure on infrastructure pricing and projects, vendors have suffered.  This suffering leads to consolidation.  One of today’s vendors, Nokia, is the sum of three previous major network vendors and many smaller ones.

The same thing that’s leading to consolidation and loss of vendor innovation is also leading to “incrementalism” in network infrastructure.  A massive change in a network requires a massive business case.  If vendor innovation is indeed being stifled, there is little or no chance that the kind of technical shift that creates a massive business case would be presented by any vendor.  That, I think, is the real justification for looking for another model, something to replace a competitive field of proprietary giants.

The same second operator cites three key ingredients in a solution to their problems.  The first is disaggregation, which they’ve taken to mean the breaking up of monolithic devices into virtualized functions.  Second is orchestration to automate the inevitably more complicated operations associated with disaggregated functions, and third is open APIs to expose critical capabilities for exploitation by evolving services and techniques.

The final point this operator makes is that the “edge cloud” is going to be a key point in differentiating telcos from hyperscalers/cloud providers.  This begs the question of why so many operators are partnering with public cloud providers, and seem to be stalling on making any progress in carrier cloud at all, much less mobile edge.  It also suggests that either the operator believes that hyperscalers will enter the “carrier cloud” market, perhaps offering 5G-as-a-service, or that the telcos will inevitably have to climb up the value chain to compete on the hyperscalers’ own turf.

Particularly when a third operator has a public-cloud partner at the center of their own architecture.  Fortunately, this operator may offer an explanation, too.  They show the edge cloud, presumably owned by them, connecting to a public cloud.  This would suggest that operators are almost universally interested in public cloud as a supplementary or transitional resource, which of course would be good news for vendors if it’s true.

Speaking of vendors, this is a good place to start thinking and talking about them, and about their approach to the open 5G theme.  As I noted above, there’s still what is to me a disquieting tendency for vendors to hunker down on the NFV story, despite the fact that in another of the recent online events, Google admitted that NFV had succeeded primarily in changing the mindset of telcos, not through adoption.  One operator did retain NFV in their diagrams, but the others were more “cloud” oriented, generalizing their goal as “exploiting cloud technology” or even “cloud-native”.  I think there are a lot of people who don’t want to face up to the fact that NFV was a seven-year boondoggle, but they’ll quietly accept that something beyond it is needed.  One vendor presentation implies that with a platform layer that hosts “containers” and “NFV”.

The ONF presented its own view of what at least part of that “something beyond” might be, which is an SDN-centric vision of routing.  They have an SDN controller talking to a bunch of P4 Stratum switches, running applications like BGP and perhaps even 5G.  This is surely a step in a different direction, but I have concerns about whether it’s a better one, because of the implicit centralization.

I’m all for control/data-plane separation, as readers of my blog surely know.  I’m all for specializing forwarding devices for the data plane.  But I’m not for centralizing the control plane, because we have absolutely no convincing information to prove that central forwarding control can work at a network scale.  You need hierarchies or federation, and those would need some work to get defined.  We may well not have time for that work to be done.

I’m also concerned about later elements of the ONF presentation, in particular the way they seem to be coupling 5G to the picture.  They introduce policy control and enforcement, which to me makes no sense if you assume you have complete and central control of forwarding.  An SDN-like mechanism, or any mechanism designed to provide dynamic forwarding control, should present its capabilities as network-as-a-service, to be consumed by the higher elements, including 5G.

What I see at the vendor (or “source” level) overall is a tendency to draw diagrams and propose platforms rather than to define complete solutions.  It’s easy to show the future as being a set indefinite (and therefore presumably infinitely flexible) APIs leading up to a limitless set of services and refinements.  There is a sense that there has to be a “fabric” or “mesh” of some sort that lives above the forwarding process and hosts the control plane(s) (both the IP and 5G ones), but there is no proposed open-source solution for those elements.

The thread that ties all the material together is a thread of omission.  We don’t have a specific structure for hosting a separate control plane in a cloud-native, practical, way.  We don’t have an architecture to define how that control plane would work, how its elements would combine to do its job of controlling forwarding and keeping track of topology and status.  Google has done some of this in Andromeda, it’s SDN/BGP4 core, but it’s not a general solution and Google has never said it was.

Innovation, the innovation that the second operator said had been lost, is needed in this very area.  Without specificity in the framework of that cloud-native universe of floating functionality that lives above forwarding, we’re not going to have a practical transformed 5G, or much of anything else.

We also may have to get specific with respect to “open” in networks.  Does every piece of hardware have to be based on off-the-shelf stuff?  Does all the software have to be open-source?  We cannot achieve this today, consistent with having something that’s actually working and working within reasonable performance and cost limits.  There’s still a lot of room for innovation, and just because the giants of the past won’t or can’t innovate doesn’t mean the startups of today, and the future, shouldn’t be allowed to give it a go.  It may prove that they should have their chance.

Looming in the background here is the growing interest of public cloud providers in offering 5G.  Hosting 5G in the cloud could still rely on open implementations of 5G, but since the cloud can already host almost anything, such a basic approach wouldn’t offer a cloud provider much differentiation.  They’re all clearly angling for a role in supplying “5G Core-as-a-Service”, which is why Microsoft recently bought Metaswitch, a vendor who has the 5G software stack.  Can the cloud providers’ as-a-service approach defeat the open-source movement, or will operators see it as replace being locked into traditional mobile infrastructure vendors with being locked into cloud providers?

An open network doesn’t lock you in.  That’s the simple definition I think we have to accept, at least for now.  Since it’s the cost of the physical devices, and the contribution of any annual-subscription software, that creates lock-in, we have to match approaches to the test of controlling these two factors if we want to preserve openness…and that still leaves the question of innovation.

The ONF presentation showed innovation, but in a direction that’s likely to raise a serious risk in performance and scalability.  Yesterday, we heard about the DriveNets win in AT&T’s core.  DriveNets has, at least within one of their clusters, a cloud-hosted and separate control plane.  Could this spread between clusters, become something like a distributed form of SDN controller, and resolve the problems with the future model of networking that shows the most promise?  I hope to blog more on them, as soon as I can get the information I need.  This might be a critical innovation, even if DriveNets software isn’t open-source.

If operators want to open everything, to eliminate any vendor profit motive in building network equipment, they’ll need to accept that the innovation baton will necessarily pass to them.  Right now, by their own admission, they are completely unprepared to provide the technical and architectural support needed to play that role.  That means that their vision of the network of the future doesn’t just acknowledge the loss of the old proprietary innovators, but the fact that new ones will be needed, and new visions of “openness” to accommodate them.

Analyzing the Telecom TV Take on Cloud-Native

Industry events are good, providing you can time-shift to watch them.  That’s particularly true for analysts, because drinking bathwater is not an attractive image, whether it’s your own or not.  The term could be said to apply to any number of things that create a kind of intellectual recycling, a belief set that builds on itself rather than on outside needs and opportunities.  Like all market pundits, I try to avoid it, and one way to do that without becoming a conference gadfly is to watch replays of events.

TelecomTV had a nice series on a topic dear to my heart, cloud-native and telco cloud, and I want to review what I think it meant.  As always, I’m putting my own stamp on this, hopefully to explain, add perspective, and sometimes disagree.  Tomorrow’s blog will look at another conference series, this one on 5G and open-source, sponsored by the ONF.

The thing that seemed to cross all the sessions, in my own view, was that there’s still a lot of haze surrounding the notion of “cloud-native”.  There are some who think that it means something designed to be hosted, period.  Some think that it’s about reorganizing things a bit to allow the telcos to create a kind of hybrid-/multi-cloud framework of their own, much like other verticals.  Some think it’s about adopting cloud-like development practices and tools, like continuous integration and delivery (CI/CD).  Finally, of course, some think it’s what it really is, which is the disaggregation of functions and applications into microservices to build new models of old things.

Perhaps the most important thing about the conference videos is that each demonstrates that there are people who see cloud-native as it is in just about every organization.  Some of the speakers commented that they had real cloud-native dialogs with every operator they talked with, which would certainly be a good sign.  Since I know that I at least have had conversations with every operator that proved cloud-native wasn’t the universal language there, this shows that a lot of your cloud-native discussion will depend on who you happen to be talking with.

Another thread that cut across sessions is that operators recognize that they are not really cloud-qualified.  Not only are they fairly sure they don’t know how to build their own carrier cloud, they’re not entirely sure they know how to consume someone else’s cloud effectively.  My sense, from the comments, is that most of the operators don’t think this is going to be an easy problem to solve, either.  Since the fact that operators participated meaningfully in all the sections proves that some people in the operator space do get it (and refer again to my last paragraph here), the problem is obviously breadth of knowledge.  Network people, to paraphrase one operator, don’t do cloud.

What’s perhaps the most important “technical” point from the conference is that 5G is very likely the big driver of cloud-native awareness among operators.  The fact that 5G separates the control plane and the user plane was cited often, probably at least once in every session where cloud-native drivers were brought up.  Sadly, this good news is dampened by the fact that the comment was often accompanied by two points I think are at least sub-optimal, if not wrong.

The first point is that 5G “requires” cloud-native.  It doesn’t, it requires (meaning specifies) NFV, which is itself struggling toward cloud-native understanding.  5G is surely an opportunity to introduce and leverage cloud-native development, but it’s still a decision that operators and vendors will have to make explicitly, and so far, I’d say that most are not making it.  If 5G is truly a big driver for cloud-native, then it may be getting wasted as we speak.

The second point is that “control plane” is one of those highly ambiguous terms.  The term usually seems to mean what used to be called “signaling”, the process of controlling communication and facilities, rather than the channel that passed information end-to-end.  What makes this definition a risk in the 5G context is that 5G networks run on IP, which has its own “control plane”.  The 5G User Plane, then, consists of the IP control plane and the IP data plane (or forwarding plane).  Since we have a lot of discussions about separating the control plane in IP networks, it’s easy to see how people might think that 5G mandates that.  It does, but only with its own signaling, which is a higher layer than that of the IP control plane.

What 5G does, most explicitly, is to separate its own control and user planes.  That creates two incentives, one implicit and one explicit, for IP networks to do the same.

The first incentive is that if 5G Core represents the state of the art in operator thinking, then it’s a powerful architectural reference for other services.  That doesn’t just mean new or old services, but both.  We should expect to see control/user separation across the board, because it’s what operators think is the right approach.

The explicit incentive is that 5G Core presents interfaces (N1, N2, and N4) between the control and user planes.  If the user plane is IP, then you can argue that an IP network should be able to support these interfaces, meaning that the AMF/SMF (access and mobility management and session management) elements could rightfully be seen as network services built on those interfaces.  If we assume that the IP network has an independent, cloud-hosted, IP control plane, then the Nx interfaces are essentially APIs that invoke network service features.  This sure sounds like network-as-a-service, and it would represent a model of represent how other services, old and new, interface with the IP network.

The reason this matters for cloud-native wasn’t really brought out in the conference, but I’ll suggest one.  The higher you go in terms of service layers, from connectivity up to applications and experiences, the more cloud-like your architecture had better be, to conform to application trends.  If the Nx interfaces are the boundary between “legacy” connection services and “over-the-top” services, then they represent the place where cloud behavior starts to make sense.  That argues for considering the IP control plane and the 5G (or other service) control plane as sublayers, with a common and cloud-native implementation.

Vendors in the sessions were more likely to see cloud-native in its real, and realistic, form than the operators.  One vendor even talked about state control, which is critical for cloud-native but something most don’t understand at all.  But even the vendors had the view that 5G Core was written to be cloud-native, and I don’t see anything in the specs that admit to that interpretation.

Another area where vendors had a distinctive (and understandable) combined view was that operators really needed to have their own cloud, eventually.  Most operators also seemed to agree with this, but it seems like the cloud software vendors are recognizing that public cloud hosting of “carrier cloud” applications could tap off a lot of their opportunity, and they’re particularly sensitive to the loss of the “edge” opportunity.

If you were to dissect my forecast of carrier cloud data center opportunity, over 70% of the data centers would be expected to deploy at the edge, largely in central offices and mobile backhaul concentration points.  Given that, and given that the software vendors would face a significant opportunity loss were the operators to outsource their edge requirements to public cloud providers, all the software vendors saw evolution of carrier cloud as starting with 5G and quickly becoming edge-centric.  They also saw the biggest public cloud outsource opportunities in the operators’ administration and operations systems, not in hosting network features or functions.

Cloud-native is here, providing that we’re willing to accept a rather loosey-goosey definition of what “cloud-native” is.  I’m grateful for TelecomTV for running these kinds of events, and making the material available online afterward.  It gives us all a stylized but still grounded view of what the operators and vendors most committed to changing things are doing and thinking.  We’re not there yet, not in the transformed future, but we can see it through the weeds now, and that’s progress enough to applaud.

Network Equipment Differentiation: Still Possible?

Differentiation has always been a challenge, but it seems like the problems vendors and providers face in standing out from the masses is getting worse.  Part of that is because there are more and more areas that could offer differentiation, more difficulties making buyers understand the issues in those areas, and more overlap across the areas.  What’s a vendor to do, and what might be the result of all this “doing”, or at least trying?

Forty years ago, computing platforms were almost totally proprietary.  If you bought a machine from DEC, you bought into the whole DEC ecosystem from hardware up to networking products.  Networks are still largely built from proprietary devices that are still monoliths, bundles of everything, but computing has been shifting more to an open model, and networking is starting to move in that direction too.  Does everyone surrender to the Will of the Masses here, or are there still spaces where a good iconoclast vendor could stand out?

The obvious one is “what comes next?”  If we’re losing network equipment differentiation because of the transformation of networking, then the path of transformation and even the end game might well be differentiators.  In fact, they’re probably the first level of selection that will separate the good from the…deceased.

The current view of network transformation threatens traditional vendor positioning.  Some call this future transformed state “cloud”, some say it’s “software”, but in either case what we’re really talking about is shifting away from “box networking” meaning routers, to something that hosts network features.  Vendors in the network space have grown and prospered selling boxes, but if the future network is cobbled together from a bunch of Lego blocks, what do those vendors sell?  A Lego block is a lot less differentiated than a Lego castle, so unbundling network boxes into separate pieces seems to cry out “commoditization” in a loud (and scary) voice.  How loud and scary the future will turn out to be depends on exactly how the evolution happens.

The router-vendor vision of the future is simple; yes, hardware and software have to be separated.  We separated it by unbundling the router software and making it an annual license, a subscription.  In theory you could run other software on our boxes (good luck finding software, and to get support will likely require miraculous intervention).  In theory, you could run our software on other hardware (same good wishes).  But, hey, we’re trying.

The router-software vision is simple too.  A router transforms into a router instance running on a commercial off-the-shelf server (COTS).  The server won’t deliver the same performance as a customized hardware platform, but you could claim this was a “cloud-enable” solution because you could run the router software on a VM in the cloud.  Expect performance there to be even worse, of course.

White-box people have a germ of a truly new vision.  You run generic router software on a custom device, a device that is augmented with the chips and other technology needed to let it at least approach the performance of dedicated routers.  This is probably what most people today are seeing as the right path, but it has the disadvantage of leaving the cloud behind.  Network operators believe in the cloud, even if they don’t know exactly what it is.

There are some in-the-network differentiation opportunities, of course.  White boxes can be differentiated by the chip set included in them, which could make it possible to create a cluster device whose individual elements are optimized to the mission.  Some might look like brawny forwarding engines, others like lookup engines, and others like cloud application hosts.  Again, leveraging this would require the decomposition of “routing” as a function into abstract elements that could then be hosted where they fit best.

Chips themselves could be differentiators, too.  The challenge with creating your own custom chips (which major router vendors do for their systems) is that they can compromise your ability to claim an “open” architecture.  A dispersed cluster of routing elements like the one I just described could still be fully open if it exposed the proper APIs to permit customization, and of course, generic white boxes are inherently open.  Stick a custom chip in a white box and it’s open only to the extent that the chip is generally exploitable, and if it is, then it’s not differentiating.

The P4 language and the ONF Stratum architecture model could offer a way to maintain an open claim while using custom silicon nobody else could get.  P4 is a flow-management language, an API and “driver” that harmonizes different chips to a common programming language.  That means that you could support a standard flow manipulation approach with custom, differentiable, silicon.  It would also mean that other stuff could be run on your box, and that your stuff could be run on other boxes, providing that everything used P4.

So far, all of these options come with plusses and minuses, but a big minus for them all is that they’re really tweaking the old router model.  It’s not enough to just run a monolithic control plane and a monolithic data plane on either the same device or tightly coupled devices.  The future network disaggregates right down to the feature level.  You separate the control, management, and data planes and you host the former two on the cloud, using cloud-native techniques.  The data plane gets mapped to specialized white-box devices.  We get the best of the white-box world without the performance risks, and we make the cloud the centerpiece.

The common thread in network operator transformation discussions is “the cloud”.  There’s an implicit belief that if cloud principles were to be applied to building networks, the cloud could transform networks as it’s transformed IT.  If every operator believes that, then every differentiating position has to be somehow tied back to the cloud.  That’s the big failing of so many of the approaches to differentiation I’ve outlined above.  They don’t end up with a cloud, and that’s where every operator thinks they have to be heading.  But how do you apply cloud-think to networks?

There have been criticisms of the IP routing process, elements of the process that relate primarily to what would be called the “control plane” for about two decades now.  Things like MPLS, Next-Hop Resolution Protocol, and more recently the various area-routing approaches, are all based on refinements to the adaptive routing process.  SDN proposed to replace all of this with a central routing instance, a place that collected topology and status and sent out updates to all the forwarding tables impacted by changes.  Could the essential piece of uniting cloud and network be the replacement of the traditional IP control plane with something else?  SDN without the centralization risk?

You could surely pull the control plane out of a single box and create a cluster that lets control and data plane evolve in cloud-native form, but as separate as the two would need to be, given the extreme differences in control and data plane requirements.  DriveNets, who won a Light Reading startup award, does this, and they’ve recently claimed a PoC that illustrates some pretty profound benefits.  The story suggests that they could extend this separate control plane multi-dimensionally, upward to climb toward actually making the network experience-aware, and horizontally across multiple devices/clusters to create a distributed form of SDN’s central controller.  Forwarding, in this vision, is simply a kind of transport layer, controlled from above via the cloud.

That new model is what I think is the ultimate differentiator.  The biggest minus (or plus, depending on your perspective) is that traditional router vendors surely see this as being differentiating via total disruption.  If you change everything, then you’re telling customers that they might as well look at other alternatives than you, as long as they’re starting fresh.  Can a router vendor develop an evolutionary revolution?  I have to wonder why, if they could, they haven’t done it already, when it’s so clear that buyers are eager for something truly transformational.

The smart thing for router vendors, then, would be to accept what’s happening and begin a transformation of their own approach.  I can already pull forwarding (the data plane) and the control plane of routing apart.  If I were a router vendor, I could then implement that cloud-ready control plane, support my current product line, and evolve myself toward cheaper, simpler, “white-box-like” (or even white-box) devices.

It’s not going to be as good a business, though, because it can’t be.  You can’t make buyers spend more on a transformation when their primary transformation goal is to spend less.  Router vendors will have to accept that their sales, and their organizations, are inevitably going to get smaller. “Shrinkage” is tough to sell as a mantra in growth-driven Silicon Valley, but it’s preferable to vanishing.  Mainframe computers and even minicomputers once drove the IT market, and neither do that today.  DEC, Data General, Perkin-Elmer, RCA, CDC, and others all had to learn that success is riding the wave of change, not being drowned by it.  So do today’s network vendors, and just because past initiatives aimed at this radical sort of transformation failed, it doesn’t mean that eventually, even accidentally, someone won’t get it right.

We Can’t Put Off Thinking About Latency!

If latency is important, just what constitutes “good” and “bad” latency levels?  How does latency figure into network and application design, and what are the sources of latency that we can best control?  I’ve talked about latency before, but over the last couple months I’ve been collecting information on latency that should let me get a bit more precise, more quantitative.

Latency is a factor in two rather different ways.  In “control” applications, latency determines the length of the “control loop”, which is the time it takes for an event to be recognized and acted on.  In transactional and information transfer applications, latency determines the time it takes to acknowledge the transfer of something.  Difference here is important because the impact of latency in these areas is very different.

Control-loop latency is best understood by relating it to human reaction time.  People react to different sensory stimuli differently, but for visual stimulus, the average reaction time is about 250 milliseconds.  Auditory reaction time is shorter, at about 170ms, and touch the shortest of all at 150ms.  In control-loop processes whose behavior can be related to human behavior (automation), these represent the maximum latency that could be experienced without a perception of delay.

Transactional or information transfer latency is much more variable, because the former source can be related to human reaction time and the latter is purely a system reaction.  Online transaction processing and data entry can be shown to be just as latency-sensitive as a control loop.  Years ago, I developed a data entry application that required workers achieve a high speed.  We found that they actually were able to enter data faster if they were not shown the prompts on the screen, because they tended to read and absorb them even when experience told them the order of field entry.  But information transfer latency can be worse; if messages are sent at the pace that acknowledgments can be received, even latencies of less than 50ms can impact application performance.

The sources of latency in an actual networked application are just as complex, maybe even more.  There is what can be called “initiation latency”, which represents the time it takes to convert a real-world condition into an event.  Then we have “transmission latency” which is the time it takes to get an event to or from the processing point, and then the “process latency” which is the cumulative delay in actually processing an event through whatever number of stages are defined.  Finally, we have “termination latency” which is the delay in activating the control system that creates the real-world reaction.

The problem we tend to have in dealing with latency is rooted in the tendency to simplify things by omitting one, or even most, of the sources of latency in a discussion.  For example, if you send an event from an IoT device on 4G to a serverless element in the public cloud, you might experience a total delay of 300ms (the average reported to me by a dozen enterprises who have tested serverless).  If 5G can reduce latency by 75%, as some have proposed, does that mean I could see my latency drop to 75ms?  No, because 200 of the 300ms of latency is associated with the serverless load-and-process delay.  Only 100ms is due to the network connection, so the most I could hope for is a drop to 225ms.

The key point here is that you always have to separate “network” and “process” latencies, and expect new technology to impact only the area that the technology is changing.  IP networks with high-speed paths tend to have a low latency, so it’s very possible that the majority of network latency lies in the edge connection.  But mobile edge latency even without 5G averages only about 70ms (global average), compared to just under half that for wireline, and under 20ms for FTTH.  Processing latency varies according to a bunch of factors, and for application design it’s those factors that will likely dominate.

There are four factors associated with process latency, and they bear an interesting resemblance to the factors involved in latency overall.  First there’s “scheduling latency”, which is the delay in getting the event/message to the process point.  Second, there’s “deployment latency”, which is the time needed to put the target process in a runnable state.  Third is the actual process latency, and fourth the “return latency”, associated with getting the response back onto the network and onward to the target.  All of these can be influenced by application design and where and how things are deployed.

The practical reality in latency management is that it starts with a process hierarchy.  This reality has led to all sorts of hype around the concept of edge computing, and while there is an edge computing element involved in latency management, it’s in most cases not the kind of “edge-of-the-cloud” or “edge computing service” that we hear about.

The first step in latency management is to create a local event handler for the quick responses that make up most “real-time” demands.  Opening a gate based on the arrival of a truck, or the reading of an enabling RFID on a bumper, is a local matter.  Everything, in fact, is a “local matter” unless it either draws on a data source that can’t be locally maintained, or requires more processing than a local device can provide.  In IoT, this local event handler would likely be a small industrial computer, even a single-board computer (SBC).

The goal is to place the local event handler just where the name suggests, which is local to the event source.  You don’t want it in the cloud, in a special edge data center, but right there.  Onboard a vehicle, in a factory, etc.  The closer it is, the less latency is added to the most latency-critical tasks because that’s where you’ll want to move them.

In network terms, meaning virtual or cloud-network terms, you want to ensure that your local event handler is co-located with the network event source.  It can be literally in the same box, or it can be in a rack or cluster or even data center.  What you’re looking for is to shorten the communications path, so you don’t eat up your delay budget moving stuff around.

The second step is to measure the delay budget of what cannot be handled locally.  Once you’ve put what you can inside a local event handler, nothing further can be done to reduce latency for the tasks assigned to it, so there’s no sense worrying about that stuff.  It’s what can’t be done locally that you have to consider.  For each “deeper” event interaction, there will be a latency budget associated with its processing.  What you’ll likely find is that event-handling tasks will fall into categories according to their delay budgets.

The local-control stuff should be seen as “real-time” with latency budgets between 10 and 40ms, which means on the average as fast as any human reaction.  At the next level, the data I get from enterprises says that the budget range is between 40 and 150ms, and most enterprises recognize that there is a third level with a budget of 150 to 500ms.

In terms of architecture for latency-sensitive applications, this division suggests that you’d want to have a local controller (as local as you can make it) that hands off to another process that is resident and waiting.  The next level of the process can be serverless, consist of distributed microservices, or whatever, but it’s almost certain that this kind of structure, using today’s tools for orchestration and connectivity, couldn’t meet the budget requirements.  The data I have on cloud access suggests that it’s not necessary for even the intermediary-stage (40-150ms) processing to be in a special edge data center, only that it not be processed so distant from the local event handler that the hop latency gets excessive.

The latency issue, then, is a lot more complicated than it seems.  5G isn’t going to solve it, nor will any other single development, because of the spread of sources.  However, there are some lessons that I think should be learned from all of this.

The first one is that we’re being to cavalier with modern orchestration, serverless, and service mesh technology as applied to things like IoT or even protocol control planes.  Often these technologies will generate latencies far greater than even the third-level maximum of 500ms, and right now I’m doubtful that a true cloud implementation of microservice-based event handling using a service mesh could meet the second-level standard even under good conditions.  It would never meet the first-level standard.  Serverless could be even worse.  We need to be thinking seriously about the fundamental latency of our cloud technologies, especially when we’re componentizing the event path.

The second lesson is that application design to create a series of hierarchical control-loop paths is critical if there’s to be any hope of creating a responsive event-driven application.  You need to have “cutoff points” where you stage processing to respond to events at that point, rather than pass them deeper.  That may involve prepositioning data in digested form, but it might also mean “anticipatory triggers” in applications like IoT.  If you have to look up a truck’s bill of lading, you don’t wait till it presents itself at the gate to the loading dock.  Read the RFID on the access road in, so you can just open the gate and direct the vehicle as needed.

The third lesson is that, as always, we’re oversimplifying.  We are not going to build any new technology for any new application or mission using 500-word clickbait as our guiding light.  Buyers need to understand a technology innovation well enough to define their business case and assess the implementation process to manage risks.  It’s getting hard to do that, both because technology issues are getting more complex, and because our resources are becoming more superficial by the day.

I’ve done a lot of recent work in assessing the architectures of distributed applications, especially cloud and cloud-native ones.  What I’ve found is that there isn’t nearly enough attention being paid to the length of control loops, the QoE of users, or the latency impact of componentization and event/workflows.  I think we’re, in an architectural sense, still transitioning between the monolithic age and the distributed age, and cloud-native is creating a push to change that may be in danger of outrunning our experience.  I’m not saying we need to slow down, but I am saying we need to take software architecture and deployment environment for cloud-native very seriously.

Navigating the Road to Cloud-Native Network Functions

The NFV community accepts the need to modernize, but it’s more difficult for them to say what “modern” looks like.  Certainly there’s pressure for change, but the pressure seems as much about tuning terminology as actually changing technology.  Nowhere is this more obvious than in the area of “cloud-native”.

Virtual Network Functions (VNFs), the meat and potatoes of NFV, run in virtual machines.  That VM link generates two specific issues.  First, the number of VMs you can host on a server is limited, which means that the mechanism isn’t efficient for small VNFs.  Second, a VM carries with it the overhead of the whole OS and middleware stack, which not only fills up resources, it increases the operations burden.

One proposed solution is to go to “CNFs”, which some have called “cloud network functions” and some “containerized network functions”.  The latter would be a better definition because the approach is really about making containers work for VNF hosting, but even here we’re seeing some introduced cynicism.  The lingua franca of container orchestration is Kubernetes, but a fair chunk (and perhaps a dominant one) of the NFV community is looking more at the OpenStack platform, since OpenStack was a part of the original NFV spec.

The other solution is to go all the way to “cloud-native”, which is a challenge given that the term is tough to define even outside the telco world.  We can fairly say that “cloud-native” is not just sticking VNFs in containers, but what exactly is it, and what does it involve?  I’ve mentioned cloud-native network functions (CNNFs) in prior blogs, but not really addressed what’s involved.  Let’s at least give it a go now.

A CNNF, to be truly “cloud-native” should be a microservice, which means that it should be a fairly small element of functionality and that it should not store data or state internal to the code.  That allows it to be instantiated anywhere, and allows any instance to process a given unit of work.  The biggest problem we have in CNNF creation, though, may be less the definition and more the source of the code itself.

When the first NFV ISG meeting was held in Silicon Valley in 2013, there was a fairly vocal dispute over the question of whether we needed to worry about decomposition of current code before we worried about how to compose services from VNFs.  A few in the meeting believed strongly that if current physical network functions (PNFs) hosted in devices were simply declared “virtual” by extracting the device code and hosting it, the value of NFV would be reduced.  Others, myself included, were concerned for three reasons.

First, there would likely be a considerable effort involved in decomposing current code, and vendors who owned PNFs wouldn’t be likely to be willing to undertake the effort for free.  That would likely raise the licensing fees on VNFs and impact the business case for NFV.

Second, there would likely be pressure to allow decomposed PNFs to be reassembled in different ways, even mixing vendor sources.  That would require a standardization of how PNFs were decomposed, and the vendor-mixing process would surely reduce vendor interest.

Third, it was clear that if you composed a service from a chained series of VNFs, the network latency associated with the VNF-to-VNF connections could impact performance to the point where the result wouldn’t be salable at all.

Finally, there were clearly some network functions that were inherently monolithic.  It’s hard to decompose the forwarding plane of a device at the end of a packet trunk.  What would be the strategy for handling those?

In the end, the decision was made to not require decomposition of existing PNFs, and that was probably the right decision.  However, no decision was even considered on whether to support the notion of decomposed PNFs, and that has proved to be unfortunate, because had there been such a decision, we might have considered the CNNF concept earlier.

The four points above, in my view then as now, really mean that there is no single model that’s best for hosting VNFs.  The key point in support of CNNFs is that they’re not likely to be the only “NFs” to be hosted.  My own proposal was that there be a service model for each service, that there be an element of the model representing any network function, and that the element specify the “Infrastructure Manager” needed to deploy and manage it.  That still seems, in some form, at least, to be the best and only starting point for CNNFs.  That way, whatever is needed is specified.

Some network functions should deploy in white boxes.  Some in bare-metal servers, some in VMs, some in containers.  The deployment dimension comes first.  Second would come the orchestration and management dimension, and finally the functional dimension.  This order of addressing the issue of network functions is important, because if we disregard it, we end up missing something critical.

The orchestration and management processes used for deployment have to reflect the things on both sides.  Obviously, we have to deploy on what we’re targeting to deploy on.  Equally obvious, we need to deploy the function we want.  The nature of that function, and the target of deployment, establish the kind of management and orchestration we need, and indirectly that then relates to the whole issue of how we define CNFs and CNNFs, and what we do differently in each.

If we want to exploit cloud-native anywhere, I think we have to accept that the network layer divides into the data/forwarding plane and the control plane.  The former is really about fast packet throughput and so is almost surely linked to specialized hardware everywhere but the subscriber edge.  The latter is about processing events, which is what control-plane datagrams in IP are about.  The control plane is quite suitable for cloud-native implementation.  The management plane, the way all of the elements are configured and operationalized, is a vertical layer if we imagine the data/control planes to be horizontal.

The management-plane stuff is important, I think, because you can view management as being event-driven too.  However, if you are going to have event-driven management, you need some mechanism to steer events to processes.  The traditional approach of the event queue works for monolithic/stateful implementations, but it adds latency (while things wait in the queue), doesn’t easily support scaling under load (because it’s not stateless), and can create collisions when events come fast enough that there’s something in the queue that changes conditions while you’re trying to process something that came before.  The TMF NGOSS Contract approach is the right one; service models (contracts) steer events to processes.

The event-driven processes can be stateless and cloud native, and they can also be stateful and even monolithic, providing that they are executed autonomously (asynchronously) so they don’t hold up the rest of the processing.  Thus, you could in theory kick off transaction processing from an event-driven model as long as the number of transactions wasn’t excessive.

The hosting of all of this will depend on the situation.  Despite what’s been said many times, containers are not a necessary or sufficient condition for cloud-native.  I think that in most cases, cloud-native implementations will be based on containers for efficiency reasons, but there are probably situations where VMs or even bare metal are better.  There’s no reason to set a specific hosting requirement, because if you have a model-and-event approach, the deployment and redeployment can be handled in state/event processes.  If you don’t (meaning you have an NFV-like Virtual Infrastructure Manager), then the VIM should be specific to the hosting type.  I do not agree with the NFV approach of having one VIM; there should be as many VIMs as needed.

And then there’s latency.  If you are going to have distributed features make up services, you have to pay attention to the impact of the distribution process on the quality of experience (QoE).  Stringing three or four functions out in a service chain over some number of data centers is surely going to introduce more latency than having the same four processes locally resident in a device at the edge.  The whole idea was silly, in my view, but if latency can kill NFV service chains, what might it do to services built on a distributed set of microservices?  If you’re not careful, the same thing.

CNFs do have value, because containers have a value in comparison to the higher-overhead VMs.  CNNFs would have more value, but I think that realizing either is going to require serious architecting of the service and its components.  Separation of control and data planes is likely critical for almost any network function success, for example.  Even with that, though, we need to be thinking about how the control plane of IP can be harnessed, and perhaps even combined with “higher-layer” stuff to do something useful.  Utility, after all, is the ultimate step in justifying any technology change.