How a Separate Control and Data Plane Would Work

How would a separate control plane for IP work?  What would it facilitate?  It’s pretty obvious that if you were to separate the control and data planes of IP, you could tune the implementation of each of these independently, creating the basis for a disaggregated model of routing versus the traditional node-centric IP approach, but why bother?  To answer these questions, we have to go back in time to early attempts to work non-IP “core networks” into an IP network.

The classic, and most-technology-agnostic, view of a separate control plane is shown in the figure below.  In it, and in all the other figures in this blog, the blue arrows represent data-plane traffic and the red ones the control-plane traffic.  As the figure shows, the control plane sets the forwarding rules that govern data-plane movement.  Control-plane traffic (at least that traffic that’s related to forwarding control or state) is extracted at the entry interface and passed to the control plane processing element.  At the exit interface, control-plane traffic is synthesized from the data the processing element retains.

The “Classic” Separate Control Plane

The earliest example of something like this is decades old.  Back in the 1990s, when ATM and frame relay were just coming out, there was interest in utilizing widespread deployment of one or both of these networks for use with IP.  Either protocol created “virtual circuits” analogous to voice calls, and so the question was how to relate the destination of an IP packet that had to travel across one of these networks (called “Non-Broadcast Multi-Access” networks) with the IP exit point associated with the destination.  The result was the Next-Hop Resolution Protocol, or NHRP.

NHRP’s Approach Uses a Control Server

NHRP, shown in the figure above, visualizes NHRP operation.  The IP users are expected to be in “Local Independent Subnets” or LISs, and a LIS is a Level 2 enclave, meaning it doesn’t contain routers.  The gateway router for the LIS is an NHRP client, and each client registers its subnet with the NHRP server.  When the NHRP client receives a packet for another LIS, it interrogates the server for the NBMA address of the NHRP client that serves the destination.  The originating Client then establishes a virtual connection with the destination Client, and the packets are passed.  Eventually, if not used, the connection will time out.

NHRP never had much use because frame relay and ATM failed to gain broad deployment, so things went quiet for a while.  When Software-Defined Networking was introduced, with the ONF as its champion, it proposed a different model of non-IP network than ATM or frame relay had proposed, and so it required a different strategy.

The goal of SDN was to separate the IP control plane from forwarding to facilitate centralized, effective, traffic engineering.  This was done by dividing IP packet handling into a forwarding-plane element and a control-plane element.  The forwarding plane was expected to be implemented by commodity white-box switches equipped with the OpenFlow protocol, and the control plane was to be implemented using a central SDN controller.

The SDN Model of Separating the Control Plane

In operation, SDN would create a series of protocol-independent tunnels within the SDN domain.  At the boundary, control packets that related in any way to status or topology would be handled by the SDN controller, and all other packets would be passed onto an SDN tunnel based on a forwarding table that was maintained via OpenFlow from the SDN controller.

While the principle goal of SDN was traffic engineering, it was quickly recognized that if the SDN controller provided “northbound APIs” that allowed for external application control of the global forwarding table and the individual tables of the forwarding switches, the result would allow for application control of forwarding.  This is the current SDN concept, the one that was presented in the recent SDN presentation by the ONF, referenced in one of my earlier blogs.

This SDN model introduced the value of programmability to the ONF approach.  Things like the IP protocols (notably BGP, the protocol used to link autonomous systems, or ASs, in IP) and even the 5G N2/N4 interfaces, could now be mapped directly to forwarding rules.  However, the external applications that controlled forwarding were still external, and the IP control plane was still living inside that SDN controller.

The fact that Lumina Networks closed its doors even as it had engagements with some telcos should be a strong indicator that the “SDN controller” approach has issues that making it more programmable won’t resolve.  In fact, I think the biggest lesson to be learned from Lumina is that the monolithic controller isn’t a suitable framework.  How, then, do we achieve programmability?

Google had (and has) its own take on SDN, one that involves both traffic engineering and a form of network function virtualization.  Called “Andromeda”, the Google model was designed to create what’s turned out to be two Google backbone networks, one (B2) carrying Internet-facing traffic and the other (B4) carrying the inter-data-center traffic involved in building experiences.  Andromeda in its current form (2.2) is really a combination of SDN and what could be described as a service mesh.  Both the IP control plane and any of what were those “external applications” in SDN are now “microfeatures” implemented as microservices and hosed on a fabric controlled by SDN and OpenFlow.  The latency of the fabric is very low, and it’s used both to connect processes and to pass IP protocol streams (the two data sources for those two Google backbones).

Google Andromeda and Microfeatures

With Andromeda, virtual networks are built on top of an SDN “fabric”, and each of these networks is independent.  The early examples of Andromeda show them being based on the “private” IP address spaces, in fact.  Networks are built from the rack upward, with the basic unit of both networking and compute being a data center.  Some of Google’s virtual networks extend across multiple locations, even throughout the world. 

Andromeda could make a rightful claim to being the first and only true cloud-native implementation of virtual functions.  The reliance on microfeatures (implemented as cloud-native microservices) means that the control plane is extensible not only to new types of IP interactions (new protocols for topology/status, for example) but also “vertically” to envelope the range of external applications that might be added to IP.

An example of this flexibility can be found in support for 5G.  The N2/N4 interfaces of 5G pass between the control plane (of 5G) and the “user plane”, which is IP.  It would be possible to implement these interfaces as internal microservice APIs or events, coupled through the fabric between microfeatures.  It would also be possible to have 5G N2/N4 directly influence forwarding table entries, or draw on data contained in those tables.  For mobility management, could this mean that an SDN conduit created by OpenFlow could replace a traditional tunnel?  It would seem so.

It’s worthwhile to stop here a moment to contrast the ONF/SDN approach and the Google Andromeda approach.  It seems to hinge on two points—the people and the perceived mission.  The ONF SDN model was created by network people, for the purposes of building a network to connect users.  The Google Andromeda approach was created by cloud people to build and connect experiences.  In short, Google was building a network for the real mission of the Internet, experience delivery, while the ONF was still building a connection network.

I think the combination of the ONF concept and the Google Andromeda concept illustrates the evolution of networking.  If operators are confined to providing connectivity, they’re disconnected from new revenue sources.  As a cloud provider, an experience-centric company, Google designed a network model that fit the cloud.  In point of fact, they built the network around the cloud.

I’ve blogged about Andromeda dozens of times, because it’s always seemed to me to be the leading edge of network-think.  It still is, and so I think that new-model and open-model networking is going to have to converge in architecture to an Andromeda model.  Andromeda’s big contribution is that by isolating the data plane and converting it to simple forwarding, it allows cloud-hosted forwarding control in any form to be added above.  Since networks and network protocols are differentiated not by their data plane but by their forwarding control, it makes networks as agile as they can be, as agile as the cloud.

Where “above” turns out to be is a question that combines the issues of separating the control plane and the issues of “locality” (refer to my blog HERE).  I think the Andromeda model, which is only one of the initiatives Google has undertaken to improve “experience latency” in the cloud, demonstrates that there really should be cooperation between the network and any “service mesh” or microfeature cloud, to secure reasonable accumulative latency.  To make that happen, the process of placing features, or calling on feature-as-a-service, has to consider the overall latency issues, including issues relating to network delay or feature deployment.

There’s also the question of what specific microfeatures would be offered in the supercontrol-plane model.  Obviously, you need to have a central topology map for the scope of the supercontrol-plane and obviously you have to be able to extract control-plane packets from the interface and route them to the supercontrol-plane, then return synthesized packets to the exit interfaces.  How should all this be done, meaning how “microfeatured” should we go?  There’s a lot of room for differentiation here, in part because this is where most of the real service revenue potential of the new model would be created.  Could an entire CDN, for example, be migrated into the supercontrol-plane, or an entire 5G control plane layer?

A supercontrol-plane strategy that would allow new-model networking to be linked to both revenues in general and 5G in particular would be a powerful driver for the new model and white boxes.  By linking white boxes to both the largest source of greenfield deployment (5G) and the strongest overall driver for modernization (new revenue), supercontrol-plane technology could be the most significant driver for change in the whole of networking, and thus the greatest source of competition and opportunity.  Once vendors figure this out, there will likely be a land-rush positioning effort…to some variant on one of the approaches I’ve outlined.  Which?  Who knows?

While all of this clearly complicates the question of what the new network model might be, the fact is that it’s still a good sign.  These issues were never unimportant, only unrecognized.  We’re now seeing the market frame out choices, and in doing that they’re defining what differentiates the choices from each other.  Differentiation is the combination of difference and value, and that’s a good combination to lead us into a new network age.

How Will Cisco Respond to Open-Model Networking?

Cisco is facing a revolution that would totally change their business model, a revolution that will devalue traditional routers.  They’re already seeing the signs of a business revolution, in fact.  Thus, the question isn’t whether they’ll respond (they must) but how they’ll respond, and where that response might lead the rest of the industry.

To software, obviously, or at least to more software.  There are growing signs that Cisco is going way deeper into software.  For years, Cisco has been the only major network vendor who provided servers and platform software tools, and its recent acquisitions (Portshift, BabbleLabs, Modcam) have been more in the IT space than in the network space.  It’s not surprising that Cisco would be watching the IT side, given that it faces a major challenge in its network equipment business.  What may be surprising is that Cisco seems focused not on IT applications to networks, but at applications overall.  The question is whether what “seems” to be true, really is.

For literally half a century, networking has been a darling of CFOs.  Information penned up in data centers could be released, via a network connection, to reach workers and enhance their performance.  There was so much pent-up demand that few network projects really faced much resistance.  It was almost “Build it, in case they come!”

The good times are all gone, as the song goes.  Network operators face compression between revenue and cost per bit, reducing return on infrastructure investment and putting massive pressure on capital budgets.  Enterprises, having unlocked most of that confined information resource base, are now having difficulties justifying spending more network dollars without additional, specific, benefits.  The most significant result of this combination has been a shift of focus among buyers, toward “open-model” networks.

An open-model network is a network created by a combination of white-box switches and separate network software.  White-box switches have a long history, going back at least as far as SDN, but the Open Compute Project and the Telecom Infra Project are current supporters of the concept, and so is the ONF (with Stratum).  Because the white-box device can run multiple software packages, it’s almost like a server in its ability to be generalized.  It’s built from commercially available parts, based on open specifications, and so there’s a competitive market, unlike that for proprietary switches and routers.

Network operators, in particular, have been increasingly interested in this space because it promised a break from the classic vendor-lock-in problem that they believe has driven up costs for them, even as revenue per bit has fallen.  The original SDN white-box approach was somewhat attractive in the data center, but the central controller wasn’t popular for WAN applications.  Now, with players like DriveNets pushing a cluster-cloud router based on white-box technology, it’s clear that routers will be under direct threat too.

Cisco has lost business to white boxes already, and with the AT&T/DriveNets deal demonstrating that a Tier One operator is willing to bet their core network on them, further interest among operators is inevitable.  Capital budgets for networking were already slipping, and white boxes could make things immeasurably worse.  No wonder Cisco feels pressure, especially from investors.

Logically, there are two steps that Cisco could take to relieve their own stock-price worries.  The first is to increase their revenues outside their core device sales.  They started doing this years ago with things like WebEx and Unified Computing System (UCS) servers, and they’ve also been unbundling their IOS network operating system to run on white boxes, and as a subscription offering.  The second is to try to beat the white-box movement at its own game.

Just selling white boxes, or promoting IOS as a white-box OS, wouldn’t generate much for Cisco.  You have to be able to add value, not just replicate router networks with a cheaper platform.  As I pointed out in an earlier blog (HERE), the white-box space and the SDN movement combine to argue for a strict separation of the IP control plane and the data plane.  The devices that can host the control plane look very much like standard cloud servers, and the data-plane devices are custom white-box switches with chips designed to create high-performance forwarding.  It’s very much like the SDN model of the forwarding devices and the central controller, except that the control plane isn’t necessarily implemented by a central controller at all.  DriveNets hosts the control plane via microservices in what we could describe as a “cluster-cloud”.  Google’s Andromeda composes control planes (and other experience-level components) from microfeatures clustered around an SDN core, and older concepts like the Next-Hop Resolution Protocol (NHRP) describe how to deliver IP routing from what they call an NBMA (Non-Broadcast Multi-Access) network.  In short, we have no shortage of specialized non-centralized IP control planes (I’ll get into more detail on some of these in a later blog).

Referring again to my earlier blog, the IP control plane is only one of several “control planes” that act to influence forwarding behavior and create network services.  IMS/EPC in 4G networks, and the control/user-plane separation in 5G NR and Core, have service control layers, and it’s not hard to see how these new service-control elements and the IP control plane could be combined in a cloud implementation.  Given that, the first question is, “Does Cisco see it too?”  The second is “Does Cisco think they could survive it?”  It’s obvious they can, and do, see it, so the survival question is the relevant one.

The SDN model uses white-box forwarding devices as slave to control logic, and vendors like Cisco have generally (and reluctantly) provided for SDN OpenFlow control of their routers and switches.  It’s looked good in the media and hasn’t hurt much because there was no practical control logic strategy that could span more than a data center.  The problem is that white-box switches like the kind AT&T describes in its press release on its disaggregated core are way cheaper than routers, so a new and practical implementation of a separate control plane to create that “control logic” would validate white boxes network-wide.

One story on the AT&T deal with DriveNets frames the risk to Cisco in terms of Cisco’s Silicon One chip strategy, which demands IOS integration.  That’s not the risk, in my view.  The risk is that a new model of network with a separate control plane that’s expanded to support service coordination, could make packet forwarding a total commodity, and provide a mechanism for services at any level to directly manipulate forwarding behavior as needed.  You could argue that the network of the future would become not an IP network (necessarily) but a forwarding plane in search of a service.  If you want to talk commoditization, this is what it would look like if taken to the ultimate level.

And that, friends, is likely what’s on Cisco’s mind.  Cisco has always seen itself as a “fast follower” and not a leader, meaning that it’s wanted to leverage trends that have established themselves rather than try to create their own trends.  That’s probably particularly true when the trend we’re talking about could hurt Cisco, and all router vendors, significantly.  And when the market doesn’t have a clear model of how this new combined “supercontrol-plane” would work, why would Cisco want to teach it that critical lesson?  Why commoditize your own base?

Only because it’s inevitable, and that may explain Cisco’s current thinking.  Server vendors like Dell and HPE, software giants like IBM/Red Hat and VMware, and cloud providers like Amazon, Google, and Microsoft, could all field their own offerings in this area.  So could startups like DriveNets.  Once that happens, Cisco can no longer prevent the secret from getting out.  To the extent that this new-model network is truly best (which I believe it is), Cisco now has to choose between losing its current customers to Cisco’s own successor new-model implementation, or losing to someone else’s.

OK, suppose this is Cisco’s thinking.  What characterizes this new supercontrol-plane?  It’s cloud-hosted, it integrates applications and experiences directly with forwarding.  It’s really mostly an application, right?  Things like Kubernetes, containers and container security, and even application features like text processing, all live in the cloud, and very possibly either inside (or highly integrated with) this new supercontrol-plane element.  If Cisco has to face the truth of this new element at some point, it makes sense to get its software framework ready to exploit it.

But can a fast-follower strategy work with this kind of disruption?  It might.  The whole reason behind white-box switches and disaggregation of software and hardware is to ensure that the capital assets that build network infrastructure are open.  It’s the hardware that creates a financial boat anchor on advances.  Open it up, and you cut the anchor rope.  But remember that any network operator will already have routers in place.  If they’re Cisco routers, and if Cisco can make its current routers compatible with its supercontrol-plane concept, then Cisco has a leg up, financially, on competitors who’d have to displace Cisco’s routers and force operators to take the write-down.

Finally, if Silicon One is a Cisco asset to be protected, isn’t it one that could be leveraged?  Cisco could build white-box forwarding devices, if white-box forwarding is the model of the future.  Sure, they’d lose revenue relative to selling chassis routers, but if they could make that up by feeding service applications into their supercontrol-plane, that could be OK.  In any event, they can’t stick their finger in the open-model dike and think it will hold forever.

Timing issues represent the big risk to Cisco.  Fast following when you’re doing layoffs and your stock has been downgraded can be a major risk if the player you let take the lead decides to do things perfectly.  I wouldn’t count Cisco out at this point; they still have some pathways to a strong position in the new-model network era, but they’re going to have to accept that they’ll never be the Cisco they were, and sometimes that sort of thing creates a culture shock management can’t get past.  They’ll need to overcome that shock, and be prepared to jump out if it looks like a serious rival for new-model network leadership is emerging.

Tracking the White-Box Revolution

Sometimes the real story in something is deeper than the apparent news.  Nobody should be surprised  by the decision by AT&T to suspend any new DSL broadband connections.  This is surely proof that DSL is dead, even for the skeptics, but DSL isn’t the real issue.  The real issue is what’s behind the AT&T decision, and what it means to the market overall.  AT&T is telling us a more complex, and more important, story.

The fundamental truth about DSL is that, like a lot of telecom technology, it was focused on leveraging rather than on innovating.  “Digital subscriber loop” says it all; the goal was to make the twisted-pair copper loop plant that had been used for plain old telephone service (POTS) into a data delivery conduit.  At best, that was a tall order, and while there have been improvements to DSL technology that drove its potential bandwidth into the 25 Mbps range, that couldn’t be achieved for a lot of the current loops because of excessive length or the use of old technology to feed the digital subscriber line access multiplexers (DSLAMs).

The biggest problem DSL faced was that the technology limitations collided with the increased appetite for consumer access bandwidth.  Early attempts to push live TV over DSL involved complex systems (now largely replaced by streaming), and 25 Mbps wasn’t fast enough to support multiple HD/UHD feeds to the same household, at a time when that was a routine requirement.  Competition from cable and from fiber-based variants (including, today, millimeter-wave 5G), means that there’s little value in trying to keep those old copper loops chugging along.

OK, that’s it with DSL.  Let’s move on to AT&T and its own issues.  AT&T had the misfortune to be aligned as the only telco competitor to Verizon in wireline broadband.  As I’ve noted in past blogs, Verizon’s territory has about 7 times the potential to recover costs on access infrastructure as AT&T’s.  Early on, cable was delivering home video and data and AT&T was not, which forced them to try to provide video, leading eventually to their DirecTV deal (which AT&T is trying to sell off, and which is attracting low bids so far).  They’re now looking to lay off people at WarnerMedia, their recent acquisition, to cut costs.

AT&T has no choice but to cut costs, because its low demand density (see THIS blog for more) has already put it at or near the critical point in profit-per-bit shrinkage.  While AT&T may have business issues, they have an aggressive white-box strategy.  Their recent announcement of an open-model white-box core using DriveNets software (HERE) is one step, but if they can’t address the critical issue of customer access better than DSL can, they’re done.  The only thing that can do that is 5G, and so I think it’s clear that there will be no operator more committed to 5G in 2021 than AT&T, and that’s going to have a significant impact on the market.

Recall from my white-box blog reference that AT&T’s view is that an “open” network is one that doesn’t commit an operator to proprietary devices.  The AT&T talk at a Linux Foundation event suggests that their primary focus is on leveraging white boxes everywhere (disaggregated routing is part of the strategy).  That means that AT&T is going to be a major Tier One adopter of things like OpenRAN and open, hosted, 5G options overall.

There couldn’t be a better validation of the technology shift, though as I’ve noted, AT&T’s demand density places it closer to the point of no return (no ROI) on the profit-compression curve than most other operators.  That means that those other operators will either have more time to face the music, or need another decision driver to get things to move faster, but I think they’ll all see the handwriting on the wall.

For the major telco network vendors, this isn’t good news, and in fact it’s bad news for every major network vendor of any sort, even the OSS/BSS players.  As nearly everyone in the industry knows, vendor strategy has focused on getting their camel-nose into every buyer tent and then fattening up so there’s no room for anyone else.  The problem with open-model networking is that it admits everyone by default, because the hardware investment and its depreciation period has been the anchor point for our camel.  Take that away and it’s easy for buyers to do best-of-breed, which means a lot more innovation and a lot less account control.

We’ve already seen signs of major-vendor angst with open-model networking.  Cisco’s weak comment to Scott Raynovich on the DriveNets deal, that “IOS-XR is already deployed in a few AT&T networks as white box software,” is hardly a stirring defense of proprietary routers.  Ericsson did a blog attacking OpenRAN security.  The fact is that no matter what the major vendors say, the cat is out of the bag now, and with its escape it reveals some key questions.

The first of these questions is how much of a role will open-source network software play?  AT&T has demonstrated that it’s looking at open hardware as the key; either open-source or proprietary software is fine as long as it runs on open hardware.  That would seem to admit open-source solutions for everything and perhaps kill off proprietary stuff in all forms—like a technology version of the Permian Extinction.  The problem with that exciting (at least for those lifeforms who survive) theory is that there really aren’t any open-source solutions to the broad network feature problem.  Yes, there are tools like Kubernetes and service mesh and Linux and so forth, but those are the platforms to run the virtual features, not the features themselves.  That virtual feature space is wide open.

Can open-source fill it?  Not unless a whole generation of startups collectively sticks their heads in the sand.  Consensus advance into a revolutionary position is difficult; it’s easier to see revolution through infiltration, meaning that startups with their own vision and no committees to spend days on questions like “When we say ‘we believe…’ in a paper, who are ‘we?’” (a question I actually heard on a project), can drive things forward very quickly.

The second of these questions is are there any fundamental truths that this new open-model wave will have to embrace?  Obviously, the follow-on question would be “what are they”, so let’s admit that the answer to the first is “Yes!” and move to the follow-on.

The first of our specific fundamental truths is that all open-model software strategies, particularly those that exploit white-box technology, have to separate the control and data planes.  The control plane of a network is totally different from the data plane.  The data plane has to be an efficient packet-pusher, something that’s analogous to the flow switches in the ONF OpenFlow SDN model.  AT&T talked about those boxes and their requirements in the Linux Foundation talk.  The control plane is where pushed packets combine with connectivity behavior to become services.  It’s the petri dish where the future value propositions of the network are grown, and multiply.

The second of our specific truths is that the concept of control planes, now defined in multiple ways by multiple bodies and initiatives, have to somehow converge into a cooperative system.  What 5G calls a “control plane” and “user plane” defines a control plane above the IP control plane, which is actually part of 5G’s user plane.  The common nomenclature is, at one level, unfortunate; everything can’t have the same name or names have no value.  At another level, it’s evocative, a step toward an important truth.

Networking used to be about connecting things, and now it’s about delivery.  What the user wants isn’t the network, it’s what the network gets them to.  Thus, the true data plane of a network is a slave to the set of service-and-experience-coordinating things that live above, in that “control plane”.  The term could in fact be taken to mean the total set of coordinating technologies that turn bit-pushing into real value.  Because that broader control plane is all about content and experience, it’s also all about the cloud in terms of implementation.

Who and how, though?  The ONF approach uses an SDN controller with a set of northbound APIs that feed a series of service-specific implementations above, a layered approach.  But while you can use this to show a BGP “service” controlling the network via the SDN controller, BGP is almost a feature of the interfaces to the real BGP networks, not a central element.  Where do the packets originate and how are they steered to the interfaces?  In any event, the central SDN controller concept is dead.  What still lives?

This raises what might be the most profound competitive-race question of the entire open-model network area; is this “control plane” a set of layers as it’s currently implicitly built, or is it a floating web of microfeatures from which services are composed?  Why should we think of 5G and IP services as layers, when in truth the goal of both is simply to control forwarding, a data-plane function?  Is this new supercontrol-plane where all services now truly live?  Are IP and 5G user-plane services both composed from naked forwarding, rather than 5G user-plane being composed from IP?

These questions now raise the third question, which is what are the big cloud players going to do in this new open-model network situation?  By “big cloud players” I of course mean the cloud providers (Amazon, Google, IBM, Microsoft, and Oracle), and also the cloud-platform players like IBM/Red Hat, VMware, and HPE) and even players like Intel, eager for 5G revenue and whose former Wind River subsidiary offers a platform-hosting option.  And, finally, I’d include Cisco, whose recent M&A seems to be aimed at least in part at the cloud and not the network.

It’s in the cloud provider community that we see what might be the current best example of that supercontrol-plane, something that realizes a lot of the ONF vision and avoids centralized SDN controllers.  Google Andromeda is very, very, close to the goal line here.  It’s a data-plane fabric that’s tightly bound to a set of servers that host virtual features that could live anywhere in the application stack, from support for the IP control-plane features to application components.  Google sees everything as a construct of microfeatures, connected by low-latency, high-speed, data paths that can be spun up and maintained easily and cheaply.  These microfeatures are themselves resilient.  Google says “NFV” a lot, but their NFV is a long way past the ISG NFV.  In fact, it’s pretty close to what ISG NFV should have been.

Andromeda’s NFV is really “NFV as a service”, which seems to mean that microfeatures implemented as microservices can be hosted within Google’s cloud and bound into an experience as a separate feature, rather than being orchestrated into each service instance.  That means that each microfeature is scalable and resilient in and of itself.  This sure sounds like supercontrol-plane functionality to me, and it could give Google an edge.  Of course, other cloud providers know about Google Andromeda (it dates back over five years), so they may have their own similar stuff out there in the wings.

The cloud-platform vendors (IBM/Red Hat and VMware, plus the server players like Dell and HPE) would probably love to build the supercontrol-plane stuff.  They’d also love to host 5G features.  So far, though, these platform giants are unable to field the specific network microfeature solutions that ride on top of the platforms and justify the whole stack.  Instead, most cite NFV as the source of hostable functionality.  NFV, besides not being truly useful, is a trap for these vendors because it leads them to depend on an outside ecosystem to contribute the essential functionality that would justify their entry into the new open-network space.  They might as well wait for an open-source solution.

And this raises the final question:  Can any of this happen without the development of a complete, sellable, ecosystem that fully realizes the open-model network?  The answer to that, IMHO, is “No!”  There is no question that if we open up the “control plane” of IP, consolidate it with other control planes like the one in 5G, frame out the ONF vision of a programmable network and network-as-a-service, and then stick this all in the cloud, we’ve created something with a lot of moving parts.  Yes, it’s an integration challenge and that’s one issue, but a greater issue is the fact that there are so many moving parts that operators don’t even know what they are, or whether anyone provides them.   For operations focus, I recommend that people keep an eye on models and TOSCA (an OASIS standard for Topology and Orchestration Specification for Cloud Applications) and tools related to it.

Big vendors with established open-source experience (all of the vendors I named above fit that) will do the logical thing and assemble an ecosystem, perhaps developing the critical supercontrol-plane tool and perhaps contributing it as an open-source project.  They’ll then name and sell the entire ecosystem, because they already do that, and already know how it’s done.

This will be the challenge for the startups who could innovate this new open-model space into final victory.  No startup I’m aware of even knows all the pieces of that complete and sellable ecosystem, much less has a ghost of a chance of getting the VC funding (and the time) needed to build it.  Can they somehow coalesce to assemble the pieces?  Interesting question, given that many of them will see the others as arch-rivals.  Can they find a big-player partner among my list of cloud or cloud-platform vendors?  We’ll see.

Another interesting question is what Cisco will do.  They’re alone among the major network vendors in having a software/server position to exploit.  Could Cisco implement the supercontrol-plane element as cloud software, and they promote it both for basic OpenFlow boxes and for Cisco routers?  We’ll get to Cisco in a blog later on.

I think AT&T’s DSL move is both a symptom and a driver of change.  It’s necessitated by the increasingly dire situation AT&T is in with respect to profit per bit, just as AT&T’s white-box strategy is driven by that issue.  But it drives 5G, and 5G then drives major potential changes in the control-to-data-plane relationship, changes that could also impact white-box networks.  Everything that goes around comes around to going around again, I guess.  We’ll see where this complex game settles out pretty quickly, I think.  Likely by 1Q21.

What’s Really Behind the IBM Spin-Out?

The news that IBM will spin off its managed infrastructure services business into a new company created a pop for its stock, but what does this mean (if anything) for the IT market?  In particular, what does it mean for cloud computing?  It’s not as simple as it might seem.

The basic coverage theme for the deal, encouraged no doubt by IBM’s own press release, is that this is going to take IBM another step away from “legacy” to “hybrid cloud”.  “IBM Slashes Legacy to Focus on Hybrid Cloud” says SDxCentral, and other sites had a similar headline.  Click bait?  From that, it wouldn’t be unreasonable to think that the deal was spinning out all of IBM’s mainframe computer business, along with related software and services.  “IBM trashes everything about itself except Red Hat and the Cloud,” right?  Wrong.  What IBM is spinning off is its “managed infrastructure services” business, which they say is $19 billion annually.  It doesn’t include either software or hardware elements.

In its SEC filing today, IBM said that customer needs for applications and infrastructure services were diverging.  It’s pretty clear what applications are, and “infrastructure services” are the collection of managed services described HERE on IBM’s site.  It’s a fair hodgepodge of technical and professional services relating to just about everything associated with data centers, but not the cloud.  Thus, it could be more accurate to say that IBM is spinning off its managed services except for those involved in hybrid cloud.

If this seems like an almost-political level of spin to you, you’re not alone.  I think IBM is doing a smart thing, but they’re wringing the best PR angle out of the move rather than providing the literal truth, which I think is more related to optimizing Red Hat and “NewCo” than anything else.  It’s about combined shareholder value.

Before Red Hat, IBM was a creaky old organization with a good self-image, credibility among aging CxOs, and little or no path to serious engagement of new customers.  Red Hat gave IBM something modern to push through its admittedly powerful sales engagements, and it also gave IBM one of the best platforms for attracting the attention of new prospects.  In short, Red Hat was, and is, IBM’s future, and one of the things it does rather nicely is to link IBM’s cloud business to that future too.

A lot of what builds and goes into the cloud is open-source, and Red Hat is a natural leader in that sector of software.  Not only that, a lot of that cloud stuff is still tied back to the data center, creating the “real hybrid cloud” model that’s been there all along and never written about much.  There will be very, very, few “cloud enterprises” if we take that to mean 100% cloud, but every enterprise will be a hybrid cloud enterprise.  No surprise, then, that IBM has picked that notion up in its PR regarding the spin-out.

But why spin anything out?  IBM has Red Hat, after all, and this isn’t some sort of nuclear business atomic-organization theory where one company coming into the IBM atom has to knock another piece out.  However, an organization to offer managed infrastructure services really needs to be somewhat technology agnostic to reach its full potential, and if IBM is going whole-hog to Red Hat and associated technology, it hardly wants to present an open-minded vision of the industry on its sales calls.  There’s a mission conflict here, one that could hurt the managed infrastructure service business if they stay, as well as the rest of IBM’s business.  The new company, “NewCo” in the SEC filings, will be a pure managed infrastructure services company, free now to manage any infrastructure as an impartial professional team, without fear that IBM product/technology bias will scare non-IBM accounts away.

Before we write all IBM’s hybrid cloud singing and dancing off as market hype, though, we have to recognize that while linking this deal to hybrid cloud may be an excursion away from strict facts, it is a fact that IBM is highly committed to hybrid cloud.  Why?  Because, as I just said, every enterprise will be a hybrid cloud enterprise, and if we define “enterprise” as a ginormous company, IBM has engaged a higher percentage of them than any other IT firm, and has a deeper engagement level with most.  Given that the market hype on the cloud has totally distorted reality, IBM is in a position to pick the low apples while all its competitors are still staring into the sun…not to mix metaphors here.

Here’s a basic truth to consider.  If the cloud is an abstraction of the thing we call “servers”, then the future of software and computing isn’t based on hardware at all.  Even “infrastructure” (in the sense of computing and related platform software) is shaped by the symbiotic relationship the infrastructure has with the cloud, and applications are “cloud application” not mainframe or minicomputer applications, even if they actually run (in part or entirely) on mainframes or minicomputers.  The hybrid cloud is the IT of the future, period.

IBM, the old and original IBM, has solid account control with many if not most of the enterprises.  It has, with Red Hat, something to sell them, and something from which they can create a respectable hybrid cloud.  The core business of IBM-the-new-and-exciting depends on making sure that Red Hat shift works, while at the same time trying to preserve overall shareholder value.  Remember, current shareholders will get some NewCo too, and IBM wants that to do well.  Inside IBM, the goals of managed infrastructure services and hybrid cloud would likely be cross-purposed often enough to be a problem.  Thus, spin it out.

The hybrid cloud lesson here is clear, but another lesson we can draw is also important.  The age when enterprise IT and “ordinary” IT differed at the core technology level—mainframes versus minis or even PCs—are over.  The enterprise is a hybrid cloud consumer, and hybrid cloud (as I noted earlier) is computing abstracted.  One platform to rule them all.

Why Optimum Transformation Depends on a Hidden Metric

I know I talk a lot about “demand density” in my blogs, so much that you could think I’m saying it’s the essential underpinning of network evolution.  Well, it kind of is.  The future of networking is written in dollars, not in bits, and demand density is the fundamental dollar reality.  I’ve been working through modeling just how it’s impacting us today, and how it will continue to impact us over the rest of this decade, and I want to share it here.

To start with, the “demand density” concept is one I developed over a decade ago in response to what seemed like a simple question: “Why do some countries have so much better broadband than others?”  I took a lot of economic data on the major market areas and ran a series of models to try to find correlations.  It was obvious that the potential ROI of network infrastructure in a given area (like a country) depends on the combination of the economic potential of the area and the extent to which public right of way is available.

The model showed that the former factor was significant if you related it to GDP per square mile of inhabited area, and the latter to the road and rail miles within the country.  Demand density was the combination of these factors, and I’ve always expressed it in relation to the US demand density, set to 1.0.  The easiest way to understand the importance of demand density is to take two extremes, the very low and the very high, and compare them.

When a country has a very high demand density, its economic potential is highly concentrated, meaning that the infrastructure investment needed to connect the country and realize that economic potential is relatively low.  ROI on infrastructure is high, so countries with high demand density are likely not to have felt the profit-per-bit squeeze at this point.  The high economic return of infrastructure makes it easier to invest in new services, and so service innovation is high.  Overall, infrastructure planning is likely to be aimed at revenue generation more than at cost control.

Operations costs are also lower in these high-demand-density countries.  On the average, the “process opex” costs associated with service and network operations in countries with a demand density of greater than 7 is 40% less than that of a country with a demand density of 1.  That’s partly because human resources are more efficient when their range of activity is small; they spend less time moving around and a central pool of resources can support a greater number of users and devices.  The other part is that it’s possible to oversupply resources where demand density is high, because of that higher ROI on infrastructure.  More resources mean less resource management.

The low-demand-density country is in the opposite situation.  Because demand is spread over a larger geography, connecting users is more expensive and infrastructure ROI is lower.  That translates to quick compression of profit-per-bit, and in the extreme cases makes it difficult to sustain investment in infrastructure at all.  Opex for countries with demand densities below 0.33 have opex costs that average 20% higher than those with a demand density of 1.0, because human resource usage is relatively inefficient.

I’ve used “countries” here because it’s generally easier to get economic data at a country level, and because most countries have been served by a national carrier.  In the US, it’s fairly easy to get data by state, and there are multiple operators serving the US.  A quick look at two, AT&T and Verizon, is another window into the importance of demand density.

AT&T’s demand density is 1.3, which means that it fits into the “relatively low” value range.  Comparing it with country data, it ranks roughly the same as Chile.  Verizon’s demand density is 11.5, which ranks slightly below that of Japan, in the “high” range.  Verizon has been aggressive in deploying FTTH and AT&T has not, because the former’s demand density suggests it could profitably connect about 40% of its customers with fiber, and the latter’s data says they could connect only 22%.  The company has recently said it was dropping new DSL deployments, and since its demand density is such that fiber support for these customers would not likely be profitable, and you don’t have to look further than the stock prices for the two companies to see the difference in how the financial markets see them.

AT&T has been perhaps the most aggressive large operator in the world on infrastructure transformation, particularly in the deployment of “white-box” technology.  That’s obviously aimed at reducing capex, but the largest component of capex for an ISP is the access network.  AT&T’s low demand density means its access network technology options are crippled.  DSL has to reach to far, and fiber to the home is too expensive.

This is where 5G mobile and 5G millimeter wave come in.  In areas where demand density is low, 5G technology could provide an alternative to copper-loop or FTTH but with a much lower “pass cost”, meaning the cost to bring service into an area so customers can then be connected.  5G in any form reduces the access cost component of capex, which can relieve pressure on the overall capital budget.  More significantly, it lets an operator raise its per-user bandwidth at a lower cost, making it more competitive and opening the opportunity to deliver new services, like streaming video.

5G seems most valuable as a tool in improving service cost and profit where demand densities are below about 3, which would correspond to US states like Vermont and West Virginia or countries like Italy.  Where demand density is higher, fiber becomes more practical and the impact of 5G on overall profit per bit is likely to be steadily less.

This is important in transformation planning, because it divides network transformation goals into “zones”.  Where demand density is high (greater than 5), profit per bit is not under immediate pressure and neither significant network transformation nor significant 5G exploitation is likely to be needed in the near term (to 2023).  Where it’s between 3 and 5, 5G is likely a competitive and service opportunity driver.  Between 1 and 3 and 5G and general network cost effectiveness combine to create the transformation drivers, and between 0.2 and 1.0, transformation has to be pervasive in both access and core, in order to control profit compression.

Obviously, pure mobile operators are going to be transformed primarily through the evolution of the mobile backhaul and core networks, so 5G standards would likely dominate transformation planning goals.  Where demand density comes in is in the area of cell-size planning.  High demand densities mean efficient backhaul even if microcells are used, and so 5G networks would likely trend toward smaller cell sizes and larger numbers of cells.  This would also favor the introduction of bandwidth-intensive applications of 5G because per-user bandwidth could be higher.  In low-demand-density areas, cell sizes would likely be larger to contain overall deployment costs, which would also reduce per-user bandwidth available and limit the new services that 5G could support.

Millimeter-wave 5G/FTTN hybrids would seem most valuable where demand densities hover in the 1-4 range, too low for large-scale FTTH but high enough that the range limitations of 5G/FTTN wouldn’t be a killer and so that delivered bandwidth could remain at a competitive level.  As demand density falls to the low end of that range, 5G mobile infrastructure to serve fixed locations would become increasingly economical, and as it rises to (and above) the high end, FTTH becomes competitive.

This last issue of transformation focus may be the most important near-term factor influenced by demand density.  Is a mobile network mobile-first or network-first in planning?  For vendors prospecting network operators, that’s a big question because it relates to both sales strategy and product planning focus.  Obviously, operators whose planning is dominated by 5G issues aren’t as likely to respond to a generalized network transformation story; they want to hear 5G specifics.  The opposite is also true.  However, there are exceptions.  Operators whose demand density favors microcells would be doing a lot of backhaul and aggregation, and thus would build a core that looked a lot like that of a wireline network.  Those with large 5G mobile cells could be doing so much aggregation in the backhaul network that they’d almost be dropping their mobile traffic on the edge of their core.

One thing I think is clear in all of this is that demand density has always been an issue.  Australia, who went to a not-for-profit NBN experiment, has a demand density of 0.2.  AT&T, who just committed to an open-model network (see my blog HERE) has a demand density of 1.3.  The general curve of profit compression that operators always draw will reach the critical point of inadequate ROI quicker for those whose demand density is lower, and measures to contain capex and opex will be taken there first, as they have been already.

Another thing that’s clear is that transformation strategies aren’t going to be uniform when demand density is not.  It’s simplistic to believe that a salesforce could be armed with a single song to sing worldwide, and succeed in selling their products across the globe.  There’s always been a need to tune messages, and with operator budget predictions down almost universally, this is probably a good time to pay special attention to that tuning.  Factoring demand density into the picture can be a big help.

Filling the Holes in Opex Reduction Strategies

Vendors are finally discovering the virtue of opex reduction.  Cisco has included the network sector in its overall AI/ML strategy, complementing “intent-based networks”.  Juniper’s Mist AI deal, combined with their recent acquisition of Netrounds, shows that they’re looking at more ways to apply automation, testing, monitoring, and other stuff that qualifies as operations automation.  The reason, of course, is that if operators are looking to cut costs to reduce profit-per-bit squeeze, cutting opex might prevent them from cutting capex, which hits vendors’ own bottom lines.

Vendor desire to support operator cost reduction in places where the reduction doesn’t impact operators’ spending on the vendors is understandable.  That doesn’t make it automatically achievable, or mean that any measure that has a claim to enhancing operations would necessarily reduce opex significantly, or at all.  I’ve done a lot of modeling on opex, and this modeling suggests that there are some basic truths about opex reduction that need to be addressed any time opex enhancements are claimed as a benefit.

The biggest truth about opex is, for operators, a sad one.  Recall from prior blogs that I’ve noted that “process opex”, the cost related to customer and infrastructure operations, is already a greater percentage of revenue than capex is.  Starting in 2016 when operators moved to address the problem, and despite measures taken largely in 2018, opex costs have continued to grow even when capex has been reduced.

The biggest and saddest truth for vendors is that securing opex improvements is getting harder.  Besides the year 2018 when major changes to installation and field support were initiated, to create a major impact on opex overall, further technical improvements have only made the growth curve of opex a bit less steep.  It has never, even in 2018, turned back, and it’s continuing to grow in 2020.  The biggest challenge is that one area where cost containment seems to have worked this year is in the area of network operations, where most of the announcements (like Juniper’s) are targeted.

It would be easy to say that service lifecycle automation and other forms of operations cost management are already out of gas, but that oversimplifies, even though it’s a view that could be gleaned from the data.  A “deeper glean” results in a deeper understanding, which is that while service automation is important, it has to be viewed from the top and not from the bottom.

If you take a kind of horizontal slice across all the categories of “process opex”, the largest contributor by far is customer care and retention, which accounts for an astonishing 40% of all opex.  You could argue that making network operations automated is important primarily if it reduces customer care and retention costs, and for that to be true, you have to be able to translate the netops benefits directly into the customer domain, which even in 2020 is a disaster.  The reason we had a brief improvement in process opex in 2018 is that operators wrung human cost out of the process.  The reason why it didn’t last is because the system changes made were simply badly done.

Let me offer a personal example.  I recently had a Verizon FiOS outage, caused (I believe) by a construction crew cutting the fiber cable.  Having lost wireline Internet, I went to Verizon’s website to report and resolve the problem.  Thanks to that 2018 change, there was no human interaction involved; I was told to run through an automated troubleshooter, and candidly it was a truly frustrating and terrible experience.

First, the troubleshooter asked me whether I was having a power failure.  It then told me there were no known outages in my area, and had me go to the location of the optical terminal.  Now, there was a red light on the OTN, but they never asked for status.  Instead I had to make sure the OTN had power (duh, there’s a red light on) and second to reboot the OTN to see if that fixed it.  It didn’t, so they told me it would now take a service call, and gave me a bunch of disclaimers on how I’d have to pay for the call if this turned out to be my fault.  Then it asked if I still wanted to schedule a service call.  I said I did, and it told me there were no slots available for the next two weeks, so I’d have to be connected to a human.

It never connected me.  I tried to call the support number and got another (this time voice-driven) troubleshooter, which proceeded to tell me that there was an outage in my area and it would be repaired by 6 AM the following morning.  Good service?  Hardly, but the problem wasn’t service automation.

There’s an automated element at the end of the broadband connection, the ONT.  There is no reason why Verizon could not have known that some number of customers’ ONTs had gone away, and from the distribution have been able to determine the likely location of the problem.  They had my cell number, so they could have texted me to say there was an outage, that they’d have it repaired overnight, and that they were sorry for the problem.  That would have left a good taste in my mouth, and reduced the chances that I’d look for another broadband provider.  It would have unloaded their troubleshooting system too, and it would have required nothing that would qualify as a network automation enhancement.

Start with the customer, with how to proactively notify them of problems and how to give them the correct answers and procedures.  When you’ve done that, think about what information could be provided to drive and improve that customer relationship.  When that’s done, think about what might have been done to prevent the problem.  The proverbial “cable-seeking backhoe” isn’t going to be service-automated out of existence, nor is the repair of the cable, which shows that some of the most common service problems aren’t even related to the network’s operation.  We absolutely have to fix customer care.

This doesn’t say that you don’t need service lifecycle automation, though, and there’s both a “shallow” and “deep” reason that you do.  Let’s start with the shallow one first.

Some customer problems are caused by network problems, and in particular they’re caused by operator error.  A half-dozen operators out of a group of 44 told me that operator error was their largest cause of network problems, but almost 30 of them wouldn’t say what was, so it could well be that operator error is the largest source of errors overall.  Misconfiguration lies at or near the top of the list.  Lifecycle automation, by reducing human intervention, reduces misadventures in their intervention (to add a bit of conversational interest to the picture).

The second, deep, reason is that in an effort to reduce capex, we’re developing a more complex infrastructure framework.  A real router is a box, and everything about managing it is inside the box.  A virtual router is a virtual box, and it still has to be managed as a box, but its hosting environment, its orchestration, and even the management processes associated with hosting, also have to be managed.  If we break our box into a chain of virtual-features, we have even more things to manage.  Management costs money, both in wages and benefits, and in the errors that raise customer care and retention costs.

You can see what might be a sign of that in the operations numbers for this year.  While network operations costs are up about 16% over the last five years, IT operations costs are up over 20%.  Given that these costs are all “internal”, meaning there’s no contribution to direct customer interactions, installations, or repairs, that’s a significant shift.  Could it be the indication that greater adoption of virtualization is creating more complex infrastructure at the IT level, and that the biggest contribution that service lifecycle automation could make is in controlling the increase in opex related to this increased complexity?

Then there’s the final point, which is the impact of demand density on opex and on opex reduction strategies.  We already know that wireline has a higher opex contribution than wireless, but it’s also clear that in areas where demand density (roughly, opportunity per square mile) is high, opex is lower because craft efficiency is higher and the cost of infrastructure redundancy is lower.  As demand density falls, there’s a tendency to conserve infrastructure to manage capex, which means opex tends to rise because of loss of reserve resources.  It’s possible that this factor could impact capex-centric approaches to improving profit per bit; if new-model networks are cheaper to buy and more expensive to operate, what’s the net benefit?

The fact is that you can’t let opex rise, in large part because a rise in opex is often a sign that customer care is suffering.  It’s possible that a customer-care-centric approach to operations, even without massive changes in lifecycle automation, could improve opex as much as new automated technology could.  It’s also possible that wrapping new service lifecycle automation in an outmoded customer care portal and practice set could bury any benefits the new system could generate.

My experience with Verizon couldn’t have been resolved by an automated system if it was indeed caused by nearby construction.  No new systems or AI was required to do a better job, only more insight into designing the customer care portal.  I’m not saying that we should forget service lifecycle automation and focus on customer portals, but we can’t forget the latter while chasing the former.

Is There Really an “Edge” to the Cloud at All?

Where is the best place to host computing power?  Answer: Where it’s needed.  Where’s the most economical place?  Answer: Probably not where it’s needed.  The dilemma of the cloud, then, is how to balance optimality in QoE and the business case.  I’m going to propose that this dilemma changes the nature of the cloud, and makes a better definition of an “edge” a “local” processing resource.  A Light Reading article quotes Dish as saying that for 5G, “the edge is everywhere”.  Is that true, or is it true that the cloud is subsuming the edge?

The relationship between humans and applications has been a constant process of increasing intimacy.  We started off by keypunching in stuff we’d already done manually, evolved to online transaction processing, empowered ourselves with our own desktop computers, and now we carry a computer with us that’s more powerful than the one we started reading punched cards with.  How can you see this other than the classic paso doble, circling ever closer?  It proves that we’ve historically benefitted from having our compute intimately available.

The cloud, in this context, is then a bit counterintuitive.  We collectivize computing, which seems to be pulling it away from us.  Of course, now the cloud people want to be edge people too, getting them back in our faces, but I digress.  The point here is that economics favors pooled resources and performance favors dedicated, personalized, resources.

We could probably do some flashy math that modeled the attractive force of economics and the attractive force of personalization to come up with a surface that represented a good compromise, but the problem is more complicated than that.  The reason is that we’ve learned that most applications exist in layers, some of which offer significant benefits if they’re pushed out toward us, and others where economics will overwhelm such a movement.

IoT is a good place to see the effect of this.  An IoT application that turns a sensor signal into a command to open a gate is one where having the application close to the event could save someone from hitting the gate.  However, a mechanical switch could have done this cheaper.  In the real world, our application likely involves a complex set of interactions that might actually start well before the gate sensor is reached.  We might read an RFID from a truck as it turns into an access road, look up what’s supposed to be on it, and then route it to the correct gate, and onward to the correct off- or on-load point.

This application might sound similar to that first simple example, but it changes the economic/QoE tradeoff of compute placement.  Unless the access road is a couple paces long, we’d have plenty of time to send our RFID truck code almost anywhere in the world for a lookup.  Since we’re guiding the truck, we can open the gates based on our guidance, and so a lot of the real-time response nature of the application is gone.

Network services offer similar examples of layers of that economics-to-QoE trade, and one place where that’s particularly true is in the handling of control-plane messages.  Where is the best place to handle a control plane message?  The answer seems simple—the edge—but the control plane in IP is largely a hop function not an end-to-end function.  There are hops everywhere there’s a node, a router.

Let’s look at things another way.  We call on a cloud-native implementation of a control-plane function.  We go through service discovery and run it, and it happens that the instance we run was deployed twelve time zones away.  What’s the chance that this sort of allocation is going to create a favorable network behavior, particularly if it’s multiplied by all the control-plane packets that might be seen in a given place?

One solution is to create what could be called a “local cloud”, a cloud that contains a set of hosting resources that are logically linked to a given source of messages, like a set of data-plane switches.  Grabbing a resource from this would provide some resource pool benefits versus fixed allocation of resources to features, and it wouldn’t require a major shift in thinking in the area of cloud hosting overall.

Where broader cloud-think comes in is where we either have to overflow out of our “local cloud” for resources, or where we have local-cloud missions that tie logically to deeper functionality.  If the IP control plane is a “local cloud” mission, how local is the 5G control plane, and how local is the application set that might be built on 5G?  Do we push these things out across those twelve time zones?  Logically there would be another set of resources that might be less “local” but would sure as the dickens not be “distant”.

The cloud is hierarchical, or at least it should be.  There is no “edge” in a sense, because what’s “edge” to one application might be deep core to another.  There’s a topology that’s defined, for a given application, and represents the way that things that need a deeper resource (either because the shallow ones ran out or because its current task isn’t that latency-sensitive) would be connected with it.

This view would present some profound issues in cloud orchestration, because while we do have mechanisms for steering containers to specific nodes (hosting points), those mechanisms aren’t based on the location where the user is, or the nature of the “deep request” tree I’ve described.  This issue, in fact, starts to make the cloud look more “serverless”, or at least “server-independent”.

How should something like this work?  The answer is that when a request for processing is made, a request for something like control-packet handling, the request would be matched to a “tree” that’s rooted in the location from which the request originated (or to where the response is needed).  The request would include a latency requirement, plus any special features that the processing might require.  The orchestration process would parse the tree looking for something that fit.  It should preference instances of the needed process that were already loaded and ready, and it should account for how long it would take to instantiate the process if it weren’t loaded anywhere.  All that would eventually either match a location, triggering a deployment, or create a fault.

The functionality I’m proposing here is perhaps somewhere between a service mesh and an orchestrator, but with a bit of new stuff thrown in.  There might be a way to diddle through some of this using current tools in the orchestration package (Kubernetes has a bunch of features to help select nodes to match pods), but I don’t see a foolproof way of covering all the situations that could arise.  Kubernetes has its affinities, taints, and tolerations, but they do a better job of getting a pod to a place where other related pods might be located, and they’re limited to node awareness, rather than supporting a general characteristic like “In my data center” or “in-rack”.  It might also be possible to create multiple clusters, each representing a “local collection” of resources, and use federation policies to steer things, but I’m not sure that would work if the deployment was triggered within a specific cluster.  I welcome anyone with more detailed experience in this area to provide me a reference!

Another point is that if service mesh technology is used to map messages to processes, should that mapping consider the same issue of proximity or location?  There may be process instances available in multiple locations, which is why load-balancing is a common feature of service meshes.  The selection of the optimum currently available instance is as important as picking the optimum point to instantiate something that’s not currently hosted.

Why is all this important?  Because it’s kind of useless to be talking about reducing things like mobile-network latency when you might well introduce a lot more latency by making bad choices on where to host the processes.  Event-driven applications of any sort are the ones most likely to have latency issues, and they’re also the ones that often have specific “locality” to them.  We may need to look a bit deeper into this as we plot the evolution of serverless, container orchestration, and service mesh, especially in IoT and telecom.

A First Step to an Open-Model Network Future

Do you think that core routing is just for big routers?  Think again.  DriveNets, a startup who developed a “Network Cloud” cloud-routing solution and AT&T co-announced (HERE and HERE) that DriveNets “is providing the software-based core routing solution for AT&T, the largest backbone in the US.”  That could fairly be called a blockbuster announcement (covered HERE and HERE and HERE), I think.  It’s also likely the first step in actually realizing “transformation” of network infrastructure.

I’ve mentioned DriveNets in a couple past blogs, relating to the fact that they separated the control and data planes, and I think that particular attribute is the core to their success.  What the AT&T deal does is validate both the company and the general notion of a network beyond proprietary routers.  That’s obviously going to create some competitive angst among vendors, and likely renew hope among operators.  The basics of their story has already been captured in my references, so I want to dig deeper, based on material from the company, to see just how revolutionary this might be.

Looking at the big picture first, the AT&T decision puts real buyer dollars behind a software-centric vision of networking, one that’s had its bumps in the road as standards efforts to create the architecture for the new system have failed to catch on.  Some operators I’ve talked with were enthusiastic about the shift in technology from routers to software, but concerned that they wouldn’t see a viable product in the near term.  DriveNets may now have relieved that concern, because now they have a very big reference account, an operator using DriveNets in the most critical of all missions.  That’s a pretty big revolution right there.

The next revolutionary truth is the fact that it is software that’s creating the DriveNets technology.  While DriveNets runs on white boxes, what makes it different is control/data separation, and how a kind of local cloud hosts the control plane.  The data plane is hosted on white-box switches based on Broadcom’s Jericho chip, and the control plane on a more generic-looking series of certified white-box devices.  White box devices also provide the external interfaces and connect with the data-plane fabric.  You can add white boxes as needed to augment resources for any of these missions.

Different-sized routers are created by combining the elements as needed into a “cluster”, and capacities up to 768TB can be supported with the newest generation of the chip (the AT&T deal goes up to 128TB).  AT&T says that they’ll have future announcements for other applications running on the same white-box devices, and that offers strong support for the notion that this is an open approach.  This is an important point I’ll return to later in this blog.

One obvious benefit of this new model of networking is that the same white boxes (except for extreme-edge applications that would use a single device) can be used to build up what’s effectively an infinite series of router models, so the same boxes are spares for everything.  Another benefit is that no matter how complex a cluster is, it looks like a single device both at the interface-and-topology level and at the management level.  But the less-obvious benefits may be the best of all.

Here’s a good one.  The control-plane software is a series of cloud-native microservices hosted in containers and connected with a secret-sauce, optimized, service mesh for message control.  This gives DriveNets (and, in this case, AT&T) the ability to benefit from the CI/CD (continuous integration and continuous delivery) experience of cloud applications.  In fact, the control-plane software and DriveNets software overall is based on cloud principles, which is where operators have been saying they want to go for ages.  There are no complicated, multi-forked, code trains that have haunted traditional routers for ages.  All the software microservices are available, loaded when needed, and changed as needed.

And another.  When you get a new Network Cloud cluster, or just a new box, it’s plug and play.  When it’s first turned up, it will go to a DriveNets cloud service to get basic software, and it then contacts a local management server for its configuration, setup, and the details.  Sounds a lot like GitOps in the cloud, right?

And still more.  The data-plane boxes can be configured with fast-failover paths that allow for incredibly quick reaction time to faults.  The control plane, in its cloud-cluster of white boxes, will reconverge on the new configuration.  More complex network issues that require the control plane benefit from having the cluster’s internal configuration tunable to support the overall connection topology that’s required to return to normal operation.  From the outside, it’s a single device.

And another…the operating system (DNOS) and orchestrator (DNOR) elements combine with the Network Cloud Controller and Network Cloud Management elements to provide for internal lifecycle management.  Everything that’s inside the cluster, which is thus a classic “black box”, is managed by the cluster software to meet the external interface and (implicit) SLA.  The fact that the cluster is a cloud is invisible to the outside management framework, so the architecture doesn’t add to complexity or increase opex.

To recap the market impact, what we have is a validation that a major Tier One (AT&T) who has been committed to open-model networking is satisfied that there is an implementation of that concept that’s credible enough to bet their core network on.  We also have a pretty strong statement that an “open-model” network is a network composed of white-box devices that can, at least in theory, host anything.

Remember that AT&T says they may host other software on the same white boxes.  I think that means that to AT&T, openness means protection against stranded investment, not necessarily that every component of the solution, the non-capital components in particular, are open.  You don’t need to have open-source software on white boxes to be “open”, but you do have to be able to run multiple classes of software on whatever white boxes you select.

I’m a strategist, a futurist if you like, so for me it’s always the major strategic implications that matter.  Where is the future open-model network heading?  What’s the next level up, strategy-wise, that we could derive from the announcement?  I think it may arise from another announcement I blogged on just last week.  If we go back that blog on the ONF conference, there was an ONF presentation on “The Network as a Programmable Platform”, which proposed an SDN-centric vision of the network of the future.  As the title suggests, it was a “programmable” network.

One figure in that presentation shows the SDN Controller with a series of “northbound” APIs linked to “applications”, one of which is BGP4.  What the figure is proposing is that BGP4 implementations, running as a separate application, could control the forwarding of SDN OpenFlow white-box switches, and create what looks like a BGP core.  You could do the same thing, according to the ONF vision, to implement the 5G interfaces between their “control plane” (which I remind you is not the same as the IP control plane) and an IP network.  This is almost exactly what Google’s Andromeda project did for Google.

Why would AT&T not have selected an ONF implementation, then?  They’ve supported, and contributed elements to, the ONF solution.  The answer, I think, could be simple:  there is no validated implementation of the ONF solution available commercially.  It may be that DriveNets is seen by AT&T as the closest thing to that utopian model that’s available today, and of course they may also believe that DriveNets is close enough that they could evolve to the model faster than someone (just starting with just the ONF diagrams to work with) could implement it.

Why could they think that?  If you dig into the DriveNets material (particularly their Tech Field Day stuff), the architecture of their separated control plane is characterized as a web of cloud-native microservices.  These work together to do a lot of things, one of which being creating what looks like a single router from the combined behavior of a lot of separate devices.  And DriveNets, in the video, says “Twice the number of network operators say they expect radical change in network architecture within three years, versus those who say they do not.”  They also say that their approach “builds networks like hyperscaler clouds.”

Let me see…radical changes are needed, build networks like hyperscaler clouds, consolidate multiple devices into a single virtual view?  This sounds like a network-programmability model that doesn’t rely on SDN controllers, single or a federation.  Do all that at the network level and you’ve implemented the bottom half of the ONF vision.

How about the top half, those northbound APIs?  There’s no detail on exactly what northbound APIs are currently exposed by DriveNets, but since their control plane is made up of microservices, there’s no reason why any API couldn’t be added fairly easily.  The current DriveNets cluster has to have a single forwarding table from which it would derive the forwarding tables for each of their data plane fabric devices, so could that table be used to create a network-as-a-service offering, including BGP4?  It seems possible.

New microservices could be developed by DriveNets, by partners, and even by customers, and these microservices could extend the whole DriveNets model.  ONF OpenFlow control?  It could be done.  Network-as-a-service APIs to support the mobility management and media access elements of 5G, via the N2 and N4 interfaces?  It could be done.  I’m not saying it will be, only that it seems possible, just as it’s possible that the ONF model could be implemented.

“Could” being the operative word.  The problem with the ONF model, as I said in that blog on their conference referenced above, is that central SDN controller.  That’s a massive scalability and single-point-of-failure problem.  Federated SDN controllers is a logical step I called out years ago when this issue was raised, but it’s not been developed (you can see that by the fact that the ONF’s pitch doesn’t reference it).  There is no industry standards initiative in the history of telecom that developed something in under two years, so the ONF solution can only be realized if somebody simply extends it on their own.  DriveNets extension to programmability and ONF implementation are both “coulds”.

Even without being able to create a programmable network like the ONF diagram shows, DriveNets has made a tremendous advance here.  It might also demonstrate that there are at least two ways to create that programmable network, the yet-to-be-defined SDN controller federation model and the “network-wide control plane” model.  Two options to achieve our goals are surely better than one.

Customers or opportunities are another place where more is better, and to achieve that, any competitor in the new-model network space is going to have to confront the business case question.  AT&T has an unusually high sensitivity to infrastructure costs, owing to its overall low demand density.  It would be logical for them to be on the leading edge of an open-model network revolution.  How far behind might the other operators be?  That’s likely to depend on what value proposition, beyond the simple “the new network costs less than routers”, is presented to prospects.  Most operators with high demand densities won’t face the issues AT&T is facing for another couple years, but there are other factors that could drive them to a transformation decision before that.  It’s a question of who, if anyone, presents those other factors.

We are going to have a transformation in networking eventually, for every operator, and I think the AT&T/DriveNets deal makes that clear.  New models work and they’re cheaper, and every Wall Street research firm I know of (and I know of a lot of them) expects telco spending to be at least slightly off for the balance of this year, and of 2021 as well.   In fact, they don’t really see anything to turn that trend around.  Even operators with high demand densities and correspondingly lower pressure on capex savings will still not throw money away when a cheaper and better option is generated.  Opportunities for a new strategy are growing.

So are alternatives to what that strategy might look like.  We’re going to have a bit of a race in fielding a solution, between cheaper routers, white-boxes with simple router instances aboard, clusters with separated control and data planes like DriveNets, and SDN/ONF-based solutions.  The combination of opportunity and competition means there’s a race to pick the right prospects and tell the right story.  It’s going to be an interesting 2021 for sure.

Does “Lean NFV” Move the NFV Ball?

NFV has seen a lot of movement recently, but all movement isn’t progress.  I noted in earlier blogs that NFV still shows up in a lot of telco diagrams on implementation of 5G, including OpenRAN, and it’s also included in vendor diagrams of their support for telco cloud, even cloud-native.  The problem is that NFV isn’t really suited for that broad an application set, a view I hold firmly and one many operators share.  Thus, it’s not surprising that there’s interest in doing something to redeem NFV.

One initiative getting a lot of attention these days is Lean NFV, whose white paper is now in the 2020 “C” revision.  I told Fierce Telecom about my reservations regarding the concept, and I want to dig into their latest material to see if there’s anything there to either resolve or harden them.  The MEF’s support, after all, could mean something.  At least, it might take NFV out of a venue that didn’t produce into one that still might have a chance.

The first subhead in the paper I referenced above is a good start: “Where did we go wrong?”  The paragraph under that heading is promising, if a big lacking in specificity.  That, we could hope for in the rest of the document.  The main theme is that NFV is too complex, in no small part because it was never truly architected (my words) to define the pieces and how they all fit.  The functional diagram became an implementation guide, which created something that’s “too closely coupled to how the rest of the infrastructure is managed,” to quote the paper.

This is perhaps the best sign of all, because the biggest single problem with NFV was indeed the relentless effort to advance it by containing its scope.  The rest of the telco network and operations world presented a bunch of potential tie points, and rather than define something that was optimum for the mission of virtualizing functions, the ISG optimized the use of those legacy tie points.  But does Lean NFV do any better?

Lean NFV defines three pieces of functionality that represent goals to be addressed by Lean NFV, and if there’s to be a benefit to the concept, it has to come because these are different/better than the ISG model.  The first is the NFV manager, which manages not only the VNFs but “the end-to-end NFV service chains”.  The paper takes pains to say that this isn’t necessarily a monolithic software structure; it could be a collection of management functions.  The second is the “computational infrastructure”, which is I think an unnecessarily complicated way of saying the “resource pool” or “cloud”.  The third is the VNFs themselves, which the paper says might have their own EMS for “VNF-specific configuration”.

The way that Lean NFV proposes to be different/better is by concentrating on what it describes as the “three points of integration”, “when the NFV manager is integrated with the existing computational infrastructure, when VNFs are integrated with the NFV manager, and when coordination is required between the various components of the NFV manager.”  It proposes to use a key-value store to serve as a kind of “repository” to unify things at these three critical points, and leave the rest of the implementation to float as the competitive market dictates.  The paper goes on to describe how the three critical integration points would be addressed, and simplified, by the key-value approach.

What I propose to do to assess this is to forget, for the moment, the specifics of how the integration is to be done (the key-value store) and look instead at what the solution is supposed to be delivering to each of these three integration areas.  If the deliverables aren’t suitable, it doesn’t matter how they’re achieved.  If they are, we can look at whether key-value stores are the best approach.

The first suggestion, regarding the interface to computational resources, is certainly sensible.  The original NFV was very OpenStack-centric, and what the paper proposes is to, in effect, turn the whole computational-resource thing into a kind of intent model.  You define some APIs that represent the basic features that all forms of infrastructure manager should support, and then you allow the implementation to fill in what’s inside that black box.  All of the goals of the paragraph describing this are sensible and, I think, important to the success of NFV.

The second suggestion relates to the NFV manager, and I will take the liberty of reading it as an endorsement of a data-model-driven coupling of events to processes.  The data model serves as the interface between processes, implying that it sets a standard for data that all processes adhere to regardless of their internal representations.  This can all, at the high level, be related to the TMF NGOSS Contract work that I think is the seminal material on this topic.

The third suggestion is the one I have the most concern about, and it may well be the most important.  Lean NFV suggests that the issues with VNF onboarding relate to the configuration information and the fact that VNFs use old and often proprietary interfaces.  Lean NFV will provide “a universal and scalable mechanism for bidirectional communication between NFV management systems and VNFs”, which I believe is saying that the data model will set a standard to “rewrite” VNFs.  I don’t think that there’s much interest among vendors in rewriting, so I’m not comfortable with this approach, even at the high level.

OK, where this has taken us is to accept two of the three “goal-level” pieces of Lean NFV, but not the third.  That leads to the question of whether the key-value store approach is the way to approach those goals, and in my view it is not.  I have to say, reluctantly, that I think the Lean NFV process makes the same kind of mistakes as the original NFV did.  They’re wrong, differently.

One problem is that a key-value store doesn’t define the relationship between services, VNFs, and infrastructure.  Yes, it describes how to parameterize stuff, but a service is a graph, not a list.  It shows relationships and not just values.  In order to commit compute resources or connect things, I need a data model that lets me know what needs to be connected and how things have to be deployed.  I told the ONAP people that until they were model-driven, I would decline to take further briefings.  The same goes here.

The second issue is the lack of explicit event-driven behavior.  APIs are static and tend to encourage “synchronous” (call-and-wait) behavior, where events dictate an asynchronous “call-me-back” approach.  Not only does Lean NFV not mandate events, it provides no specific mechanism to associate events to processes.  It suggests that microcontrollers could “watch” for changes in the key-value store, which makes the implementation more like a poll-for-status approach, something we know isn’t scalable to large-scale networks and services.

The biggest problem, though, is that we’re still not addressing the basic question of what a virtual network function is.  Recall, in a prior blog, that I noted that there was early interest in decomposing current “physical network functions” (things like router or firewall code) into logical features, and then permitting recomposition via cloud behavior.  If we decide that a VNF is the entire logic of a device, then making it virtual does nothing but let us run it on a different device.  There may be differences in subtle performance and economic issues when we look at hosting a VNF in a white box (uCPE), a commercial server, a container, a VM, or even serverless, but will this be enough to “transform?”

There are some good ideas here, starting with a pretty-straight recognition of what’s wrong with ETSI NFV.  The problem I see is that, like the ISG, the Lean NFV people got fixated on a concept rather than embarking on a quest to match virtual functions to modern abstract resource pools like the cloud.  In the case of Lean NFV, the concept was the key-value store.  To a hammer, everything looks like a nail, and so it appears that the Lean NFV strategy was shaped by the proposed solution rather than the other way around.

There’s still time to fix this.  The paper is very light on implementation details, as I’ve noted before regarding the Lean NFV initiative.  That could mean I’ve missed the mark, but it could also give the group the chance to consider some of these points.  The goals are good, the way to achieve them isn’t so good.  I’m really hopeful that the organization will move to fix things, because there’s a lot of wasted motion in the NFV space, and this at least has some potential.

The problem is that if “Lean NFV” were in fact to adopt my suggestions, it might still be “Lean” but it would have moved itself rather far from ETSI NFV.  There’s never been a standard that a telco couldn’t place too much reliance upon, for too long.  NFV is surely not one to break that rule.