A New Age in Virtual Networking?

Sometimes a term gets so entrenched that we take its meaning for granted.  That seems to have happened with “virtual network”, despite the fact that just what the term means and how one might be created has changed radically over the years.  In the last year, I asked almost a hundred enterprise and service provider network planners what a “virtual network” was, and there wasn’t nearly as much agreement as I thought there’d be.

Conceptually, a “virtual network” is to today’s IP network what “virtual machine” is to a bare-metal server.  It looks like a network from the perspective of a user, but it’s hosted on a real network rather than being a direct property of it.  There are many drivers for virtual networks, which probably accounts for the multiplicity of meanings assigned to the term, but there’s one underlying issue that seems to cross over all the boundaries.

Real networks, at least real IP networks, were designed to connect sites rather than people.  They’re also a combination of Level 2 and Level 3 concepts—a “subnet” is presumed to be on the same Level 2 network and the real routing process starts when you exit it, via the default gateway.  The base concept of real IP networks worked fine as long as we didn’t have a lot of individuals and servers we expected to connect.  When we did, we ended up having to gimmick the IP address space to make IPv4 go further, and we created what were arguably the first “virtual networks” to separate tenant users in a shared data center or cloud.

Another problem that’s grown up in recent years is the classic “what-it-is-where-it-is” question.  IP addresses are linked to a network service access point, which is typically the gateway router for a site.  A user of the network, in a different site, would have a different address.  In mobile networks, having a smartphone roam to another cell means having it leave the place where its connection is made, so mobility management uses tunnels to follow the user, which is a form of virtual networks.

The what/where dilemma can also complicate security.  IP networks are permissive in a connection sense, which means that they presume any address can send something to any other, and this has created a whole security industry.  Prior to IP the dominant enterprise network protocol was IBM’s System Network Architecture (SNA), which used a central element (the Systems Services Control Point) to authorize “sessions” within the network, with a session being a relationship between network parties, users, rather than network components.  This security industry added to the installed base of IP devices to make it harder and harder to change IP in a fundamental way, which has again boosted the notion of virtual networking.

Then there’s the big issue, which is “best efforts”.  IP does support traffic engineering (MPLS, for example) but typically in the carrier network and not the user endpoints.  A branch office and even a headquarters location doesn’t have MPLS connectivity.  Traffic from all sources tends to compete for resources equally, which means that if there are resource limitations (and what network doesn’t have them?) you end up with congestion that can impact the CxO planning meeting as easily as someone’s take-a-break streaming video.

There have been proposals to change IP to address almost all these issues, but the installed base of devices and clients, combined with the challenges of standardizing anything in a reasonable time, has limited the effectiveness of these changes, and most are still in the proposal stage.  So, in a practical sense, we could say that virtual networks are the result of the need to create a more controllable connection experience without changing the properties of the IP network that’s providing raw connectivity.

Building a virtual network is different from defining what the term means.  There are two broad models of virtual network currently in play, and I think it’s likely these represent even future models of virtual networking.  One is the software-defined network, where forwarding behavior is controlled by something other than inter-device adaptive exchanges, and where routes can be created on demand between any points.  The other is the overlay network where a new protocol layer is added on top of IP, and where that layer actually provides connectivity for users based on a different set of rules than IP would use.

The SDN option, which is favored obviously by the ONF, falls into what they call “programmable networks”, which means that the forwarding rules that lace from device to device to create a route are programmed in explicitly.  Today, the presumption is that happens from a central (probably redundant) SDN controller.  In the future, it might happen from a separate cloud-hosted control plane.  The advantage of this is that the controller establishes the connectivity, and it can fulfill somewhat the same role as the SSCP did in those old-time SNA networks (which, by the say, still operate in some IBM sites).

As straightforward and attractive as this may sound, it still has its issues.  The first is that because SDN is a network change, it’s only available where operators support it.  That means that a global enterprise would almost certainly not be able to use the SDN approach to create a custom connectivity service over their entire geography.  The second is that we have no experience to speak of on whether the SDN concept is scalable on a large scale, or on whether we could add enough entries to a flow switch (the SDN router) to accommodate individual sessions.

The overlay network option is already in use, in both general virtual-network applications (VMware’s NSX, Nokia/Nuage, etc.) and in the form of SD-WAN.  Overlay networks (like the mobility management features of mobile networks) take the form of “tunnels” (I’m putting the term in quotes for reasons soon to be clear) and “nodes” where the tunnels terminate and cross-connecting traffic is possible.  This means that connectivity, to the user, is created above IP and you can manage it any way you like.

What you like may not be great, though, when you get to the details.  Overlay virtual networks will add another header to the data packets, which has the effect of lowering link bandwidth available for data.  Header overhead depends on packet size, but it can be as high as 50% or more.  In addition, everywhere you terminate an overlay tunnel you need processing power.  The more complex the process, the more power you need.

It’s logical to ask at this point whether we really have an either/or here.  Why couldn’t somebody provide both implementations in parallel?  You could build a virtual overlay network end-to-end everywhere, and you could customize the underlying connectivity the virtual network is overlaid on using SDN.

Now for the reason for all those in-quotes terms I’ve been using, and promising to justify.  Juniper Networks has its own SDN (Contrail), and they just completed their acquisition of what I’ve always said was the best SD-WAN vendor, 128 Technology.  What 128T brings to the table is session awareness, which means that they know the identity of the user and application/data resources, and so can classify traffic flows between the two as “sessions”, and then prioritize resources for the ones that are important.  Because 128T doesn’t use tunnels for a physical overlay (they have a “session overlay” that has minimal overhead), they don’t consume as much bandwidth and their termination overhead is also minimal.

What Contrail brings is the ability to manipulate a lower-level transport property set so that actual IP connectivity and the SLA are at least somewhat controllable.  With the addition of Juniper’s Mist AI to the picture for user support and problem resolution, you have a pretty interesting, even compelling, story.  You can imagine a network that’s vertically integrated between virtual, experience-and-user-oriented, connectivity and a virtualization layer that’s overlaid on transport IP.  From user, potentially, to core, with full integration and full visibility and support.

If, of course, this is a line Juniper pursues.  The good news is that I think they will, because I think competitors will move on the space quickly, whether Juniper takes a lead with 128T or not.  That means that while Juniper may surrender first-mover opportunities to define the space, they’re going to have to get there eventually.  They might as well make the move now, and get the most possible benefit, because it could be a very significant benefit indeed.

What’s Behind Cisco’s Container-Centric Software Strategy?

Cisco loves containers.  There’s no question that container and software-related acquisitions have dominated Cisco’s recent M&A, but it’s sure reasonable to wonder what they hope to gain.  Does Cisco think they can become a competitor to cloud software and server companies, are they betting on hosted network elements, or what?  Cisco’s Banzai acquisition last month is perhaps a “tell” regarding Cisco’s direction.

Cisco has been in the server business since 2009, with its Unified Computing System (UCS) line.  At first, UCS was pretty much a server story, but since then, and especially within the last couple years, Cisco has been picking up platform software to augment their hardware position.  Given that Cisco always tells whatever story seems likely to gather the most PR momentum, it’s never been really clear where they wanted to go with the stuff.

I think that, early on, Cisco’s foray into servers came out of their data center switching business.  Cisco is, and always has been, a very sales-directed company.  The IT organization tends to be the buyer of data center switching rather than the networking organization, and so Cisco’s salespeople were calling on a new cadre of prospects for their rack switching.  Given that data center switches in general, and top-of-rack systems in particular, connect server farms, the demand for them comes as a result of an expansion in the server area.  Salespeople ran back to Chambers (CEO of Cisco at the time) and suggested Cisco get into the server business.

UCS has generated respectable but not spectacular revenue for Cisco, but it reached its peak growth around 2015, and the largest number of UCS customers were in the software and technology space.  UCS has fallen in market share in the last several years according to most analysts’ reports.  This coincides with the sudden growth in cloud computing and containers, and that’s what raises our questions regarding Cisco’s motives.

Cisco might be doing nothing more than aligning UCS with current platform directions.  Users increasingly want to buy hosting platforms, which include both the servers and the necessary operating system and middleware tools.  Even more significant is the user’s focus on the platform software for the hosting value proposition; the servers are just cost centers to be negotiated into “margin marginalization”.  Since Cisco doesn’t want to be in a commodity business, it makes sense to build the value-add.

The Banzai deal may be the “tell” for this view, as I’ve already suggested.  Banzai was focusing on enterprise cloud-native development and deployment.  If Cisco wants to be a hosting platform player for the enterprise, building their credibility in the original UCS mission, then jumping out ahead of the current market is critical; there’s too much competition for vanilla containers.  Differentiation would help Cisco sustain margins.

The only problem with this is that IBM/Red Hat and VMware are also jumping into the cloud-native space, and from a position of an established data center vendor.  Their approach is to replace the software platform while being server agnostic, meaning that to compete with them, Cisco would have to either sell software without UCS servers, or displace existing servers to move UCS servers in.  The former means going head-to-head with established vendors, and the latter would be a tough sell to enterprise CFOs.

So, what are they doing?  A second possibility is that Cisco is shifting its focus to a future convergence of network and cloud.  Remember that Cisco’s revenues are overwhelmingly from network equipment, and their best profit margins have been in the router space.  With routers under price pressure, and with network operators and enterprises both looking at open-model networks, Cisco’s core business is under pressure.  Could it be that Cisco is looking to sell servers to buyers who have new network missions that involve servers and network equipment?  Think “carrier cloud”.

Carrier cloud is kind of like the Seven Cities of Gold; everyone “knew” they were out there, but nobody ever found them.  The potential of carrier cloud is enormous, over 100,000 data centers by 2030, containing millions of servers.  It would be, if deployed, the largest single source of new server purchases in the global market.  Best of all, from Cisco’s perspective, carrier cloud is sold to carriers, people Cisco have been selling to for decades.

The problem with this is that operators are far from a strong financial commitment to carrier cloud.  Most of them see applications of “carrier cloud”, but few of them are confident at this point that they can make a business case for them, or even assign a cost to them, given a lack of understanding of just how the applications would work.  NFV was the only carrier cloud driver that operators really understood, and it failed to develop any credibility.  It’s not like Cisco, sales-driven as it is, to spend a lot of sales resources educating a market that they know will be competitive if buyers learn the ropes.

5G and Open RAN might be the next opportunity for carrier cloud, and for Cisco.  Here, the opportunity and the execution on it develop pretty quickly, there’s funding/budget in place, and there’s clear market momentum.  Cisco could well see an opportunity to grab ahold of this next driver, and by doing so gain control over carrier cloud.  They might also be able to use 5G and Open RAN to cement a position in the “separate control plane” model.  Cisco disaggregates IOS and its router hardware, but to make their position real, they need to separate the control plane and extend it, at least in part, to the cloud.

The problem with this is that it’s still a big reach for a company that’s never been a software innovator.  I think it’s more likely that a Cisco competitor would jump on this opportunity, in which case Cisco likely sees it the same way and would probably not invest a lot of resources at this point.  Fast-follower is their role of choice.

What does that leave?  “Eliminate the impossible, and what’s left, however improbable, must be the answer.”  I think that while none of the possible Cisco motives for container interest are impossible, the one least improbable is the first one, that Cisco is seeking broader data center traction.  A recent Network World piece seems to reinforce Cisco’s interest in the enterprise data center as their primary motivator.

One good reason is the one already cited; Cisco has engagement with the buyer in that space already.  Another good reason is that Cisco thinks the enterprise, or at least the leading-edge players in that space, are likely to move faster than the service providers.  Service provider profit per bit challenges are profound, and it may take years for them to evolve a strategy and fund it.

A final, possibly critical point is that carrier cloud is more “cloud” than “carrier”.  If there is a credible market for carrier cloud in the future, it will involve service features based more on traditional public cloud technology than on network technology.  Thus, a Cisco initiative to address near-term cloud-native opportunity for the enterprise today could pay dividends for carrier cloud initiatives in the future.

Can a Fiber-Centric Strategy Help AT&T?

Is ATT right about fiber?  The CFO says it’s a “three for one” revenue opportunity, which is why the operator says they’re likely to add to their fiber inventory.  One clear message you can draw from that is that a one-for-one or two-for-one might not have been enough, which leads us I think to the real reasons why AT&T is looking for more glass in the ground.

Consumer fiber, meaning FTTH, requires a fairly high demand density to justify, because its “pass cost”, or cost just to bring the fiber to the area of the customer to allow for connection when there’s an order, is high.  Operators put the FTTH pass cost in the over-five-hundred-dollar range, and at that level, there’s way too much area that residential fiber can’t easily reach.

If you can’t make residential fiber broadly cost-effective, perhaps you can gang it with other fiber applications, notably things like 5G backhaul and multi-tenant applications like the strip malls the article talks about.  If you look at everything other than large-site fiber as being an application of passive optical networking, you can see that just getting PON justification in a given area could open that area up to FTTH at a low enough incremental cost to make it profitable.

Of the possible non-residential fiber drivers that could be leveraged, the most interesting could be microcells for 5G and fiber nodes for 5G millimeter wave.  The former mission is valuable in both more rural settings and in high-density retail areas, and the latter in suburban locations with highly variable demand density, where pass costs for FTTH could limit how much of the suburb you could cover with high-speed broadband service.

Every restaurant and retail location knows that customers like WiFi, and restaurants in particular almost have to offer it.  People often use WiFi in restaurants to watch videos and do pretty much the same things they’d do at home, but in a larger concentration.  If you spread a few 5G microcells around a heavily strip-malled area you could feed them with fiber, getting fiber closer to the residential areas they served.

From there, you could then consider the 5G/FTTN hybrid model.  By extending strip-mall feeds to a local 5G millimeter-wave node, you could now reach residential and even small business sites up to about a mile away for high-speed broadband delivery.  Each decent-sized strip mall could be a multi-purpose fiber node that supported even 5G mobile services and enhanced total capacity and service credibility.  In fact, the combination of 5G/FTTH and 5G mobile could be a killer in the suburbs, and of course it facilitates wider-ranged fiber deployment.

Additional cells also improve the chances of open-model networks, and 5G in particular.  One of the factors operators cite to justify proprietary 5G RAN is that there’s no support for the “5G massive MIMO” that could improve cell capacity.  With more, smaller, cells, there’s less pressure to provide very high-capacity cells.  In fact, this may be a major factor in AT&T’s dense-cell strategy; they also have a major commitment to open-model 5G.

A lot of fiber benefit could also enhance the classic capacity-versus-complexity tradeoff in network design.  If you have a lot of capacity, you need less traffic management and complexity at Level 3, which is where most opex costs are generated.  You can also probably rely more on generic white boxes for your data paths, as long as they can support the capacity you need.  The effect of combining open-model IP networking with higher optical capacity and density is to shift your capex toward fiber and reduce your opex.

There’s no question in my mind that AT&T is right about using more fiber, creating more 5G nodes.  There is still some question on whether the move can really fully address AT&T’s rather unique issues as a Tier One.  To understand why, and what might help, we have to dig a bit into those issues.

Most Tier One operators evolved from wireline carriers who served populous regions.  AT&T is fairly unique in that its wireline territory has more widely dispersed populations; rival Verizon has seven times the demand density, meaning much more opportunity per unit of geographic area.  In cities and typical suburbs, AT&T and Verizon are comparable, but in more distant suburbs and rural areas, AT&T is far less dense.  When Verizon started with its FiOS plans, it shed some of the localities where there was no chance FiOS could be profitable, to eliminate the problem of having some customers who could get great wireline broadband and others who could not.

Wireless is different, of course, and more so if you factor in the 5G/FTTN hybrid.  Instead of having a pass cost of about $500 per home/business for FTTH, your pass cost could reduce to less than $100 providing that you had reasonable residential density in a one-mile radius of the node.  That would cover about 80% of the thin suburban locations.  Add in mobile-model 5G, with a range of 8-20 miles from the tower, and you have the potential to cover your entire territory with acceptable pass costs.

That’s why the decision by AT&T to drop DSL is smart.  They have too many thinly populated areas to sell off everything where FTTH won’t work, so they have to find something that does work, and the 5G option is their best answer.  In fact, AT&T’s network business salvation lies in focusing on 5G infrastructure for wireless and wireline and using FTTH only where the customer density is very high.  If 5G mm wave works out, in fact, they might well be better off not using FTTN anywhere.  Going full 5G would improve their demand density problems significantly, to the point where their effective density would triple.

That’s not enough for them to be profitable, in the long run, from delivering broadband alone.  Instead of Verizon being ahead by seven times, they’d be only a bit more than double AT&T’s effective density.  AT&T would still get some kicker from their Time Warner acquisition, but they’ll still need new revenue streams.  If they move totally to a 5G model, meaning a pure packet model, they would be committed to streaming video, which they’ve already failed to capitalize on.  Can they do better, in streaming and elsewhere?

Maybe, because the open-model, separate-control-plane, network would also potentially address their new-revenue challenge.  5G has some control-plane cloud-hosting potential (white boxes are still the best approach for the data plane), and future services built on contextual/personalization processing and IoT are all dependent on mobile access.  If AT&T did an intelligent network modeling for this combination of a pure 5G future and new contextual/IoT services, they could get pretty well out in front on generating new revenue, credible and significant new revenue, in fact.

Can they do that?  AT&T has been, perhaps more than any network operator, a driver of open-model networking.  They’ve not always been the most insightful driver, though.  There is a risk that their white-box focus will focus them again on boxes rather than on software, that they’ll view software as just a necessary component of white-box networking rather than its real justification.  If they can learn the software lesson, or if a vendor can teach it to them, they’ll have a shot at a future.

The Street is mixed on AT&T today.  Some love their dividend and see them as a safe play, and some say that beyond the dividend-of-the-moment, there may be bad moments ahead.  I think that what AT&T has done so far has not secured their future.  I think fiber enhancement and even open-model networking won’t secure it either.  But I think that these measures have bought them at least two or three years to do the right thing.  It’s just a matter of their identifying it, then executing on it.

There’s not much awareness among operator planners regarding the architecture of a monetizable future service set, or the network/cloud symbiosis needed to create and sustain it.  There’s also, apparently, a serious shortage of cloud software architecture skills in operator organizations.  Finally, operators still see “services” as being the result of combined device behaviors rather than the creation of software functionality.  I think AT&T is working as hard as any operator to deal with all these issues, but they need to get moving; remember their cost management measures will buy them three years at the most.

More on the Evolution of Open RAN

The news for Open RAN just keeps getting better, but we all know that news is an imperfect reflection of market reality.  There is indeed a lot of good news and important stuff, but there’s also some major questions being raised by operators, and implied by the very “good news” we’re happy about.

Operators are getting more and more committed to an Open RAN approach.  There’s been many announcements of vendors’ and operators’ support, and the US House passed (with almost unheard-of unanimity) an Open RAN bill designed to improve the credibility of claimed adhering technology elements.  As I blogged last week, operators are seeing “Open RAN” as the gateway to a more open network model.

To me, the most important development in Open RAN is the Dish deployment.  Dish obviously intends to be the fourth big mobile operator in the US, and it’s solidly in the Open RAN camp as far as infrastructure.  They’ve had to pull together a lot of technology pieces to make this work, and while that illustrates the integration challenge of open technology (which is almost always built up from individual projects), it also illustrates that the challenge can be and is being met, which means there’s a prototype approach out there for other operators to learn from.

One thing we can learn already is that there’s still an expectation that Open RAN will involve white boxes that use some custom chips.  Qualcomm is a recent Dish partner in their initiative, and other chip vendors (including Intel) are expecting to reap some new opportunities by supplying 5G-tuned technology elements.  That raises the question of how “the cloud” and “white boxes” will be combined in an Open RAN initiative, and how that cooperation will evolve as operators look beyond 5G RAN for open-model network candidates.

We know that as you get toward the edge of a network, the number of users to connect per geographic point gets smaller.  You can push resources toward the edge, but you can’t go further than the number of users per pool of resources would justify.  It follows that as you move toward the edge, there is less chance that your “open” strategy can consist entirely of cloud-hosted features and functions.  You’ll start to see white boxes.

Interestingly, the same is true throughout the data path.  While it may be possible to create a server that, when combined with some kind of superchip augmentation to traditional CPUs, would be able to push packets as fast as a white-box forwarding device, it’s not clear that there would be much value.  Data paths are threaded through a combination of nodes and fiber, and the latter kind of goes where it goes.  You know where you need nodes, which is at the place where multiple trunks terminate.  White boxes make sense there.

This combination seems to me to suggest that it’s likely that white boxes will play a very large role in not only Open RAN, but in whatever builds out from it to create wider-scale open-model networks.  In fact, if we forgot OSI models and network philosophy, we might be able to see the future network as the “white-box layer” with the “cloud-feature layer” on top.

There is, in 5G RAN and Open RAN, a concept of a “near-real-time” element, which acknowledges that there are pieces of the 5G control plane that have to be married tighter to the user plane.  All over-the-top services, from video streaming to social media, also acknowledge that the entire OTT space is further from real-time than most anything that’s part of the network.  We also know that OTT applications are users of the network, not part of it.

If we map this to our device layers, we can say that white boxes are likely to handle the “near-real-time” pieces of feature/function distribution, and the cloud the higher layers.  We could also presume that the cloud could be handling things like backing up or scaling some of the near-real-time parts, given that an overload or failure would likely result in more disruption than a slightly longer data/control pathway to a deeper cloud element.  Particularly if edge computing is where that overflow stuff is hosted.

Edge computing, in fact, may be justified less by the new things like IoT that might be done with it, than by the requirements of an open network and reliable hosting of features.  That is more likely true if we start to think about how 5G’s control plane and user plane align with IP’s control plane and data plane.

We create a whole new layer, a tunnel layer, in wireless networks to accommodate the fact that cellular network users move between cells while in a session of any sort.  To preserve the session, we have a system that detects the movement and moves the tunnel accordingly.  Since the user’s session traffic is going to the tunnel and not to the IP address of the cell they started in, moving the tunnel moves the traffic.  But if we have superchips that do packet forwarding based on instructions from the 5G control plane and mobility management, why couldn’t the data network offer an interface to convert those instructions (the 5G N2 and N4 interfaces) to forwarding table changes?  No tunnels.

We do something similar to this in content delivery networks.  When you click on a content URL, the decoding isn’t like that of an ordinary URL that links you to a specific IP address, it’s a dynamic relationship-building process that links you to the optimum cache point for what you’re looking for.  Again, we could do that with the data-plane forwarding process.

Even cloud computing might have a similar enhancement.  Right now, things like Kubernetes (at a primitive level) and service meshes like Istio (at a sophisticated level) do load balancing and discovery tasks to allow messages to reach dynamic and scalable microservices.  Why not let that happen down in the chips?

What I think is emerging from 5G, and from other developing network missions, is the recognition that there’s a kind of middle-ground between “in the network” and “on the network”, a “partnered-with-the-network” piece that I’ve generally assigned to the category of “network-as-a-service” because it slaves connectivity to something that’s not part of traditional IP route determination and handling.  As we morph from the white-box piece toward the cloud, we’re changing the relationship between “control” pieces and “data” pieces, and we’re flattening OSI layers to subduct more stuff down to the chip level, where we can do the functions efficiently.

It’s facile to say, as the 3GPP does, that features of 5G are hosted with NFV, but that’s a mistake because it means that we’re evolving into a future where low-level packet handling is getting a lot more efficient and agile, while we’re nailing ourselves to architecture models designed when it wasn’t.  Things like NFV, or arguments like whether VNFs should be CNFs or CNNFs, are implementation details that should be based on specific tradeoffs in latency versus efficiency.  One size does not fit all.

The challenge of this vision goes back to our hardware layers.  We have white boxes that will obviously have to host components of the network.  We have edge systems that will provide more localized supplementary hosting, both as backup to box stuff and as a feature repository for things that have to be somewhat close but not totally local to the data plane.  We have cloud technology to host more OTT-like elements of services.  If this were all homogeneous, it would be easy to see how pieces are deployed and coordinated, but it’s not.

The further out we go from traditional cloud data centers, the less likely it is that the totality of cloud software will be available for us to draw on.  There are already versions of container and orchestration (including Kubernetes) software designed for a smaller resource footprint.  There are therefore already indications that one toolkit may not fit all projects in the open-model network future.  How do we harmonize that?  Multiple orchestration systems fall outside even today’s concept of “federation” of orchestration.

This I what I think will become the battleground of future network infrastructure.  It’s not about finding a solution, but recognizing there’s no single solution to find.  We need to federate the swirling layers of functionality without constraining how any of them are built or used.  It’s the ultimate application of abstraction and intent modeling, waiting for an architect to lay out the blueprint.

Who?  The public cloud providers, the software platform players like VMware and Red Hat, and the white-box software newcomers like DriveNets, all have a place in the story to exploit.  It’s not a race for a solution as much as a race for a vision, and those kinds of races are the hardest to call, and the most interesting to watch.

Making Sense of SASE Trends

What do users and providers think about SASE?  Besides, perhaps, that they hate acronyms and the constant creation of “new categories” for the famous quadrant charts?  I usually try to get a feeling for this sort of question at the end of the year, because even where there’s no formal tech planning cycle at year’s end, there’s still a lot of anticipation.  The new budgets, after all, are coming.

Secure Access Service Edge is definitely draped with a healthy dose of hype.  Every vendor knows that it’s very difficult to sell anything into enterprise or service provider organizations without some editorial visibility to generate some sales interest.  Editorial mentions sell website visits, website visits sell sales calls, and sales calls sell products and services.  This direct (and cynical) connection makes it all too easy to miss some really important things that are starting to happen.  New terms, even without new concepts to back them up, generate media attention.  In the case of SASE, though, there’s a lot of reality that’s in danger of being buried in the hype.

What, first of all, is a “service edge?”  In traditional networking it used to be called the “dmarc”, short for “point of demarcation”, or the boundary between the operator service and business network infrastructure.  If you sell a service that doesn’t connect right to the user (which few business services do) then you have a problem of responsibility that can be solved only if you can point to a spot and say “my stuff starts here, and beyond that is your stuff”.

There’s still a lot of dmarc in SASE, but the SASE concept reflects the realization that connecting to the user is the real goal, whoever is responsible for it.  There’s a risk that by saying that carrier responsibility stops at a given point, an unwary buyer might think all responsibility stops there, and on-site or in-company networking is the wild west.  That’s particularly bad in a security sense, because nearly every business network today is connected to the Internet, and even non-Internet elements can be hacked.

SASE is a new kind of dmarc, a service portal that includes both traditional connection features and features that are required to make connected users and resources safe, prioritize traffic, optimize resources, and so forth.  The technical core of SASE is often SD-WAN, but it also includes web gateway security, cloud access security, firewall services, and other stuff.  It’s really this loose mixture that creates a lot of the hype and confusion on the topic.  We need to look at buyer views to get the truth.

One obvious truth is that SASE, which of course means Secure Access Service Edge, is largely about security as far as everyone is concerned.  Interestingly, though, there really isn’t a specific buyer view on how that comes about, on what makes SASE different from other kinds of service edges in technology terms.  They realize that edge security is important, but they’re hazy on how they’d get it.  Part of that may come back to the fact that SASE is functionally a bit of a hodgepodge of “product categories”, and part to the fact that the buyer has a goal but is still searching for how to fulfill it.

Beyond the obvious security point, what buyers say about SASE is that it’s comprehensive, it’s virtual, and it’s part of a harmonious vision of future services that we could call network-as-a-service or NaaS.  To understand what they mean, and what this means to the network space, we have to look at each of those terms.

“Comprehensive” is simple.  Enterprises want a secure access service edge strategy that works for all their edges, which means that it can’t be something that doesn’t handle branches, home workers, cloud hosting, data centers, or anything else that’s either a provider or consumer of services.  It’s this attribute that’s most often assigned to SD-WAN technology, but most SD-WANs aren’t secure.  In fact, security is the main thing buyers think separates SASE from SD-WAN.

The second piece, “virtual” is a bit harder to explain.  Buyers generally saw SASE more as an architecture or model that described a collection or distributed set of features, than as a single box that did everything.  They didn’t object to boxes where they needed a device, but they can’t put boxes inside the cloud and some aspects of SASE evolution seem to them to point to a more distributed model, which gets us to our next (and last) buyer point.

The third piece, a NaaS vision of the future of SASE, is more my characterization than a clear buyer-stated requirement, even though about a third of buyers will use the term at some point.  Most buyers will creep up on the NaaS notion in discussion, but not name it.  What they visualize is kind of what you could call a “cloud model”, where distributed pieces of stuff (the “virtual” part) are assembled by uniting the services these pieces create.  There’s a bit more, though.  That same third of buyers who think about NaaS are also seeing the role of SASE expand beyond security, and that may answer a lot of questions about what vendors seem to be thinking.

The clear goal, name notwithstanding, is to refine the notion of “services” by personalizing it to user and application needs.  Network as a service implies a level of customization, a means of setting enterprise priorities on user/application relationships and ensuring that the higher priorities get higher resource priorities.  In fact, prioritization or even explicit QoS is seen as part of SASE’s mission by almost all the users who recognize NaaS.  The “secure” piece is just a part of the issue of controlling information flows to ensure they meet business goals.

User-specific performance, whether it’s in priority form or in absolute QoS form, requires a way of identifying the user and what they’re doing.  In fact, what would be ideal is a framework where users and applications could both be grouped into categories that would allow for mass assignment of priority handling.  Location, home-versus-branch, role, and other factors are examples of user categories.  Many systems, including Microsoft’s Active Directory, provide a way of classifying users, but that doesn’t mean you’d be able to recognize their traffic.  It seems as though SASE evolution is going to have to lead to a place where user identification is supported at the traffic level.

That then raises a security question.  Do you base the “SA” in “SASE” on the current mechanism of intercept/blocking via firewall?  Permissive connection meets insurmountable object?  Or do you employ a zero-trust mechanism where only authorized connections are carried?  If you’re going to do explicit zero-trust security, it would make sense to have as much knowledge of both user and application as possible, so that you could infer default connection and priority settings based on what’s known about both users and the applications or sites they’re trying to access.

That raises the always-fun topic of support.  It’s always been true that to a user, “the network” was whatever wasn’t on their desk (or in their hand).  You could describe this view as a local-/distant-resource separation.  In our modern world, more and more of the experiences we’re seeking, the applications we’re running, involve distant resources.  This creates a growing problem with quality of experience management for the IT professionals who are generating what people are looking at.

A secure edge that can’t deliver applications with the appropriate levels of QoE isn’t an advance, according to a growing number of users.  In fact, over half of all buyers are concerned about the QoE impact of enhancing security.  They know they need security, but they need what they’re doing, running, viewing, to be supported, too.  If you take the security and priority/QoS stuff down to user/application granularity, the challenge of support explodes.

If users see IT as local/distant, then most of what they consider IT is out of sight, how does this impact vendors?  There seems to be a polarizing among network vendors, with most staying in their traditional lane and focusing on the transformation of the network as a connectivity framework.  The main outlier is Cisco, who may believe that by strengthening its position in cloud computing platforms, it can leverage the coming symbiotic relationship between network and cloud.

Juniper may be emerging as the most interesting player in this new game.  With their Netrounds and Mist acquisitions, they created an assurance-centric, operations-based, platform that combined with a Juniper edge to create the first “virtual SASE” model.  While the details of their handling of 128 Technology post-acquisition aren’t yet clear (because it isn’t closed yet), 128T has all the features needed to create a super-SASE solution when combined with Juniper’s other assets.  Could Juniper be thinking of a model of networking where the services themselves, and the experiences and applications the services connect with users, become the center of the network universe?  I think that may very well be what they have in mind, in which case they could become a true force in network evolution…if they get it right.

SASE is proof that the concept of dissecting unified missions into discontinuous product sets is good for analyst firms and vendors but bad for users.  The same forces that have acted to create SASE continue to operate, to add to the mission.  Might we end up with “Secure, Assured, Self-Managed, Access Service Edge”?  SASASE is a kind of catchy, musical, acronym, but I think we’re heading for “NaaS Service Edge” instead.

That might create an interesting market dynamic, because there are multiple paths to get to a NaaS.  An edge solution is logical because the edge can assemble functional pieces at the user connection point, but a deeper strategy could provide more efficiency, and introduce “cloud elements” of functionality more easily.  Whatever the pathway, it may be that the NaaS model is the real goal in the end, even for SASE.

What are Operators Planning for New Technology in 2021?

Operators usually do a tech planning cycle that runs from about mid-September to mid-November.  The ones I’ve been tracking (about 50) are now done with their cycles, so this is a perfect time to look at what operators think they need to be doing, and facing, in 2021.

Tech planning and budget planning, I must note, are related but not the same.  Most operators will plan out any major technology initiatives or potential new directions, and then work their business cases in the budget cycle.  The tech cycle is designed to ensure that they understand how to acquire and deploy something new, and what they could expect to get from it in broad terms, before they spend a lot of time running numbers on the candidates.  Nearly all major network technology shifts in the past have started with a tech planning item.

One thing I find highly interesting (and unusual) in this year’s planning cycle is that two tech initiatives came up as not only tied for the top, but co-dependent.  The issues were 5G deployment and open-model networking.

Obviously, all operators are at least committed to 5G and 37 of the ones on my list were actively deploying 5G.  The reason this is important is that when you’re looking at operator technology initiatives, it’s not the brilliance of the technology that matters, but how well the technology is funded.  Nobody questions 5G funding credibility for 2021, period.  That makes 5G almost unique, and that makes things that are tied to 5G automatic concept winners.

The reason for the linkage between 5G and open-model networking is a winner for two reasons beyond simple association.  First, operators recognize that there is little or no near-term incremental revenue credibility to 5G deployment.  Of the operators I’ve chatted with, only 12 suggested that they believed they could see “new revenue” from 5G in 2021, and only 26 thought they’d see it in 2022.  Frankly, I think both these numbers are about double what’s really likely to happen.  Second, operators know that because 5G will happen, whatever 5G is built on will tend to get nailed to the financial ground with a stake about five to seven years long.  It’s easy to extend current technology directions, but much harder to change them.

One thing that this last point has done is to commit operators to standard first, open second in terms of planning.  They want to make sure that everything they deploy conforms to the structural model of the 3GPP 5G RAN and Core specs, and then then want to maximize the number of open components within those models.  This vision eliminates (or at least reduces) the risk that the operator might have to forklift an entire early deployment to adopt an open-model approach, if such an approach were to be unavailable or impractical at the time of the first deployment.  You can introduce new open-model solutions to 5G elements on a per-element basis because the standard guarantees the interfaces needed to connect them.

But is open-model 5G a stepping-stone toward open-model networking overall?  That’s a complicated question that only about half of operators seem to have considered, or considered in any organized way.  Clearly the functional elements of 5G, the 3GPP building-blocks, are specialized and so not likely to be reused in non-5G applications.  What operators think should or will be reused are the tools associated with the deployment, operationalization, and modernization of open-model 5G.  The principles, then, of something like Open RAN or Cloud RAN or whatever, should be principles that could be extended to the network at large, to all its services and missions, in the future.

This point seems to be a goal flying in the face of details, or vice versa.  A bit less than half the operators had really looked at the question of the technology needed for open-model networks, both 5G and in the broader context.  The others were saying things like “both will be supported by virtual functions” or “we expect to increase our use of white boxes”, without the kind of details that prove careful consideration has been involved.

Among those that have actually thought things out, there’s a debate that’s not fully resolved, and a few who don’t think any resolution is possible.  They’re aware that there are NFV-style VNFs in play, and actually called out in the 3GPP 5G stuff.  They also know that there’s something called a “containerized network function” and something called a “cloud-native network function”, and it’s their view that neither of these things are defined with any rigor.  They also know that it’s almost certain that no hosted network function of any sort is going to replace high-capacity data-path devices like switches and routers.  Any open-model approach there will necessarily be based on white boxes.

To me, the white-box piece of this story is the most critical.  Networks carry packets, which means that virtually all network elements have a data-plane component.  It’s credible to think that a hosted function of some sort could provide individual user data planes (it’s not, to these key operators, credible that this would be a net savings for both capex and opex).  It is not credible, according to the operators, to believe hosted routers will replace all proprietary routers, where it is entirely credible that white-box routers could.  Thus, the open-model network of the future is going to have a large number of white boxes, and it’s likely that the biggest piece of that network—the aggregation and core IP stuff of today—will be white-box-based if it’s open.

For this group, the question is whether the source of the white-box router and the proprietary router are the only difference between the two.  Open, disaggregated, software running on a variety of white boxes that are 1:1 substituted for proprietary devices is one choice.  Router complexes (such as those of DriveNets, who won AT&T’s core) are another choice.  SDN flow switches and a controller layer are a third.

One operator planner put it very well; “The question here is whether an open network is open at the device level, [meaning] based on open implementations of traditional elements, or open at the functional level, meaning based on open implementations of service features, in non-traditional ways.”  Both paths lead to some white-box elements, but one path means a lot more of them.

Another issue that this “literati” group is beginning to deal with is the notion of the control plane as the feature layer of the future network, whatever the implementation model.  IP has a control plane, one that SDN centralizes.  5G (and 4G) separated the “mobile control plane” and the “user plane”, which means defining a second control plane.  Services like video delivery have a series of CDN-related features that could be collectively defined as a control plane, and cloud computing creates something like a control plane for orchestration, service mesh, and other stuff.  Are all these control planes going to get more tightly coupled, even as the data plane becomes more uncoupled?

This may be a question that’s tied into the other priority consideration from this tech cycle; “carrier cloud”.  Operators used to see carrier cloud as being their implementation of public cloud, justified by selling cloud computing services to enterprises.  They thought that hosting NFV or 5G on it was just a mission for an infrastructure they saw as inevitable and already justified.  Now, obviously, there is no realistic chance for operators to compete with public cloud providers.  There may not be a realistic mission to host NFV or 5G in the cloud at all; white boxes might be the answer.  Should operators even be thinking about carrier cloud as their own resource pool, or is “carrier cloud” the set of things they outsource to the public cloud providers they used to be thinking of competing with?

Almost all operators I’ve chatted with believe they cannot deploy “carrier cloud” to address any near-term service or technology mission.  That would generate an unacceptable first cost to achieve coverage of the service area and reasonable economy of scale.  They think they have to start in the public cloud, which of course makes public cloud providers happy.  But the big question the literati are asking is what is it that we host there?

Cloud providers want to provide a 5G solution more than a 5G platform in the cloud.  Microsoft is a good example; they’ve acquired both Affirmed and Metaswitch to be able to sell 5G control-plane services, not just a place to put them.  The smaller operators are increasingly OK with that approach, but the larger operators are looking harder at the risk of a major lock-in problem.  Better, they think, to create a 5G platform and 5G hosted feature set, and then have the public cloud providers host it with minimal specialization of the implementation to any given cloud.  That way, the operators can use multiple providers, switch providers, or pull everything or part of everything off the public cloud and pull it back to self-hosting in the original carrier cloud sense.  There will be an exploration of the business case for these competing approaches in 2021.

There’s also going to be an exploration of just what we’re heading for with all these control planes.  While it’s true that the OSI concept of protocol layering means that every layer’s service is seen as the “user plane” or “data plane” to the layer above, modern trends (including 4G and 5G) illustrate that in many cases, higher-layer control functions are actually influencing lower-level behavior.  Mobility is the perfect example.  If that’s the case, should we be thinking of a “control plane” as a collection of service coordinating features that collectively influence forwarding?  Would it look something like “edge computing”, where some control-plane features would be hosted proximate to the data plane interfaces and others deeper?  The future of services might depend on how this issue is resolved.

The unified control plane may be the critical element in a strategy that unifies white boxes and hosted features and functions.  If there’s a kind of floating function repository that migrates around through hosting options ranging from on-device to in-cloud, then you really have defined a cloud with a broader scope, one that is less likely to be outsourced to cloud providers and one that opens the door to new revenues via the network-as-a-service story I’ve blogged about.  About a quarter of operators are now at least aware of the NaaS potential, but nobody had it on their agenda for this year’s cycle.

The final issue that’s come along is service lifecycle automation.  This has the distinction of being the longest-running topic of technology cycles for the decade, illustrating its importance to operators and their perception that not much progress is being made.  Operators say that a big part of this problem is the multiplicity of operations groups within their organization, something that carrier cloud could actually increase.

Today, operators have OSS/BSS under the CIO, and NMS under the COO.  In theory, systems of either type could be adapted to support operations overall, but while both the TMF and some vendors on the NMS side have encouraged a unified view in some way, nobody has followed through.  The thought that NFV, which created the “cloud operations” requirement, could end up subducting both came up early on, but that never happened either.  The missteps along the way to nowhere account for most of the lost time on the topic.

Today, open source is ironically a bigger problem.  AT&T’s surrendering of its own operations initiative to the Linux Foundation (where it became ONAP) made things worse, because ONAP doesn’t have any of the required accommodations to event-driven full-scope lifecycle automation.  There are very few operators who’ll admit that (four out of my group), and even those operators don’t know what to do to fix the problem.  Thus, we can expect to see this issue on the tech planning calendar in 2021 unless a vendor steps in and does the right thing.

A Statistical View of US Market IT Spending Trends, with a Projection

What is going to happen with information technology?  There are a lot of people who wonder about that, who have a significant stake in the answer to that question.  There are a lot of answers, too, most of which are simply guesses.  Well, there’s no way to make a forecast without an element of judgment, but as I said in a prior blog, it is possible to get solid government statistics on a lot of things, including IT.  If you use those statistics you can get a picture of the past and present evolutionary trends, and it’s those trends that create the future.

I’ve been working with US market data from government sources for decades, and recently I’ve been pulling together information on the broad set of changes we’ve seen in those decades, with the goal of determining just where overall IT spending and technology change might be heading.  The credibility of this stuff has to be established with past history, so I’ve started back in the early days of IT and taken things forward to today.  From today, I’ve tried to model objectively the decisive shifts that past history has offered, and from those to paint a picture of the future.

All of the charts in this blog are based on source data on private fixed asset investment by type, selecting the categories for computers and peripherals, network/communications equipment, and software.  Charts showing spending are representing billions of US dollars.  Those showing rate of change of spending show the ratio of increase or decrease, so a value of 0.1 means a ten percent increase over the prior year.

The Cycles of IT Spending and Productivity

The chart above, which I’ve included in past blogs, shows the broad trends in IT spending in business IT since the dawn of commercial computing.  The three cyclical trends in the chart represent periods when a new computing paradigm enhanced business productivity significantly enough to justify boosting spending to accelerate the paradigm’s adoption.  This chart, showing the rate of change of IT spending, makes the cyclical trends clearer, and it’s these three cyclical shifts that have combined to drive our current levels of spending.  The primary technology changes representing each cycle are noted, and I’ll repeat them in other charts below for reference.

The problem we face in IT is also illustrated in the chart.  Since the last cycle crashed with the dot-com crash at the end of the 1990s, we’ve not had another productivity cycle.  Why that would be true isn’t obvious from the cyclical chart, so we have to look deeper.

IT Spending by Type, Showing a Resource-versus-Software Constraint Point

The chart above is a busy one, but important.  It shows the rate of change in investment in hardware (computers and peripherals), network equipment, and software spending for the critical period from 1961 when large-scale commercial computing.  If you squint and look at the chart, you’ll notice that for the majority of the period, hardware and network investment growth tend to lead software investment growth.  That reflects a period when hard resources were the limiting factor in IT empowerment.  Starting after the 1999-2000 dot-com crash, though, we see a shift, one where software spending growth leads growth in both hardware and network spending.

Let’s noodle the meaning of the timing shift the chart illustrates.  In the period starting around 1977 we saw minicomputer technology explode, increasing hardware investment and lowering the unit computing cost to the point where true online transaction processing was cost-efficient as well as productivity-enhancing.  Software spending increased shortly thereafter as applications to exploit the new paradigm developed.  In 1994, we saw a boost in hardware spending launched by the personal computer’s adoption.  Software applications for the PC exploded in 1997 and beyond, and online Internet applications did the same in 1999-2000, increasing network equipment spending growth.

The dot-com crash caused all IT spending to drop, but it also represented a major shift in thinking to that software-defined era we’re now in.  At around 2000, advances in chip technology meant that we had the technology to create more computing power than we had productivity benefits to consume it.  There was a sudden stabilization in hardware spending growth, with little change in spending year over year.  This coincides with my own survey work’s identification of a sudden change in the source of IT budgets.  Prior to 2000, IT budgets tended to come from a mixture of modernization and project sources, with the latter opening new productivity-linked applications.  Beyond 2000, there was never again a significant contribution of new productivity-linked applications.

There was still software advance, as the chart shows, but this advance related to the fundamental platform shifts we saw—first virtualization and then cloud computing.  Software was building a new framework for hardware to run within.  That framework had the initial effect of dampening hardware spending because the new compute paradigm was more resource-efficient.  At the very end of the period, 2016-2018, we saw a big jump in hardware spending growth rates, coinciding with the massive build-out of hyperscaler public cloud technology.  The dip in 2019 isn’t a dip in spending, but in spending growth; public cloud built out its initial deployment and was now in a process of orderly growth.

The network spending curve also contributes insights here.  The shift to IP infrastructure during the 1970s coincided with the rise of OLTP, and the growth period in the 1990s to the shift to IP infrastructure from TDM and POTS.  Thereafter, network investment has been roughly aligned with computer and peripheral investment, which suggests that hardware and network are now slaved to software innovation.  Absent a new productivity paradigm, the result has been a focus on using software to reduce hardware and network investment.  We’ve seen the result of that for years now.

Software is what’s going to change, and the chart above illustrates that software is on approximately a shorter four-to-five-year modernization cycle.  That means that the next opportunity for a productivity cycle is likely to come at the point where the next software cycle could begin, which would be approximately 2022-2023.

CIMI’s Modeling of the “Next Wave”

The chart above shows my modeling forecast for the rate of change in IT investment by businesses, in each of the three categories, for the period beyond 2019 up to the limit of current data reliability.  The projection shows a significant uptick in software spending, the peak of which is in 2023, and it shows network and IT spending rising in synchrony.  This is because a real productivity-driven cycle requires collateral investment in all the elements to deliver empowerment to workers.

In fact, the model shows both IT and network spending increasing faster at the end of my forecast period, showing that the impact of the upcoming cycle is still advancing.  If the cycle starts in 2023, then it would be 2027 before it peaks.  By 2030 we’d have wound down the cycle, meaning that IT spending would then resume its normal modernization-and-replacement rate.

The question of what might drive the cycle, and even the question of whether anything will drive it, is something that statistics and projections can’t answer.  The best that I can do is say when the ingredients are right, but if you look at the earlier figures, you see that we’ve had software peaks in 2007, 2012, and 2018, and none of them kicked off a productivity cycle.  There is no objective reason for that, nor for an assumption that 2023 will be one, so we need to look at what’s behind the curves.

The past cycles were launched because of pent-up productivity benefits awaiting resources for realization.  We were applying IT to traditional business problems in evolutionary ways, so a cycle was in a sense a resource-driven paradigm.  Minis, then PCs, reduced unit cost of empowerment.  The problem now isn’t resources, it’s that we’re awaiting a model, an architecture, that links resources to productivity in a new way.  A new paradigm, in short.  My presumption (or prediction, or guess, or whatever term you’re comfortable with) is that the new paradigm is a continuation of the past trend toward successive cycles that bring computing closer to the worker.  As I’ve said in the past, this would make the next step one of point-of-activity empowerment.

Behind this new paradigm is a lot of technology.  What you’re trying to do is to construct a virtual world in parallel with the real one the worker, buyer, or seller inhabits.  This virtual world can be understood by and manipulated through technology, and it’s anchored to the real world through…you guessed it…the Internet of Things.  IoT provides us with the ability to sense the real from the virtual, to locate us and everyone we’re interacting with in both worlds, and to relate positions and relative movements to goals and tactics.  Augmented reality provides a window that lets us see the real world through or with the virtual-world overlay.

There are obviously a lot of moving parts to this, but the software needed could be generated easily with today’s tools, particularly cloud and possibly edge computing.  The challenge for today is the IoT framework.  We need sensors, we need the ability to analyze video to recognize things, and we need a business model that deploys the stuff.

Raw IT Spending by Type $billions US

The final chart above shows US market IT spending ($billions, US) in the US by major category, rather than rate of change.  It shows (if you look carefully) that we’ve already started to see a climb in computer/peripheral spending from the low in 2016, owing to cloud computing.  Network spending is also increasing, though more slowly.  My model says that both network and computer spending will accelerate starting in 2023 if the new paradigm arrives.

It’s less likely to arrive, though, if we keep groping pieces of it rather than addressing the whole picture.  The new paradigm could validate 5G investment.  It could create massive IoT spending, edge computing could become real, and AI and ML could explode.  Our mistake is postulating the gains of these technologies in the absence of a driver.  The value of computing is created by the combination of hardware, network, and software.  We have to get the technologies in all these areas singing the same tune, or we’re going to see stagnation.  Who will step forward to make this happen?  There is no question that someone will, but the “when” and “who” are still up in the air.

Hardware Abstraction, Software Portability, and Cloud Efficiency

One of the factors that limits software portability is custom hardware.  While most servers are based on standard CPU chips, the move toward GPU and FPGA acceleration in servers, and toward custom silicon in various forms for white-box switching, means that custom chip diversity is already limiting hardware portability.  The solution, at least in some eyes, is an intermediary API that standardizes the interface between software and specialized silicon.  There are two popular examples today, the P4 flow programming standard and Intel’s new oneAPI.

The problem of hardware specialization has been around a long time, and in fact if you’re a PC user, you can bet that you are already benefitting from the notion of a standard, intermediary, API.  Graphics chips used in display adapters have very different technologies, and if these differences percolated up to the level of gaming and video playback, you could almost bet that there’d be a lot less variety in any application space that involved video.

In the PC world, we call this intermediation process after the name we’ve given to the piece of technology that creates it, “drivers”.  There are PC drivers for just about every possible kind of device, from disk storage to multimedia and audio.  These have a common general approach to the problem of “intermediarization”, which is to adapt a hardware interface to a “standard” API that software can then reference.  That’s the same approach that both P4 and oneAPI take.

The upside of this is obvious; without intermediary adaptation, software would have to be released in a bewildering number of versions to accommodate differences in configuration, which would likely end software portability as we know it.  Intermediary adaptation also encourages open-model networking by making lock-in more difficult and making an open-source version of something as readily usable as a proprietary product with a zillion marketing dollars behind it.

There’s a downside, too; several in fact.  One is efficiency.  Trying to stuff many different approaches into a single API is a bit like trying to define a single edge device that supports everything, the often-derided “god-box” of the past.  Jack of all trades, master of none, says the old saw, and it’s true often enough to be an issue.  Another is innovation; it’s easy for an intermediary API framework to define a limited vision of functionality that can’t then be expanded without losing the compatibility that the API was intended to create.  A third is competing standards, where multiple vendors with different views of how the intermediation should evolve will present different “standards”, diluting early efforts to promote portability.  We still have multiple graphic chip standards, like OpenGL and DirectX.

P4, the first of the two intermediation specifications I’ll look at here, is a poster child for a lot of the positives and negatives.  P4 is a flow-programming language, meaning that not only does it define an intermediating layer between chips and software, but a language to express chip-level commands in.  Since both “routing” (Level 3) and “switching” (Level 2) packet handling are protocol-specific forwarding techniques, P4 can make it possible to define forwarding rules in a way that can be adapted to all forwarding, and (potentially) to all chips, or even no chips at all.

The name is alliterative; it stands for “Programming Protocol-Independent Packet Processors”, the title of a paper that first described the concept in 2015.  The first commercial P4 was arguably from Barefoot Networks, since acquired by Intel, and Intel is arguably the major commercial force behind P4 today.  However, it’s an open specification that any chip vendor could adopt and any developer could work with.

A P4 driver converts the P4 language to chip-specific commands, in pretty much the same way that something like OpenGL converts graphics commands.  For those who are in the software space, you can recognize the similarity between P4 and something like Java or (way back) Pascal.  In effect, P4 creates a “flow virtual machine” and the language to program it.

The ONF has embraced P4 as part of its Stratum model of open and SDN networking.  Many vendors have also embraced P4, but the arguable market leader in the flow-chip space, Broadcom, started its own P4-like concept with its Network Programming Language, or NPL.  There’s an open consortium behind NPL just as there is with P4.

P4 and NPL aren’t compatible, but that may not be the big question in the flow-switch space.  A flow switch is part of a network, meaning that there’s a network architecture that aligns flow-switch behavior collectively to suit a service.  A good example is SDN versus adaptive IP routing.  You could build a P4 or NPL application for either of these, and the result would be portable across switches with the appropriate language support.  However, an SDN flow switch isn’t workable in an adaptive IP network; the behaviors at the network level don’t align.  It’s like metric versus English wrenches and nuts.

Intel’s oneAPI, like P4, emerges as a response to the need to support innovation in optimum hardware design while preserving software compatibility.  The specific problem here is the specialized processors like GPUs that have growing application in areas like AI and computation-intensive image processing and other missions.  As already noted, different graphics chips have different interfaces, which means that software designed for one won’t work on another.

This problem is particularly acute in cloud computing, because a resource pool that consists in part of specialized processors is likely to evolve rather than being fork-lifted in.  There may be multiple processors involved, some that have emerged as more powerful successors and others that have been empowered by some new specialized mission.  The result is a mixture of processors, which means that the resource pool is fragmented and getting applications to the hosts that have the right chip combination is more difficult.

The oneAPI framework supports GPUs, CPUs, FPGAs, and in theory, any accelerator/processor technology.  Intel calls this their XPU vision, and it includes both a library/API set designed to allow XPU programming in any language, and a new language for parallel processing, Data Parallel C++ or DPC++.  Like P4, oneAPI is expected to gain support from a variety of XPU vendors, but just as Broadcom decided to ride its own horse in the P4 space, AMD, NVIDIA, and others may do the same with oneAPI.  Intel has some university support for creating the abstraction layer needed for other XPUs, though, and it seems likely that there will be ports of oneAPI for the other vendors’ chips.  It’s not yet clear whether any of these other vendors will try to promote their own approach, though.

The presumption of the oneAPI model is that there is a “host” general-purpose computer chip and a series of parallel XPUs that work to augment the host’s capabilities.  The term used for these related chips in oneAPI documentation is devices.  An XPU server thus has both traditional server capability and XPU/oneAPI capability.  The host chip is expected to act as a kind of window on the wider world, organizing the “devices” in its support and blending them into the global concept of a cloud application.

I’m a supporter of both these concepts, not only for what they are but for what they might lead to.  For example, could we see Intel integrate P4 into oneAPI, creating a model for a white-box switch that includes both its “main processor” and its switching fabric?  Or, best of all, could we be heading for a vision of a “cloud hosting unit” or CHU?  This would be the abstraction of a true cloud platform-as-a-service, offering perhaps its own “Cloud Parallel C++” language and its own APIs?

Cloud-native development depends on a lot of tools, most of which have multiple open-source implementations.  Public cloud providers have their own slant on almost everything cloud-native too, and all this variability creates a challenge for development, forcing almost all cloud-native work to fit to a specific hardware/software model that locks the user in.  Since lock-in is the number one fear of cloud users, that’s a major problem.

I’ve said for years that we should be thinking of the cloud as a unified virtual computer, but the challenge with that step is that there’s no universal language or library to program that unified virtual computer with.  Could someone develop that, and so set the cloud on its own pedestal at the top of all these parallel/virtualized-hardware abstractions?  Could that be the dawn of the real age of cloud computing?  I think it could be, and I think that we’re seeing the early steps toward that future in things like P4 and oneAPI.

Why Polling and Surveys Often Fail

Probably the only election topic most agree on is polling.  For two presidential elections in a row, we’ve had major failures of polls to predict outcomes, so it’s no surprise that people are disenchanted with the process.  What’s more surprising is that many people don’t realize that all surveys are as inherently flawed as polling is.  If getting a view of the “future” is important to you, it should be important for you to understand why it usually doesn’t work out well.

I started doing surveys back in 1982; a big survey was in fact the first source of revenue for CIMI Corporation.  I had to identify 300 large users of digital data services (TDM DDS and T1, in those days) and get information on their use and issues.  The project worked fine, and because it was so difficult to find people to survey, I kept in touch with the survey base to keep them engaged for future projects.  To do that, I posted them questions twice a year on various network topics, and by 1997 I had a nice baseline of predictions on future network decisions.

Nice, but increasingly wrong.  In fact, what I found was that the data showed that by 1997 there would likely be no statistical correlation between what my surveyed users said they were planning to do with their networks and equipment, and what they actually did.  Obviously, that was a major disappointment for me, but it did get me involved in modeling buyer decisions based on more objective metrics rather than asking them to guess.

That’s the first key truth to surveys.  You can ask somebody something that’s objective and observable that’s within their own scope of responsibility, but don’t ask them for planned future actions or about things outside their own areas.  Few people or organizations have any skill at figuring out what they’ll do much more than a year out, and most have their head down, focusing on their own jobs.

The next key truth emerges from another set of experiences.  First, I was asked by Bellcore to look over a survey they’d done a trial run with, because there were some extremely troubling responses.  I did, and what I said was that the survey presumed that the person being surveyed was a specialist in the technology they were asking about, but they never qualified them.  I suggested that, instead of asking whether their company “used ATM”, they ask what speed the ATM connection was running at.  When they did, the great majority said they were using “9600 bps ATM”, obviously mistaking asynchronous modem operation with Asynchronous Transfer Mode, what Bellcore wanted to know about.

Second, a major network publication approached me on a survey they did, and as in my first example they were concerned by some of the results.  I looked things over, and sure enough, 33 percent of the survey responders reported they were using gigabit Ethernet.  At the time there were no gigabit Ethernet products offered; the standard wasn’t even fully defined.  What happened here?  What I found in my own survey experience is that people want to sound smart and involved when they’re surveyed.  Ask them about a new technology and about a third will give you the answer they think makes them look good, even if the situation they describe is impossible.

The third key truth is what I’ll call the “diamond in the rough” paradox.  A big organization has tens of thousands of employees, and in the IT part of the company alone there are hundreds of people.  What do you suppose the chances are that all of these people know everything a company is doing or planning?  Zero, but what percentage of surveys actually effectively target the questions to people who are likely to know the answer?  Zero.

A big network equipment vendor had a well-known analyst firm do a survey and report for them, one that cost a boatload of dollars.  As in other cases, they were troubled by what they saw in the results and asked me to audit the survey and results against my own data and suggest what might have gone wrong.  The problem was that the survey did in fact ask the right companies, but made no attempt to engage the right people.  As a result, fully three-quarters of those who responded had nothing to do with the technology at issue, and their answers to the questions were totally irrelevant.

Sometimes the survey people even know that.  While I was auditing a survey of LAN use in the early days, I happened to hear one of the survey people talking to a target.  “Do you use Ethernet or Token Ring?” they asked, and apparently the party they had called had no idea.  The surveyperson helpfully said, “Well, feel around the back of your computer.  Do you feel a fat cord?”  Apparently the response was affirmative, and so the user was put down as an Ethernet user (that’s what Ethernet was delivered on at the time).  In fact, the cord was likely the power cord, so I asked the person doing the survey, and they said that almost nobody had the answer, so they were given that follow-up to use.  Guess how good those results were!

Then there’s the “bulls**t bidding war” problem.  People like getting their names in print.  A reporter calls an analyst firm and says “What do you think the 2021 market for 5G is?  XYZ Corp says it’s a billion dollars.”  The analyst firm being called knows that the reporter wouldn’t be making the call if they weren’t interested in a bigger number, so they say “We have it as two-point-two billion.”  This goes on until nobody will raise the bid, and so the highest estimate gets into the article.  The same problem happens with reporter calls with users; “When do you expect to be 100% cloud?” or “What percentage of your computing is going to the cloud?” is going to generate shorter timelines and higher percentages.

I know there are a lot of people who believe in surveys, and I’m among those who do as long as the survey is designed and conducted properly.  I only survey people I know well enough to know their skills and roles, and with whom I’ve had a relationship for some time.  I try to use a stable base of people, and if somebody leaves, I get them to introduce me to their replacement.  But even with that, most of the questions that get answered in survey articles and reports are questions I’d never ask, because even the best survey connections couldn’t get a good answer to them.  Most businesses respond to market trends that are largely unpredictable and most often tactical, so why ask they what they plan to be doing in three or five years?

This is why I’m a fan of using data from the past to forecast the future.  If you’re careful in what data you collect and how it’s used, the past measurements and the trends they expose are more authoritative than anything you could get by asking somebody for their plans.  They can also tell you how the mass of companies in the area you’re interested have responded to changes, and that’s a better measure of the future than asking people to guess.

The data on IT spending growth cycles I blogged about HERE is an example.  I found that there was a cyclical pattern to the growth in IT spending, both year-over-year and 5-year-smoothed and both by itself or compared with comparable GDP.  I found that there were notable new technology advances in the early part of each cycle, and my own surveys showed that in those cycles, companies reported that a larger portion of their IT spending came from new projects rather than budgeted modernization.  All that suggests a productivity driver for each wave.  It was particularly significant that after the last wave, there was never a time when project spending (justified by new benefits) contributed significantly to IT spending overall.

You can get government data on an amazing number of things.  Want to know what industries are most likely to use cloud computing, adopt IoT, rely on mobile workers and WFH?  Massage the government data and you can find out.  Where might empowerment of workers focus, either in industry or geography?  It’s there.  Where will high-speed broadband be profitable, and so available?  That’s there too.  Best of all, careful use of the data generates results that tend to hold up.  My demand density work and IT cycles work started fifteen years ago, and predictions made from the work in the early days have come true most of the time.

The final issue with surveys and polls is that they tend to show what whoever is paying for them wants to see.  I used to get RFPs from some of the big “market report” firms, and they’d be in a form like “Develop a report validating the two billion dollar per year market for ‘x’”, which is hardly a request for an unbiased solution.  You could try to ask if surveys are really unbiased, if all those who respond are fully qualified.  Good luck with that.  A better way might be to get a quote on a report of your own.  If your company can go to a firm and request research to back up a position, you can be sure others can do the same, and you should therefore distrust the information that comes from these sources.  I stopped selling syndicated research because people didn’t want to pay for the truth.

No survey is, or can be, perfect.  No model will always get things right.  There is no absolutely reliable way to know what a given company or person or group thereof will do, even months in the future.  There are just ways that are worse than others, and the best you can do is to try to latch on to one of the less-worse methods, and if it’s not good enough, opt for flexibility in your planning.  So, the moral here is don’t tell me what your survey showed, or someone else’s survey showed, or what you project to be the truth, because I don’t believe it.  So, of course, you’re free not to believe me either.

Integration Woes and Complexity in 5G and Beyond

If ever there was a clear picture of open 5G challenge, THIS Light Reading piece may provide it.  The chart the article offers shows the things that Dish had to integrate in order to build an open-model 5G network.  It doesn’t look like a lot of fun, and there’s been a fair amount of industry skepticism generated.  The old saw, “A camel is a horse designed by a committee” surely comes to mind, and that raises the question of how to avoid camels in an open-model network future.

It’s not just 5G we have to worry about.  The issues that created a table of elements for 5G has done similar things in other areas.  We need to understand why the problem is so pervasive, and what might be done about it.

Networking is perhaps the only technology space that can define ten products based entirely on open standards, and still find that there are almost no interfaces in common.  This sad truth arises from the fact that network transformation projects tend to be very narrow in scope, which means that they have to fit in with a bunch of similar projects to add up to a complete definition of a network.  Why is that, and what (if anything) can be done about it?

I recall a meeting of the NFV ISG in the spring of 2013, where the discussion of the scope of the project included the statement that it was important to limit the scope to the basic question of how you deployed and managed virtual functions as software elements, leaving the question of broader management of the network out of scope.   This was done, it was said, to ensure that the task would be completed within 2 years.  Obviously, it wasn’t completed in that time, and obviously the question of how a virtualized set of functions was managed overall was important.  We didn’t get what we needed, and we didn’t meet the timing goals either.

The decision to make the functional management of a virtual-function network out of scope wasn’t actually a stupid one, if viewed in the proper context.  The difficulty was, and still is, that these kinds of decisions are rarely put in any real context at all.  Without the context, there’s plenty of room for a non-stupid concept to be a bad one nevertheless.

We were, at that moment in 2013, managing real devices, meaning what the NFV ISG called “physical network functions” or PNFs.  The presumption that was inherent in the views of that meeting was that it was critical to preserve, intact, the relationship between the management system and the PNFs when the PNFs were virtualized.  Since the process of deploying and connecting virtual functions would clearly not have been part of the management of PNFs, that was what the NFV ISG ruled as its mission.  However, that decision created a second layer of management, adding to the number of things that had to be present and integrated for NFV to work.

Networks are collections of stuff (the name, in fact, implies that).  Each of the elements of the network stuff are subject to review in terms of whether it provides optimum cost/benefit for the network operator.  This means that there’s a strong tendency to look at stuff-change piecemeal, and to contain the spread of impact to protect the investment in technology areas where there’s no particular need for a change.  That’s the likely foundation of the thinking that led to NFV as it is, and the same pattern of thinking prevails in all areas of network transformation today.  That means that even things like 5G and Open RAN are, to a degree, victims of PNF-think or scope containment.

Does that mean that we’re so locked into the past that we can’t transform at all?  Is transformation possible if we do nothing but reshape the nature of the PNFs, the boxes, without changing anything else?  Those are really subsets of the big question, which is whether we can transform networks by organizing different forms of the same thing, in the same ways as before.  Or, do we have to rethink how networks can work differently within, to open new possibilities and new opportunities?  We need to create an agile model of the new network, somehow, so that scope gaps aren’t inevitable, so that we can fit things that belong together into a common model.

SDN was inherently more transformational than NFV.  Why?  Because SDN proposed that an IP network that looked like the IP networks of old at the point of external interface, but that worked differently and (in some ways) more efficiently within, was the best answer.  The black-box theory, in action.  What’s inside the box doesn’t matter, only its external properties.

Despite SDN’s natural transformation potential, though, it hasn’t transformed the WAN.  Why that is, I think, is that SDN is illustrating the problem of true transformation in networking, which is that spread of impact.  I build a network that’s not based on native IP within itself, but I have to interface with stuff that thinks it is native IP.  Thus, I have to build native IP behavior at the edge, which is a new task.  I can’t manage the SDN flow switches and controllers using IP device management tools because they aren’t IP devices, so I have to emulate management interfaces somehow.  More scope, more pieces needed to create a complete network, more integration…you get the picture.

Scope issues have left is with an awful choice.  We can either do something scope-contained, and produce something that doesn’t change enough things to be transformational, or we can do something scope-expanding, and never get around to doing all the new things we now need.

There’s been a solution to this scope problem for a long time, illustrated by Google’s Andromeda and B4 backbone.  You wrap SDN in a BGP black box, treat it as perhaps a super-BGP-router with invisible internal SDN management, and you’re now an IP core.  That’s a great strategy for a revolutionary company with ample technical skills and money, but not perhaps as great for network operators with constrained skill sets and enormous sunk costs.

This is where I think the ONF is heading with its own SDN and SDN-related initiatives.  We actually have tools and software that can create a thin veneer of IP over what’s little more than a forwarding network, something like SDN.  We can compose interfaces today, with the right stuff, just as Google did.  The ONF doesn’t include the capability now, but might they?

DriveNets, known mostly for the fact that their software (running on white boxes) is the core of AT&T’s network, is perhaps the dawn of another approach.  Rather than applying SDN principles inside a network-wide black box, why not apply at least some of (and perhaps a growing number of) those principles to building a kind of composite device, a cluster of white boxes that looks and works like a unified device?

SDN separates the control and data planes, and it’s this separation that lets it work with composed interfaces, because it’s control-plane packets that make an IP interface unique, that give it the features not just of a network but of an IP network, a router network.  DriveNets does the same thing, separating the cluster control plane.  It makes a bunch of white boxes in a cluster look like a router, so we could say that the SDN model is going from the outside or top, downward/inward, and the DriveNets model is moving from inside to…where?

The reason this discussion is important to our opening topic, which is the 5G challenge, is that 5G standards and Open RAN and other related stuff is making the same sort of mistake SDN and NFV made, which is creating a scope boundary that they don’t resolve.  We have 5G “control plane” and “user plane”, and the 5G user plane is the whole of an IP network.  We have interfaces like N4 that go to a User Plane Facility, which ultimately depends on IP transport, but the UPF isn’t a standard IP network element or feature.  5G’s control plane is an overlay on IP, almost like a new protocol layer, because that’s pretty much what it’s trying to be.  Mobility management by creating tunnels that can follow a user from cell to cell to get around IP’s address-to-location who-it-is-where-it-is dilemma.

Why not get around that by composing the capability via a separate control plane?  Today it’s hard to introduce things like segment routing into IP because you have to update all the devices.  If there is no knowledge of the control plane within the forwarding devices, if all control-plane functions lived in the cloud and controlled forwarding, couldn’t things like tunnels and segments be composed in instead of being layered on?

Separating the IP control and data planes is fundamental to SDN, and also to DriveNets’ software.  It lets you build up a network from specialized devices for forwarding and other devices for hosting the control plane elements, which means that you can do a “cloud control plane”, something that I think the ONF intends but hasn’t yet done, and something DriveNets has definitely done.  But the separate control plane also lets you compose interface behaviors like Google’s Andromeda B4, and it would also be a doorway to creating network-as-a-service implementations of stuff like the 5G user plane elements.

If you can create a forwarding path from “A” to “B” at will by controlling the forwarding devices’ tables, do you need to build a tunnel the way both 4G/EPC and 5G UPF do?  If you can compose interfaces, couldn’t you compose the N4 interface (for example) and present it directly from your transformed IP network to the 5G control plane?  Might you not even be able to host the entire 5G control plane and user plane within the same cloud?

I think the answer to layer complexity, scope complexity, integration complexity, and management complexity in 5G and all transformed networks is composability of the control plane.  Yes, it raises the question of whether this sort of capability could promote proprietary implementations, but if we want to create open yet composable specifications, we can work on that problem explicitly.  That at creates a plausible if not certain path to solution.  I submit that just multiplying elements, layers, and pieces of a solution doesn’t lead us to anywhere we’re going to like.