How Two Initiatives Could Change the Face of Networking

Remember the AT&T open-source white box switch software?  Their announcement of the dNOS white-box operating system was made just three months ago, and open-sourced a month later.  Now they have a competing venture, from the ONF, called “Stratum”.  What are the two approaches doing, and what impact might they have on the SDN market?  Are they even that different?  We’ll take a look at all that here.

SDN has always been a rather fuzzy space.  Back at the beginning five years ago, I said there were really three different models of SDN—a “software-controls-legacy-router” model favored by vendors like Cisco, an overlay model supported first by Nicira (who was then acquired by VMware), and an explicit centrally controlled forwarding model defined by the Open Network Foundation (ONF).  This is still true today, and some operators have been frustrated by the fact that the first two models have dominated the market, in no small part because (the operators feel) vendors haven’t supported the “best” and most cost-transforming model, the ONF approach.  White-box switches, simple generic hardware devices to replace proprietary switches and routers, were the response, and there were two problems with that, one practical/market and one technical.

The practical problem is the need to displace current router technology with a new central-controlled forwarding paradigm.  That kind of transformation makes operators really antsy, and the benefits of central control can’t be realized (even if you accept them) unless you pretty much fork-lift out the traditional technology and replace it.  That kind of massive sunk-cost risk makes operators even antsier.

The technical problem with white-box switches is that hardware alone isn’t going to create a software-defined network solution.  Switch software is needed, and that was what AT&T was saying in their dNOS announcement.  dNOS stands for “Disaggregated Network Operating System”, and AT&T thinks it will create a platform that when combined with merchant silicon advances, ensures that network devices will be open and fully exploit new capabilities.  It’s about “innovation” for AT&T, according to their paper, and the fact that it would break the proprietary lock that vendors have on large transport IP devices probably doesn’t hurt either.

dNOS devices (routers, in AT&T’s terminology) have distinct separation between control and data planes, well-defined APIs between the two, and vertically to link to central management.  It’s a three-layer structure where the top layer (Applications) provides implementation of routing protocols and management elements, the middle layer (Shared Infrastructure and Data) does the management of the database functions and chassis resources, and the bottom (Forwarding and Hardware Abstractions) provides models of the functionality of the hardware, which can then be mapped to the specific devices supported.  The model is based in part on the P4 language, which lets you describe data-plane behavior exactly, rather than influencing generic data planes in very limited ways.  We’ll come back to P4 below.

In theory, the dNOS model can be applied to devices at any level, but AT&T’s proposal is clearly aimed at the “core router” and MPLS handling.  The white paper doesn’t make a big thing of OpenFlow, for example, and instead focuses on what for all practical purposes seems to be a router instance hosted on a specialized but still commercial-silicon-based chassis rather than on a general-purpose computer.  Because the dNOS model focuses on traditional routing behavior, it’s easy for AT&T to integrate with existing networks.

The fact that AT&T doesn’t mention OpenFlow might be an indicator of why the ONF is now looking independently at something new.  “Stratum” is the result, a project that the ONF says “Delivers on the ‘Software-Defined’ vision of SDN.”  Stratum, like dNOS, is expecting that the data plane devices will become fully agile and programmable (in P4 in Stratum, as in dNOS) and that capability means that it’s not necessary, or even useful, to classify network devices based on their “protocol”.  Protocol behavior is a P4 program, and so you need to rethink the relationship between devices and device control to accommodate the additional flexibility.

Stratum is a kind of P4 abstraction layer that sits on top of the various silicon implementations and understands P4 commands and how to implement them.  This is going to require a plugin process, I presume, to match the Stratum layer to the specifics of the hardware.  Stratum also provides a management and operations framework in which the P4 interpreter runs, and it exposes interfaces northbound via these features to connect to higher-level network applications.  These interfaces are different from the old ONF model, not because changing them was the goal but because you need a new API model to manage a P4-based forwarding abstraction.

The ONF documents point out that while OpenFlow was designed to control forwarding behavior, it wasn’t created to define it.  You could write a P4 program to enable it to implement OpenFlow, you could write one to implement Ethernet switching or IP routing (including MPLS), and you could write one to implement any arbitrary set of forwarding behavior you liked, controlled by any combination of in-band adaptive exchanges (like IP) or centrally managed processes (like OpenFlow SDN).  Thus, in a sense, P4 opens a wider range of behaviors that include everything we have now both in the legacy IP/Ethernet and SDN worlds.  In fact, you could write an overlay SDN program or SD-WAN program in P4 and control either/both of those behaviors in a generic device.

There are some new APIs in Stratum, but software APIs are way different from protocols like OpenFlow or hardware interfaces.  It’s fairly easy to map between software APIs, and so it wouldn’t be rocket science to adapt many network software stacks to Stratum.  The ONF offers examples in this kind of mapping, using popular projects like ONAP and popular initiatives like 5G.  But as I said before, APIs aren’t the story, it’s really P4 that makes both these initiatives different.  The ONF just calls this out more directly.

The ONF says that this eliminates the notion of “Black-“ or “White-Box” switches, because a generic switch is a P4 machine that’s as much of either the adaptive past or the explicit future as you like.  While P4 is a per-device language, it can be used to define “logical” pipelines that could correspond to tunnels or virtual routes, simply by coordinating the behavior of the P4 switching elements across multiple devices.   P4 is do-it-yourself forwarding, broadening the range of what a “router” could do to the very limits of silicon technology.  Stratum codifies the transformation that P4 creates.

How does Stratum relate to dNOS?  You can get a part of the answer by looking at the Google partnership in Stratum development.  Google is providing the foundation code for Stratum, and it is used by Google internally to host its Expresso network operating system.  The ONF provides a set of diagrams that depict a variety of network OSs sitting on Stratum, some aimed at more SDN-ish missions and some at traditional switching/routing.  In truth, a P4 Stratum base means that the devices themselves are non-mission-specific, and the way that P4 defines data plane behavior (which is controlled by the NOS above) is where the mission comes in.

That makes dNOS a “subset” of Stratum, with Stratum forming that bottom abstraction layer of dNOS.  dNOS, in my view, is then one of the NOS options that Stratum could support.  The two projects, then, are more complementary than contradictory or competitive.  That might make them even more important, because it suggests a couple of very critical changes are coming in the network equipment space.  These could give a new decoding to the “RINO” acronym—“Router in Name Only”.

The first and most important of these changes is that the combination of P4 and a generic appliance architecture, something like that of the Facebook-spawned Open Compute stuff, would open the network equipment market to pretty much anyone.  Clearly, the network operators in both projects would like to be able to make equipment a commodity play.  Clearly, their participation is a signal they’re prepared to push that goal aggressively.  Clearly, they will eventually succeed.

The second point is that there will still have to be some over-arching framework for the deployment of P4-compatible boxes in a network.  One set of box-specific forwarding rules does not make packet delivery possible.  The box-level activity has to mediate across all the box interfaces, either coordinated by a set of policies fixed in the box through a P4 program, or coordinated centrally through procedures similarly defined by P4.  Since these same box-to-box procedures need to be coordinated with those of existing equipment during a transition, that gives current vendors an edge—if they play it smart.

Which brings up the third point, historic incumbent stupidity driven by a desire to maintain current revenue flows at all costs.  This P4-driven approach is going to eventually destroy traditional network equipment models.  If the current router vendors can accept that and offer respectable transformation strategies and software products, they will retain a position (yes, a diminished one) with their customers.  If they don’t do that, then they’ll spawn a whole legion of new competitors who will do the right thing, and they’ll lose all their positions.

Point four involves Google.  Remember that Google has already deployed a new-model SDN forwarding in its network, and adapted to router protocols like BGP at the boundary point.  Google is contributing code here, and perhaps also the mindset of creating a true SDN core with a thin router veneer.  That would allow operators to transform to pure SDN much faster.

The final point is that the network vendors who sell to those operators will now all need to find new positions, to supplement their loss of traditional device revenues.  One logical place to do that is below, in the optical layer.  A second is in the building of generic boxes backed by a known vendor and carrier-quality fabrication and components.  A third is the cloud, and the final one is the radio-access network (RAN, and for 5G the New Radio or NR).  Cisco is obviously already jumping at the first and third of these spaces.  Other network incumbents will have to move quickly to cement their own choices.

This isn’t going to be an easy transition for network vendors, but it’s pretty obvious that with the exploding interest of operators in this new “agile-box” technology, combined with the fact that there are real specifications (P4) and hardware advances behind it, something is going to come of it.  The good choices for vendors are likely to be used up pretty quickly, so my advice to the network equipment vendors is simple—don’t delay.