If the Data Center Drives Networking Overall, What’s the Destination?

For decades, enterprises have told me that the data center drives the network.  The vendors they said had the greatest influence on their overall network strategic planning were those who had the greatest strategic influence in the data center.  For the last ten years, the data center has been evolving to meet the requirements of componentized applications, virtualization, and cloud hosting.  In recent Wall Street reports, analysts have commented on this trend and cited it as the basis for software-defined network (SDN) growth.  What do data center users, particularly cloud providers and network operators, think?

Both large enterprises and service providers (including telecom, cable, and cloud providers) have long noted that the trend in software was toward componentized applications.  These applications generated the same “vertical” app-to-user traffic as monolithic applications, but also generated “horizontal” or inter-application traffic through component workflows.  Since a unit of work passes twice on the average (in and out) in the vertical direction, but could pass through four or five components in the horizontal direction, it was clear that we were heading to a time when horizontal traffic could outstrip vertical traffic.

Both classes of data center users also realized that resiliency and scalability were far more reliable and less resource- and performance-impacting if they were confined within the same data center, creating no WAN connections.  “Local” component workflows are much better than those that involve remotely connected resources.  Thus, the horizontal traffic expected would grow up within data centers more than between them, and thus the natures of data center and data-center-interconnect traffic would diverge over time.

Data center switching concepts have lagged this horizontal-emphasis transformation.  Historically, there have been two considerations associated with building data center switching systems.  The first is that the capacity of a given switch is limited by backplane speed, and the second is that Ethernet bridging protocols have historically limited a switch to a single trunk path to another switch.  About ten years ago, some network vendors (including Juniper, whose announcement I attended) began to offer fabric solutions.  Fabrics also developed in the storage market (remember InfiniBand?).  By traditional definition, a fabric is a switch that can connect every port to every other port without blocking and without major differences in transit performance.

You don’t necessarily have to use a monolithic structure to create a fabric.  The combination of top-of-rack and transport switches can do nearly as well, as long as you address the issue of multiple pathways between switches.  There are protocol enhancements to Ethernet bridging to do that, but an increasingly popular approach is to use SDN.  What SDN does is allow the creation of ad-hoc Level 2 virtual networks.  If it’s combined with a low-latency, non-blocking, switching system in a data center, that lets tenant and application networks to be created, repaired, and scaled dynamically with minimal concerns about whether the resulting workflows will be efficient.  Finally, you can use hosting policies that reflect horizontal traffic levels to ensure you don’t overload switch-to-switch trunks.  This is a workable way to extend basic switching, but obviously a non-blocking approach will make finding optimum locations for hosting capacity easier.

The big question for data-center-centric network planners is the one raised by edge computing.  The efficiency of a data center as a resource pool is roughly expressed in an Erlang C curve, which says that even though increasing the number of servers won’t create a uniform increase in efficiency, there is a critical level of resources needed to provide a reasonable hope of fulfilling a hosting request.  Edge computing would naturally distribute resources, and in the early days of edge computing it might be difficult to assign enough servers to a given edge data center to reach reasonable Erlang efficiency.  If you can’t host a given component in the right place, then hosting it in the next-best place could mean a significant increase in the cost and delay associated with the horizontal connections with the component.

One implication for networking overall is that a move to edge computing would be most effective if it were accompanied by a parallel improvement of edge transport connectivity; data center interconnect using high-capacity pipes.  You don’t necessarily have to mesh every edge point; it would probably be enough to join edges by simply connecting each one to its nearest neighbors.  That would produce a virtual resource pool consisting of three edge data centers.  Data center interconnect (DCI) facilities aimed at creating this kind of trio-modeled collection would go a long way toward minimizing efficiency and availability risks associated with the smaller data centers.

Another implication for networking is that if edge data centers, like other cloud data centers, are multi-tenant, then DCI connections between data centers will have to extend the multi-tenant mechanisms across data center boundaries.  There’s nothing inherently wrong with doing that providing that the trunks are efficient, but what it could do is increase the size of the virtual switch and the need for reliable switch control, both at the real-device and virtual-device levels.

Span of control for SDN switching has always been recognized as a potential issue, for a number of reasons.  First, the scalability of central controllers is something that almost all enterprises and operators are wary about.  How many events can they handle, how many events might be generated by common failures, and so forth?  This could be handled through the use of federated SDN controllers, but that approach is very new.  We don’t fully understand how federating SDN could impact things like setting up virtual tenants across a DCI.  Is there a risk of collision, or does the federation add excessive delays?

Does data center virtualization stop at the data center boundary, or extend to partner data centers?  Does it extend outward even from there to things like cell sites, content sources?  Those are the questions that are now going to be asked, and the answers may determine the future of SDN and the model for metro networking in the future.  There are powerful inertial forces established in networks by legacy technology investment.  Open-box operating systems and open switch/routing software could create a more agile future without risking all of this incumbent investment.  The P4 forwarding language could then support modernization and a whole new model, which could include some SDN.  Or SDN might get its federation act together and leverage what it’s already won—the data center.  That’s still where the strategic balance of power for networking is found.

Where DevOps Needs to Be

What’s next in DevOps, or orchestration or automation or whatever you like to call it?  That’s a question that was asked (in a more limited form) in a new DevOps LinkedIn group, and I think it’s an important one if we address it the right way.  At the DevOps or tool level, a question like that is just a decent into hype, a speculation about what might happen based on the premise that technology evolves by itself.  It doesn’t.  We have to step beyond the old hype that technology drives reality and explore where reality is driving technology.

The term “DevOps” is a problem in the current age, because it embodies two fundamental errors.  The first is that this is all about transporting developer knowledge into application operations practices.  Developers know best, they know what the application needs, and so if we communicate that need to operations, we get optimality.  The second is that the goal is a transfer of knowledge, when the real goal is the automation of what we know we need to do.

The first problem is one of “inclusion”.  Imagine the deployment of yesteryear (when, yes, the Lone Ranger rode), a time when we deployed transactional applications in a very basic (often monolithic) form on bare metal.  Now imagine a world of componentization, microservices, lambdas and functional computing, virtualization, containers, clouds, serverless, events, and more.  How likely is it that the practices of that early era would, though simple evolution, meet the current needs of the IT market?

Are developers of applications the drivers behind cloud adoption?  Do they understand the differences between containers and bare metal?  Do they know why we scale applications, why stateless behavior is good for some components?  The first truth of the current era is that we need OpsDev as much as DevOps.  But that truth doesn’t go far enough.

Application scalability and resiliency aren’t just things that operations people can introduce on a whim.  They are ways of meeting application requirements set by the ultimate arbiter of truth, the line department users and management.  That’s also true of whether we need to empower mobile workers, what we empower them with, how sensors and controllers could fit into business processes, and other stuff we think of as developer territory.  We need UseOps, UseDev, as well as DevOps and OpsDev.  There’s a trinity here; users, operations, development.  They all need to be represented.

And they all need to be orchestrated or automated.  We can’t talk about automated tools without considering how the humans work with them.  Optimum “automation” has to look at the business process efficiency overall, not simply at what the most efficient IT operation might be.  That might be awful for the workers.  Similarly, automation of tasks has to automate enough to ensure that what’s left for the worker isn’t harder to do than the job would have been without any automation at all.

Early in my career as a software developer/architect, I spent a lot of time working out a way of making a certain IT process more efficient.  My boss looked at it, then showed me that by simply doing something totally different at the IT level created a better business process overall.  It was an important lesson.

The second problem is simpler in one sense and more complex in the other.  My favorite-of-all-time CIO quote was “Tom, you need to understand.  I don’t want decision support, I want to be told what to do!”  Put another way, we don’t want “knowledge” or “guidance” as much as we want things taken care of.  That’s what automation is all about, after all.

The question, then, is how we go about automating something.  DevOps has always had two different models, one that describes the steps needed to accomplish something, and the other that describes the “something” we’re trying to accomplish.  “Prescriptive” versus “declarative”, or “transformation” versus “end-state”.  If you took a fairly loose definition of what kinds of tools fit into traditional DevOps, you could probably say that the majority of it is prescriptive today.  But prescriptive models are very hard to define in highly complex environments.  What, exactly, is the script trying to do?  The more variable the starting and ending points, the more scripts you need and the harder it is to organize things.

Declarative approaches have had their own challenges, because once you decide that you’re not going to build scripts/instructions for handling something, you need to decide how the automated process knows what to do to get from where you are to that glorious, desired, end state you’ve declared.  We are seeing clear recognition of this challenge in “zero-touch automation” in cloud-hosted carrier services today.  Do we think that services are inherently more complex than applications, and so have a need for this heavy thinking that apps don’t?  Why is that, given that what we are hosting future services in is the same cloud as what we’ll host future applications in?  DevOps, at the basic level of deployment of features, has been way ahead of comparable initiatives in the service provider space (like Network Functions Virtualization’s Management and Orchestration or MANO processes).  Why get behind now?  DevOps has to be careful not to ignore “orchestration”, just as orchestration needed to embrace DevOps.

Two things are emerging in the service provider orchestration discussions that need to be in DevOps discussions as well.  The first is the notion of intent modeling and hierarchical orchestration.  This lets you define abstract steps that can be refined according to the specifics of where deployment is taking place or what the SLA needs of a service/application happen to be.  The second is the notion of scalable, distributed, execution.  It makes little sense to deploy scalable, resilient, applications using monolithic tools.

The reason these two points should be important to DevOps is that we’re already seeing a major shift of DevOps focus, from static deployment to dynamic and continuous lifecycle management.  DevOps tools have dabbled in this shift through the support of “events”, but not gone far enough.  Modeling a service lifecycle in an environment where deployment and redeployment and scaling are almost continuous, and where resource pools include not only the data center but multicloud, is beyond current tool capabilities.

Event recognition is really a key to a future DevOps transformation.  First, events combine with intent-model principles to establish the notion of a coordinated multi-layer, multi-element, lifecycle management process.  This, as I’ve said, is the key point that the service provider orchestration advocates are contributing to lifecycle management.  Second, events combine with AI to create the kind of automated system people are seeing as the future of deployment and lifecycle management.

You can’t have AI without events, because events in some form are what AI systems react to.  But most important, you can’t have AI without goal-state awareness.  The only rational mission for AI in lifecycle management is the convergence of a complex system on the goal state.  State/event systems recognize the goal state by defining it and defining the “non-goal-states” and their event progression toward the goal.  AI systems would presumably replace explicit state/event structures with an automated state-event-interpretation process, something like complex event processing.  Whatever happens at the detail level, nothing can come out of this unless the system knows what the goal-state is.  That means that DevOps should be aggressively moving toward an end-state-and-event approach.  It’s moving there, but hardly aggressively.

The future is complicated, and if there’s a failing that’s plagued tech for decades, it’s that we’ve always underestimated the complexity of new technology, particularly in networking.  Automation is critical to sustaining reasonable operating costs and preventing human error from contaminating service reputations.  DevOps has led the way, and it needs to continue that, because applications and service features are all going the same direction—componentization and cloud hosting.

The Two Drivers to Network Change, and How Each is Doing

Hosting service features in some form is going to happen.  The timing is fuzzy, the specifics of the technology to be used is perhaps even fuzzier, but it’s going to happen.  This is a good point in our hosting-features evolution to think a bit about the options available and the things that might help select among them.  We’ll start with the broad approaches and go on from there.

Feature hosting really started with hosted router instances.  I can recall talking with Tier One operators about the value proposition for hosted routers and switches back in 2012, and by 2013 the operators were both interested in and investing in software routers and switches.  These early efforts focused on hosting a router or switch, rather than on hosting features in a broader sense.

Software-defined networking (SDN) came along at about the same time, and it was directed at creating a different model for forwarding packets, something to replace the adaptive routing behavior of current networks with a centrally controlled forwarding process.  You could say that this initiative was a kind of “router/switch as an intent model” because the goal was to present traditional interfaces and services but use a different technology to forward packets, under the covers.

Network Functions Virtualization (NFV), which emerged in late 2012 and matured starting in 2013, took a broader view.  NFV said that services were made up of specific features, and that these features could be hosted on servers and connected with networking and produce something that looked like virtual devices.  The initial focus of NFV, and what most (including me) would say is the continued focus, was on features above Layers 2 and 3, switching and routing.  Firewalls and other endpoint service features were enveloped in “virtual CPE” (vCPE) and features of mobile networks were converted from appliance form to hosted form.

Software-defined WAN or SD-WAN took yet another approach, this one aimed at abstracting the service away from the connection infrastructure that hosts it.  In many ways, SD-WAN is an evolution of legacy concepts of “tunneling”, “virtual wires”, or (as my sadly passed friend Ping Pan said) “pseudowires”.  You build virtual pathways between endpoints over whatever connectivity is available at L2 or L3, and your service presentation is made by the elements (usually software components hosted on or in something) themselves, independent of the underlying transport.

All of these initiatives came along from the feature side, but there have also been hosting-side changes.  Virtualization, cloud computing, and containers represented ways of packing more features onto a physical server, thus improving the economics of feature hosting.  The recent announcements in “white box” operating systems (the P4 forwarding-programming language, AT&T’s dNOS, now a Linux Foundation project called “DANOS” but both meaning “disaggregated network operating system”, and ONF Stratum) represent ways of exploiting off-the-shelf servers or open white-box switches to host arbitrary functionality, based on legacy features or new forwarding paradigms.

Picking the key options out of this list demands some scenario modeling, which I’ve been working to do.  My model says that there are two primary pathways to the hosted model of networks.  The first is the subduction path, where SD-WAN technology establishes service experiences and effectively disintermediates the underlying infrastructure.  That infrastructure then evolves based on some combination of the other options.  The second is the modernization path, and here we have the P4-virtual-device model being adopted in places where major infrastructure evolution is created by “outside forces”, like 5G or IoT.

The subduction path says that service-layer enhancements are valuable enough to induce network operators, managed service providers (MSPs) and even end users to adopt SD-WAN to reap service-layer benefits independent of operator infrastructure.  The advantage this approach has is the breadth of its support; there are almost two-dozen vendors and many service provider and MSP adopters of the technology.  It doesn’t impact existing infrastructure so there’s no risk of displacing undepreciated assets, and it provides a level of service-layer visibility that’s lacking in most of today’s services.

The many-cooks asset is also SD-WAN’s biggest liability.  Prospective SD-WAN users say that they have a hard time digging out a rational market vision from the host of competing positioning statements from vendors.  Nobody is singing melody in the SD-WAN song.  On the other hand, legacy vendors tend to softly push against the technology, for the obvious reason that being disintermediated isn’t exactly a positive thing.

The modernization path is almost the opposite.  On this track, the driving force is doing a better job of building infrastructure in places where new builds or greenfields are seen likely to emerge.  As I said in yesterday’s blog, the easiest place to introduce something new is where something new is needed.  However, the best of the something-news would be something seen as more evolutionary than revolutionary.  AT&T’s commitment to put white-box DANOS-based devices in cell sites is an example; these are much more likely to be open routers than a device based on a new forwarding paradigm (like SDN/OpenFlow).

The evolutionary approach to transformation is the biggest asset for the open-box-OS approach, and the biggest liability it has is the fact that evolution is hitched to another star, the star of whatever’s forcing the new-infrastructure deployment.  How long will 5G take?  How about IoT?  Given that little rational thought seems to be focused on presenting a model for either one that makes a business case for all the stakeholders, the answer could be “very long indeed”, in which case the modernization path to a hosting strategy fails.

What about SDN and NFV as drivers?  Forget it.  Both SDN and NFV face a common problem, and each then has its own unique problems.  None appear to be moving toward a solution.

The big common problem is “concept scope”.  What, exactly, is SDN or NFV?  Every vendor who has any role for software in networking calls their strategy a form of SDN, and every vendor who has a software feature or virtual device running on any platform calls it NFV.  The lack of an accepted singular model makes it hard to postulate a network goal, much less an evolution to achieve it.

SDN’s unique problem is the lack of a validated and scalable central-control strategy.  We can make SDN work in any data center.  We can probably make it work in transport missions.  Can we make it work in a VPN?  Some say yes, and others say no.  Could it scale to the Internet level?  Most say it cannot, and all admit that there are a lot of proof points needed and few clear paths to getting them.  If SDN can’t be everything, we’d need to know exactly what it can be, and what we’d need it to be, to expand its adoption.  We’re not going to get those answers soon, if ever.  Is it a coincidence that the ONF, who promoted OpenFlow-controlled SDN, is advancing Stratum, a general hosting-side solution?

NFV’s big problem is that it’s taken them so long to do anything, that there’s nothing uniquely useful left to do.  The limited scope of the ISG efforts make it little more than a crippled form of orchestration or DevOps, in a world where cloud and virtualization initiatives have built something bigger and better, and provided a much larger base of adoption.  There’s nothing in the NFV work that isn’t, shouldn’t, or couldn’t be in cloud work.  The whole focus of NFV seems to be virtual CPE, which the open-box-OS solutions like DANOS or Stratum do much better.  There was a lot of good, insightful, powerful thinking early on, but whatever we can learn from NFV has already been learned, implemented better, and adopted elsewhere.

In the resource-driven side of networking, the modernization path, hosting is now and will be always about the cloud.  Cloud initiatives, though, will be supplemented by business-level zero-touch automation, which might evolve from the ONAP work, from ETSI’s ZTA group, from a newly enlightened and invigorated TMF, or perhaps from some new initiative.  But even here, it’s obvious that orchestration in the cloud is growing upward from the resource level.  Give it long enough, and it will produce an application set that makes telecom and network services into nothing but an application.

What about the race for influence supremacy among our possible drivers of change?  The subduction model of change is in my view reactive; it means that no major new services are emerging, that 5G and IoT don’t generate massive new deployments of devices, just trivial access/RAN changes.  The modernization model is proactive, it means that some major new deployments are happening that can leverage new technology.

Right now, the model says that the odds slightly favor the proactive modernization model because overall economics are good and there’s a viable set of 5G/IoT deployments.  However, the decisive period is 2019-2020, because it’s then that the modernization-driven deployments will have to achieve some mass.  If that doesn’t happen, then SD-WAN will become the vehicle of the future.

Bootstrapping Network Transformation

We all know the phrase “You can’t get there from here.”  It’s a joke to illustrate that there are some logical destinations that don’t have any realistic migration path.  Some wonder whether a new network model of any kind, whether it’s white-box-and-OpenFlow, hosted routing, or NFV, is practical because of the migration difficulties.  Operators themselves are mixed on the point, but they do offer some useful insights.  Another useful data point is move giant Intel is making.  Can we “get there”, and just where are we trying to get?

One of the biggest issues with network transformation is the value and stability of the transition period.  Since nobody is going to fork-lift everything, what you end up doing in transformation is introducing new elements that, initially in particular, are a few coins in a gravel pit.  It’s hard for these early small changes to make any difference overall, so it’s hard for the transition phases to prove anything about the value of transitioning in the first place.

Let’s start with the financial side.  Say you have a billion dollars’ worth of infrastructure.  You have a five-year write-down, so 20% of it is available for replacement in a given year.  Say you expect to add 10% more every year too, so you have in the net about 30% of that billion available.  Now you get a project to “modernize” your network.  Logically, you’d expect to be able to replace old with new to the tune of 30% of infrastructure, right?  According to operators, that’s not the case.

Normal practice for a “modernization” would be to do a series of trials that ended in a controlled field trial.  Operators say that such a trial, on our hypothetical billion-dollar infrastructure, would likely involve no more than two percent of the total, regardless of the financial displacement theories.  That controlled field trial would be tasked with proving the technology and business case, and that could be hard.

What is the business case, meaning the value proposition?  If it’s a simple matter of “Box Y costs 30% less than Box X and is fully equivalent”, then you could in face prove out the substitution with a field trial of that side.  You pick some displaceable representative boxes, do the substitution, and see whether the stated full equivalence is real.  If that happy simplicity is complicated by things like “Box Y isn’t fully equivalent” or “part of the savings relates to operations or agility improvements”, then the scale of the trial may not be sufficient to prove things out.  It’s this focus on the relationship between features/capabilities, benefits, and proof points that brings technology into things.

Justifying transformation gets easier if you can take a portion of a network that was somewhat autonomous and implement it fully using Box Y.  In other words, if part of your 10% expansion strategy could be separated from the rest and used as the basis for the field trial, you might be able to treat it as a little network of its own and get a much better idea of how it could make the benefit case.

In some cases, the 10% expansion targets might include a mission or two that lends itself to this approach.  5G is a good example.  If operators plan 5G trials or deployment in 2019 and the plans include using white-box cell-site switching, it’s very possible you could define the specific requirements for interfacing those devices so that a single cell or a set of nearby cells might be able to form a sandbox for testing new technology.

Operators tell me that this is the way they’d like to prove in new technology.  You don’t fork-lift because you don’t want to write down too many assets.  You don’t just replace aging gear with transformed boxes because they spread out through the network and can’t work symbiotically with each other.  The need to fit them into all kinds of different places also raises risks.  So, what you want to do is find a place where something new can be done a new way.

This is one reason why 5G and similar initiatives that have at least some budgeting support could be critical for new technologies, and probably explains why operators like AT&T are so interested in applying open-networking principles with the 5G edge.  Edges are always a convenient place to test things out, because they are easier to integrate there.  There are also more devices at the edge, which means that edge locations may be better for validating operations savings and practices associated with a new technology.

IP networks also have a natural structure of Autonomous Systems or domains.  You’d use an interior gateway protocol (like OSPF) within a domain, and an exterior gateway protocol (like BGP) at domain boundaries.  The reason this is important is that domains are opaque except for what the exterior gateway protocol makes visible, which means you could almost view them as intent models.  “I serve routes,” says the BGP intent model, but how it does that isn’t revealed or relevant.  Google uses this in their SDN network implementation, surrounding an SDN route collection by a ring of “BGP emulators” that advertise the connectivity the way real routers would, inside the black box of the BGP intent-model-like abstraction.  That means you could implement a domain (hopefully a small one) as a pilot sandbox, and it could be used with the rest of your IP network just like it was built with traditional routers (assuming, of course, that the new technology worked!)

Cloud computing, hosting of features using NFV, content delivery, and a bunch of other applications we already know about would be plausible sources for new domains, and would thus generate new candidates for the pilot testing of new technologies.  Best of all, these new service elements tend to be focused in the metro network, which is also where operators are planning the most fiber deployment to upgrade transport capacity.  The network of the future is very likely to be almost the opposite of the network of the present, with most of its capacity in metro enclaves at the edge, and a smaller capacity fabric linking the enclaves.  That’s the logical model for edge computing, event processing and IoT, and content delivery.

IoT is the big potential driver, more important to next-gen infrastructure than even 5G.  While 5G claims a lot of architectural shifts, most are inside 5G Core, the part that’s on the tail of the standards-progress train and that has the most problematic business case.  IoT is a compute-intensive application that relies on short control loops and thus promotes edge computing over centralized cloud data centers.  It boosts metro capacity needs and also creates the very kind of domain-enclaves that lend themselves to testing new technology options.

Interestingly, Intel just announced it was spinning off Wind River to have the latter focus on IoT, edge computing, and intelligent devices.  This is the kind of thing that could accelerate IoT adoption by addressing the real issues, almost all of which relate to edge processing rather than inventorying and managing sensors.  However, Wind River has been focusing on operating systems and middleware in general terms, rather than on assembling any specific enabling technologies.  Would they create a model of functional computing to rival Amazon’s Lambda and Step Functions?  They could, but whether they’ll take that bold step remains to be seen.

Operators are slightly more optimistic about IoT as a driver to next-gen technology and transformation than they are about 5G, but only slightly.  This, despite the fact that the service-demand-side credibility of IoT is much higher than 5G, which is more likely to make current cellular networks faster and a bit more efficient than to launch myriads of new services.  I think a big part of the reason lies in the classic devil-you-know.  What an IoT service infrastructure looks like is tough to visualize.  For Wind River, and for other vendors who need transformation to succeed, making IoT easier to visualize may be their best mission statement.

Monitoring and Testing in the Age of Open Networks

Every change you make to a network changes how you look at is.  That changes what you look at it with, meaning the technologies, tools, and practices involved in monitoring and testing.  Networks are a cooperative association of devices, coordinated in some way to fit a specific mission.  Getting them to do that, or determining whether they’re doing it or not, depends on the mission and the means of coordination.  That’s changing rapidly, and testing and monitoring have to change with it.

There are three broad models of networking in use today.  The first is the adaptive model where devices exchange peer information to discover routes and destinations.  This is how IP networks, including the Internet, work.  The second is the static model where destinations and pathways (routes) are explicitly defined in a tabular way, and the final is the central model where destinations and routes are centrally controlled but dynamically set based on policies and conditions.  Each of these models have different monitoring and testing challenges.

IP networks are the baseline technology approach today, and in IP networks the devices themselves will typically exchange information on who they can reach and how efficient the pathway is.  Since the networks adapt to changes in conditions, including traffic loading, it’s traditionally been difficult to know just exactly where a given flow of traffic is going, and so it’s difficult to say whether it’s impacted by some trunk or node condition.

Advances like MPLS (multi-protocol label switching) have allowed IP networks to define virtual trunks that create a more meshed topology, and by routing these trunks explicitly you can gain greater control over how routes are drawn through a network.  This, of course, creates what’s effectively another layer of protocols, and a new list of things that might be subject to monitoring and testing.

The static model is in a sense an extreme end-game of these virtual-trunk practices.  The theory is that if a “good” connectivity model can be defined as a series of forwarding rules/policies/tables in all the devices, then you can keep it in place unless something breaks, and if something breaks you fix it.  The fixing delay might be prohibitive in a “real” world of nodes and circuits, but if we’re dealing with virtual elements then fixing something is a matter of virtual reconfiguration.  No big thing.

Static models have their disadvantages, the biggest of which being that they don’t work well if there are a lot of new destinations added, old destinations deleted, or destinations moving around in connectivity terms.  It takes time to update all the distributed forwarding rules, and while the update is in progress you risk having illogical and non-functional intermediate states created.  This also happens in IP, where adaptive routing changes can create pathways to nowhere as the new topology “converges”.  It can also mean your monitoring and testing end up focusing on somewhere the traffic isn’t going.

Central control models try to thread the needle between the approaches by providing a set of policies for forwarding, but also a controller where the policies are determined, stored, and distributed.  OpenFlow SDN is a modern example of central control, and for those who like me have a leg in networking history, IBM’s System Network Architecture (SNA) was a centrally controlled network approach.

Central control has issues too, not the least of which is the loss of connection between a network node and the controller.  Without a control path, forwarding policies can’t be updated, and it’s possible that without their being updated, you can’t connect a control path.  Classic Catch-22.  There’s also the fact that the load on the central controller can be acute where the central controller is presented with packets for which no forwarding rules have been defined, and then expected to define them.  During periods of high dynamism, where new things are being added or a lot of things are breaking, this loading issue could take a controller down.

The new paradigm of programmatic control of forwarding, via languages like P4, open even more variability.  What is needed to control a P4 network?  Answer: It depends on how it’s programmed, which means it would be very challenging to evolve testing and monitoring based on traditional techniques to keep pace with what P4 might produce in network forwarding approaches.  How do you address that?

At a broad scale, the answer is analytics.  You analyze overall behavior, because with hosted VPN or Internet services, you have no other option.  Enterprise management and monitoring inevitably evolves to analytics, because there are no nodes or trunks visible to you.  Networking is a numbers game, and that’s also true as things transform to intent models.  Even specific, provisioned, services can be reduced to analytics when you divide them into an intent hierarchy.

Not so for those who build networks, and who face the open-network revolution.  There, the only possible solution to the new agile-forwarding problem is also a solution to the problems of the current three network models.  What you need to do is forget the notion of packet interception and inspection, which depends on knowing what’s supposed to be happening on the data path.  Instead you focus on what the nodes themselves are doing.  The network, whether it’s the network of the present or some abstract network of the future, is the sum of its policies.  Those policies are applied in the nodes, so know the nodes and you know the state of the network.

Monitoring, then, is really going to turn into examining the node’s behavior.  That means two things, really.  One is the rules that determine how packets are handled, and the other is the conditions of the connections and the data plane behavior of the nodes that do the connecting.  The first is a kind of sophisticated policy interpretation and analysis process, and the second is dependent on in-band telemetry.  P4 assumes both can be made available, and you can make the two available with some customization even in the current three network models.

The ideal foundation for P4 testing and monitoring would be a “P4-network simulator” that let you model the nodes and policies and test the behavior of the network under the variable conditions that the P4 policies were sensitive to.  The same simulator would generate pseudo-in-band telemetry and how much was generated, when, and what it recorded could be analyzed and correlated with load conditions.  Every P4 device has to have a “P4 compiler” anyway, so doing a simulator wouldn’t be rocket science.

There’s a P4 suggestion on monitoring that uses a fairly broad set of in-band features that not only provide information on the flow, but also on the forwarding path rules.  I’m sure that this would be helpful in “debugging” a P4 program, but I think the simulator would be of greater value, and the in-band telemetry could then focus on end-to-end performance information, obtained by timestamping the messages and also recording things like queue depth along the way.

I think future testing and monitoring for all network models should converge on this approach.  Do a simulation and test it with telemetry comparisons.  For network models that don’t have forwarding rules distributed to devices, the goal would require you obtain current forwarding rules (time-stamped) to feed the simulator.  That means reading routing tables.  You could also, for legacy network models, read adaptive discovery information from the data paths, but this has limited value except at area boundaries where such information can be made available.  Snooping on every path/trunk is complicated, and of diminishing value.

How about probes and protocol analysis?  I think both concepts have been slowly diminishing in importance too; first because fewer people can read them and second because of the difficulty in matching probing to virtual networks.  It is possible to introduce probe processes more easily now, but more difficult to interpret the results and avoid impacting performance.  I think that “virtual probes” have a place in creating a simulator or analytics framework for monitoring and testing, but that their value outside that environment will continue to decline over time.

There are network simulators out there, of course.  As far as I know, they’re not specific P4 simulators, which wouldn’t necessarily rule them out as the framework for modern network monitoring and testing.  It would make the more general, but less able to model the specifics of our new forwarding-programmable model of devices.  Most are used in research and university environments, though, meaning that they’re not intended to model things at scale for operational analysis.  Discrete event simulation, the core of most of the models, is difficult to apply at scale for the obvious reason that processing simulated events at scale is as big a problem as actually moving the equivalent traffic.

Some experts in the operator space tell me that there’s interest in the notion of moving beyond discrete event simulation to a broader model.  You first build elements whose behavior you do model traditionally, but you then link them in a different way to create something that can actually represent network behavior in real time without requiring a sea of supercomputers.  We’re not quite there, but we’re at least looking at the right stuff.

I think that’s true with testing and monitoring.  Virtualization inevitably changes how we build networks, so it’s going to change how we monitor and test them too.  Intent modeling could have a profound impact by framing SLAs for components, which enhances the role of analytics.  Smaller components can also be simulated more easily, and the simulations can then be “federated” to model behavior overall.  As in other spaces in our open-network future, vendors have been dragging their feet a bit.  That’s diminishing the opportunity to get testing and monitoring right, shifting the industry toward predictive analytics instead.  It could still be turned around, but the longer that takes the harder it will be.

Who Wins, and Loses, in Open-Model Networking

Open feature software and hardware platforms for devices could revolutionize networking, at many levels.  I blogged about the service impacts earlier, so now it’s time to look at what could happen to the industry overall.  I think everyone understands that the changes would be radical, but not universal.  Even in the vendor space, there could be some winners, even spectacular winners.

Industry leader Cisco has already responded to the open-device threat, promising to make at least some network functionality available by hosting a version of its IOS software on open platforms.  This seems to be a classic incumbent response to things like AT&T’s dNOS (now under the Linux Foundation) and the ONF’s Stratum, meaning that it goes part-way toward creating a commercial alternative to an open framework, but not quite enough to be fully disruptive to Cisco’s business.

The Stratum approach to open network software seems to me to be the most organized.  You have a series of hardware-function abstractions that are mapped, via plugins, to specific hardware implementation.  The software, which is based on the P4 packet processing language, is run on top of that, so device features are customized at the software level while all potential hardware benefits are exploited below.  This seems to be the logical, best, way to approach the problem.  However, it’s also the most disruptive.

The Stratum approach opens the widest possible door for open network devices, so of course it poses the greatest threat to vendors like Cisco.  First, open hardware that’s effectively supported could by itself cut profits by about 25% if all Cisco’s hardware business were impacted.  Then there’s the fact that the IOS software would have to compete with an open community whose products had lower cost and a whole host of feature developers working to enhance it.

Am I then advocating “Open IOS?”  Not hardly, at least not at this point.  Such a move by Cisco would immediately validate a shift to open technology that, as I said in a prior blog, has yet to find its optimum justification path.  Why, if you’re worried about losing revenues/profits to an open platform, make such a platform inevitable?

If you’re not an incumbent router/switch giant like Cisco is, though, such a move would be very smart.  Dell EMC and Metaswitch have joined to offer a “composable network” framework that is very similar conceptually to the Stratum model or dNOS.  Their material shows the Cisco approach of “disaggregation” or separation of hardware/software, as being an intermediate step to the happy future of full openness, characterized by the ability to interchange elements of the network software stack from top to bottom, and also leverage silicon advances.  The software runs on OPX-compatible switch hardware, including the Dell S and Z series.

The duality of the disaggregated and composable options is the focus of the current tension in adoption.  Cisco’s approach says “Hey, you know our stuff and you have a ton of our gear in your networks already.  You know you can’t fork-lift all that undepreciated stuff, so why not let us give you something that makes that unnecessary?”  The Metaswitch and Stratum/dNOS approach still has to address that problem of near-term displacement, so Cisco could in theory gain traction for its disaggregation model before any more convulsive approach gets anywhere.

As I said in an earlier blog, the challenge is that you can’t realize capex reduction through premature displacement.  A tossed switch that’s not written down costs more than the purchase price of the replacement, and that additional cost would have to be covered by something else.  Future service benefits aren’t currently credible according to operators, so the only pathway is opex reduction.  While it is true that open, composable, devices could be managed more efficiently than devices are managed today, what we’ve seen so far in service lifecycle automation has focused on managing “virtual routers” that are mapped to the same management techniques as the real ones.  Thus, any improvements in opex efficiency would be just as available to legacy networks.

Intel/Wind River may be coming at the issue from another angle.  They’ve announced that they’re contributing portions of the Wind River Titanium Cloud software, parts relating to edge computing to the “Akraino Edge Stack, the open source community creating an open source software stack that supports high-availability cloud services optimized for edge computing systems and applications. These contributions supplement AT&T code designed for carrier-scale edge computing applications running in virtual machines and containers to support reliability and performance requirements.”  If I’m right about “overlay services” being the easy pathway to future new services, and if the most obvious place to look for displacement opportunities is the edge of the network where the most devices are, then this deal might be able to frame those future overlay services as cloud-edge services not network services.

Microsoft has also just told its employees that it’s forming new development teams and realigning efforts to focus more on intelligent cloud and edge, which includes AI and edge computing.  This is a clear indicator that IT companies, both software and hardware, are going to look a lot more closely into edge computing and the use of “intelligence” as a service, which again suggests that many of the new overlay service opportunities will be addressed from the cloud and IT side.

It seems to me that this combination of happenings proves two things.  First, any near-term success for an open model for networking is going to depend on minimizing displacement risk.  That means that you either have to focus on spots where there’s little or nothing to displace (AT&T’s 5G edge, or the network edge in general), or you have to cover the cost with benefits.  Second, we don’t have any clear pathway to realizing those covering benefits so far.  We still don’t know how to manage complex distributed systems cooperating to create services.

This opens the “competitive point” in our discussion.  If incumbents like Cisco want to minimize the damage to their bottom line, they will need to frame highly efficient zero-touch automation of the full-stack service lifecycle, and do it in a way that fully leverages legacy devices.  That would radically reduce the risk that an open-model device could include management features that would tap that opex reduction benefit in new ways.  On the opposite side, if Metaswitch/Dell EMC and Intel/Wind River want to foment a revolution, they need to build an open management strategy that’s stellar in its efficiency, and incorporate it with great fanfare into their tools.  That would as much as triple the number of devices that could be targeted for replacement by open nodes in the next three years.

Nobody survives as long as Cisco has by being stupid, but IBM juggled the present and future successfully for a longer period by far, and IBM seems to have slipped up in managing the current virtualization-and-cloud shift.  Cisco could mess up in facing the current open-device challenge.  That’s particularly true given that Cisco tends to spread a soothing oil of positioning over a troubled market, relying on the fact that it’s a leader in networking to protect itself until change is inescapable.  This has been called the “fast-follower” strategy, and this time it poses a risk of being a “first-left-behind” strategy instead.  Some very big players are already moving.

Who might win, then?  In a period of market convulsion, winners are often selected by serendipity.  There are always avenues of change that pan out faster, pay off better, than average.  Some smaller companies will be accidentally positioned in one such niche, and a smaller number may see the opportunities in advance and get positioned.  For mid-tier vendors, this figure-it-out-first approach could be particularly rewarding because they have some market credibility to leverage.  Smaller players will need to position with great verve and flair and aim to be acquired by companies like Cisco when the opportunities become clear.

One candidate group of smaller players is the SD-WAN vendors.  Most of them realize by now that overlay services are the game, not simply extending VPNs beyond MPLS.  Most have also realized that hosted functionality is better than fixed appliances, because they have to embrace putting SD-WAN endpoints in the cloud.  Thus, they have all the pieces needed, technically speaking, to address the opportunity.

The problem this group faces is their VCs.  The litany of “laser focus” that I’ve heard from VCs for decades hasn’t changed in modern times, and it reflects the notion that you toss ten million or so at somebody and see if they get traction.  If not, you let them die.  There have been some M&A deals in the space, which may induce the VCs with remaining companies to stay the traditional, narrow, course.

Another candidate is the SDN players, but in my view they have more baggage to contend with.  Yes, they typically have more backing (those who have survived), but the fact is that the open movement is an impediment to early SDN adoption, or broader adoption.  If you can make legacy routing and switching open, why test out totally new centrally controlled paradigms?

Specialty startups focusing on open-model networking seem unlikely at this point; it’s hard to raise money to promote commoditization.  Other players, especially those in the SD-WAN space and some SDN players, could jump on this, and of course hardware vendors might focus on building the appliances and supplying supported but open software.  The big players?  They’ll likely wait till something really gets going, then buy the startups.  That will give them technology options, but for their business model, there’s no stopping the open movement now.