A Look at Where SD-WAN May Be Heading

We now have a pretty good idea what SD-WAN looks like, at least in its basic form.  We can infer, from things like the MEF draft SD-WAN standard, what some SD-WAN future developments might be.  What we really need to do is explore the future of SD-WAN in a more formalized way.  SD-WAN, as I’ve said, is a step toward Network-as-a-Service (NaaS), and so we could expect NaaS needs to drive SD-WAN evolution increasingly, as the relationship between the two becomes clearer.

NaaS is a concept of virtual networking where connectivity is created on demand.  There are two inferences in that statement.  First, it presumes that there are demands for connectivity that an open and promiscuous IP network cannot fulfill.  Generally, that would mean that connection policies were a key part of the requirements.  Second, it presumes that “connectivity” in both business and consumer applications involves building up what has been called “closed user groups”.  These are communities that can interconnect with each other, but for which membership is an explicit requirement to that connectivity.

A traditional SD-WAN model involves the use of what I’ve called “SD-WAN nodes”, which form the boundary between SD-WAN services and legacy IP connectivity.  These nodes are deployed, either as appliances or software instances, where on- and off-ramp capabilities are needed, and every SD-WAN user has its own unique set of nodes.  This structure differs from that of IP networks, including the Internet, because the connectivity layer is deployed on a per-user basis, and any resource sharing among users happens below, in the transport infrastructure on which SD-WAN services ride.

The current SD-WAN model works fine as a company VPN, either based on combined Internet and MPLS VPN services or alone, as an overlay on Internet connectivity.  Where it falls short of NaaS requirements is in its support for community services that either cross user boundaries or bind users to resources that are inherently shared among users.  Partner connectivity is an example of the first, and things like content delivery networks an example of the second.

To make SD-WAN a full NaaS strategy, we’d need to look at three “needs”, in some combination.  First, a standard way of connecting transport networks so that SD-WAN could overlay a mixture of such networks.  Second, an SD-WAN node that could support multiple SD-WAN communities, and third a means of connecting users when the NaaS communities and SD-WAN networks aren’t in a 1:1 relationship.  We’ll look at each of these in order.

Starting, of course, with the bridging requirement.  Any NaaS that’s truly a NaaS has to be able to connect users wherever they are, which means whatever transport network they happen to be on.  That means that there has to be a means for bridging multiple transport networks, ranging from the Internet, IP MPLS, Ethernet, and even in theory TDM.  A bridging strategy of any sort has to have a node with an interface on the networks it bridges, but what exactly is happening in that node depends on the overall connectivity strategy.

One choice is to have an SD-WAN node that can support an overlay on any of the target transport technologies.  There are “universal” pseudowire or tunneling protocols that will work over almost anything, and one of those could be used.  Another option is to select a “transport protocol”, the logical option being IP, and bridge IP over the other transport options.

Those same two transport options could work if we had a specialized bridging node that was aware only of the transport choice and simply passed the SD-WAN overlays transparently.  However, this choice would mean having the SD-WAN implementations zero in on a compatible transport mechanism that could overlay any physical network.  IP would be the only universal option, so we’d need to decide how to create an IP-like transport overlay on which SD-WAN overlays could be hosted.  The overhead of this could become an issue, and of course so could cooperation among SD-WAN vendors.

A special case of our last option is the answer to the second of my three “needs”, an SD-WAN node that could support multiple SD-WAN communities in parallel.  Think of it as a “shared bridge” or “shared gateway”.  There are two possible models of such a device, one that supports multiple communities based on the same SD-WAN implementation, and one that supports communities with diverse SD-WAN technology choices.  The former option isn’t feasible in a market where there are already two dozen or more implementations, so we have to focus on the latter.

Supporting two different SD-WAN technologies within a device means mapping between them for connectivity, and the only practical way to do that in a market as diverse as SD-WAN is to presume you’re going to harmonize everything to a common standard.  The MEF’s draft SD-WAN spec, while still only a draft, seems to have taken some useful steps in defining what such a standard would look like.  SD-WAN is about forwarding application flows, so the MEF defines the notion of application flows and then allows policies to assign flow identities to packets that arrive.  One could assume that if you harmonized both the on- and off-ramp for a flow to the MEF specs, you could enter from one implementation and exit onto another.

The fault in this happy picture lies in the fact that there are major variations in the way that current SD-WAN implementations treat the “application flow” concept.  Most provide little real support for flows, allowing only a few assigning policies to operate.  This is a problem because interworking nodes as I’ve been describing, even those that harmonize to a common standard, have to be able to harmonize both ingress and egress to that standard, which can’t happen if the features aren’t congruent.

They aren’t, and in particular they aren’t congruent with respect to connection policy controls.  The specific NaaS issue is those closed user groups.  If application flow policies can’t identify specific user/user or user/resource relationships, then there is no NaaS because there’s no closed user groups.  In addition, there’s no natural security for shared-node resources because user sessions aren’t recognized and can’t be kept separate.  Today perhaps three or four implementations of SD-WAN have any real relationship-or identity-based policy controls.  Those can perhaps be made to interwork with each other, but for the rest?  It’s going to be tough.

That’s a problem in gateways, but a greater problem with creating an explicit interworking approach among the SD-WAN implementations.  The harmonize-through-convergence-on-a-standard model is also an answer to the third of our NaaS “needs”, the ability to map NaaS membership independent of SD-WAN technology.  Any node could provide SD-WAN intermediation between vendors if the vendors would cooperate, which is doubtful in a competitive market.  On the other hand, if the MEF approach is popular and credible, and if it actually defines the interworkings required fully, there would be pressure on SD-WAN vendors to support it, which would then make it possible to build a “harmonizer”.  In fact, it’s not unreasonable to assume that there would be an open-source initiative to do just that, and each vendor could then contribute their own mapping interface.

The challenges all of this will face in a competitive market are clear, and it’s equally clear that time isn’t going to make matters easier.  Thus, the pace of the MEF’s work is as important as its completely the mapping issues.  I’m guessing it will probably be some point in 2020 before the final standards are available and the market can even assess whether the NaaS points I’ve made can be supported with it.  If they aren’t, we may have to wait for some other approach to come along, and that will impact NaaS progress at a time when we really need progress to be made.

From Climbing the Value Chain to Descending It?

We all know that in a market that’s commoditizing as networking seems to be, it’s common to see “horizontal aggregation” of players, where competitors bulk up to improve efficiency as market profit margins decline.  We now seem to be entering an era of “vertical aggregation”, where companies buy not competitors but rather symbiotic players in an attempt to gain competitive advantage.  The publicized media acquisitions of Comcast and AT&T are examples of this, and Amazon is now rumored to be looking at buying Boost from Sprint/T-Mobile.  Is this just a maturation of the old M&A trend, or something new?  Two factors may be the determinants.

In a small pond, you might well expect one big fish to survive simply by eating the others.  This is a pretty good picture of the horizontal M&A model, and if it were left to market dynamics to control events, we’d probably end up with little or no competition.  The first complicating factor is regulation.  Almost all industrial countries have anti-trust regulations, and most generally follow established economic theory that says that you should have three or four viable competitors in a market.  Policy may thus prevent those natural market forces from taking hold.

The other factor is the economic benefit of horizontal aggregation.  You can reduce waste if you can lower the amount of stuff like competitive overbuild of network infrastructure and redundancy in staffing that results when you divide a market into a bunch of competitive enclaves, but suppose that’s not enough?  Ultimately the costs of supporting services will plateau to a certain level; you need some infrastructure and staff.  If willingness to pay for services isn’t rising, and if new market opportunities aren’t developing, then profits will also plateau, which makes the stock market very antsy.  They want growth in earnings, and that’s that.

Buying a non-competitive business is a possible answer.  A number of operators, including Telefonica, have gone to emerging markets to deploy services, having exhausted the potential for growth in their home market and facing a lot of competition there as well.  The problem with this approach is that you’re essentially buying revenue, and the strategy will hit the same wall as your new market area reaches cost optimization and addressable-market limits.  Every emerging market isn’t a good one, and the good ones get snapped up fast.

The symbiotic approach is what’s now emerging as a complete and credible strategy.  In the old days of networks, they were focused on connection as a service.  Today, they’re focused on experience delivery.  Internet traffic and profits come far more often from delivering video than on making calls or sending emails.  Thus, it’s logical to assume that network operators would look to climb the value chain and obtain the experiences themselves.

An operator who owns the experience now has the profit margin from the experience plus the profit margin from the connectivity.  Some of the organizations’ staffs can be consolidated, but the big value is that there’s less disintermediation of operator interests when you have both what’s wanted and how it’s delivered.  The OTT isn’t stealing your profits because you are the OTT, or at least part of the OTT.  You’re climbing the value chain toward what the buyer really wants.  All very logical.

What’s not as clearly logical is the trend in the opposite direction, which is instead of climbing the value chain, you descend it.  Google fiber is an example of this, and Amazon’s acquisition of a mobile operator would be another.  Generally, the experience-level stuff has a better return than the connection-and-delivery stuff, so why would an experience player decide to dive into the mud?

One possible reason is that you’re afraid that your own business success will be impacted by the increased profit pressure on those lower on the value chain.  Google’s business wouldn’t be possible without network operators to provide mobile and wireline services.  Google, Amazon, Facebook, Netflix, and all the rest, would love operators to just suck it up and invest no matter how lousy their return on infrastructure is.  But what if they won’t?

Google demonstrated the “what-if” when they fielded the notion they might bid on wireless spectrum.  They did another demonstration with Google Fiber, which was a big media hit but was deployed only in situations so rare that there could have been no hope of it changing the market.  Amazon might be doing that today, or it might not.

Mobile services, mobile broadband, have been for years the growth and profit engine of network operators.  In general, wireless capex has continued to grow at least modestly while wireline has generally been in a slow decline.  So one reason why an OTT might be willing to dive into the mobile service space is that it’s still green-like-money grass and not quite mud.  One reason, but not a convincing one.

The problem with mobile services is that it’s more competitive.  How many times have we heard about a big telco running out to deploy wireline infrastructure in another telco’s territory?  We do have poaching of high-value business customers, but not of mass-market consumers.  But mobile services are easily offered out of region, and so they’re quickly being eroded by the extra competition.  The grass is greener than wireline’s grass, but it’s showing signs of mucking up.

There’s a better reason, though.  From the perspective of “new” services, it’s hard to find much happening except in the mobile space.  Mobile broadband is really the driver of social media.  It’s really the driver of streaming video.  If there’s a future to augmented reality, it’s probably a future that will come about because of a mobile offering.  Same with IoT.  You can’t immerse yourself in, or change your life with, something that you can only access between dinner and bedtime.  Mobile is where you are, everywhere you are, at any time.

Which may be what Amazon is thinking.  Nobody believes that Amazon can eat all of retail.  Nobody believes it can eat all of video and audio either, and yet the Street agonizes over how Amazon can continue to deliver great quarterly growth when their core markets aren’t growing.  So perhaps Amazon is thinking that they’ll find new markets, and that might well mean finding new mobile-centric services.

Amazon failed in its phone attempt.  Might they see 5G as an opportunity to launch a new run at that market?  If so, might they also see that if a 5G-centric service happened to be something that Amazon could sell as an experience to users, as a cloud feature to partners, they’d be able to double- or triple-dip?

There’s even a chance that this kind of move would help relax the regulatory problems facing the big OTTs.  For all the coverage we have of unfair practices from these guys, the fact is that they’re fighting for an ad market that has little upside.  The global ad market is always going to be a lot smaller than the actual market for goods and services that the ad budgets are shooting to influence.  Amazon is already in a good place because they sell stuff, not ads.  They could be in a better place if they were driving a networking revolution.  Imagine people liking Amazon again!

If Amazon thinks it would be able to create a 5G revolution and profit from it, then acquiring Boost would make a lot of sense.  However, there would have to be some understanding that 5G would be part of the deal, and perhaps that would mean even an MVNO piece of a possible 5G millimeter-wave deal of the kind T-Mobile was promising.  There’d also have to be some specific brass-ring application for Amazon to exploit.

We have never had an MVNO relationship built with the goal of exploiting a new service rather than a legacy service.  Does Amazon hope to do just that, and if they do, might that actually give life to the network slicing aspect of 5G?  Remember that network slicing creates a more complete asset segregation than a normal MVNO relationship would, and that could be something Amazon would be interested in.  Could Amazon hope to create a 5G network-slice MVNO over both wireless and wireline?  Even more interesting, and we’ll have to see what actually develops.

And how would this play for Google, with Fi?  Google already has an MVNO relationship with both Sprint and T-Mobile.  If Amazon could exploit a new MVNO deal (via Boost) surely Google could do the same.  And if that were to come about, we’d have a real business case for network slicing, for 5G core, and for a more revolutionary 5G deployment model.  It’s worth thinking about.

What We Should Take Away from Google’s Outage

Everyone has surely heard about Google’s massive outage, even those who weren’t directly impacted by it.  The fact that “network congestion” could take down such a broad set of services—some essential to their users—is troubling, and it should be.  The details of the cause will probably eventually come out, but it appears the issues were broader than Google’s cloud alone.  Whatever they were, there are a lot of C-suite executives who are very worried this morning.

The biggest worry reported by execs who have been emailing me is that the cloud is less reliable than their own data centers.  One said “I’ve committed my company to a cloud strategy to improve availability, and now parts of my cloud failed while my data center is running fine.”  Obviously this sort of thing could hurt the cloud, eventually.  My model of buyer behavior says that the credibility of Google’s cloud and public cloud services in general won’t be impacted by this outage, but if the problems with cloud failures persist, it could happen.

There’s a pretty solid chance that this problem, and in fact most of the publicized problems with the cloud, are really network problems rather than hosting problems.  Cloud data centers, “hyperconverged” if you like, tend to be highly reliable because they have built-in redundancy.  The problem with cloud computing isn’t the computing part at all, but the fact that network connectivity is essential in the cloud.  In fact, it’s the foundation of the cloud.

IP networks are “adaptive”, meaning that they rely largely on cooperative behavior among devices to set up routes and respond to failures.  Routers “advertise” reachability and various protocols are used to exchange this information, which is then turned into routing tables.  For ages, back into the ‘70s in my own experience, there’s been a debate on whether this kind of autonomy is a good thing or a bad thing.  The good part is that networks built on autonomous device behavior are self-adapting.  The bad part is that if they get it wrong, it may be difficult to figure out what happened and how to remedy the problem, which seems to have been the case here.

Proponents of SDN have long felt that IP was a bit too much of a jolly band of brothers and to little an organized and controllable technology to build networks on.  They’d like to replace IP with central control of routes, and if you look at that as a path to a more controllable, “traffic engineered” network, the notion seems valid.  The problem is that nobody has proved it would work at anything approaching the scale of the Internet, and that it’s far from clear that any operator would be willing to make the massive technology change that adopting SDN to replace traditional adaptive routing would involve.

The other truth here is that when the dust settles on this outage, we’re very likely to find that the fault was caused by a configuration problem, meaning human error.  Yes, you could argue that the configuration of devices that, when they’re set up, kind of do their own thing, makes it hard for network engineers to predict the relationship between changes they make and future network conditions.  A number of operators have told me this disconnect is the cause of most of their significant network outages.  The point is that human error isn’t going to be eliminated by moving to a networking model that’s based more on human traffic engineering decisions than on the decisions made by autonomous devices.

The thing that this outage proves is that we really do need some form of AI in networking, not just in the “management” or “traffic engineering” parts of it but also in the configuration management part.  In the cloud, meaning in the hosting part of the cloud, we’ve already recognized the underpinnings of a mature, reliable, framework for host management, and they start with the notion of a “single source of truth” and a “declarative model”.

Declarative models describe the “goal-state” of a system, and employ automated processes to converge on that (or another acceptable) goal state when something goes awry.  We’ve been shifting increasingly to declarative DevOps, for example, in both VM and container hosting, and in the public cloud.  In networking, the rule is still script-based or “imperative” management, where instead of saying what you want, you tell the network what to do.  That process is much harder to add AI to, but it’s also hard to have scripts changing things while devices themselves are adapting to conditions, including the ones you create with your changes.

The single source of truth is perhaps the solution to this.  Rather than have a dozen or more sources of configuration information, single-source-of-truth approaches keep everything (which should mean all the goal-states) in a common repository.  Since Git is the most common repository, this approach is often called “GitOps” because it builds everything around the repository.  Everything that needs a configuration has to get it from there, and the repository can examine the totality of data and ensure that everything is consistent.

Why, if the cloud had figured out GitOps, has this problem with cloud/network symbiosis not been solved?  The big reason is that even though we’ve virtualized practically everything these days, we’ve somehow not really done that with the network itself.  Even technologies like overlay SDN and SD-WAN, which provide virtual connectivity, do so by layering it onto the transport network, which is first of all almost surely IP and second of all is treated as a separate resource, presumed to be available.  Things like service mesh are just now starting to integrate network thinking and cloud thinking.

Connectivity is what makes the cloud a cloud and not just a big data center.  We can’t make the cloud work if we can’t make the network it depends on work, and that network is the Internet.  We need to figure out how to make GitOps-like technology converge with what network geeks often call “NetOps”, and we’re not there yet.  Google’s outage shows we need to accelerate the process of getting there.