Are Skill Issues a Hidden Problem in Zero-Touch?

Could we be missing a big requirement in zero touch?  In a recent conversation I had with a Tier Two operations executive, this question came up, and I think it’s a fair one.  It also illustrates that the world of telcos is more varied than we’d think.

Nearly all the emphasis on zero-touch service lifecycle automation is on its potential to radically reduce operations costs.  This is important first because opex is larger than capex for most operators, and second because new technologies designed to reduce capex (like NFV) risk raising complexity and opex in turn.  There’s nothing wrong with this perspective, but my recent conversation suggests it’s incomplete.

“Our problem isn’t how much touch operations needs, it’s who’s available to do the touching.”  That’s the insight my executive contact presented.  His point is that smaller telcos have major challenges in sustaining an adequate base of skilled operations personnel.  That means that zero-touch might have an added benefit of reducing errors and supporting more complex network technology choices, where limitations in skilled labor impact operations practices.  It also means that some zero-touch approaches might themselves be too complicated to administer.

My all-time favorite tech cartoon comes from the ‘60s.  At the time, it was popular to measure computing power in terms of how many years it might take some number of mathematicians to solve a given problem, versus the computer.  In the cartoon, a programmer comes home from work, tosses a briefcase on the sofa, and says to his spouse “I made a mistake today that would have taken a hundred mathematicians a thousand years to make!”  This reflects the downside of automation.  If it goes wrong, it can do truly awful things so fast that humans might never be able to respond.  If the humans don’t have the requisite skills…well, you get the picture.

Having machines run amok is a popular science fiction theme, but rampant machine errors aren’t confined to robots or artificial intelligence.  Any automated process is, by design, trying to respond to events in the shortest possible time.  Zero-touch systems necessarily do that, and so they’re subject to running amok too.  My contact is concerned that if somehow zero-touch automation isn’t set up right, it makes one of those thousand-year mistakes.

Most automated operations tools are rule-based, meaning they look for events and associate them with a set of rules, which in turn describe actions to be taken.  It’s pretty obvious that messing up the rules would mess up the results.  What could be done to reduce this risk?

The pat answer is “AI”, but artificial intelligence still has to somehow learn about the real world.  A proper AI strategy is probably a subset of a broader approaches that we could say are “state-based” or “goal-seeking”.  These systems define a network as existing in a number of possible states, one of which is the goal state.  The system responds to events by translating into a new state, and when in that new state, a set of rules describes how to seek the goal state again.

This approach is similar to “intent modeling” in that it can be implemented via an intent model.  It could also be implemented using AI, but however it’s implemented there’s still some skill required in setting up the state transition rules.  The advantage is that the state-based approach is both holistic (it deals with network state, not discrete events) and easier for people to visualize.  One disadvantage is that the state transition rules tend to be “hidden” and easily forgotten.

The big problem with state-based modeling for zero-touch automation is that complex networks have a lot of moving parts, and thus a lot of failure modes to represent.  Similarly, the task of restoring the goal state is highly complex, to the point where you could expect someone to author rules to do it, which takes you back to the rule-based approach.  As always, we then have the complexity of the rules and the difficulty creating them without skilled personnel.

The approach that my contact liked was simulation.  If we presumed that we had a simulation of the network, and that the simulation allowed for the input of conditions, we could use the simulation both to create rules and to validate actions before they were taken.  In fact, the marriage of a good simulation strategy and any zero-touch approach seems natural, to the point that it’s hard to see how we could have come this far without trying it out more seriously.

Simulation of this sort depends on a library of elements that can then be combined graphically to “draw” a network.  The elements would include devices and trunk lines, of course, and ideally there should be elements to represent any real-world network component.  There would also have to be “load” and “condition” simulators to allow the simulated network to be “operated” and conditions tried out.

We have simulators, of course, and even complete simulation languages and libraries.  There are several dozen “network simulators”, including a number of free ones.  Some have decent library support, though it’s often necessary to define your own models for devices if the library doesn’t include what you’re running in your network.  The Tier Two and Three operators aren’t uniformly aware of these tools, and in my surveys the number who have any experience with them is down in the statistical noise level.

Those who have tried them don’t report much success.  The problem is that there’s nothing much out there to describe how to integrate simulations with zero-touch automation or even operations in general.  For Tier Ones, this is a problem I’ve often heard about, and for Tier Two and Three operators, it’s probably insurmountable without outside help.

AI machine learning or neural networks might also offer a solution here, but it’s unclear just how effective either would be absent a skilled team to provide information.  Most AI relies on subject matter experts who “know” the right approaches, and “knowledge engineers” who then create an AI application that can then deal with conditions in the field.  If a telco has limited access to a skilled labor pool, those experts could be hard to come by.

I think simulation is likely the best approach here, providing we can get some broader industry support for the idea, and in particular can get some standards and APIs that allow simulation to be integrated with operations tools, including zero-touch automation.

I also think that the problem of skill levels needed to establish and sustain zero-touch automation goes beyond Tier Twos and Threes.  I’ve worked on rule-based systems for quite a while, and the challenge that they pose is related to the complexity of the ecosystem they’re applied to.  A single trunk or device can be sustained in service with simple rules.  As you add elements, the rule complexity rises with the square of the number of elements, simply because networks are an interdependent system.  I wonder how many big Tier Ones will face this truth as they try for their own zero-touch applications.  Maybe the skills needed are beyond even them.

Should we Break Up Big Tech?

Should we break up big tech?  That’s a question that’s being asked both politically and by tech companies themselves.  A thoughtful article on the topic raises some good points, but I think there are both useful amplifications of those points the article makes, and some points that it doesn’t make that need to be considered.

The political and technology camps are looking at the issue from very different perspectives.  The political problem with big tech is really about big ad sponsorship.  People’s information, even their privacy, is compromised by companies who push for profits by improving ad targeting.  The tech industry’s own questions center on whether innovation is being stifled by the control exercised by the market giants in the space.

The Boston University article points out that the legal basis for breaking up big tech is questionable, at least under current law, and that a decision to try could threaten shareholders unfairly, perhaps even compromising some pension funds and nest eggs seniors rely on.  The legal question I leave to the lawyers.  What I want to look at instead is the practicality, the value, and the impact of a decision to break up some of the biggest and most successful companies of our age.

I know a lot of people who are all for privacy, complete protection of personal information, suspension of all tracking, and so forth.  Most of the same people say they’re not willing to pay for social media availability, for doing searches, or for viewing online content.  We are addicted to free stuff, and the truth is that most don’t care very much about how the “free stuff” is really paid for.  The business model of the Internet overall, even the players and their relationship with each other, is hazy in the minds of the great majority of Internet users.

The fact is that online stuff costs money, which means somebody is paying, which means that somebody is getting value from the “free” experience in some way.  That way is usually advertising.  Just as network TV is paid for by ads, so is much of our online experience.  Ads are successful because they work, and they work best if the ads are presented to actual prospective buyers, at the right time when they’re considering a purchase.  This is why web companies gather user information, why if you search for a product on Google, you’re likely to suddenly see ads for that product popping up on your news and weather pages.

We have a perfect and absolute solution to gathering personal data and tracking our activities.  Pay for what we want online instead.  Obviously that’s not going to happen, and it’s useful to look at this extreme-case example because it means that the richness of our free online experiences is directly related to what we decide to surrender.  I suspect, based on my own contacts, that any attempt to limit big tech’s personal information-gathering would lose support if online experiences were impacted.

Your news and weather sites are going to show you ads, or you won’t see the news or weather.  If they show you random ads with no real connection to your interests or needs, they’ll get paid less for them and thus show you more ads to increase their revenue.  What works best for you?

The article makes another very important point, which is that web firms have an “inverted value” model, one that encourages others to create stuff by making that stuff more accessible.  How many web sites would there be without search engines to find them when we’re looking for a product, service, or information?  If we had to know the URL for everything we saw and did online, we’d have a poor (even impoverished) Internet.

Amazon isn’t a social media company, but the same arguments can be made for it.  Amazon promotes what you likely want to buy, because it’s more relevant for you and more profitable for Amazon.  If Amazon couldn’t track your interests, they’d still show you stuff just as your news and weather sites would.  Likely more of it, again because the untargeted stuff is less valuable.  Your choice?

Amazon is a kind of bridge company here, because it’s cited by tech companies as an example of somebody who’s stifling innovation by swamping out new competitors in its space and by politicians because of the impact it has on storefront retail.  Does the “Amazon is too big” argument stand up?

I remember the Sears Catalog.  We got it at home when I was a boy, and I spend a lot of time browsing it and wishing I had the money to buy what I found.  We don’t have Sears Catalogs today; they were swamped out by storefront retail in the form of strip and facility malls.  Retail economic evolution, in short, made them a casualty.  Online retail puts the same kind of pressure on storefronts as storefront proliferation put on catalog sales.  If storefronts fail, it will be because the buyer prefers an online model.

But there is a problem here.  A bunch of storefronts compete with each other.  A single giant online retailer can essentially destroy competition.  It’s relatively easy to open a hardware store, but very difficult to open an Amazon competitor.  The bigger Amazon gets, the harder it is to create competition, and if all other retail avenues are closed by Amazon’s power, who then protects consumers against predatory pricing?

The same argument could be made for cloud computing.  A giant cloud provider has economies of scale, name recognition, and other benefits that are very difficult to match if you’re just getting into the cloud business.  Does an Amazon kill off innovative cloud companies?

Perhaps, but GM sells a lot of different car brands and nobody is breaking it up.  There’s a value in economy of scale, and generally the theorists have postulated that the “optimum” market has three competitors.  We have more than three online sellers (Amazon and Walmart are the largest, but nearly every major storefront business maintains a web storefront too).

Then there’s the VCs, who firmly believe that startups need exit strategies that earn big paydays for founders and investors.  Who buys up the startups?  Those big companies we’ve been talking about.  Without an acquisition exit, the only option is an IPO, and that would be much more difficult.  If the innovation in a market is related to the number of startups, you could argue that big companies who buy startups are promoting innovation.

Or you could argue they’re stifling it.  Are companies buying up innovative startups to promote their innovation or ensure it never makes it to market?  Look at the record of M&A by the big networking companies and you’ll see that most startup acquisitions they make don’t get to market, or at least don’t seem to make an impact.  Could this be a buy-and-kill strategy in action?

I think that the “break-up-big-tech” argument has three flaws.  First, the main point the article cited, which is the impact on shareholders.  Second, the problems we have with privacy are problems with the ad sponsorship business model, not the firms that implement it.  Finally, even if we accept either the privacy or innovation arguments and the consequences of our accepting them, it’s far from clear that breaking up big tech would actually do anything.

We almost certainly need to look at the way we regulate online business.  We may need new laws, and we may eventually have to apply anti-trust measures to current tech giants.  Whatever we do, we have to do thoughtfully and not fall prey to slogan-engineering.

Amazon’s IoT Initiative: Good or Bad?

“Standards are great they break the ice, dabbling this way is oh so nice…” these paraphrased words from an old song could be an anthem of IoT.  The problem is that breaking ice and dabbling don’t get things installed.  Indeed, the first step in that is actually having something to install.  Amazon, one of the web giants already under pressure for anti-competitive behavior, is launching something that could well add fuel to that fire, but it could also put a realistic face on IoT, which is something we darn sure need.

The source of all this potential for good and bad is Sidewalk, an Amazon-created device protocol that’s designed to let people build simple IoT networks from cheap devices, over fairly large areas.  Sidewalk will join a bunch of other protocols like X10, Zigbee, Insteon, and of course Bluetooth, WiFi, and 5G’s own IoT protocols, including NB-IoT.  There’s not a lot of side-by-side (no pun intended) information available on how Sidewalk matches up with these old and modern competitors, but there are some general comments.

It may not matter much how the technologies match up, though.  What Amazon is promising is to use its Ring (acquired) brand to promote a host of devices operating over Sidewalk.  It’s very possible, even likely, that Amazon will deploy its own IoT ecosystem, something like Insteon did years ago in the home control space.  Amazon is also likely to target home control, but Sidewalk has broader potential.

The name expresses Amazon’s desire to deal with the short-range limitations of current in-building IoT protocols, but without the higher device costs of full WiFi or (worse) 5G.  Extend your IoT to the “sidewalk” and beyond, Amazon invites users.  The range of Sidewalk is somewhat variable depending on the situation, but up to a half-mile is reasonable and in some cases, it could be even longer.  For those of us who lack vast estates, it’s hard to imagine that even the shorter range could fail to support devices ranged over a property.

You’ve got to open any Sidewalk discussions with a caution.  IoT protocols aren’t something you download the Library of Congress over, or use for interactive 6K video editing.  They don’t replace WiFi, including WiFi 6, for local broadband connectivity and they don’t replace 5G (or 4G, or whatever) for cellular service.  They’re designed to make it possible to connect cheap sensors and controllers into a network, probably one around a “hub” that lets you manage how all the information that’s collected, and all the stuff the controllers let you do.  So don’t get swept away by comparisons between Sidewalk and other protocols with very different missions.  Cheap, simple, secure, is what Sidewalk is about.

That, and products.  Amazon’s decision to launch a new protocol that’s integrated with new devices (in the Ring family) is a smart one, because it makes adoption of Sidewalk a byproduct of building a larger Ring ecosystem, something Ring users are already somewhat committed to doing.  Amazon will also make the specifications available to other device-builders, which means that surely others will jump on the bandwagon to leverage Ring’s footprint.  Despite the pledge of public specs, though, Sidewalk is an Amazon development and not an industry standard.  That’s the downside, so now let’s look at the upsides.

More Ring, further away, is the obvious mantra of Sidewalk, but there are other themes less obvious.  For one, it seems to me that it’s a certainty that Amazon will be providing cloud services in support of the broader Sidewalk mission.  Ring devices already offer cloud-hosting of events.  If a Sidewalk home/building network combines with a WiFi gateway (that can also provide custom programming for how sensor events influence controllers to do things like turn on lights), then it can have its own relationship with Amazon’s cloud.  I believe, from what rumors I hear, that Amazon intends to provide a range of cloud services for Sidewalk, ranging from simple extensions to the old Ring services to actual AWS features that let people or third-party developers build applications around Sidewalk gear.

That would be an interesting extension in itself, but there could be more.  Listen to a quote from Amazon’s blog on Sidewalk (italics mine): “just a week ago Amazon employees and their friends and family joined together to conduct a test using 700 hundred Ring lighting products which support 900 MHz connections. Employees installed these devices around their home as typical customers do, and in just days, these individual network points combined to support a secure low-bandwidth 900 MHz network for things like lights and sensors that covered much of the Los Angeles Basin, one of the largest metropolitan regions in the United States by land area.”

Home networks combining?  It sounds like an IoT mesh, a step that would take home control beyond the home and securely into the classic IoT domain.  If you don’t think this is proof enough of Amazon’s broader “federation ambition” how about their first Sidewalk device: “this week we announced Fetch, a compact, lightweight device that will clip to your pet’s collar and help ensure they’re safe. If your dog wanders outside a perimeter you’ve set using the Ring app, Fetch will let you know. In the future, expanding the Amazon Sidewalk network will provide customers with even more capabilities like real-time location information, helping you quickly reunite with your lost pet.”

What good is a dog-locating tool that lets you either find your dog in your own yard, or find out it left it.  That’s what most owners of wayward dogs can do already.  “Real-time location information” is valuable if the location is somewhere other than on your patio.  Amazon is obviously intending to provide a means to link their Sidewalk networks and provide a means of controlling what information can be shared among them.  This is a real step beyond home control.

My sources tell me that Amazon is looking beyond the home, not only in geographic terms but also in mission terms.  Sidewalk is a strong candidate for smart building applications because of its greater range.  Campus applications would also be possible, and perhaps easy if my rumors about linking multiple facility networks into a federated complex are accurate.

A federation of home/facility IoT networks would be highly interesting, if done right.  Obviously few people are going to be interested in having their neighbors control their thermostats or turn their lights on and off, but it is very likely that users of facility IoT would be interested in something like “groups”, where people and/or companies would define specific IoT communities with specific policies for event- or controller-sharing.  Obviously, that would require formidable security protections, not only in ensuring that only specified “shares” were allowed but also in making sure that people understand the implications of the sharing strategy they set.  Amazon would need to protect itself against liability here.

If this is what Amazon plans, it would be a practical on-ramp to a broader model of IoT, even one approaching the utopian (and unrealistic) one of having all sensors and controllers open on the Internet.  A model, I might add, that would have product/vendor backing from a powerful source.  Anyone with sensors or controllers could “federate” them to unrestricted use, though of course that would render them open to DDoS attacks and other hacks.  Just what things might be done to protect this open model could be tested in Amazon’s live testbed, on the Sidewalk, so to speak.  I’d like to see Amazon make a go of this, and also that others copy the approach.  We could see a realistic path to at least a broad form of IoT at last.

Open Devices and the New Network Model

I want to pick up on yesterday’s blog about the “new network”, to illustrate how a network operator is responding to the pressures of profit per bit on conventional connection/access services.  Remember that operators have been facing declining profit per bit for over a decade, and this pressure is the force behind declining budgets for network enhancement and for various initiatives to lower costs.  Among those are initiatives to replace proprietary devices, which operators feel are inordinately expensive, with open software, servers, or white boxes.

AT&T has done a lot of interesting things to open up networking and reduce capex and opex, the latest of which is DDC, its distributed disaggregated chassis white-box form factor.  It’s getting a lot of attention in the telco world, as THIS Fierce Telecom article shows, and it may in fact be a game-changer in the open device or “white-box” movement.  It’s also likely to create major problems for vendors, particularly if you combine it with the “new network model” I blogged about yesterday.

AT&T is no stranger to open-model network equipment and software.  It’s already released an open operating system and open-white-box design for a 5G edge router, and it’s been active in open-source software for network transformation.  DDC is an expansion to the AT&T open model, a framework to create a highly scalable hardware design suited for more extensive deployment in 5G and other networks.  It’s also a potential factor in the reshaping of the network that I blogged about, which is important because (deliberately or otherwise) AT&T’s getting out in front of the “new network” while vendors seem to be behind.

As I noted that earlier blog, modern networks are transforming to a new model that replaces a hierarchy of core routers with a series of on-ramp (access) routers and gateway routers that offer the bridge between the service and access portions of the future network.  Both the on-ramp and gateway routers would be variable in size/capacity depending on the area they were serving.  In this network, agile optics and electrical tunnels would create a mesh within the service network, and would provide the aggregation of electrical-layer traffic onto lambdas (wavelengths) and then fiber.  Routing importance would be diminished, with many features pushed down into the opto-electrical layer.

AT&T’s DCC seems to aim not only at today’s evolving needs, but at this future model.  What AT&T has done is create a model of router that replaces chassis-backplane design with clustering and cabling.  This approach isn’t as good for true transit router missions where any incoming traffic could go in theory to any output trunk, because the cross-connecting of the distributed elements would need too much capacity for traditional external interfaces to handle easily.  It would work in either on-ramp or gateway missions where most incoming traffic was going to a small number (often only one) output trunks.  The differences in mission would likely require only changes in the configuration of the distributed cluster and to the software hosted on the devices.

AT&T’s description of the DDC concept illustrates this point.  They first explain the chassis-backplane model, then say “But now, the line cards and fabric cards are implemented as stand-alone white boxes, each with their own power supplies, fans and controllers, and the backplane connectivity is replaced with external cabling. This approach enables massive horizontal scale-out as the system capacity is no longer limited by the physical dimensions of the chassis or the electrical conductance of the backplane.”  This is saying that you can expand the “router” cluster through external connections, which of course means that you can’t assume the DDC creates a true non-blocking fabric, because some of those external paths would likely congest.  As a traditional core router, this would be a problem.  As an aggregator, a gateway or on-ramp in my new-network model, it’s far less problematic.

Aggregating to a point, meaning to a very limited number of destinations, is more like a hierarchy than a mesh, and that architecture has been used in data center networks for decades.  You can still provide failover paths within the distributed device mesh (as you can with some data center LAN extensions), but the nice thing about it is that the DDC is built from white-box routers, not from line cards, and those white-box routers are themselves suitable as on-ramp routers where traffic is limited.  They can also be aggregated into small DDC clusters, to serve higher-traffic missions and to provide growth potential (scale-out).

Imagine for a moment that the new-network vision I described in that past blog, and the DDC, combine.  We then have what’s essentially a one-device IP network, where the white-box elements of the DDC are the only real devices needed, and they’re just connected to suit the mission.  It’s obviously a lot easier for a white-box model of network devices to sweep the opportunities off the table when there’s really only one “network device” in use above the opto-electrical layer, connected into different structures to support those two new-network-model missions of access on-ramp and service gateway.

There are two things needed, in addition to DDC, to make this all work.  One is an architecture model of this future network, a model that shows all the elements, both “logical” or virtual and physical.  The other is the agile lower-layer devices that create the opto-electric mesh.  AT&T is working hard on the former piece, and the industry may be converging on, or at least recognizing, the latter.

I think it’s likely that the network gurus at AT&T already understand how the logical and physical elements of this new model would fit.  Remember that they first announced “dNOS”, a disaggregated network operating system, which became the Linux Foundation DANOS project.  Now we have the DDC, which is also “disaggregated”.  They’re working toward an overall model here, for sure.

The opto-electrical piece is something I mentioned in a prior blog about Cisco’s decision to acquire Acacia and Ciena’s challenge in maintaining margins.  I noted that pure optics was plumbing, pure electrical layer was under terminal price and open-model pressure (like DDC), and so both the electrical and optical players would inevitably fight over the middle, the agile lower-layer trunking that would mesh electrical elements and dumb down the higher layer.  Ciena, Infinera, and Cisco all have the technology and contacts to make a go of this agile opto-electric (in the optical vendors’ view) or electro-optical (in Cisco’s view) layer.  The hold-up is the classic “Do I mess with my current quarter in the name of securing my future or hope for divine intervention down the line?” question.

The optical guys have demonstrated they know about this new network model, but are helpless to make the internal transformation needed to address it.  If you’ve done the wrong thing consistently for five or six years, you may still theoretically be able to turn yourself around, but the odds are against it.  Recent announcements by both Ciena and Infinera suggest they’re determined to stay where they are, layer-wise.

Cisco is more of a wild card.  You can do a lot with Acacia’s stuff, including pretty much everything that needs to be done in the opto-electrical layer.  You can also just build a bottom layer in a router, assuming that all the opto-electrical features will be offered only by the big boxes.  That’s hardly an open approach, but it would be a smart move for Cisco.  For the buyer, it might seem to limit the structure of the opto-electrical layer, but in most cases, it may be that the logical place for those opto-electric features will be the same places there are gateways and on-ramps.  After all, you can’t stick network gear out in parking lots, you need real estate.

Whatever structure might be proposed for the opto-electrical network, the time available to propose it is limited.  Right now, the way this opto-electric magic would work is territory open for the taking, opportunity ripe for positioning.  The “hidden battle for the hidden layer” as I said in a prior blog, is a battle that the combatants are (for now) fairly free to define as they like.  The early positioning in this area, if compelling enough, is going to set the tone for the market, a tone that later entrants will have a very difficult time changing.  This isn’t the first time that we’ve had a small window to try to do great things, but it might well be the most important.

A solid opto-electrical open-device reference design, one that fits into AT&T’s “disaggregated” family, would solve all the router-layer problems…except for those of the vendors.  AT&T has led operators in coming up with open-model network technology.  I’ve not always agreed with their approach (ONAP is an example where I totally disagree), but both DANOS and DDC are spot-on conceptually.  It’s clear that these developments are direct results of vendor pricing and intransigence.  Vendors could stimulate the development of further members in the disaggregated family by continuing with their past attitudes, seen by operators as opportunistic foot-dragging.  It may well be that the network vendors, with their heads in the quarterly-earnings-cycle sand, have already lost the chance to respond to operator pressure to reduce network costs.

Which would leave us with enhancing benefits, which would mean revenue for operators and productivity gains for enterprises.  Operators really need hosted OTT-like experiential services, which their vendors have avoided pursuing.  Enterprises network spending has stagnated because of a lack of new productivity benefits that could drive increased spending.  Network vendors have known about this for a full decade (I know because I told them, in some cases right to the CEO’s face), and they’ve not responded with architectures and products designed to broaden the benefit base for network investment.

Unlocking new benefits may be the next battleground, if cost management fails for traditional services.  If connection services can’t keep the lights on for operators, then they’ll have to climb the value chain, and implement the higher-layer, benefit-enhancing, stuff that vendors have been ignoring.  There are already signs that both operators and enterprises are seeing cloud providers and cloud vendors as the go-to resources for new-benefit planning.  Most of the cloud is driven by open source.  Déjà vu, anyone?

The Future Model of the Future Network: Harnessing the Hidden Layer

We don’t build networks like we used to.  That fundamental fact should illustrate why it’s important to look at how we should build them, given the changes in both technology and demand that have driven networking in a different direction than it followed in even the recent path.  The “right” answer to any network question is dictated by the relationship between capabilities and requirements, and that relationship is changing very quickly.  To understand that we have to go back to look at what was true, then forward to look at what is, and will be.

Harken back fifty years, to the dawn of packet networking.  We had networks then, to be sure, but they were telephone networks that had been pressed into service to carry data.  It was an imperfect marriage, but at the time that didn’t matter much because data demand was very limited.  People called each other on the phone, and that’s what “communications networks” were all about.

Telephone networks have two important features, created by both human behavior and the pricing policies of the time.  One is “locality”.  Calls tended to be substitutes for physical meetings among people who did meet physically at times, but not all the time.  Long-distance calling was relatively rare because of that substitution factor, but also because long-distance was expensive.  Thus, most traffic went from the copper loop access to the central office, and then back to another “local” loop.  The other is “connectivity”; the purpose of the network was to connect people, playing no real role in their relationship beyond that connection.

When the worldwide web came along, it generated a real consumer data application.  We all know now that the sudden rush to get online transformed the mission of networks from handling calls to handling IP packets, but there were other important transformations that were missed.

The biggest impact was that the Internet and the web are about experiences, not connections.  Users go online to get things, such as retail access or content.  The ability to connect through the Internet exists, but it’s not the experience that generates most of the traffic.  The “connectivity” property is devalued, in short.

And so is “locality”.  Your average web server or content portal isn’t in your neighborhood.  Your traffic is far less likely to terminate on the same facility that it originates on, and that means that the “access network” is really not a part of the data or service network, the Internet.  It’s not responsible for connecting to a bunch of different places, just for getting you onto the public data network that can make those connections.

Mobile networks, when they came along, created another mandate for the access network, which was to accommodate the mobility of the users of the network.  This mobility was a property of the devices and users, not a new destination, and so it again created a mission for access—and now metro—networks that had nothing to do with where the traffic was eventually going, which was still to that public data network.

Now let’s look at the technology side.  From the time of digital voice, we’ve had services that, on a per-user per-instance basis, used less bandwidth than the trunks could carry.  Voice calls used 64 kbps channel capacity, and we had T1 (in the US) and T3 that were 24 and over 600 times that.  Logically, if you build a communications network, you want to avoid trenching trunk lines that are under-utilized, so we developed aggregation models to combine traffic onto trunks.  That aggregation originally took place in the central offices were loops terminated, but it migrated outward to fiber remotes, cell towers, and so forth.

Even the advent of packet networks and the Internet hasn’t eliminated aggregation.  We have “core” and “edge” routers as proof of that.  At the same time, we’ve had different aggregation technologies in the access network, including (gasp!) ATM!  The point is that access and core, or access and service, were beginning to diverge in both mission and technology.

When is aggregation a good idea?  Answer: when the unit cost of transport on an aggregated trunk is much lower than it would be on a per-user or per-destination basis.  If there’s enough traffic directly from point A to point B that the transport economy of scale isn’t much worse than would be produced by further aggregation, then run the direct trunk.  And fiber economies are changing with new technologies like agile optics and even DWDM.

One consequence of all of this is that networks divide into two zones, one for access and one for service.  In the access zone, everything aggregates toward the service-zone (public data network) gateway point.  In the service zone, the goal is to give all those gateway points as direct a path to primary resources (those that generate the most delivery bandwidth) as possible.  The presumption, though, is that service resources are distributed through the service network, not placed in a nice convenient central point.  Any of the service network gateway points thus has to be able to get to any service resource, while access network on-ramps need only connect with the service gateway points.

All of the pathways between service gateway points and service resources in the service network are likely to carry enough traffic to justify a direct trunk, meaning at least that it probably wouldn’t make sense to pass the traffic through a transit router.  That makes each path a candidate for either agile optics or a low-level groomed electrical “tunnel”.  Networks would then look like a mesh of low-layer connections between what were essentially large access routers.  However, “large” here wouldn’t be large relative to today’s core routers; those big transit boxes would be devalued by the mesh of opto-electrical paths below.

The “tunnel model” could also be called the “mesh model” because the goal would be to establish Level 1 or 2 tunnels between every element in the network, not the users but the edge points or resources.  If you are “of” the network, you get connected to the other “of’s”.  There could be exceptions to this, for example, to eliminate the need to connect content resources to each other, and there could be a default traditional multi-transit-router path to get low-level traffic around.  The mainstream stuff, though, would bypass the traditional transit routing and go edge to edge.  The impact of this on optics and routing are pretty obvious, but we should look at each of the main ones anyway.

First and foremost, we’d see a tapping off of features and dollars from the router layer, to be subducted downward into that Level 1 or 2 tunnel or “opto-electric” agile-grooming layer.  This layer would to the steering between edge points on the network, and would also supply fault tolerance and scalability features.  This approach could benefit optical network vendors, as I said in a prior blog, but those vendors have had the opportunity to push this model for at least a decade and have done nothing.  Probably, that will continue.

Second, we’d see a transformation of routers.  Instead of big “core” routers that perform transport missions, we’d have “on-ramps” to get users or resources onto an access network, and “gateways” to get between an access network and a service network, or to connect mass resources directly onto a service network.  Transit routers typically need any-to-any connectivity, but on-ramp and gateway routers really just pass a bunch of input ports to a small number of output ports.  Remember, every virtual pipe in our opto-electrical mesh isn’t a different interface; the traffic is simply groomed or aggregated at Level 1 or 2.

The third thing we could expect is a simplification of the process of routing.  What’s the best route in a resilient mesh?  There isn’t any, because every destination has its own unique tunnel so logically there’s only one path and no need to select.  Think of all the adaptive behavior that goes away.  In addition, the opto-electrical mesh is actually protocol-independent, so we could almost treat it as an SDN and say we’re “emulating” IP features at the edge.  The tunnels might actually be SDN, of course, or anything else, or any mixture of things.  They’re featureless at the IP level.

That leads to the next thing, which is the simplification of device software.  There’s a lot less work to do, there are different kinds of work (the “emulation” of IP features) to do, and there’s every reason to think that different forwarding models might be overlaid on the same tunnel structure.  VPNs, the Internet, VLANs, or whatever you like, could be supported on the same opto-electrical core.  In my view, in fact, the presumption would be that the higher-layer devices, the “routers-in-name-only”, would be IP control and management emulators combined with agile forwarding capabilities, like those P4 would support.

And then there’s the final thing, which is consolidation of functions into a smaller number of devices.  We have an explicit optical layer and routing layer in networks today, and we’re looking at the Level 1 and 2 “grooming” or “tunnel” layer.  We can’t end up with three layers of devices, because that would destroy any capital or operational economy.  We have to end up with less.  The solution to this new network model, in device terms, has to cut costs somewhere, in some way.  We can eliminate layers, consolidate electrical and optical.  We can build out optical and dumb down electrical, with simplified routing and open-model devices and software.  We have to do one, or both.  Things can’t go on as they have been going, because that’s lowering operator profit per bit and threatening network expansion.

All of this is driven by the assumption that profit per bit has to be improved by improving cost per bit.  Operators have face declining revenue per bit for a decade or more, and the most obvious counter to this is to reduce cost per bit so profit per bit stabilizes.  We’re seeing lots of different steps to address cost reduction, but what I’m calling the “new network” is one of the most obvious but also most profound.  If networks are fundamentally different today, both because of the demands placed on them and because of the technology that builds them, then why should today’s networks look, topologically speaking, like models of the past?

Inertia, of course, but whose inertia?  Operators blame vendors for not responding to their profit issues, and instead supporting their own profits.  If this is true (which at least in part it is) then vendors are doomed to eventual disappointment.  You can’t expect your customers run at a loss so you can earn improved profits.  The new network model I’m describing may not be a policy of operators, a united vision that spans the globe, but it’s still a reality because what drives it is a reality.

Operators are at a critical juncture.  Do they believe, really believe, their vendors are scamming them?  If so, then an open model for devices and networks is the answer.  Some operators are already embracing that.  Others may well follow.  Tomorrow we’ll look at the way the leading operator in open transformation is proceeding, and how their approach might (or might not) sweep the market.

Virtualization and Cost Reduction

Is virtualization a cost-saving strategy?  An article in Light Reading on Monday talks about whether a single-vendor virtual network is more likely to save money than a multi-vendor network (Huawei is a major source for the data).  That’s a reasonable question, perhaps, but I don’t think it’s the central question here.  That honor goes to the question I opened with, and there are other better questions that big question introduces.

The operators I recently connected with were all over the map regarding whether “virtualization” saved money, or was even intended to save money.  The reason was that there are many different views of what constitutes “virtualization”.  Is it NFV?  If so, most operators don’t think it will deliver substantial bottom-line savings.  Is it SDN?  Operators generally think SDN can save money.  Is it white-box switch technology?  Nearly all operators think white boxes will save, but most doubt they’re “virtualization”.  But the big point I got was that most operators are seeing virtualization as a means of creating agility or resiliency, and thus aren’t assigning savings goals to it.  To me, it’s the SDN, white-box, and virtualization-goals topics that need some exploration, but I’m going to attack them in the opposite order.

Almost every operator sees virtualization as a key element in any feature-hosting model, whether they believe in NFV broadly or don’t.  Right now, there are three camps on virtual-feature-hosting strategies.  The smallest, which is represented to a degree in most operators but discounted in terms of bottom-line impact, is “true” NFV.  The next camp is the 5G cloud-native group, who believe that 5G will employ cloud-native technology for feature hosting but won’t conform to the NFV ISG specs, and the last group is the group that believes they’ll rely on virtualization, probably in vSwitch and SDN form, in all their data center strategies.

A few, but growing, number of operators see virtualization as a general tool in making their transport networks cheaper.  “Transport” here means largely mobile metro or IP core, and in both cases the operators are looking primarily at a form of SDN, often roughly mirroring Google’s Andromeda model of an SDN core surrounded by an IP/BGP “emulator” edge that lets their structure fit into the Internet seamlessly.  They’re, in a real sense, doing “network virtualization” as I’ve described it, abstracting the entire transport network as a black box that looks like IP from the outside.

At the access network level, the greatest interest is in the white-box model.  Operators are generally more white-box focused where there’s a large number of new devices likely to be deployed, which means both the 5G edge and CPE.  This separation is interesting because it connects this approach to the NFV camp in feature hosting.  In fact, about a third of operators see the possibility that white-box uCPE and NFV could combine usefully.  That’s far less likely for missions like 5G edge, where operators overall tend to favor the AT&T-launched DANOS model of a highly programmable edge software OS that marries to a white-box switch.

You might wonder what this has to do with multi-vendor, and the answer at the high level is “not much”.  Most of the operator virtualization objectives are implicitly (at the least) linked to a no-vendor or open model.  But do single- or multi-vendor virtual networks really create significant differences in TCO, as Huawei’s comments in the article suggest?  That’s a lot more complicated.

Virtual networks are typically transport overlays, meaning the ride on something else, something “real”.  If the networks involve something that has to be explicitly hosted, then they introduce capex.  In all cases, since the virtual network is still a network in terms of behavior, they all introduce opex.  But the numbers Huawei suggests seem to be averages across a universe with an enormous variation in experiences.

Huawei tells Light Reading that virtualization will initially cause costs to rise, and then with single-vendor implementation, costs will fall to 91% of the original amount.  Even if we consider the “original amount” to be the pre-virtualization costs, every operator knows darn well that they’d never approve a project that delivered a 9% reduction in TCO.  In fact, the article from Light Reading I cited on Monday quoted an operator as saying that NFV projects had to deliver a 40% reduction in TCO to even be interesting to talk about.  But the reality is, as I noted, that most virtualization isn’t really about lowering TCO, it’s about improving agility and scalability, or about avoiding cost rather than lowering a current cost.

What about the multi-vendor assertion?  I go back to my point that most virtualization processes are really about getting rid of vendors in favor of open platforms, both hardware (white-box) and software (virtual features/functions or software switching/routing).  What I think Huawei may be doing (besides an expected bit of self-promotion) is relating that if you have an application for virtualization that’s pretty broad, as 5G would be, you could approach that application in single- or multi-vendor form.  In the latter case, there could indeed be cost issues according to my own operator sources, but it’s more complicated than being just a “multi-vendor” problem.

There are, as they say, many ways to skin a cat (no disrespect to cats intended!), and vendors tend to look for differentiation, which means there’s probably a different way selected for each of the major suppliers you might involve in 5G.  The 5G connection is critical here, first because 5G is source Huawei’s biggest current target and second because 5G is at the moment incredibly vague overall, and particularly so in terms of how virtualization is used.  In fact, 5G could well be the space that proves that most of the “NFV” implementations being proposed are “ninos”, meaning “NFV in name only.”  They don’t abide by the NFV ISG spec because the spec is of no value in the applications of virtual functions to 5G.  It does bring a feeling of openness and comfort to wary operators, though.

Going back to Google Andromeda, the thing about virtualization is that it either has to be packaged inside a network abstraction that, at the edge, conforms to standard industry protocols and practices (IP/BGP in Google’s case), or you have to adopt a specific virtual-network standard.  What’s that?  Actually, the problem with virtualization standards is that we don’t have a broad, universal, picture of a virtual network ecosystem so we don’t really know what “standards” we should expect.  That makes the standards idea a kind of wild west myth, and so it could well be dangerous to consider a multi-vendor virtual network.

Even trying to create rational standards and interoperability could be politically dangerous for vendors, too.  Back when NFV looked more promising, HPE created a press tornado by proposing to emphasize specific vendor relationships in its implementation of NFV in order to evolve to a secure, orderly, and deployable community of elements.  It didn’t sound “open” and HPE got trashed, but of course it’s more open than the single-vendor approach.  The HPE strategy was proof that we weren’t completely addressing the integration issues of virtual networking, leaving too much to be handled by individual vendors.

Standardizing virtualization interfaces at this point would be difficult, because we don’t have a single source of them, and because many of the attempts that have been made aren’t likely to be effective.  NFV wouldn’t have onboarding problems if it had effective virtualization interfaces to standardize.  Multiply NFV’s issues by the ones that could be created by ETSI ZTA, 3GPP 5G, and ONF SDN, and you can see how much disorder even “standards” could create.

What we need may be less a virtualization standard than a virtualization architecture.  Let’s assume that we’re going to virtualize a device, a box.  The model is that we have one or more hosted elements “inside” and these behave so as to present an external interface set that fully matches the device the virtualization/abstraction represents.  We could call this a “abstraction domain”.  The model could then be extended by saying that a collection of ADs that operate as a unit are also an AD, presenting the kind of interface that a network/subnetwork of devices would present.  Inside an AD, then, can be anything that can be harmonized to the external interface requirements of the thing the AD’s abstraction represents.  You abstract a box, you look like a box, and if you abstract a network or subnetwork, that’s what you look like.  Inside is whatever works.

This approach, I think, is more realistic than one that tries to define a new set of control/management protocols for the virtualization process.  A side benefit is that it creates interoperability where we already need it and have it, at the device/network level.  I think that if 5G virtualization is ever going to amount to anything, this is how we’ll need it to be done.

One thing I think is clear about 5G and transformation alike, which is that operators are committed to open models.  Sorry, Huawei, but I reject the notion of a single-vendor solution as the ideal for optimizing TCO.  Why bother with open technology if that’s the approach you take with it?  We need to make open approaches work, not only for operators but for the market at large.