Should We Focus on Portal-Based Support, not Lifecycle Automation?

The biggest operations problem carriers face is more debatable than you might think.  The CFOs and most CEOs say it’s high operations cost, but some in the operations and CIO groups think it’s really customer portal problems.  Telcos and cable companies alike have been cutting operations costs by cutting headcount, largely by eliminating humans as tech and support contacts in favor of automated systems.  That shift has created serious issue for many of them, to the point where some say that churn created by service problems is costing more than they’ve saved by reducing the number of human agents.

Let’s look back to roughly 2016.  By this point, it was clear to most operators that SDN and NFV were not going to be effective in transforming network infrastructure to improve profit per bit.  The real problem wasn’t capex, the target of both SDN and NFV, it was opex.  Not only is opex a larger portion of every operator revenue dollar than capex, SDN and NFV’s capex savings were eroded by opex growth.  But every attempt by operators to improve opex through a generalized service lifecycle automation strategy failed as much as SDN and NFV before them.

The response of operators was predictable.  Pick low apples.  There were areas of operation that were human-labor-intensive, and many of those areas could be addressed (they believed) through automated systems, through portals that let customers interact with support systems without a customer service rep acting as an intermediary.  These low apples had the advantage of being obvious and also being technically contained.  You didn’t need an over-arching lifecycle automation system that you couldn’t really seem to get off the ground.

Now, we’re seeing the results of this one-off operations cost targeting strategy.  Any major problem, according to three Tier Ones, results in a whole community of unhappy users, and this creates the largest risk to churn that operators face.  Sure, competitors can lure customers with special pricing or offerings, but those can be easily countered by simply meeting their terms.  On the other hand, operators say that two outages, combined with bad support experience, will mean that a significant portion of the impacted users will actively seek an alternative provider, and will stay with you only if none can be found.

Operators say that churn accounts for almost a third of opex, and it’s clear to most of them at this point that their strategies for reducing opex in other areas has increased churn risk.  I did a general survey of my contacts asking not about their business networking and IT but their personal mobile and wireline broadband, and I got some interesting results.

Over a period of a year, a third of them had experienced what they called a “significant outage”, which meant that they lost their connectivity for more than 2 hours during a time when they relied on it.  They had experienced this sort of thing before, but what was interesting was that every one of them said that the response of their operator to the 2020 outages was far worse than before, and just over half said that it was unsatisfactory.  About a third said they’d tried to change providers.  I can’t claim statistical rigor here, but even assuming this represents a kind of network-consumer elite view, it’s troubling to say the least.

One almost-universal complaint is that operator support of broadband often implicitly depends on the availability of the service that’s failed.  My home or office broadband is down, but the support portal requires access to the Internet.  When I try to access it via mobile phone, it’s difficult to communicate the identity of the service that’s failed.  Maybe I need a bill.  Maybe I’m hunkered down in a closet in my basement trying to read a little label on the router.  Maybe I don’t have mobile service in that closet.  Almost 80% of my contacts said that just getting to a proper support page for the service that was involved was difficult.

The second most-related problem (by over half of users) was that the status of the service being reported was incorrectly related by the support system.  There were three specific problems, related almost an equal number of times.  First, the system reported that the service was operational, which clearly it was not.  Second, the system ran through a series of tests the user was expected to run, and at the end of the tests had no suggested remedy.  Third, there was a systemic outage (the “cable-seeking backhoe” problem) and the support system failed to note it.

When a problem was eventually acknowledged, users said they were run through a dialog telling them that if this was their problem they’d have to pay for support.  This often took several screens and several acknowledgements that you’d read and accepted this or that statement.  They were then asked if they wanted to schedule a service call, and if they did, were told that it would take an average of about a week to get it.  This, mind you, in cases where a whole neighborhood was out.

Users also said that there were major inconsistencies in the information they were given.  In almost two-thirds of cases, the online support system conveyed incorrect status, and the automated voice response system, if used, offered correct information.  In the cases where the problem was systemic (impacting multiple users), voice response usually provided a reasonable estimate for remediation.

International use of mobile devices and domestic wireline broadband services for consumers and small business sites are the two areas where users are most likely to be unhappy with online customer support.  They’re not the only area though.  Enterprises tell me that if they have a problem, they expect to be able to call a human account rep for support.  They’re willing to do simple things to change a service online, but they want problem resolution done via a customer service rep.  Why?  “Because the online stuff” either “takes too long”, “doesn’t work”, or “requires too much skill to use.”

Broad call-center systems can be helpful in addressing this problem, but they’d have to include online support and also have to provide an agile linkage to operator OSS/BSS and perhaps even NMS systems to get the correct information.  The big benefits of this path are the fact that it tends to harmonize support look-and-feel, reduces the cost of multiple solutions, and can provide for escalation to a voice call, providing that online support can be handled through the same system.  However, to be effective these systems still have to meet the criteria I’ve below.  This is the path most operators are currently exploring.

What’s the take-away on this?  I think there are two truths that emerge in any serious exploration of customer support.  First, we don’t have portals down pat, and their poor design makes things harder than they need to be.  Second, we need some sort of local agent process to do things on the user’s premises.  That might mean providing an app or browser extension, or it may require a feature set in the user’s device.

Suppose that a broadband access point for wireline Internet had a QR code on it.  Suppose your bill had the code.  Suppose you had an app on your phone, linked to your wireline broadband account, that could immediately provide information to the support connection via mobile broadband, if it was wireline that failed?  Suppose that everything about service state was recorded in a common database (which it usually is) and accessed by agents, voice response, or online support, via the same set of APIs?  That gets us started.

Portals are another dimension of the problem, and the solution.  Too many portals are ill-thought.  They take information that might be available to a CSR and sanitize it for user consumption, forgetting that the CSR has access to other information, and is trained to respond to problems that a given user may be seeing for the first time.  You really need to think of a portal as a compositor of multiple information APIs, something that can compose an information view from all the available facts.  In software terms, it’s a “Storefront design pattern”, and you can change the user experience quickly by changing the “storefront” appearance and the way it integrates its API-based information streams.

I think that the combination of an agile multi-source portal design and local agents to facilitate the exchange of critical information are fundamental to good customer support.  That they’re harder than pasting a layer of information distribution on top of current CSR resources is obvious, but that they’re at risk of failing is even more so.

What about that systematic service lifecycle automation option?  This was, is, and will likely always be my own preferred approach, and I think that by the time operators get through their one-offs and fixes and corrections, they’ll have more than paid as much for their anemic efforts as they would for a good solution.  The problem is that they don’t know how to get to a good solution, and the fact that we’ve fumbled the ball on this for a decade now suggests that train has left the station.  If that’s true, then we need to get customer care fixed, and quickly.

Can We Think Our Way to Better Broadband?

Could better knowledge, better data, create better broadband?  That’s a question that the FCC is looking at, according to a story in Light Reading.  Conceptually, the idea is an outgrowth of a general FCC look at the role of AI, undertaken by the Technical Advisory Council.  The FCC isn’t noted for its speed relative to the market, and so the fact that this idea is just an application of a general TAC investigation of AI doesn’t bode well for quick answers.  Since I’ve attempted to use statistics to assess broadband potential, I’d like to take a look at what seems to be missing in such efforts.

The profitability of broadband is the biggest determinant to its availability.  Globally, telecom has been “privatized” for decades, and that means that providers are typically public companies with a profit motive, not regulated monopolies or agencies of the government.  That means that a given market geography is likely to have broadband proportional to the opportunity the market area represents to operators.  That would equate to the ROI the broadband services could earn there.

The opportunity that a given area would represent is fairly easy to classify; it’s related to the gross domestic product (GDP) of that area, to the household income, to local retail sales, and so forth.  This sort of data can be obtained fairly easily on a macro basis, by country or area within it (down to the state level, using the US as an example).  Neighborhood-level data is difficult to obtain because it’s expensive to develop and often is collected only through census-taking, done often once in a decade and so usually fairly out-of-date.

The real problem is the cost picture.  You can gain a reasonable idea of the cost and benefit relationship if you alter your “opportunity” assessment to be on a per-unit-area basis.  My “demand density” calculations do that; assessing opportunity per square mile in the US, and relating demand density in either parts of the US or other countries to the US metric (the US has a demand density of 1.0 in my work).  This works fairly well where demand is fairly concentrated, which in my own model means that demand density is at least 2.0 on my scale, but it can fail with lower demand densities.

The problem is that as demand density falls, the question of how you’re going to deliver broadband becomes critical.  A good example is fiber to the home (FTTH).  Based on current passive optical network technology, FTTH can be profitable where demand densities are around 5.0 or higher.  CATV infrastructure can be profitable at demand densities of around 2.0 and higher, and 5G/FTTN hybrids could be profitable in about the same 2.0-and-up range.  However, these technologies have limitations in range of coverage, so average demand densities won’t tell the story.

What we’d really need to have is demand density data that, in US terms, was per zipcode or even tighter.  Think of it as a “subdivision” issue.  When you concentrate people, either in terms of where they live or in terms of where they work and shop, you create not only a concentration of demand but a simplification in demand fulfillment.  More stuff works in the concentrations than would work outside.

A home in the country, even a home of a big earner/spender, is almost a one-off in terms of provisioning a service, because it may be a mile or so from other dwellings and thus isn’t able to share provisioning with others.  The only thing that’s going to connect this kind of place profitably is an over-the-air service with decent range.  A home of the same value, in a residential subdivision with lots that average 15,000 square feet, is part of a collection that could be served by perhaps a single head-end, a PON deployment, CATV, and 5G/FTTN all at once.

There are also often issues of public policy involved.  The US is hardly unique in its debates about the “digital divide”, the difference in broadband empowerment between rural and urban/suburban areas.  Almost all countries with respectable telecom infrastructure will have these three zones of habitation, but the nature of the three will vary.  Australia and Japan have cities, suburbs, and urban areas, but for Australia even the cities and suburbs are of generally lower density.  In Japan, demand density at the national level makes good broadband almost inevitable.  In Australia, they’ve resorted to a not-for-profit company to semi-nationalize broadband in an effort to improve quality for all.

There may be a special role for AI even without any better data to work with, though.  One example is that AI might be used in conjunction with mapping software to identify business and residential locations and create opportunity density and access efficiency data.  While this might not be completely up-to-date, it would still be a significant improvement over what’s available today to guide public policy.  Most operators have their own databases showing connections and business or residential locations, and this would also be suitable as a jumping-off point for AI analysis of broadband ROI, which could help their own service and infrastructure planning.

What, exactly, could be done with this data, whatever its source?  Let’s start by saying that the best single tool would be what I’d call an ROI gradient.  Think of it as a three-dimensional map whose first two dimensions are classic map dimensions and whose third is the ROI potential of each point.  The surface created would point out the best and worst areas, and if we presumed that an ROI surface was created for each technology option based on its technical limitations and estimated pass and connect costs, we could get a pretty good idea of where service would be profitable and where it might not be.

For operators, something like this would let them know the best way to address a given demographic or geographic market segment, and whether there were any ways good enough to make service there fruitful absent any subsidization.  Governments, of course, could use the same data to target subsidies where they’d actually make a difference, and to set rules on technology used to ensure that subsidies were actually delivering the optimum technical choice for the area.  That would reduce the risk of “stranded subsidies” where funds are allocated to some technology that, in the longer term, would be unlikely to be considered competitive in terms of service delivery capability.

The barrier to better broadband through knowledge, though, really starts with accepting that’s actually the only path to it at all.  There’s enormous economic and demographic/geographic diversity in most countries, and even more when you consider the essential locality of service technologies.  We’ve managed to ignore this for well over a decade, which shows we could likely find a way to continue to ignore it today.  If we do, then we’ll always have a “digital divide”, and it may get worse over time.

Network Transformation: Outside-in or Inside-out?

Inside-out or outside-in?  That’s the biggest question in network transformation, for operators and for vendors alike.  It’s a question that involves assessing a lot of complex factors, ranging from the nature of new network technology ideas, sunk costs and financial depreciation, budgets, mindsets, hopes, and fears.  The balance of these forces is constantly shifting, but there are some signs that things are stabilizing, and that’s important for everyone in the field.

It’s been about two decades since our last “transformation” in networking, the point where consumer broadband and the Internet combined to shift operator investment decisively from TDM toward packet and IP.  That transformation was driven by opportunity, the first opportunity to turn consumers into an army of broadband buyers.

Since then, operators have found themselves trapped in a snare of their own making.  Consumers were unwilling to pay proportionally for capacity, so the competitive dynamic of the market forced operators to provide more capacity for little or no incremental revenue.  This declining profit per bit has put a damper on telecom spending for years now, and it’s the major factor cited by the network equipment vendors as they moan to Wall Street about systemic forces that are hurting their growth.

We’ve had many initiatives that promised to improve profit per bit by reducing cost per bit.  Some have focused on streamlining operations, and these have been largely successful, but they’ve impacted customer support and operator reputation.  Some have focused on reducing capex, and things like SDN and NFV have failed to deliver on the hopes of their proponents.  Other things, like “white box” networking, are showing promise.

The inside/outside situation I opened with is a result of the tension between these “inside” network infrastructure movements and another force, the 5G modernization force that’s attacking network infrastructure from the outside.  The consequences of the collision of force themselves are significant, and when a winner is decided, it will likely decide the fate of a bunch of products, or even vendors.

The “inside” force in network transformation has, in theory, both a capex and opex dimension.  On the capex side, “transformation” has been recently focused primarily on white-box networking, primarily because the current proprietary-box networks can be evolved more easily if we presume box-for-box substitution.  The barrier to real transformation has been the fact that the residual depreciation on network devices has a wide range, and where it’s large, it’s difficult for operators to bear the write-down cost.

Operators also relate a significant problem of finger-pointing that arises when you introduce one (or a few) new boxes into the router-network mix.  One operator said “We never had a fault isolation problem before, and now we have a problem with every fault.”  They believe the incumbent vendor or vendors are trying to discredit the new devices, which is likely true.

The opex dimension is, of course, related to this.  One solid truth in network operations is that “Change is bad until proven otherwise!”  People are conditioned to a given set of practices and tools, and any changes in either will generate confusion, pushback, and likely higher costs, at least for a while.  For that reason, an increasing number of white-box proponents are focusing on operations transparency, which means looking like an incumbent box in all respects.

Box-wise substitution, even with measures to mitigate operations issues, is kind of death of a thousand cuts, because every new box brings a new flood of protests from incumbent vendors and their supporters.  Given that major router vendors have certification programs whose certificates play a big role in proving professional advancement and securing job security and mobility, there are a lot of those supporters in any operator organization.

One could argue that the larger problem in the mister-inside evolution of networks is the lack of a new network architectural model.  We are surely short-changing transformation by insisting that everything look the same as it always did, and run the same as well.  Operators admit that their operations automation projects tend to bit off little low-apple chunks of cost rather than addressing the overall operations picture, but they can’t do better without knowing how “better” would fit together.  That’s true on the capex side too; one reason both SDN and NFV fell short was lack of a credible overall model of how a software-defined or virtual-functionalized network actually looked.

We’re seeing some successes in the inside-out space, where there’s a credible driver to making a change to an entire part of a network—a geographical area or something like the IP core network.  The core has proven a fruitful target because core routing is distinctively different (increasingly it’s about shuffling label-switched paths) and often doesn’t involve a huge number of devices.  Where operators’ demand density is low, radical core-network capex reduction looks really attractive.

What’s pretty clear at this point is that the majority of operators, particularly the larger ones, are having a problem with even core-replacement strategies.  Some (Tier Threes in particular) don’t really have much of a core.  Some have high enough demand density that their infrastructure is fairly cost-efficient.  For the rest, it’s a long slog through the CFO’s approval process, and that’s why the “outside” strategy looks better for some.

The big advantages of the outside-in transformation, based on 5G, is first that 5G is budgeted, second that 5G standards call for function hosting, and third that there’s already open-model 5G options in play.  Open RAN is gaining a lot of traction, for example.  But for many, the big advantage is that “services” and any personalization of handling is traditionally an edge function, and that’s where 5G infrastructure goes.  In fact, 5G may be converging wireline and wireless at the edge.

I think that the real benefit of a 5G-inward approach to transforming network infrastructure is the combination of the practical and theoretical from the benefits I listed above.  On the practical side, 5G is budgeted and happening on a broad scale.  On the theoretical side, 5G’s scope matches the part of the network where service value is created; the deeper you go, the more things look like simple bit-pushing.  If I’m right, then Open RAN is the single most important networking initiative of our time.

Something with this much positive potential just has to have a downside, of course.  The big downside for the outside-with-5G approach is that 5G’s business case is sketchy, budgeting notwithstanding.  There is surely a value for 5G RAN, in the subscriber density per cell and in 5G/FTTN hybrids for fixed wireline service replacement.  Beyond that, you’re entering the Great Question Mark, where CFOs dig past your high-sounding statements and demand to see hard ROI numbers.  They’re not easy to come by, and that means that 5G is at risk for being discredited as a broad-scale technology revolution.  That could mean that the service potential that 5G could bring will never come.

It would seem that the logical response to the inside/outside dilemma would be a “both-sides” approach, meaning the combining of 5G-driven edge initiatives with more generalized modernization strategies, including white-box networking.  We already have white-box initiatives at the edge, after all, driven by 5G.  Couldn’t they be pushed deeper?  Sure, obviously, but the question is whether we could promote any additional business cases for doing that.  The one place that seems fruitful in the uniting of our two sides is the control plane.

I’ve pointed out the benefits of separating the IP control plane, and uniting it with 5G’s control plane, in prior blogs, notably HERE.  Topologically speaking, a new control plane model is the logical place to unite everything in a transformed network; it spans everything (or could).  Technically speaking, a united control plane would allow service features and benefits, applied at the edge, to be coupled with deeper behavior, as deep as the business case permitted.  It would also provide a mechanism to coordinate edge behaviors throughout a service, even if the core didn’t participate.

Most sellable experiences are rooted in hosted content or applications, meaning they’re run in a data center somewhere.  If 5G edge features were coupled to data center networking features, then all those experiences could be optimized at both the requestor and supplier endpoints.  Would it be possible to use a separate IP control plane, merged with 5G’s control plane, for this?

Yes, but.  The “but” is that something like this is almost certainly going to require either standardization or open-source implementation.  If we look at IP topology exchanges today, they’re adaptive on a per-hop basis, meaning that routers pass things along in a kind of vast daisy chain.  Could we simply say that a combination of core topology and edge reachability could be more widely advertised?  Could we propagate reachability directly from the edge to other edges?  Sure, if we agreed on a mechanism that could be widely supported.

That’s one reason why 5G’s outside-in value may be decisive.  5G could converge all broadband under a single edge approach, either because of 5G/FTTH hybrids that used some 5G control plane features, or simply for effective roaming between WiFi and 5G.  The concept of an edge partnership over a more passive core could be an outcome.

That suggests that the biggest value for a separate control plane for IP is that it facilitates 5G control plane integration.  If that’s true, then it means that most vendors who advocate control plane separation will have to take a harder look at integrating 5G control plane features, and that could give rise to a major shift in how service control planes in other areas, ranging from CDNs to service meshes and cloud computing, are handled.

That may have to happen quickly too.  The market in 2021 is going to be very competitive, and a decisive link between open-model 5G and open, transformative, network technology in general would be pretty compelling to operators looking for profit-per-bit relief.  The earliest well-stated positions in the space could be decisive.

Strategies for Automated Operations

There seem to be a lot of announcements around automated operations these days.  Many, perhaps even most, involve artificial intelligence and machine learning.  Obviously a part of this is due to the tried-and-true tactic of vendors jumping on hot terms for SEO, but part is also due to the fact that operations issues are increasingly recognized as being minefields of cost and churn.  To separate the hype from the truly useful, we need to frame a standard approach that’s optimized for reality.

Martin Creaner offered an assessment of telcos’ digital transformation progress that touches on this point with a number of his areas of assessment.  He makes a lot of good points, so while I don’t necessarily agree with everything, I think his view is at least as good as my own, and you’d benefit from considering the piece in conjunction with what I write here.  As usual, I’m trying to take things from the top down, and that yields a different perspective.

Martin’s piece rates telcos’ transformation progress in ten areas on a 1-10 scale, and my rough calculation says he’s giving them an average of 6 out of 10.  I’d probably shave at least one and perhaps two points from that, but Martin’s higher scores are associated with things like 5G, which are undeniably important but don’t make up anything like the totality of operator services and responsibilities.  I also think that there’s an issue that separating transformation into ten areas can actually disguise, which is lack of systematism in the overall approach.  I think we already have many areas where mobile and wireline are converging, and there are going to be more of them.  We really need to solve automated operations challenges throughout the telcos’ infrastructure.

The only solution to growing complexity is expanding automation.  Networks are getting complex naturally, from growth in traffic and connected users and from the introduction of new technologies.  The one creating the greatest risk of complexity, in fact, is the one that holds the most promise for the future, the concept of hosted and composable functions.  What do we do about growing complexity; how can we introduce automation?  There are two basic options.

Option one is to apply automation, and likely AI/ML, directly to the point-of-management interfaces.  We have humans sitting in network operations centers (NOCs) today, and this approach is aimed at making the human burden (and its associated costs and errors) smaller by taking on tasks that are presented to humans, implicitly or explicitly, as part of orderly operations.

Option two is to restructure network-building into the assembly of networks and services from autonomous elements.  A network, in this approach, is created by self-managed components, and these “intercept” faults and issues before they’d ever reach the NOC and human visibility.  As a result, today’s point-of-management interfaces have a diminished (and likely diminishing further, as practices improved) mission.

The big plus of the second option is that it’s essentially creating an automation layer attached to current management systems at the point where they interface with the NOC, or above the NOC itself.  That means nothing below requires any retooling, so it’s conservative of current investments in management and monitoring, and obviously it doesn’t imply any changes to network infrastructure itself.  A lot of network vendors are going this way, at least initially.

There are limitations to the second approach, and some could be significant.  First, it may be difficult to apply the new-layer solution at the top of each current management stack, because the automated remediation may require actions that cross over between management systems.  That would mean that to be fully effective, the strategy has to be applied in what’s essentially the “between NOC APIs and humans” level.

The problem with any such high-level approach is scalability.  The complexity of network infrastructure and the scale of issues that could be generated isn’t impacted; the stuff is all presented to the new automation layer to work on.  This could mean tens of thousands of events could be reported during a massive outage in a short period, and it calls into question whether the new layer could handle all the activity.

You can’t just have an event-handler at the top of the NOC stack (and the bottom of an event avalanche).  The problems of fault correlation and remedy interdependencies have to be addressed.  The former is an often-discussed problem, often called “root cause analysis” to reflect that often a single problem can generate tens, hundreds, or thousands of related problems, none of which really need to be fixed but all of which clutter the landscape.  The latter is rarely discussed, and it’s a major issue.

How do you know, when you take a step to solve a problem, that the step doesn’t collide with either other aspects of the same problem, or the effects of a parallel problem?  You can try to reroute traffic, for example, but it will be ineffective if the alternate route is already below spec, or has failed.  There has to be a classic “god’s eye” view of a network to assure the responses to a problem won’t create a different problem.

There are two possible paths to addressing this.  One is simulation, designed to determine what the outcome of a step would be by modeling the impact of taking it.  The other is failure modes, which try to describe different ways that a network could operate in a problem state, and use events to define what failure mode is present.  The failure mode would define a preferred response, which would be organized across all elements in the network, and could in fact be set by human policy in advance.

OK, so this approach has some potentially critical limitations.  How about the autonomous-network approach?  This one is much harder to assess because saying “autonomous network” is like saying “multi-vendor network.”  One box may make a network “multi-vendor” but one that’s 50-50 split is a lot more so.  With autonomous networking, the problem is defining both the scope of “network” and the meaning of “autonomy”.

If the autonomous network is a management domain, then what’s created is the same as adding a layer of automation to the top API of a current management system as it feeds into the NOC.  I covered that above.  Rather than list all the things that the term might mean, and that would produce sub-optimal behavior, let’s look instead at the right approach.

I think “autonomous network” means “intent-modeled network”.  You create functional black-box abstractions, like “VPN” or “access network” or “firewall”, and you define a specific set of external properties to each, including an SLA.  Any implementation (the inside of the black box) that meets the external properties is fine, and can replace any other.  Inside each abstract black box, an autonomous management process is responsible for meeting the SLA or reporting a failure, which then has to be handled at the higher level.

The notion of building a network service by composing functions is obviously a shift in thinking, but in some ways that’s helpful here because so is automated operations or “autonomous networking”.  If you spend a bit of time defining your intent-modeled functions properly, and if you have a proper modeling language to express the result, you can define every operations task within the model, which means you can automate everything short of the handling of situations for which no handling is possible.  Even there, you could initiate an escalation, trigger a service credit, etc.

To get back to Martin’s piece, I think this represents the key qualification to his comments.  All autonomous networks are not the same, and many aren’t really autonomous in the sense that they are self-healing.  If we credit some early activities that could loosely be called progress toward true autonomy as an indication that automated operations is making progress, we presume that these initiatives are actually aimed at that goal and being guided there by management.  I do not believe this is the case.

When I talk with operators themselves about automation efforts, they point to successes that are striking in that they’re limited-scope in nature.  They’ve fixed this or that, but they’ve not addressed the problems of operations overall.  That may, in the short term, be a driver for the AI-overlay approach.  Putting a layer of AI at the NOC has the advantage of being systemic by nature, but I think that the issues I’ve noted above will eventually mean it will hit the wall without lower intent-modeled element support.  That’s the strategy I’m most interested in seeing.

I still think that Juniper may have a shot at getting there.  Their Paragon Automation has a lot of the pieces of an intent-modeled approach, and their Apstra acquisition would let them take things further.  They have a great AI tool in Mist too, but Paragon doesn’t cite any deep Apstra or Mist AI roots, so it’s not yet the fully structured and autonomous-management piece it could be.

We need to go back to a key point, which is that you can’t do effective operations automation in little grope-the-elephant-behind-the-screen pieces, you have to do it through a strategy that spreads to every part of the network.  It will be interesting to see if Juniper takes it further, and whether competitors like Cisco (with Crosswork) will push their own offerings in the right direction…or both.

Services, Traffic, and NaaS

There is a profound difference between two things we think of as key in networking.  There are “services”, which are the things that users consume, and there’s “traffic”, which is what the network carries.  What has happened over the last 40 years is that we’ve gotten focused on the latter, when it’s the former that keeps the lights on.  The question now, for both network operators and vendors, is how that’s going to get changed.

A good place to start with this tale is a contrast that’s been meaningless for decades, the difference between “connectionless” and “connection-oriented”.  When we make a phone call, we’re calling another party (or parties, in a conference call).  When we access a website for e-commerce, or watch a video, we’re entering into a specific relationship with a specific source or partner.  These are examples of “services” because they represent what we’re trying to do, and thus what we’re willing to pay for.

In a very real sense, these services represent “connections”, though it’s better these days for reasons we’ll see shortly to consider them “sessions”.  In the early days of networking, we carried this connection relationship downward to the network.  The phone network set up connections, and successor technologies like ISDN, frame relay, and ATM were all based on connections.

The problem is that connections don’t scale, at least not at the network (or “traffic”) level.  When you have connection-oriented networking, you “set up a connection” which essentially threads a persistent route through the network.  The nodes along the way have to “remember” the forwarding rules for that persistent route, which means they’re aware of each connection.  There have been various strategies to aggregate routes, but the fact is that no matter what you do, the property of connection-ness at the service level is incompatible with efficient traffic-handling.

The solution to this was IP, which is “connectionless”.  In an IP network, all the packets that are associated with service-level “sessions” are aggregated and handled simply based on the destination interface.  The network doesn’t know anything about “sessions”, only “packets”.  The nodes (the routers) in the network only have to know about topology, not relationships.  It’s proven to be highly scalable.

The problem is that scalability and traffic efficiency aren’t the only properties of networks that we have to think about.  One of these other issues has been around for a long time; traffic management based on application needs.  Almost every business, and probably every consumer, faces challenges when they mix traffic that’s sensitive to delay and/or packet loss, and traffic that’s well-suited to IP’s best-efforts nature.  The other issue has been around too, but it’s only recently getting fully recognized; security/compliance.

Back in those early connection-oriented days, the ruling technology for data networking was IBM’s Systems Network Architecture (SNA).  In SNA, all “sessions” were explicitly authorized by the mainframe, via something called the “Systems Services Control Point” or SSCP.  If you didn’t have SSCP consent, you didn’t connect.  This created an easy way to policy-manage connectivity built into the network architecture.

IP networks have to add security on, because it’s not intrinsic in the network model.  Firewalls, which bar attempts to connect by barring packets with specific characteristics, aren’t exactly perfect, nor are they cheap.  More recently, there’s been interest in “zero-trust” security and compliance, where sessions have to be authorized by the modern equivalent of the old-time SSCP.  Juniper recently acquired 128 Technology, the firm I’ve always said was the leader in this space.

The big development in both security and compliance is some form of session awareness, the ability of a network element to recognize one of those “sessions” and treat it based on its needs and on the policies of the business or service involved.  What this may be, and I hope will be, creeping up on, is a useful implementation model for “network-as-a-service” or NaaS.

With NaaS, though, we’re facing the all-too-common problem of hype versus reality.  The properties of NaaS are difficult to understand and communicate, which allows almost anything to be called “NaaS” with little fear of contradiction.  That’s too bad, because the lack of a consensus model for NaaS could hurt us all, particularly in the consumer space.

Businesses can adopt NaaS, even true session-aware NaaS, without a major problem even if there’s a hundred different implementations.  They pick one and run with it.  For network operators, cloud providers, content providers, and so forth, the lack of consistency is problematic because network resources at the broad-market level necessarily converge to interwork and to support consistent services.

What would be needed to provide a common-model NaaS?  Obviously you’d have to start with a common requirements set, and today we don’t even have that.  That’s sad, because there are only three requirements for a useful, broad, NaaS model.

The first requirement is that NaaS has to look like IP to all network users and devices.  Whatever is done has to be done without requiring upgrades to existing technologies that connect, for the obvious reason that if changes were required there’s little chance NaaS could bootstrap into today’s online world.

Requirement number two is that NaaS has to be session-aware and apply policies for handling and connectivity based on that awareness.  Otherwise, it doesn’t move the ball in any of the areas where pure IP falls short.

Number three on our requirements list is that NaaS has to provide an effective, efficient, and scalable means of defining, storing, and applying policies.  It has to avoid being weakened by any barriers to creating connectivity control.

The last of these three is the most difficult to achieve.  We already have at least one implementation of SD-WAN (128 Technology) that fits the first two.  While that implementation has successfully been scaled to corporate VPN size, it’s not been tried yet at the scale of a public carrier.  We also need to accept the fact that public carriers would probably not adopt a proprietary model without some resistance, so it would be ideal if we had an open definition of policy exchange.  We already have standards for policy control and enforcement points (PCP and PEP), so it’s probably not rocket science to provide an adapting technology to utilize them, or to implement them directly.

There’s still one question, though, which is whether NaaS could also be supported at the network level rather than via a form of IP (real or virtual) overlay.  It doesn’t seem likely that sessions could be mapped at Internet scale all the way through from end to end, no matter how transport infrastructure evolves, but that’s really not necessary.  In most cases, congestion is going to be encountered near to the edge, and often right at the service demarcation.  Probably, the most that would need to be done at the IP level is to provide QoS- and security/compliance-specific pathways, which isn’t far from what network slicing in 5G is supposed to do.  It will be interesting to see if Juniper, both a router vendor and new acquirer of 128 Technology, will do in this area.  Or what their competitors might do.

There’s still a lot of work to be done on NaaS if we’re to advance it from where we are today, but we’ve come a long way in the right direction.  If networking is truly to be a service, it’s essential that we be able to classify traffic according to the relationships between users and resources.  That step has now been taken, and it’s going to be very interesting to see where we go next.

Are Vertical or Horizontal Markets the Key to Operator Success?

So is Google going vertical?  It sure seems like the deal they did with TELUS’ takes that direction.  It also seems that Google might be responding to the changes that COVID has created, not the simple WFH issues but the more profound issues of reshaping stakeholder and employee relationships to make them work better under any set of market conditions.  There may even be a connection between this deal and Telia’s recent “global IoT” announcement (and yes, I know Telia’s a different organization).

TELUS, of course, is an operator, and the Google Cloud deal is therefore a deal that’s designed to help operators address new service opportunities.  That’s the most important part of this story, in fact, because for years the network equipment vendors have sat on their (…hands…) and watched operators struggle with climbing the value chain.  Why encourage them, the vendors thought, because we’re not what’s needed on the next level.  Well, people, there are other players in the industry who are prepared to step up.  Inside this win for Google is a big, big, loss for those network vendors.

Operators have struggled themselves with that next level, of course.  According to the operators themselves, their problem is a mixture of organizational ossification, fear of changing a business model their great-grandparents could have related to, and an almost complete lack of understanding of the basic tools that are needed to take that giant, scary, step.  It’s easy to say that “The cloud is the future”, but obviously realizing the future has to be more than putting your hand on your heart and chanting “Cloud…cloud…cloud!”

IT vendors, both servers and software, have tended to believe that since other verticals have been able to adopt the cloud, telecom should be too.  The problem of course is that enterprise adoption of the cloud is a straightforward process of running stuff they need, and may already be running in some form, in that cloud.  For operators, you need to convert this to services, an intermediary step.  Many services, and likely all of the most compelling, relate to vertical markets.  Operators don’t know vertical markets as well as server/software players do.  When you sell something whose primary virtue is not getting in the way of what’s running over it, you’re understandably weak on the details of those applications.

There are some cloud-technology things that are almost surely going to underpin any vertical application the partnership might elect to chase.  One is obvious—hybrid cloud.  Very few enterprises will avoid public cloud services, and very few will go totally into public cloud for their IT.  That means everyone who’s a prospect for these applications will want to deploy them as a hybrid.  In fact, what it really means is that these are likely to be a series of SaaS (and perhaps PaaS) offerings that are designed to be integrated with both the data center and any enterprise public cloud the buyer might want.

Things like edge computing are less obvious, though I suspect that both TELUS and Google both believe that the edge is going to be a part of the story.  The press release is pretty vague on whether TELUS will be providing Google with edge-hosting real estate, whether the vertical-market services will involve the edge, or what.  It does seem like Google may be hosting some of the 5G elements, though.

An even-less obvious thing is “why verticals?”  Yes, as I said earlier, the most valuable applications for a business tend to be their “vertical” ones, because what makes them a member of a vertical is their key business focus.  Apps, or services, that support that are of primary importance.  But from the perspective of sales, that’s not necessarily a good thing.  Core business applications are really touchy elements to enterprises.

Vertical markets are also the best places for “viral spread” of technology to develop.  Companies trust the opinions of others in the same business over any other information source, and companies in the same business sector are also more likely to interact in trade shows and forums, and to exchange employees.  References from within the same sector are thus much more valuable, and the best way to get them is to attack a sector as a vertical.

The Telia IoT announcement seems at one level to have little in common with the TELUS one, but what it does have in common is that it’s an application, in this case, IoT.  While vertical applications have the greatest value, horizontal ones like IoT are second, and “platform” announcements that rely on integrators or self-development to adapt to business needs are at the bottom.  Is this just a case of second-place efforts?

No.  It’s a demonstration that “services” have to be more than connectivity.  Operators have always been aware of higher-level or over-the-top services, and in fact have watched them explode while their own businesses dip into profit-per-bit challenges.  But operators view OTT-like services the way a Dark Ages peasant would view a noble; distant, admirable, and untouchable.  Partnerships are a great way to get into something you’re afraid of doing by yourself, and if you do have to take an uncomfortable step, dipping a toe into horizontal solutions is less frightening.

I think the TELUS approach is much more likely to bear fruit.  First, horizontal IoT as supplied by a network operator is notoriously focused on selling more 5G services, which of course is just what the IoT users are trying to avoid using.  Second, horizontal IoT services are almost as difficult for a business to consume as a platform service, because these services need application partnerships to make them work.  It may not be “build it and they will come”, but it’s in some ways worse.  “Show them a blueprint and they will come!”

The big advantage of the TELUS deal may well be Google.  As I’ve noted in past blogs, Google has always been the least frightening of the possible partners from a telco perspective, and that includes “cloud partners”.  Google is also the cloud player who is the most agnostic with respect to things like multi-cloud, hybrid cloud, and even cloud-related technology deployed totally on the premises.  That means that there’s way less risk of lock-in, which is something operators are always seeing behind every bush.

For both TELUS and Google, though, this deal is a major risk.  Any high-profile partnership is equal parts of opportunity and risk.  Do it right and it’s a reference approach you can take to a lot of banks worldwide.  Do it wrong and it’s a potential disaster that will haunt your footsteps wherever you go.  Performing on the potential of the deal is almost surely going to be up to Google, because they’re the cloud and application specialists.  If they can get this right, they could launch themselves into the lead in the race for cloud providers to sign up telco partners.

What’s the Relationship Between Containers and “Cloud-Native?”

If you don’t like something, redefine it.  If you’d like something different, redefine things to suit your fancy.  OK, that sounds like basic marketing, but it also has a tendency to confuse complex issues, as well as buyers who face them.  We’re seeing this in the area that’s absolutely the most important thing in IT today, the cloud.

Fundamentally, what is a “cloud”?  The majority of enterprises say that a cloud is a public computing service, something that offers Infrastructure-, Platform-, or Software-as-a-Service.  That’s synonymous with “public cloud”, in other words.  What is “multi-cloud”?  It’s having multiple public cloud providers.  “Hybrid cloud” is a combination of public cloud and data center.  All these definitions are supported by almost three-quarters of the enterprise IT planners I’ve been in contact with over the last six months.

Now, in February of 2021, we’re seeing some weakening of these definitional norms.  One example is that the term “multi-cloud” is increasingly applied to hybrid clouds, and I think this shift can be attributed to attempts to make “multi-cloud” marketing relevant to hybrid clouds (which are almost universal) versus multi-public-cloud use (which is way less common than many companies say it is).

There’s a lot of popular focus on multi-cloud, generated in no small part by media stories, and a lot of it is based on the “locked in” or “single point of failure” arguments.  The largest driver of multi-cloud has never been that, it’s been one or both of two factors.  The first is that all public clouds don’t cover all geographies well, so somethings users had to hybridize.  The largest cloud providers now cover pretty much the same geographies, so this is less a factor.  The second is that all clouds don’t offer the same features, and some applications need a specific feature set.  This is not only still the case, but a growing one.  Cloud providers, in some areas, are diverging in features more rapidly than a year ago.

The feature differences between public clouds are actually a voice speaking against multi-cloud, though.  Early public cloud use was really a kind of re-hosting push, moving something that ran in the data center into the cloud.  IaaS services in the cloud, analogous to VM usage in the data center, is pretty universal, meaning that there are no cloud features that impact development and execution, only operations.  Today, over 80% of new cloud applications are written for the cloud, and since every cloud provider offers at least slightly different implementation of features, if not different features, these applications are not portable.

This means that the largest number of “multi-clouds” emerging, if we accept the old multiple-public-clouds definition, are created by application-specific cloud needs.  That’s important because these multi-clouds are less likely to be sharing things between clouds; the applications are separate.  It also means that the relationship between the cloud(s) and the data centers—the hybrid cloud—is really important, and that relationship is as much operational as programmatic.

A cloud is a resource pool abstracted to look like a computer, one that might be bare (IaaS), have OS and middleware (PaaS), or run an application component set (SaaS).  That’s fine, but what runs on these platforms?  An application or an application component has to deploy on something.  What?  The old answer was “VMs”, but the new answer, increasingly, is “containers”, and that introduces our second terminological shift.

Is a “container” a “cloud”?  Not unless we change how we define clouds, but that’s not necessarily a barrier.  Logically, though, containers relate to things that you host, and clouds to what you host on.  Moving to containers doesn’t mean the same thing as moving to the cloud, and formatting applications or components into containers doesn’t make them “cloud-native”, despite claims from many in the NFV community to the contrary.

A container packages applications and dependent elements (like databases) into a deployable unit that’s substantially infrastructure-independent because everything that’s needed for deployment is in the container.  Containers, and Kubernetes orchestration of container deployments, is becoming a standard approach both for the data center and for the cloud.  Thus, over time, both public cloud and data center are becoming container hosting options.  So, according to some vendors and stories/reports, a hybrid cloud is a premises container “cloud” and one or more public clouds.

What’s really happening here is that containers have become a deployment option that works across both cloud and data center.  Operationally speaking, this could be a step toward unifying the cloud(s) and the data center, but remember that public cloud services are a combination of hosting (including managed container or Kubernetes services) and those cloud-provider PaaS-like web service features.  Those features may make a given container/component non-portable.

This explains why some of the initiatives of the public cloud providers to support user-supplied edge hosting are, or include, a way to extend the web-service framework of the cloud provider to the user edge.  One could expect that these same sorts of initiatives could extend cloud-provider services into the data center, which would then mean that the containers would be fully portable across that boundary.  Since these hybrid-edge facilities of cloud providers are specific to each provider, using them would reduce public cloud portability.

This creates, in a sense, a layered model of hybrid application deployment.  At the top of the layer stack are the application/component elements.  These have a relationship with one or more “platform-layer” elements, representing the web services or middleware services that are not universally available.  Those, in turn, ride on the container layer, which rides on “hosting” layer elements representing each public cloud and the data center(s).

This also explains why “Kubernetes federation” tools like Google’s Anthos are so much in the news.  In Kubernetes, the general rule is that public clouds and data centers are independent domains (clusters), so some form of cross-cluster federation is needed to coordinate anything that has to cross the boundaries.  It’s perfectly possible to create static bridges between clusters without federating them, but if you want to be able to deploy across the cluster boundaries, you’ll need a federation tool like Anthos.  Google has been working hard to get Anthos supported on all public clouds, and of course it’s already available for data center deployments.

I think that containers are a logical evolution of server virtualization, and that they’d have grown in popularity with or without collateral interest in public cloud services.  Public cloud does seem to make them inevitable, though, and there has certainly been some cross-influencing between cloud and container.  That shouldn’t encourage us to conflate the two, in particular as we’re discussing “cloud-native”.  There are a lot of things needed to convert a “container” strategy for the cloud to cloud-native, and to miss them is to risk a lot of the cloud-native benefits.

IBM May be Advancing the Cause of AI (and its Own!)

Is low- or no-code the secret to AI success?  IBM thinks it might be, and so is launching another of its Cloud Paks, this one called “Palantir Cloud Pak for Data”, that combines them.  Obviously, Palantir, and apparently primarily Palantir Foundry, is a key element in this, but the Cloud Pak still involves IBM products and professionals too.  Then, of course, there’s the question of whether the new Cloud Pak could help IBM realize the gains it expected from AI, in terms of revenue and customer traction.

When I’ve talked with enterprises about AI, they’ve been optimistically pessimistic.  They’re confident that AI (Artificial Intelligence and ML, or Machine Learning) could do a lot for their company.  About a third say they have some experience with AL or ML in what are essentially limited and packaged solutions.  While none of them said this experience was life- (or business-) changing, almost three-quarters said it was valuable.

They’re pessimistic that making that goodness happen is going to be easy for them, and some who have tried it out have been disappointed.  Even enterprises with local AI/ML success admit that they aren’t confident that they could combine a bunch of limited and local AI/ML applications to create something that addresses their business overall.  That, they believe, would require customization, which they associate with software development.  One CIO said that “AI and ML are kind of like programming.  You know programming can be great for your business, but just knowing that doesn’t move the ball much.”

That’s a fairly insightful view, actually, because using AI and ML really is a lot like programming unless you get a package that digests one or both into the specifics of the mission you’re targeting.  There are AI/ML tools just like there are software development tools, but many organizations have absolutely no knowledge of AI/ML and no internal skills to use the tools available.  That’s likely what IBM has in mind to address with its Palantir Cloud Pak for Data.

Cloud Pak for Data is IBM’s original offering, and it’s the foundation of the new solution.  It provides for data collection and organization, as well as Watson Knowledge Catalog for unification of all data assets, as well as for compliance and security policy enforcement.  Palantir lays a new layer on the result, apparently via the Catalog.  Watson Studio is also generating AI models for use in various what-if scenarios, and Palantir builds an object-oriented framework on top, as well as a set of semantics for accessing and referencing both data and AI models.  This, says IBM, builds a “digital twin” of the real-world operation.

What IBM and Palantir seem to be doing here is creating a kind of generalized data-analytics-centric AI framework that can be mapped to all or part of a business operation.  In other words, there’s a “create-the-digital-twin” task and then a “draw-on-it-for-insight” task.  I don’t have any direct enterprise feedback on the approach, which is obviously fairly new, but it does seem to at least attempt to address the problems of AI/ML use I’ve noted above.

What I think IBM intends here is to provide professional services (at least optionally) to assist businesses in building their digital twins with the Cloud Pak.  With that done, and with the staff perhaps indoctrinated in the way the digital twin works, the semantic layer of Palantir would be tapped to provide ways of pulling useful results from their data, via those AI/ML models and using vertical-market specialties and special tools created by Palantir.

I think the digital-twin notion is a strong one.  If you could create an AI/ML-empowered model of a business, with important elements mapped as objects and assembled into a model of the processes of the company overall, there are a lot of good things that could fall out of the organization alone, and AI/ML could be guided better to consider the important areas in the context of their real-world relationships.  Every butterfly’s wings flapping in Japan don’t start a hurricane off the East Coast, after all.  Some limits on analytic correlation could be important in weeding out the chaff.

The enterprises I’ve talked with who use Palantir do like it, and find it valuable, but the majority of my contacts have never used it.  Neither IBM’s nor Palantir’s websites offer a really strong view of just what’s going on here, or on how the new Cloud Pak could be used and extended.  That could indicate that the target customers are those companies with strong IBM account positions and an appetite for consuming detailed demonstrations and sales material.

Is that bad, you could ask?  It depends on just how much impact you’re hoping for.  IBM Cloud Paks aren’t small business offerings; even most mid-sized businesses wouldn’t be candidates, so loss of a bit of populist appeal might not be a big deal.  However, almost half of all enterprises I talk with are not major IBM accounts, and they might be interested in something like this if it were a bit more accessible.

I thought Red Hat was a compelling acquisition for IBM because it extended its total addressable market.  I’m not sure that IBM can succeed with AI/ML unless it harnesses a lot of that incremental, Red-Hat-created, breadth.  I’m not sure that this offering can do that.

What it might do is serve as a blueprint for others, though.  I could see a Dell or an HPE frame a broader offering with similar elements and offer it to all the enterprises and many of the SMBs as well.  That would have a major positive impact on AI/ML adoption, but it would have a negative impact on IBM.

Most companies are not going to do their own AI programming.  A low-code/no-code tool would be useful, but given the nature of AI/ML I’m not sure that traditional programming models for low-code/no-code would serve usefully.  The digital twin concept seems to me to be a really good way of doing things.  You create a model of a company or process and then build AI/ML enhancements around it, then create a front-end that lets you harness the value it creates.  That’s what Palantir Cloud Pak for Data really does in the end, and the idea is great.

Not so much the positioning and marketing.  IBM is one of the smartest companies I’ve ever worked with, but smartness can be a real problem about as often as it’s a real benefit.  Somebody up in the dazzling IQ clouds doesn’t communicate well with the masses, and markets are made up of the masses or they’re not much of a market.  With better positioning, and possibly offered in a true SaaS form, this could be something strong.  They’re just not there yet, and they’re certainly giving competitors ideas they’ll likely run with.  Maybe further than IBM will.

That’s especially true because it’s becoming obvious to many different kinds of companies that new technology adoption requires more than just throwing a pile of tools and APIs at a user and hoping they figure out something useful to do with it all.  There are other developments in other sectors of the IT world that are emerging, and it’s only a matter of time before somebody figures out that what’s really needed are useful services, not frameworks.  The frameworks are essential in that they build the useful services, but a clever vendor could jump the line and go right to the service dimension.  IBM is trying that in AI, and we’ll see more of that as 2021 rolls on.

What are Businesses Doing on the Cloud?

What are businesses really doing with the cloud?  According to a piece in Medium, not necessarily what you think.  Some of the numbers they cite are interesting, even fascinating.  Some are misleading or perhaps just wrong, at least according to my own enterprise data.  Can the numbers support what I think is the primary goal of the piece, which is to promote Google’s Anthos?  We’ll look at the numbers, and the value of Anthos, here.

Let’s start with some statistics from the piece.  Google says, according to the article, that 73% of “organizations” use the cloud.  What I’ve found is that’s roughly true for “multi-site businesses” (my figure is 67%), among enterprises (the large multi-site businesses with at least 7,000 employees and 15 or more sites), the number is 100% for all practical purposes.  Google also says that only 10% of companies have “moved workloads” to the cloud, and that’s a bit more difficult to parse.

My data has shown consistently that only about 23% of workloads are eligible to run in the cloud.  If 10% had been moved, that’s nearly half of all that could be moved, and I think that is high, unless we think of “moved” differently, and that leads us to what’s really the main point of the article, the supremacy of hybrid cloud.

Google, again quoting the article, says that 80% of enterprises have hybrid cloud.  Again, if we use my definition of an enterprise, 100 percent of cloud users are hybrid cloud users, and since 100% of enterprises (actually, somewhere over 99% but I’m doubtful about the remainder) use the cloud, that would mean 100% are hybrid cloud users.

Finally, according to Google as cited in the article, “80% of those who have moved to the cloud have moved workloads back on-premise because it proved to be operationally expensive to run in the cloud.”  This seems to require some qualification.  It’s roughly true that 80% of all enterprises (84%, in fact, in my data) who use the cloud have moved some things off the cloud, and unexpected operations costs is the largest reason sited.  However, in none of these cases has all the work been moved, and only 12% said they moved “a significant application” representing “more than 10% of their cloud usage” back to the data center.

The author and Google seem to be setting up a case for Anthos here, and while I find some of the data troubling, the fact is that my own data confirms what’s the basic business case for Anthos.  If you are an enterprise, you are a hybrid cloud user.  Further, as I’ve noted briefly recently, the current trend is toward much greater cloud use.  While 100% of enterprises use hybrid cloud, the cloud was actually involved in only about 34% of applications as of November 2020.  Users tell me that number will increase to about 45% in 2021.  The driver, both for “past cloud” and “future cloud” hybridization, is the use of the cloud to provide an agile “portal” to information from one or more (and, more often, multiple) applications.  “Portalization” of their applications is the largest driver of cloud growth.

What enterprises are trying to do in 2021 is create a flexible way of projecting data from multiple applications to multiple stakeholders, based on role and company security and compliance policies.  This means drawing from a set of APIs that represent connection points to legacy business applications running in the data center, or perhaps from a piece running in the cloud, or both.  Containers are the go-to strategy for both cloud and data center these days, and so this model creates a hybrid cloud but with an emphasis on the agility of the cloud in presenting composed user interfaces.

Anthos is Google’s orchestrator-of-orchestrator approach to hybrid cloud.  The challenge with hybrid cloud when applied to containers is that Kubernetes cloud and premises domains are separate, and users want them to somehow come together.  Google, like other cloud providers, has offered managed Kubernetes services (GKE, in Google’s case) to make the public cloud side of the hybrid easier, and Anthos builds on GKE to create “federated” Kubernetes domains for hybrid and even multi-cloud.

Anthos creates a cloud-resident Kubernetes control plane that’s then managing Kubernetes nodes that are deployed, well, wherever they are.  This has created some initial concern among users, because they see their hybrid cloud, which to enterprises is mostly the data center with a ring of cloudiness around it for portal creation, controlled by Google.  That seems a bad idea until you get to some other enterprise statistics that the article doesn’t include.

Remember that the thing that’s always driven public cloud use by enterprises is the creation of front-end pieces for legacy data center applications that enterprises have no intention of moving.  This trend is accelerating in response to the post-COVID realization that projecting application information through portals that are customizable and multi-application offers big advantages.  The net is that there really isn’t much “cloudiness” going on inside the data center, and there’s not a lot of cross-border migration of components.

The statistics I’ve gathered show why.  Among those enterprises who moved applications back out of the cloud for cost reasons, the underlying technical common ground was that the applications were really too tied to the data center or that the applications used a lot of fancy public-cloud technology.  In both cases, the impact of the fanciness and tight linkage was to create unexpected costs.  One or both these reasons was cited by users in 73% of cases, in fact.

Why have a hybrid cloud at all, then?  You could have a public cloud that drops transactionalized data in a queue to be serviced by data center applications without any “hybrid cloud” elements in the data center?  There are two reasons, enterprises say.  First, they have an increasing commitment to containerization of their data center applications that has nothing to do with the cloud.  Containers have a significant operational advantage in that they can package an application in deployable form, including all its dependencies, and deploy it as a whole.  That not only saves effort, it reduces errors.  Second, they are finding that a highly portalized cloud may in fact create a need for a more agile interface between it and the data center.

A simple example will serve to illustrate the final point.  If you’re going to merge multiple applications into a single virtual application to supply data to the cloud (a “storefront design pattern” in software would be an example), you might want to do the merging and apply security measures within the data center, where you have better security/compliance controls.

If we have an agile cloud framing information for users, partners, customers, and even (in a few cases) regulators and accounting firms, containers and in particular managed and perhaps even “serverless” containers, are valuable assets.  Your own agile on-ramps for those cloud elements then need to be orchestrated in harmony with the cloud pieces themselves.  Thus, Anthos may be a good idea.  I think that the Anthos story is, in fact, better than the one the article presents, and that a complete and realistic picture of cloud usage makes it even better.

Does Cisco Have it Together on Optical Disaggregation?

Does Cisco come to mind when talking “open” in networking?  Probably not, but they’re nevertheless touting an open model in optical transport.  They’re also talking “disaggregation”.  Is that an attempt to jump on a popular term (used by DriveNets, who beat Cisco in an AT&T core router deal)?  We’ll have to dig a bit to get the answers to this one, starting with some history and Cisco’s blog on the topic.

In the history department, Cisco recently redid its Acacia acquisition deal, responding to a tiff between the two companies likely generated by Acacia’s increased perceived valuation.  According to the press release, Acacia “designs and manufactures high-speed, optical interconnect technologies that allow webscale companies, service providers, and data center operators to meet the fast-growing consumer demands for data.”  Likely, the Acacia deal is behind Cisco’s blog.

In that blog, Cisco touts its representation in the Telecom Infrastructure Project’s Open Optical & Packet Transport (OOPT) group, which according to its website “works on the definition of open technologies, architectures and interfaces in Optical and IP Networking.”  OOPT provides a reference framework for an optical network on that site, a diagram that shows a string of elements that are in-scope for OOPT.  There’s also a list of working groups, including one for disaggregated aggregation and core routers.  Based on the diagrams, it seems the focus of the activity overall, including the router piece, is “mobile infrastructure”, since the chain of elements the diagram shows starts in cell sites and ends in “the Internet”.

The specific project focus of the blog is TIP Phoenix, a project to develop an optical transponder that sits at the front and back ends of the optical chain in the diagram.  According to OOPT, it’s “An open white-box L0/L1 transponder that operators can deploy on top/together with their existing line systems to increase the capacity of their optical networks.”  Acacia’s technology certainly fits in that mission, and of course Cisco has its own branded offerings in optical networking, notably the NCS 1004.

It’s pretty clear that all of this seems directed most at mobile infrastructure in general, and 5G buildouts in particular.  Operators have expressed continued interest in white-box solutions to the 5G buildout, and some (like AT&T) are committed to them.  This is important to vendors because 5G is a budgeted project for operators, and largely greenfield besides.  The easiest place to introduce new technology into a network is a greenfield buildout, because nothing in place has to be written down, something that would complicate the business case.  In short, 5G infrastructure could be the part of the tent that the white-box camel’s nose could most likely enter.

This is what I think is the core of Cisco’s interest in OOPT, an interest that’s two-dimensional.  5G is in fact going to create a significant additional investment (eventually) in access infrastructure, the chain of elements shown in the OOPT chart.  It behooves Cisco to have something to field in that space, both for mobile 5G deployment and for 5G/FTTN mm-wave hybrids.  Otherwise, they cede a big investment area to white boxes.  Second, the architecture for a generalized packet-optical framework is being worked on in the OOPT, and the last thing Cisco would want is for the initiative to move forward under control of competing vendors, promoting competing ideas.

Most people who are or have been involved in any standards initiatives would say that Cisco isn’t known for submerging its own interests.  Some would say they’re obstructive.  An open, disaggregated, framework for packet access, aggregation, and transport would certainly not be in Cisco’s interests even if it stayed confined to mobile infrastructure.  Probably, it would not.  In fact, DriveNets’ win in the AT&T core might reasonably be linked to AT&T’s white-box focus for 5G, showing that white boxes can spread like a virus if you let them in.

Operators like disaggregated, open, mobile networks, and that’s clear.  It may be, given AT&T’s position in white-box 5G and the fact that they designed the Distributed Disaggregated Chassis (DDC) model on which DriveNets is based, that 5G thinking primes an operator for broader white-box deployment.  The spread of OOPT, then, could be a big threat to vendors like Cisco, not only in 5G infrastructure but in IP infrastructure overall.

The problem OOPT poses is that it defines 5G infrastructure from the tower to the Internet, which means that there’s an explicit edge and aggregation element to it.  It’s the latter that’s important for router vendors, because while you don’t need a router on a linear path from tower to something (there’s no route alternatives to take), as soon as you hit an aggregation device you have implicit need for path selection.  That makes this 5G aggregation element a right and proper router, and one with a strong packet-optical flavor.  Might it then ease its way over the Internet boundary into Internet aggregation?

Operator IP networks today are most often based strongly on MPLS to create “virtual transport” routes within the IP technology framework.  If you were to take packet optical and push it higher, might it take over the MPLS role?  If that were to happen, might IP networks literally have the heart cut out of them?  Scary stuff for a router vendor.

On the other hand, could you not take packet optical directly to the router, provide packet optical transport integration into the routers themselves?  You can combine optic and routing by pushing packet down into optical transport, or by pushing optical transport up into packet/routing.  The latter approach could be a heady option for that same router vendor.

We know that at least one operator (AT&T) was prepared to redo their core based on white-box disaggregated routing.  We know a lot of operators are prepared to build 5G greenfield infrastructure based on white boxes, and it’s likely that the “aggregation” router could look pretty much identical to the skim layer on the Internet core.  If Cisco can hold off the white-box wave at that device, they can contain the risk.  If they can use Acacia technology to push an optically-capable router outward via a packet optical chain, to the tower or access edge, they could do more than contain, they could win.

A strong integrated packet optical story would make things very tough for all of Cisco’s competitors, whether they’re white-box or not.  The traditional mobile infrastructure players need to fend off white boxes and open 5G.  If Cisco embraces openness beyond the aggregation router, but offers a linked packet optical story outward to the tower and inward through the core, they step on a lot of toes, even those of the optical-layer players like Ciena and Infinera.

Could Cisco be biting off more than it could chew, trying this (if they really are)?  Perhaps, but if you’re a giant vendor, it might be wise to turn a bunch of little problems into a big one that you have the mass to solve.  Maybe they need to think even bigger.  Recent stories of integrated open 5G involve a dozen or more elements.  Just putting them all together is a challenge, and if Cisco takes that on, and adds in their packet optical chain and aggregation router, they could field a network architecture that answers everyone’s problems.

They’ve beaten open models before, after all.