How Operator Planners View Carrier Cloud

Even based on early numbers from network vendors, it’s clear that network operators are at least slow-rolling, and likely constraining, capital spending.  While it’s easy to blame the pandemic, network usage for most consumers and businesses has increased since the lockdowns began.  It’s obvious that operators don’t make incremental money on traffic absent usage pricing, so the pandemic is proving that revenue per bit for traditional services is much more likely to continue to fall than to suddenly start rising.

In my recent blog, I noted that TV as a revenue kicker was losing its luster.  VoIP services from operators are also in decline, which means that there are no “traditional” add-on services to broadband access that can be expected to fill the developing gap in revenue versus needs.  My expectation, and the expectations of most of the thoughtful carrier planners I hear from, has been that higher-level services would do the filling.  That means “carrier cloud” in some form.

There are two basic models of carrier cloud emerging.  The “classic” model is that the operators deploy their own cloud computing data centers, justified by a mix of virtual-function applications relating to current services and an increasing contribution of the new higher-level services that could provide the revenue kicker needed.  The “outsource” model is that the operators buy cloud computing services from public cloud providers, fulfilling their virtual-function and higher-level needs.  I’ve been getting some insight from carrier planners on the question of which option seems best, and why.

A year ago, every single carrier planner I talked with said they would be deploying their own carrier cloud.  Today, the number is hovering around half, and that’s an astonishing shift in so short a period of time, especially given the traditional inertia of the network operators.  What’s responsible for this shift, according to those key planners?

The biggest factor, according to the planners, is increased sensitivity of executives in general, and CFOs in particular, to high first costs and stranded asset risks.  Every planner who said that classic carrier cloud was less likely to be deployed now versus a year ago, or who said that it would be deployed more slowly or on a smaller scale, cited this factor.  The causes of the increased sensitivity were much more varied.

Tied for the top reason was pressure from the financial markets and lack of confidence in a stable architecture for classic carrier cloud.  Like the vendors who support them, network operators have to report their financials to shareholders and regulators, and a decline in profit equates to a risk of a decline in share price.  Embarking on a new hosting model for service features generates a risk of stranding assets, and even suggesting that future path would create a demand for an explanation of how it would work, which operators are unsure they could provide.

It’s my own view that the architecture problem is the fundamental one.  The problems with revenue per bit were very clear to operators as far back as 2012, and for some even in 2010.  The idea that some form of open-model technology to replace vendor-proprietary (and high-priced) network equipment could improve profits was discussed widely in 2012.  Then, and now, operators were unsure about just what their open-model network would look like.

Open devices, a combination of white-box hardware and open-source software, is currently the most appealing choice for operators because it doesn’t stretch their current understanding of network-building or operations practices.  An open device could be switched in 1:1 with a proprietary device, with a promise of minimal (if any) changes overall.  Cloud-hosting functions instead was recognized as having a greater impact on capex because of the obvious economy-of-scale benefits at the hardware level, but operators are not currently comfortable that they understand how to accomplish cloud-hosting of virtual features/functions.  They’re also not comfortable that they understand the operations complexity and cost implications.

According to the planners, this has caused most operators to shift their strategy from hosting of virtual features in a cloud to open devices.  The latter approach, obviously, doesn’t justify carrier cloud data centers, and given that rather enormous change in thinking, it reveals the financial risk of jumping into the classic model of carrier cloud.

The obvious question at this point is what the lack of classic carrier cloud means to those higher-level service opportunities, the ones that could provide the revenue kicker and enhance profit per bit.  The planners are mixed on this point too, and again almost evenly divided.

The two explanations are higher-layer opportunities for carrier cloud, when they emerge, will justify the carrier cloud buildout, without any seeding of data centers generated by hosting virtual features, or that outsourcing cloud requirements to public cloud will fulfill the needs in higher-layer services.  In other words, the planners say they’re split between “build” and “buy”, but that may not be exactly the case.

Most of the planners say that they think they will “eventually” start to build out carrier cloud on their own, in the classic model.  Thus, the question is whether they start to do that as soon as they have a credible architecture and a credible target business case, or whether they start building the business case by hosting on public cloud, and then move to self-hosting as their confidence in both technology and business case grows.

My sense is that the common thread in this dualism is a lack of architecture comfort.  Those who think they will start their buildout when they’re confident obviously have that issue in mind.  But even those who think they’ll start in the public cloud and move over have architecture concerns, as I’ve noted.  And there’s a special concern that the latter group should have, but apparently don’t recognize yet.

Applications aren’t automatically compatible with any given public cloud; you have to write them for the target.  In most cases, the more you use specialized cloud features to enhance your application, the less portable it becomes.  Could operators build a higher-level service on a public cloud and then simply move it?  It would depend on how it was built.

The few operator planners who have apparently considered this issue are split into three groups (but this is a small number, so there’s not a lot of statistical confidence in the breakout).  One group thinks they can define a cloud model and tools and simply host on IaaS, which would facilitate their movement to their own cloud by eliminating any special cloud-specific features.  One group thinks that they’ll simply adapt the applications when they decide to move them, and a third group thinks they’ll simply keep higher-level services on a public cloud provider, using classic-modeled carrier cloud only for virtual feature hosting in things like 5G.

The weak point for all of them is the architecture issue.  Early operator attempts to define function hosting deviated from public cloud and cloud software conventions, defining their own elements instead of reusing either the architecture or the tools.  They also left many issues uncovered, including the optimum model for deploying software-based, versus appliance-based, network functionality.  The key planners know this, but they’re not aware of initiatives that are on track to address the problem.  Thus, they don’t know how carrier cloud applications will be built, and without knowing that, where they can be run.

It’s very possible that this will have a major impact, to the point of making the build/buy decision for operators once and for all.  Operators admit to be looking more at SaaS contributions from public cloud providers as solutions to their higher-level services, and even for function hosting.  That approach would bypass the architecture question by letting public cloud providers build the applications and simply expose the interfaces for consumption.  And this is pretty much what the “web services” offerings of the public cloud providers already do, so we might wonder whether Amazon, Microsoft, and Google will be offering carrier cloud service-as-a-service in the future.

5G Needs to Overcome the Hype, and Here’s How

The problem with 5G is overreach.  We start off with a network technology that, realistically speaking, has a 100% probability of deployment.  We then try to gild the lily, insisting that every 3GPP feature be deployed, and that everything we do with 5G be something new, never done before, never possible.  We want the rollout to be immediate, and we want it to be exciting.  The result is that among the operators I talk with, there’s a persistent feeling that 5G might “fail”.

The biggest factor in this overreach is the desire by 5G vendors to promote their technology and accelerate its deployment by using the media.  It won’t surprise anyone to hear that stories are more likely to be run if they introduce something new and exciting.  Given that, a story that “5G supports the same video viewing, the same web access, and the same applications” as 4G did is hardly likely to make the top of the editorial hit parade.  That’s sad, because initially that’s exactly what 5G is going to have to do, and what’s going to have to justify its deployment.

5G supports more bandwidth and more customers per cell site than 4G, which is more than enough to guarantee that operators will deploy it anywhere cellular services are widely used.  As more phones become 5G-capable, there will be more pressure to deploy 5G in sites where its specific features aren’t necessarily needed, or even useful.  We’ll end up with 5G pretty much everywhere, but it might well take a decade to achieve that.  It’s this slow evolution that has vendors worried, and it leaves network reporters and editors struggling to write something.  In fact, recent skeptical stories about 5G can be attributed to the inevitable evolution of coverage.  First, you overhype something to get readers interested.  Then it underperforms your overhype.  Then you write stories about how it failed.

Vendors, who often will freely admit to their role in this as long as you promise not to quote them, feel that they have little to lose by pushing 5G ahead of realistic use cases.  They know they’ll win in a decade, and that “discrediting” 5G in the media won’t really delay the realistic deployment timeline.  If hype could accelerate it, with no risk of deceleration, why not go for it?

Well, there are actually some issues created by the 5G hype cycle, issues that I think vendors should consider.  The obvious issue is consumer credibility.  If people’s expectations for 5G are inflated by hype, it’s likely early adopters won’t be able to report much positive about their experience, which will then make it harder to socialize 5G to a larger user base.  But that’s not the real problem, which is that we really do need to identify things that only 5G can do, and the hype is hurting rather than helping.

If you read up on 5G use cases, you find all kinds of things that sound exciting, but that require a rather significant cooperative investment by 5G service providers and technology providers who will support the use cases with user-side technology.  What we need to look for instead are 5G use cases with two specific characteristics.  First, we need some that can be exploited with limited add-on technology on the user side, stuff less complicated than buying an autonomous vehicle.  Second, we need some that can be exploited without developing a software/content ecosystem that would itself have to be justified.

Self-drive cars are a great example of overreach.  Nobody is going to deploy 5G hoping that someday, vehicle manufacturers will sell self-drive cars, and that a lot of drivers will buy them.  There have to be steps in the process or it won’t fly (or drive).

The first logical step is to apply autonomous vehicle technology to vehicles that operate within a specific facility or space, and under common central control.  Warehouse material is a good example, and so are locomotives.  These kinds of vehicles could either be retrofitted with control technology or could be built new with the technology included (likely for the warehouse example).  Using private 5G or even network slices, it would be possible to control these vehicles via 5G, and while there still would likely be a human operator to override controls if a problem occurred, it would create a standard control mechanism that could then be gradually applied to other applications.

Another logical step would be to define a connected-car element that used 5G to communicate, and initially focus it on telemetry.  The module would receive data from the car’s computer on the status of all the monitored systems, including the current speed.  It would also have GPS capability to report location data, the ability to accept multiple camera inputs (driver’s view and interior view, for example).  All this information would be coded in a standard way and made available, subject to both subscription and security information, to applications via 5G.  Something like this could be built as an after-market add-on as well as a build-in to new vehicles.  If it had a control-out channel that had a standard interface, it could then be used for vehicle control where control systems were available to link with.

Robotic surgery is another of the 5G hype applications, one I’ve blogged about before.  The problem of course is that there are neither the robots available nor the number of procedures likely to be carried out, so as to justify the use case.  But were we to define a telemedicine element, a module that would communicate with 5G and support standard interfaces for medical monitoring and even for device control for devices with current remote capability, there would be an easier uptake.  We could still add an interface for future robotic control to future-proof the gadget.

We already have a large number of people worldwide who require medical monitoring outside a hospital.  During a disaster, we could well have a lot more of them.  Basic combinations of telemedicine modules and monitors or controllable elements could be deployed to homes and nursing homes, with more complex sets of equipment made available in trailers for remote setup during disaster relief.

Telemedicine is also a viable option for care in areas where having a full staff of medical professionals is impractical or isn’t cost-effective.  A trained technician with a full-service kit of telemedicine tools could provide far more diagnostic and treatment choices than videoconference-based telemedicine, which we’ve already accepted for some missions.

The key points to my examples here are, first, that they’re a subset of the pop culture use cases, second, that they focus on something that can be achieved near-term, and finally that they are all examples of a mission-first approach.  5G as a communication technology needs something to communicate with.  We need to get that something deployable (financially and physically) in quantity before we start talking about it as a 5G driver.  I’m not saying we need roaming telemedicine clinics in search of 5G services, but that we need to be able to deploy them immediately where 5G services are made available.  Otherwise they’re not a use case.

Operators or vendors who want to accelerate 5G should look at the examples I’ve offered here.  There are similar examples available for all the hyped-up use cases for 5G.  All of them could be made rational in a stroke, and while the rational version might not be as interesting in click-bait terms as the pop culture version, a phantom use case generates a phantom deployment.  Reality matters.

Changing Broadband Tastes and the Digital Divide

Our taste for broadband is changing, or to be more precise, our taste for what we use it for is changing.  This is particularly true when you broaden the topic to our consumption of “wireline” services.  Things like channelized TV, once the mainstay of wireline services, are being dropped by customers.  Operators are looking at Internet broadband, for perhaps the first time, as the true digital dialtone of the future.  But what does this mean, both for the companies and for their customers, and how do things like COVID-19 impact the situation?

The biggest factor in the shift in consumer broadband is the widespread availability of streaming video services with large libraries of material.  Amazon, Hulu, and Netflix offer tens of thousands of programs, movies, and specials, and these services have offered users an opportunity to do something other than “watch what’s on.”

This streaming alternative has become more attractive for many because of rising discontent over broadcast/real-time TV.  Online advertising and the shift of viewers away to streaming video have combined to reduce the revenue of networks, and the logical way to compensate is to sell more commercial time.  Reduced revenue has also encouraged content producers to cut back on the number of new shows, which of course means there’s often nothing “new” on at all.  Every winter, every summer, there’s a period in which new episodes of shows aren’t aired, so viewers have to find something else.  When they do, the question is whether that something else is better, and some don’t go back to the scheduled programming.

The obvious impact of this is to reduce the incentive to include linear RF delivery to the home in a bundle.  Most fiber systems deployed by telcos have included linear RF in parallel with broadband data.  CATV cable was designed initially to deliver only linear RF, and the latest DOCSIS specifications (4.0) support 10GB downstream and 6Gbps upstream.  Linear RF is still supported, but it seems pretty clear that a big reason is customer compatibility.  Many of the cableco technology planners I’ve talked with say that they’d love to move to a pure broadband digital plant even now.

A lot of telco planners have already gotten to the “pure digital” model, from the supply side.  The 5G millimeter wave hybrid with FTTN would eliminate linear RF delivery, forcing those who adopt it to stream video exclusively.  That’s not seen as a major hardship by many, but what it does is encourage the content producers (the networks) to play streaming video suppliers against one another, or even to (as CBS has done in the US) go to streaming their own material.  In short, the comfortable linear model is being displaced by a new delivery model, where we don’t yet have the experience needed to predict winners and losers.

You can see the signs of this in the Verizon quarterly call.  Verizon in particular added 59 thousand broadband customers, but lost 84,000 TV customers and 94,000 digital voice customers in the quarter.  Because broadband and streaming are increasingly the only option customers need, and because streaming fulfills more and more of their content needs, they’re shifting to a cheaper plan from Verizon, a plan that provides only that digital dialtone.

Over time, what seems likely to happen is that more and more video will shift to streaming on-demand form.  Sports and news are the only two forms of television that almost demand real-time delivery.  With all the streaming giants doing their own series production and releasing the material a full season at a time rather than in a specific time slot, series TV seems destined to move to the same model.  That would almost guarantee nobody would want to deploy linear RF infrastructure, and we could expect everything to move to broadband streaming.

Some are saying that broadband streaming would create explosive growth in “the Internet”, but of course the majority of the impact would be on the access/metro infrastructure, as far back as the content delivery networks.  We could, if everything watched today were to be delivered via broadband access, see access/metro traffic levels more than double.

One of the benefits of this for the operators is the simplification of the customer demarcation, and the boosting of self-install potential.  COVID-19 has demonstrated that having installers enter the home is a big potential issue, and a complex plant requiring in-home CATV cable almost guarantees that professional installation would be required.  On the other hand, it might be possible to set up an external termination for a broadband data connection, to which a homeowner could attach local wiring and a wireless router.  Some operators think it might even be possible to externally mount the wireless router, and provide homeowners with in-home repeaters they could simply place and plug in.

The impact of all these things on wireline broadband and 5G depends on the operator, and in particular on the demand density of the operator’s territory.  Again, presuming that the average demand density of the US is 1.0, demand densities of roughly 5 or more could support consumer wireline broadband profitably with no additional revenue kickers.  Where demand density is less than 2, it will be challenging to make traditional wireline broadband profitable, and since that’s the situation AT&T finds themselves in, you can understand their challenges.

When an operator can’t make broadband profitable on its own, the traditional revenue kicker of television has (as I’ve already noted above) lost its luster.  AT&T, in my view, is spending too much time and management cycles trying to make a TV offering work, but buying a content producer may have been their smart move.  Similarly, Comcast’s earlier move in the same direction has certainly provided them some revenue upside to buffer them against negative trends in the TV delivery space.

The reason this TV-or-no-TV-kicker stuff is important is that for the lucky (Verizon, with a demand density of 11), you can accept TV losses as long as you can figure out a way of deploying competitive broadband to your base.  FiOS probably can’t expand its footprint much further, so Verizon’s future depends (quite literally) on the 5G/FTTN hybrid that they’ve already been pushing.

For the unlucky, which includes AT&T, the question is whether 5G in some form can save them.  All operators’ wireline territories are a mixture of good-demand-density and low-density areas.  For many, including AT&T, there will be wireline competition already active in some of the good areas, with more likely to come.  Imagine AT&T if it finds itself non-competitive in the areas of high demand density, and with no opportunity elsewhere.  That’s the risk that Stankey will have to address in some way.  Cost cutting will buy him some time, but it’s not a long-term strategy.

To return to the point of competition, if TV delivery isn’t the automatic win it used to be, and if broadband fixed wireless is the future solution to home broadband, then there’s little to prevent any player from entering a market with strong demand density.  We’re already seeing small ISPs springing up to deliver fixed wireless broadband.  Any big operator could surely jump into another territory and cherry-pick, and if content companies can be bought by operators, would it be possible for content companies to instead become operators?  Remember Google Fiber; why not Google 5G/FTTN.  I wouldn’t bet against it.  Same for Amazon, or Apple, or even Facebook.

Where this leaves those service areas with low demand densities remains to be seen.  We don’t know for sure how broadly “mobile-oriented” 5G can project what’s intended to be a wireline replacement.  How many low-density areas are high-density enough to be served profitably with 5G in some form can be calculated only with difficulty; real experience in the field is needed.  But nobody should believe that all areas can be served that way, and if that’s the case, then we may be unable to close the digital divide.

Even for higher-density areas and operators, the impact of losing kicker revenue from video content and voice services may be a big blow.  I believe that the most critical space in the market will be the zone between densities of about 2 (twice the average of the US) and about 4, the point where broadband Internet is profitable enough to sustain a business.  There’s a lot of global geography in that space, and if it can’t be made to support profitable “wireline” home service, a lot of countries are going to have to think hard about policy changes.

State/Event or Policies: Best Lifecycle Automation Option?

Any sort of lifecycle automation demands the generation of responses to conditions.  The industry has defined two broad approaches to that, the use of policies and the use of state/event logic.  Both these concepts have been around for (literally) half a century, so there’s plenty of experience with them.  There seems to be less experience in sorting out the plusses and minuses of each approach, so this is a good time to do that.

The basic problem with handling lifecycle automation, the series of tasks needed to keep a network or computing platform running properly, is that it involves a bunch of different conditions that arise in their own time, with variable relationships among them.  A condition might be anything that moves the technology target from its “normal” state of operation to any other state, and actions are things triggered to restore normalcy.

The oldest approach to dealing with this sort of thing comes from the software that handles network protocols.  A network protocol connects partner elements, and like human conversation, network connections only work if everyone is on the same page.

A typical data link protocol, for example, would have a “setup” state where the link is established, a “normal” state where it’s operating, an “error recovery” state where it tries to fix a problem, a “violation” state where it detects a sign the endpoints are out of context with each other, and a “shutting down” state where it’s going out of service.

Events in a network protocol are the messages received on the connection.  A data packet is an event, as is a request to enter the “normal” state after setup, or a request to repeat the last message.  Each event has to be interpreted in the context the state represents.  A data packet in the “normal” state is fine, but getting one in the “setup” state is a procedure error that shows a context problem.  So is a setup packet in the “normal” state, because things are set up already.

State/event tables are two-dimensional structures (state and event type), and each intersection typically defines a process to be invoked and the “next state” to be set.  When a message is received, it’s classified into one of the event types, and the event and current state then index the table.  You run the process, you set the current state to the “next state” value in the table, and you’re done.

Policies are the other way of dealing with conditions.  A policy is a description of a condition tied to the description of an action.  It’s not unlike the programming statement (in many different programming languages) if condition-exists then perform-action [else perform-other-action]”.  The “condition-exists” test can be compound, so you could test state and event through a policy and eliminate the whole state/event table concept.

Policies are usually defined in sets, which are policies related to a specific condition or goal.  Access control is something that’s often handled by policies.  Each policy is a rule, and policy-driven systems will often have “rule editors” to create the rules and policy sets.  The policies/rules are often directly readable, making it easier to create one or to understand one you’re reviewing.

So OK, we’ve defined state/event and policy systems.  What are the plusses and minuses of each?  Let’s start with the state/event systems.

State/event tables are great for real-time unitary systems.  What that means is that the conditions/events happen when they happen, and that each system of event generation and response has its own state/event table.  I can model a network connection using a state/event table, but if I want to track a whole network, I’ll need to define a model that relates each element of the network to its own state/event table, and I’ll have to define events that can be used to signal one unitary system from another, a means of collectivizing and organizing the system’s behavior.  It’s possible to create n-dimensional state/event tables for complex environments, but the notion falls apart quickly in practice.  I never worked on one, and I’ve worked on a lot of state/event systems.

If you tried to replicate unitary system (network connection, for example) state/event processes using policies, you’d run into a similar problem.  The policies would exist for every connection, but you’d have to reflect how those per-connection policies related to the higher-level system, and how higher-level policies would then be defined and executed.

In networks, the complexities of unitary versus systemic policies are often handled implicitly.  The presumption in many (nay, most) policy-based network implementations is that you have relatively autonomous elements composed of many unitary systems.  These deal with their own issues in their own ways, but those ways are influenced by a policy control point, and implemented via a policy enforcement point.  In other words, the structure of a policy-based system creates an implicit model that, for state/event systems, would have to be explicit.

This sounds sweet, but it begs two points.  The first is how those autonomous elements “deal with their own issues”.  Most networks rely on self-healing or adaptive behavior to maintain operation, and policies are then needed only to influence the way that behavior is tuned, or to deal with situations where resolution through self-healing isn’t possible.  The second is the risk of policy complexity overload.

Back in my early days of programming, before high-level languages were commonly used in business computing, one of the big problems in application design was the failure to efficiently organize tests of conditions.  Remember, there’s no if/then/else structure in the language.  Often a program would check for a dozen conditions, and the data would present condition number 13, which would “fall through”.  My employer had a genius software type who invented a “decision table” language to organize the tests, and the resulting tables could be machine-validated to determine if they tested all possible conditions, or worse tested things that prior tests had already ruled out.  If A=3 gets you started with this policy set, testing whether A=4 is clearly not a good sign of logical thinking.

Policies can present this same problem of test complexity.  You have policy rules to cover conditions, but are the rules complete and consistent?  This can be particularly difficult when there’s no central policy control point, so policies are distributed and are difficult to assess holistically.  This isn’t a defeat for policies, but a cry to support policies through central repositories where they can be examined for completeness and contradictions.

My own views on this are probably understood by those who have followed this blog regularly.  I believe that it’s essential that we model services as a set of functional blocks, modeled using intent principles, and based on a formal “class inheritance” structure where a “node” might decompose into a “router”, then into a “Cisco router”, and a “network” might decompose into “access” and “core”.  This approach has been in place within the TMF (with various degrees of rigor in compliance) for well over a decade.  If this is done, then I believe that the (again-TMF-initiated) NGOSS Contract approach that uses the contract data model to steer events through each functional element’s state/event table, is the way to go in implementation.

What makes me leery about policies is that the implementation threatens to lose real-time properties.  A state/event table is inherently event-driven, meaning real-time.  A “conditions test” rule is one that invites a “poll for status” implementation.  “If-condition-exists” is easily interpreted as meaning to go get the variables and run the test.

Policy implementations are also free to deviate from functional standards, meaning that they’re not automatically stateless.  A properly designed state/event-driven system uses stateless processes for the state/event intersection, making it resilient and fully scalable.

So, I’m firmly in the state/event corner on this one.  I’m happy to analyze policy-based approaches if they’re documented well enough for me to assess them, but I’m going to be looking hard at the points I’ve cited here to see if they deal with them.  If they don’t, then in my view, they won’t form a reasonable basis for lifecycle automation.

Is There a General Approach to Automating Tech Lifecycles?

If the biggest problem in information technology is complexity, could the biggest question be whether artificial intelligence in some form is the solution?  We may have to answer that question as soon as this year, partly because the evolution of IT is bringing us to a critical point, and partly because the pandemic has raised some difficult issues.

In March, I asked a bunch of my enterprise operations friends what they believed the biggest source of new operations challenges were.  I asked them to generalize rather than say something specific, but almost everyone gave me points of both kinds, so I’ll present the responses that way.

The biggest new source of challenges in a general sense was optimization of resources.  The overall view was that it’s easy to run applications and support connectivity if you don’t care much about resource costs.  You bury your problems in capacity.  Ops types say that over the last decade, there’s been increasing pressure to do more for less, to reduce the relentless expansion in server and network capacity to contain costs.  The result has been greater complexity, and more operations issues.

Take virtualization, the number one specific point ops types cite.  We used to have a server, an operating system, and applications.  Now we might have a server, a hypervisor to create virtual machines, deployment/orchestration tools, and an application.  The result might be that we use servers more efficiently, but we’ve complicated what’s needed for an application to run.  Complexity increases the management attention needed, the number of events to be handled, the number of ways that the pieces could be fit (many of which won’t work).

You might wonder why so many companies would decide to trade higher opex for lower capex, even to the point where the costs might exceed the savings.  According to operations professionals, the reason is that operations organizations tend to be staffed to handle peak-problem conditions and to ensure a very high level of availability and QoE.  The additional issues created by resource optimization are often handled, at least until a peak-problem condition occurs.  Line organizations might tolerate erosion in availability and QoE.  There’s residual capacity in most operations organizations, a safety factor against a major issue.  Companies are knowingly or unwittingly using it up.

They’re also turning to a strategy that’s creating our second biggest challenge, point-of-problem operations aids.  When an operations organization is overloaded, the problems don’t all hit at once.  There are almost always specific areas where they manifest first, and when that happens even operations types think in terms of fixing the problem.  If your pipe is leaking, you don’t consider replumbing your home, you fix the leak.  The result can be a lot of patched pipes, and the same thing is true in operations organizations.

The problem is that every new and specialized tool introduces more complexity.  Point-of-problem technology is what the name suggests, a tool focused on a specific issue.  There may be related issues, ones that cause the target issue or ones that the target issue can cause.  There may be issues that look like the target issue but are really something else.  All these issue relationships are outside the scope of a limited tool, and so they’re handled by human agents.  Operations resource needs then increase.

On the compute side, operations personnel cite virtualization again as a specific example of point-of-problem tool explosion, and because containers are such a focus right now, they cite container operations as their biggest concern.  Containers are a feature of Linux.  Container orchestration is handled by Kubernetes, but you have to monitor things so you need monitoring tools.  Kubernetes may not be able to efficiently orchestrate disparate data centers or hybrid/multi-cloud without a federation tool.  You might need a special version for edge computing.  What tends to happen as container interest grows is that bigger and bigger missions are selected for containers, which create new issues, generate a need for new tools, and the next thing you know, you have half-a-dozen new software tools in play.

A big part of the expansion in tools relates to new missions, and the new missions create the third of our operations issue drivers; new features or requirements create complexity.  Most network and IT operations organizations have seen a continual expansion in the features and services they’re required to support, required because new technology missions have created new demands.  Ops types here cite security as the biggest example of this.

One ops type in the network space, talking about using firewalls and access/forwarding control to protect critical applications and information, noted that line organizations were told they had to advise operations of changes in personnel, roles, and in some cases even workspaces, to keep the access control technology up to date.  “Every day, somebody forgot.  Every day, somebody lost access, or somebody got access to what they weren’t supposed to.  They told me it was ‘too difficult’ to do what they were supposed to do.”

Security also generates a new layer of technology, and often several different and interlocking layers in both network and IT/software.  All this means more complexity at the operations level, and because security crosses over between IT and network operations, there’s also a need for coordination between two different organizations.

The pandemic is an almost-tie for security in this particular area of operations complexity generation.  Application access, information relationships to workers, and workflow coordination through a combination of human and automated processes have all been impacted by the work-from-home impact of the virus.  Not all companies are prepared to assume that they have to condition their operations to respond to another pandemic, or decide whether to try to improve their response ad hoc or create a new and more responsive model of IT.  Over the rest of this year, most say they’ll make their decision, and implement it in 2021.  That should give us a target date to solve the operations challenges we’ve raised here.

The common element here is that more stuff means more management of stuff, and most everything going on in networking and computing is creating more stuff.  This stuff explosion is behind the interest in what’s called “operations lifecycle automation”.  We see the most refined version of this in some of the modern DevOps tools, particularly those that use the “declarative” goal-state model of design.  These tools can accept events from the outside, and use them to trigger specific operations steps.

In networking, state/event tables and events can be combined (as the TMF did with their NGOSS Contract work) to steer conditions to the processes that handle them.  The TMF SID and the OASIS TOSCA models both support at least a form of state/event handling.  My ExperiaSphere project presents a fairly detailed architecture for using state/event-centric service modeling to provide for automation of a network/service lifecycle.

Given all of this, you might wonder why we don’t see much happening around operations lifecycle automation in the real world.  Over 80% of my ops contacts say they don’t use it significantly, if at all.  Why?

The problem is that operations lifecycle automation based on goal-state or state/event handling is what you could call anticipatory problem handling.  A state/event table is an easy example of this problem.  You have to define the various states that your system/network might be in, all the events that would have to be handled, and how each event would be handled in each state.  If we’d started our automation efforts when we first saw glimmerings of our management-of-stuff problem we’d have been fine; our knowledge and the problem could have grown together.  Retrofitting the solution onto the current situation demands a lot of the ops organization, not to mention the tools they rely on.

Proponents of AI say that what we need here is some fuzzy logic or neural network that does what all those human ops specialists are expected to do, which is analyze the “stuff” and then automatically take the necessary steps in response to conditions.  Well, maybe, but none of the ops types I’ve talked with think that something like this could be created within three to five years.  I tend to agree.  We may, and probably will, reach a point in the future where machine judgment can equal or surpass human judgment in operations tasks, but we can’t wait for that to happen.

The real hope, the real possible benefit, of AI would be its ability to construct operations lifecycle automation policies by examining the real world.  In other words, what we really need is machine learning.  State/event processes, and even state/event automation of operations processes, are based on principles long used in areas like network protocols and are well understood.  What’s needed is definition of the states and the events.

In order to apply ML to operations, we would need to have the ability to look at the network from the point where current operations processes are applied—the operations center.  We’d have to be able to capture everything that operators see and do, and analyze these to correlate and optimize.  It’s not a small task, but it’s a task that’s clearly within the capabilities of the current ML software and practices.  In other words, we could realistically hope to frame an ML solution to operations automation with what we have today, providing we could monitor from the operations center to obtain the necessary state and event information.

This is where I think we should be focusing our attention.  There is no way of knowing if we could create a true AI “entity” that could replicate human operations judgment at any point in the near future.  We could likely apply the ML solution in 2020 if the industry decided to put its collective efforts behind the initiative.  ETSI’s Zero-Touch Automation (ZTA) group might be a place to do that, but we already have a term for the general transformation of ML principles to the real world—Machine Learning Operations or MLOps—and vendors like Spell, IBM, and Iguazio have at least some capabilities in the space that could be exploited.

The biggest barrier to the ML solution is the multiplicity of computer and network environments that would have to be handled.  It’s not going to be possible to devise a state/event structure that works for every data center, every network, as long as we have different events, states, and devices within.  Were we to devise a generalized network modeling framework based on intent-model principles, the framework itself would harmonize the external parameterization and event generation interfaces, define the events and states, and immediately present an opportunity for a universal solution.

Would it be better to wait for that?  I don’t think so.  I think we should try ML and MLOps on networks and IT as quickly as possible, so we don’t hit the wall on managing complexity down the line.

What IBM’s Hybrid Cloud Positioning Means to IT

Can we know anything about open source in a pandemic world?  IBM is the first big tech name to report earnings since the virus struck, and with the acquisition of Red Hat, they’re also the giant in the open-source enterprise software space.  Do they offer us any clues about the way that the open-source software space, and even cloud software, might be impacted by the virus and shutdown?  We’ll see.

IBM’s new CEO, Arvind Krishna, set the stage in the call by calling out the “unprecedented global public health crisis.”  COVID-19 caught the world by surprise with the pace of its spread and the impact it had on locking down the world economy.  No company was likely to have planned for this, including IBM, and IBM was in fact in the midst of a leadership change and restructuring when it happened.  But just because IBM opened with the pandemic doesn’t mean that their focus for the future is to address work-from-home in future pandemics.  It’s to create a new application model that would be more agile if more WFH is needed, for any reason.  It’s about technology.

As far as technology direction is concerned, Krishna was also clear.  IBM wants to be the hybrid cloud focus of tech, the go-to company for anyone who is serious about hybrid cloud.  In a realistic sense, that means every enterprise who has any cloud computing aspirations at all, so this is a significant broadening of scope for IBM.  Remember, IBM had shed its hardware lines except the mainframe stuff, and it was rapidly contracting to become a mainframe company in a cloud world.  No longer.

On its earnings call, IBM’s Krishna made the same point I’d cited from an internal IBM memo when he took the reins.  Hybrid cloud is the fourth pillar of IBM’s success, the thing that takes over from the traditional hardware, software, and middleware business to propel IBM into a leadership position…again.  In their early days, IBM’s motto (they even gave out notebooks with it on the cover) was “Think.”  Now, it could be “Think Hybrid Cloud.”

Krishna also makes a point I touched on in my blog yesterday; buyers are looking for sellers who will weather the pandemic and shutdown storm.  “As our clients adjust to this new normal, they need a partner they can trust. IBM is that partner.”  I think it’s IBM’s intention to rebuild the position they once had in strategic influence in the data center.

Strategic influence has been the most important factor in IT vendor success for literally decades.  Since I started to measure it in 1989, IBM led the pack by a large margin.  In fact, their early influence was 40% higher than any other, which meant IBM could set the tech agenda, and sell into the climate they created.  Since 2011, though, its influence has steadily dropped, to the point where prior to the Red Hat deal, there were four companies that had greater influence.  Red Hat was one, and of course IBM now owns that.

Enterprises have told me from the first that they strongly favor cloud strategies based on open-source components.  Part of the reason is their growing fear of lock-in, at a time when developments are coming fast and a vendor’s position might suddenly collapse.  Open-source levels the playing field, increasing competition among vendors and mobility among buyers.  Red Hat obviously has the pieces to the hybrid puzzle, so all that’s needed is to put them together.

The key to the hybrid cloud position, and the strategic influence it brings, is the notion that there is such a thing as a hybrid cloud platform.  As strange as it may seem, vendors have up to now failed to grasp that an application is designed to run on something, and the “something” it runs on is a platform.  If the hybrid cloud is to become the computing strategy of the future, it has to have a platform.

If there is a singular best hybrid cloud platform, the vendor who offers it is in the driver’s seat.  IBM intends to be that vendor, by accepting that the hybrid cloud is an “architectural battle”, then winning it.  Krishna says “we give clients that unique ability to build mission critical applications once and run them anywhere. Together with Red Hat, we are establishing Linux, Containers and Kubernetes as the new standard. This is winning the architectural battle for hybrid cloud.”

Red Hat, even before IBM’s acquisition, was known as an aggressive promoter of the single-supplier strategy, the concept that a sifi book I once read called a “circular trust”.  Red Hat Enterprise Linux works best with Red Hat middleware, which works best with OpenShift for containers and the cloud, which works best with RHEL hosts…you get the picture.  Hybrid cloud is the value proposition that validates this circularity, because it requires all the components.  If applications are designed for a hybrid-cloud architecture, then hybrid cloud architecture defines and unifies the entire Red Hat product line, and its value proposition.

Hybrid cloud is critical to IBM, but so is artificial intelligence (AI).  In fact, IBM executives like Krishna tend to mention the two together most of the time.  AI has a powerful appeal as a means of addressing complexity in all its forms, and it’s something IBM has pushed (with things like Watson) for years.  But the AI emphasis in the first earnings call after the start of the pandemic shows that IBM sees AI as a means of assuring operations when lockdowns disrupt the normal operations practices of their customers.

A lot of AI positioning by IBM has been almost philosophical in nature.  Analytics and AI will help you make better decisions, spot problems, address opportunities.  Nice, but not really compelling.  Now, AI can help you keep your data centers running and your customer and supply-chain portals running even when your staff is forced to work from home.  Nothing abstract there.

I think IBM wants its AI to be almost like an IBM account team of old; always on site, always consulted, always influencing, always a part of the business.  AI can be a path to managing the complexity that enterprises perceive exists in deploying hybrid cloud, containers, Kubernetes.  Add AI to hybrid cloud, assume you can move application components between cloud and data center, and you have a kind of “my-business-as-a-service” framework.  When companies fear capex, they can migrate more to the cloud and make it into opex.  When they’re confident, they can pull it back again.

Promoting this through sales/marketing is obviously a challenge, and IBM thinks it has a solution, which they describe as a “more technical approach to selling.”  While they don’t break it out this way in any forum I’ve seen (including the earnings call), enterprises tell me that it seems to have four focus areas.

The first one is that IBM recognizes that it needs to focus on a deeper sell, because cloud and hybrid cloud decisions go deeper, in technical terms, than the executive teams of enterprises.  IBM account efforts have typically focused on those executive teams, and now they have to be prepared to go deeper, to present technology details.  Red Hat has faced that all along, of course, so their experience helps here.

The second focus area is demonstrations and proof-of-concept work.  IBM doesn’t want to fall into the sales trap of showing that something can work, they want to show something working.  The critical piece of promotional material that links executive teams and project teams is an actual demonstration of an actual product.  That sort of thing proves that the solution isn’t too far away, which is important when it’s needed quickly.

This concept is extended by the “virtual garage” concept.  Instead of relying on having IBM consultants, developers, and client personnel setting up a kind of garage shop to build an implementation, IBM proposes to do it virtually.  This seems a great way of addressing sales/marketing in a pandemic world, but they appear to have been working on this approach even before COVID-19.  If you want account control, you have to get account presence, and the broader base Red Hat brings just doesn’t lend itself to traditional major account teams.

The final focus area is public cloud services.  Hybrid cloud hybridizes the data center and the public cloud.  Even if IBM thinks they can own the data center, having a hybrid cloud architecture that compels the introduction of a hungry potential competitor like Amazon, Google, or Microsoft invites problems.  IBM needs to promote its own public cloud to secure the full benefits of its hybrid cloud positioning.  It can’t be an exclusive relationship, but it can be preferential.

Looming over these four areas, and the entire hybrid cloud question, is whether we’re inevitably going to embrace a hybrid cloud platform as a service.  Right now, applications that are divided between public and private cloud hosting tend to be made up of components either designed for the data center or for the cloud.  Yes, there are some that can be moved between the two, but there’s a distinct front-end/back-end structure.  That’s due in large part to the fact that development practices for the two “ends” are different.  If some vendor came up with a hybrid PaaS with its own set of APIs, applications could be written entirely to the PaaS, and the PaaS APIs could be mapped either to the cloud or the data center as needed.

This would be a profound change in development, but if IBM is right about the hybrid cloud’s importance in future IT, it may be a change that IBM will not only have to support, but lead.

Coronavirus: What Now?

Our unhappy pandemic is fading…slowly…for now.  It’s not clear when and how things will be opened up again, but it seems likely that will be starting within a month in the US.  Other countries have already started opening up.  The impact of the pandemic is historic, but we won’t know all of the details, likely, until the end of the summer.  It may well change our lives, our behaviors for a very long time, perhaps for a generation.

What can we now expect for technology?  I blogged about the impact of the virus HERE and HERE, but I’m not going to reference these past blogs now.  Instead I’ll try to apply what we’ve seen and learned to the rest of 2020.

One thing we know now is that the global economy is terribly vulnerable to a pandemic that is serious enough to pose a risk to life on a broad scale, and for which we have no vaccine or mitigating treatment.  Coronavirus alone has created three incidents in the last decade or so, and of course we have various flus and even Ebola.  I think we accepted that disease was a risk, but I don’t think we ever thought of it as a natural disaster capable of collapsing economies.  It probably didn’t happen this time, and probably won’t even in 2021, but it could have.  That means it likely will at some point.

In a world whose supply chains, tourism, and business travel link every country, it doesn’t take much for something bad and local to become bad and global.  Some will say that this means we should cultivate work from home and take other steps to limit the travel that spreads diseases, and while that could be a good idea for other reasons, it’s not going to stop a future pandemic.  People will travel, goods will move, and even a little would be enough.

Technology could play a role in protecting us, not so much by replacing real travel with virtual travel, but by monitoring conditions.  We’ll have to navigate legitimate issues of privacy, but all societies rely on a surrender-something-for-more-gain theory.  I saw infrared camera monitoring in Africa during the Ebola crisis, and it’s still in place.  We didn’t have it here, even when I returned from Africa during Ebola, and that’s probably not smart if we don’t want a future economic collapse.

Google and Apple, I think, have a germ of a good idea in mobile-device contact tracing.  It will be important that it not turn into a way of stalking someone, but if we could be aware of interactions close enough to spread disease, and then check the affected people, we could have a head up on the next one.

If we can test quickly.  Some countries did quite well with testing, isolation, social distancing.  The US, not so well.  The most important technical advances we could seek now are those that might apply things like AI to virology/epidemiology.  A rapid test for a disease, and for immunity among those who might have had it without symptoms, could frame a strong and realistic response to a new virus (or a new outbreak of COVID-19) short of shutting down the global economy.

Testing and monitoring might have global political consequences.  Many viruses get their start by crossovers between animals and humans, crossovers that often happen because of local cultural practices that have been in place for generations.  When globalism combines with this localism, does it create a risk that has to be monitored across national and political boundaries?  Do we need to not only watch our people for signs of disease, but watch others in case they’re developing something that could spread to us?  How would we deal with a country who refused to allow monitoring and refused to abandon their local cultural practices?

What about remote work and work-from-home?  I don’t think that fear of another pandemic is going to motivate companies to shift their policies radically for the long term, but it’s likely that we’ll see a very strong shift in attitude.  Companies will want to have a productive WFH strategy in place, to be executed if needed.  That strategy, if it’s good enough, may also have the effect of empowering home work for a larger percentage of the workforce, but it will be the strength and value of the strategy and not the fear of pandemic that will do this.

For the Internet, we’ve learned that while we use the term as though there was one universal service spreading globally, we really have local pockets of “internet” that combine to create “Internet”.  Where those pockets have CATV or fiber access, the Internet rode out the situation quite well.  I personally never noticed any issues at all, but I know people who had major problems even doing conference calls.

We have to decide whether we want to have a true base level of broadband access available for home use.  If we decide we do, we’ll have to figure out how to fund the copper-loop replacements.  If we don’t, we’ll have to accept that WFH isn’t going to work in some areas, and so companies there will be at risk if there’s a shutdown in the future.

We also have to examine another online phenomenon, which is the ad sponsorship issue.  If what we can do online, even what we can view on TV, depends on how much ad revenue it can generate, we’re creating a filter on reality.  Many people tell me that pandemic information on social media has been manipulative, wrong, horrible.  Maybe; I didn’t look.  I do believe that news coverage of the pandemic and the response has been bad, so bad that I’ve stopped watching some material completely.  For me, it’s not political, it’s about truth and accuracy.  If I think somebody is saying something to generate views/clicks, I’m out of there.  How do we keep people informed if the information resources are watching their ad revenues and not the real world?

Then there’s the problem of potential advertising collapse.  If all the world is ad-funded, what happens when retail sales take a big hit, as they have in the pandemic?  With the decline in ad revenue, we are at risk of losing that which is sponsored by advertising.  What happens to our online lives if the sponsors cut back?  We need to think about that, before we lose that sponsorship.  Is broadband a true public utility?  We just moved in the other direction.  Is search, news?  That’s a big step.

Finally, there’s tech budgeting.  There’s a lot of talk now that the wonderful thing about “moving everything to the cloud” is that your costs could expand and contract as your business does.  If a pandemic creates an economic crash that dries up your revenues, you could lower your costs too.  To a degree, a small degree, that’s true.  But most companies would tell you that IT spending was a small piece of their operating budget.  The virus isn’t going to move everything, or even a lot more, to the cloud…at least not directly.

The pandemic has shown that retail front-ends, either for sales or customer service, have to be a lot more flexible and agile.  Thus, we can expect to see more cloud front-end development coming along, even this year.  Enterprises who have given me data on this say that they expect to spend 18% more on cloud front-end development and deployment in 2020, versus budgets.

That could be the bright spot, budget-wise.  The same companies say that they will slow-roll capital projects in IT, especially projects that don’t meet their “urgent” definitions.  They want to see whether the virus will kick up again in the fall, and how fast their own customer base responds to the gradual re-opening of their economies.  They’re not saying they’re cutting costs, only that they’re delaying previously approved spending.  Most think that their IT spending this year will be off, an average of about 2.5%, versus up around that same amount, which is therefore a 5% drop versus expectations.

The impact of all of this on tech vendors will vary.  I think that the largest of the players in the space will be favored.  Businesses will be wary of the stability of their technology suppliers, fearing loss of support or even failures to deliver.  Bigger firms with bigger names have an easier time promoting customer engagement at a time when sales calls are either risky or downright forbidden.  IBM’s quarterly report was light on revenue versus expectations, but they beat on earnings.  This could make them a preferred partner over smaller open-source players who are yet to report, and who may be impacted more.

Many of the other issues arising from the pandemic are beyond anything but guessing.  Will it hurt or help 5G?  Neither, probably, but who knows?  Will consumer sales of tech be stronger than business?  Probably, but probably not as strong as they would have been.  Will public policies on broadband change?  Probably not, but of course we can expect to see some more committees and studies.

We’re not out of the woods on this yet, friends, but it’s not too early to start thinking about the next one down the line.  We can do better.

Learning a Lesson from Frontier

Frontier Communications has filed for Chapter 11, which isn’t a huge surprise to most of us who’ve followed the fortunes of the company from the first.  Following hasn’t been easy, because Frontier is a hodgepodge of smaller phone companies, has gone through a number of name changes, and has explored a number of different business models.  The question is whether it went wrong along its convoluted path, or whether it’s a victim of a systemic problem that more telcos might face.

Frontier’s collection of customers spans more than two dozen states.  It has a very small FTTH base it acquired from Verizon, and an equally small cable system base in Connecticut.  The great majority of its customers are supported on copper loop.  That meant that either Frontier had to commit to DSL broadband, or to a significant capital program to upgrade these customers to fiber.  Since one financial analyst categorized the various acquisitions Frontier made as “scavenger sales”, meaning that they represented customers who were considered to have lower ARPU potential than average, earning a return on that kind of upgrade was problematic.

Frontier seems to have realized its problems relate to its dependence on copper plant.  In a report made to the West Virginia PUC, the company admitted to almost a million potential network problems that related to the state of the copper plant and the result of decades of accumulated maintenance patch-ups.  They told the SEC that “significant under-investment” in fiber “created headwinds that the company is repositioning itself to reverse,” according to a Fierce Telecom report.

The problem, of course, is that fiber to the home has a very high “pass cost”, meaning that the costs associated with just getting into the area of a home so a connection could be made to a new customer is very high.  In rural areas, and even in many suburbs, the cost could well be prohibitive, particularly since the live TV services that most telco and cable broadband services have tapped to raise ARPU are losing ground to streaming TV.

The key question is ROI, which obviously is a combination of potential revenue and cost to connect.  The economic potential of a geography is roughly related to its GDP and the density of rights of way within it, which is a measure of the density of population.  If we normalize the US demand density as 1.0, we find that countries where good broadband is universal have demand densities from 4.5 upward.  Within the US, we find that 14 of the 50 states have demand densities in that 4.5-and-up range, and that Frontier’s DSL customers are concentrated in states where demand density is near or even below the US average.

Revenue realization from broadband depends on being able to sell useful speeds, and the standard for that is video.  DSL is rarely considered competitive for streaming video delivery, and that’s now a must-have for most home broadband prospects.  About half of all the complaints that the West Virginia report cited were related to DSL performance not meeting the Frontier promises, and even had the promises been met, streaming video would likely have been problematic for many.

The final nail in the profit coffin is the fact that copper loop voice communications is also in decline, as more and more users shift to mobile services.  Consider that if only half of West Virginia complaints were about DSL, another 20% were about both DSL and telephony, and 29% about telephone services.  That means half the Frontier customers were unhappy even with Plain Old Telephone Service (POTS).  If you can’t do POTS on copper loop effectively, what good is that plant?

This isn’t a new problem.  Back in the late 1980s, a Bell South expert told me that the copper loop plant was the big risk for the company, that unless something were done to shorten the loop through fiber to the node, it was unlikely loop quality could be maintained.  The expert suggested that they had perhaps fifteen years of life remaining, and of course we’re now past double that lifespan.

There have been plenty of technological advances that aimed to deliver DSL at competitive speeds, but the fact is that the condition of the loop plant and the length of the loop have an enormous impact on these measures.  Many of the deals Frontier did to acquire its current customers were made by other telcos who apparently believed that they couldn’t make money on the customers they were selling off.  Loop condition and potential ARPU were obviously the problem.

Frontier, in the Fierce Telecom article cited above, said they had 11 million DSL customers, and that a plan to convert 3 million to fiber would cost $1.4 billion through 2024.  I think it’s safe to assume that Frontier cherry-picked those 3 million, and that they represented the best potential.  Think about what might be required for the average, or most challenging, customers.

This is the same dilemma faced by Australia, who took the unusual step of shifting broadband access to a semi-government company, New Broadband Network (NBN).  They expect to have about 5 million of the 12 million NBN customers on fiber this year, when NBN rollout is complete.  Experts in Australia complain that the fiber-to-the-node approach that Bellsouth advocated before consumer broadband was a market issue is now inadequate, and that in any event the loop plant itself is “a run-down network.”

It’s time to face an important truth, which is that the current consumer and business broadband opportunity cannot earn a respectable ROI on the cost of a full fiber modernization.  We are not going to close the digital divide by putting everyone on fiber, even were we to follow Australia into an almost-nationalized approach.  Maybe, aggressive investment at the dawn of consumer broadband might have paid back.  Not now, and that means no “wireline” approach is going to work across a typical country or state or service area with a demand density less than 4.5, and for success 4.5 seems a bare minimum.

Wires won’t work, but wireless might.  In fact, the biggest impact 5G could have on consumers and small businesses is likely to be not the changes it might (or might not) make in the mobile experience, but what it could do as a substitute for copper-loop wireline.  How far the benefits of 5G could extend into the “wireline” world is hard to say, but here are some basic presumptions we could make.

First, the millimeter-wave 5G hybrid with FTTN can deliver speeds well over 100 Mbps to distances up to around four miles, according to the data I’ve been able to see.  A single node could then cover an area of about 12 square miles.  Using a density of 300 dwelling units per square mile (suburban average) that comes to 3,500 prospective customers per node.  5G/FTTN distances would be greater if the node were placed on a high spot in the topography, on a tall tower, etc.  Operators tell me that this would be suitable for about 80% of suburban populations.

Second, standard 5G cells should deliver about 20 Gbps aggregate capacity to a distance of 10 miles, depending on the frequency used.  That would mean coverage of about 80 square miles, and since “rural” areas are considered to have about 50-100 dwelling units per square mile (we’ll use 75 for an average), so that would mean a total of 600 prospective “wireline” customers, in addition to whatever mobile service might also be provided.  The question, which is difficult to answer at this point, is whether it would be practical to cover 80% or even 50% of the rural population with 5G mobile-like service.  The difficulty lies in the fact that there are too many ways of defining “rural”, and no matter what definition you use, the ways that rural populations are concentrated are highly variable.  Some studies, though, admit that three-quarters of the cost of covering a diverse service area like a state or country could well come from covering rural customers.

It looks to me like about 7 or 8 million of the 11 million customers Frontier has could be covered with 5G in some form, but it’s unlikely at this point that Frontier could acquire the spectrum it would need.  They didn’t even bid on the 23Ghz auction last year, and they’re not a mobile player at all.  They’d also need to run fiber to all the cell sites and nodes.  All this to overcome the fact that DSL won’t cut it, and even if you get past that issue with 5G, you have to ask whether the customer base would pay enough for super-broadband to generate an ROI.

Frontier didn’t have a sound business model, in my view.  Instead, they were chasing a view of telecom as a perpetual cash cow, neglecting both the influence of mobile services and the need for truly high-speed broadband to support video.  Now it may well be too late for them.  It’s hard to model the cost of a 5G strategy for Frontier, but I’d estimate we’d be talking about ten billion or more, and who’s going to give it to them?

We can’t retrospectively fit the right decisions and business model on Frontier.  What we need to do instead is prevent the problem from spreading.  Copper loop will never provide the kind of broadband people want, and small telcos will never be able to afford to change it out without the kind of massive government subsidies we’re not likely to see in our age of the pandemic.  Regulators need to think about how to get copper out of the plant, and how to encourage larger telcos to buy smaller ones, rather than selling off their less-profitable pieces.

Does Virtualization, in or outside 5G, pose a Security/Compliance Risk?

In yesterday’s blog, I looked at the operations considerations that arise when you build a “network” by hosting feature/function instances on infrastructure.  I pointed out that the process creates two explicit layers—“functional” and “infrastructure”—and one implicit one, the layer representing the binding between the two that actually creates the service.  We explored the operations impact of this in yesterday’s blog, so today we’re looking at the security/governance impact.

Some aspects of 5G security are considered within the specifications themselves.  These relate to the “traditional” security issues of protecting the exposed interfaces identified, so they’re necessary but not necessarily sufficient.  The implementation of 5G, just like the implementation of any new network model that hosts features on a pool of resources instead of fixing them within devices, creates a new set of interfaces.  These are neither normally exposed nor considered in the 5G specs, so we have to dig into them by digging into the model of a virtual-feature-hosting infrastructure.

Just as we can visualize two “layers” of our virtual network, we can divide connectivity by those two layers.  Since we’re presuming that the functional layer represents the equivalent of the device network, the connectivity in that layer would be similar to (or identical with) what we’d see in a device network.  The infrastructure layer, on the other hand, represents the connectivity in the pool of resources we can commit to function hosting.

In the functional layer of the network, we have the service domain network, the connectivity that’s explicitly visible to service users.  Think data plane, control plane, and management plane for device networks and you get the idea.  If we have a “functional element” in our functional layer, and that element exposes interfaces that are visible within the service, then those connections are part of the service domain.  Part of this network supports connectivity to the management/lifecycle processes at the service level.

The infrastructure domain network is the network that connects the pool elements and their associated management/lifecycle tools.  Generally speaking, this network should be totally isolated from the service domain network, because the infrastructure layer is shared by all users and services, and so exposing it to an individual user/service would constitute a major security/governance breach.  This is our on-ramp to our discussion today.

The greatest barrier to intrusion in any network is access control.  If you can’t address the network or something on it, then you can’t attack it.  The corollary is that the more attackable things there are on networks, the more points of attack are possible.  The fact that virtual hosting creates an infrastructure layer with its own network means that network creates attack vectors that have to be shortstopped.

The best starting point for security and governance is to assign a private IP address to anything in the infrastructure network.  These addresses (for IPv4, assigned in RFC 1918 and for IPv6, RFC 4193) are not routed on public IP networks like the Internet, so something that has a private IP address has to be gated (NATed) onto a public IP network, meaning the address has to be translated.  That which is not cannot be addressed, which lets us separate the service domain and infrastructure domain networks.

Of course, theory should make almost anything immune to hacking, so we know that theories have a way of not being applied.  In this case, there are two common loopholes.  First, anything that lives on both network domains can shunt things between them.  Second, if something can be planted on a private IP network, it can then relay or introduce traffic there.

The obvious response to the security/governance issues related to virtual functions is meticulous control over the software at all levels.  Of paramount importance is the software that manages the deployment, the “orchestration” tools that might reside in a variety of places.  Anything that has the right to deploy onto infrastructure can deploy malware.

Even virtual functions themselves can be an issue.  At the very least, malicious functions could disrupt a service.  They might also be able to disrupt the host, attacking via holes in container or VM security.  But in some cases, the risk could be higher, because a virtual function could live in two worlds.  If the function represents a piece of data/control-plane functionality, it necessarily lives on the service domain network.  If it has to be addressed/manipulated by the infrastructure lifecycle components, it might also live on (be visible on) the infrastructure domain network.  As such, it could be that badly behaved portal between the two that could admit all manner of bad things.

There are some strategies that could limit this kind of risk.  The obvious one is that any component of lifecycle management must be considered to be highly secure, its source authenticated and its operations validated in a lab before deployment.  Having a guarantor to sue if things go wrong is always a good strategy too.  A less obvious but potentially valuable requirement is that no virtual function/feature element that would likely have to be introduced with less certification should ever be allowed to present an interface directly onto the infrastructure domain network.  If orchestration and lifecycle management needs access to this sort of software, it has to be via an intermediary element, something as simple as an API proxy or gateway or as complex as an abstraction tool designed to harmonize all functions of a given class to a common set of interfaces.

5G complicates the picture because of network slicing, which is a partitioning of an operator’s 5G network into what are supposed to be independent subnetworks, which are then essentially tenants of the main network, in the sense that users of cloud computing are tenants of the cloud provider’s infrastructure.

I think the best approach with 5G is to assume that each network slice will have its own “infrastructure layer” and domain network, separate from the main 5G Core resource network.  If all slices were truly ships in the night, then application of the layer-domain principles above should provide a pathway to securing each slice.  However, it seems certain that some elements of 5G software will have to coordinate across, and thus themselves cross, boundaries.

There is no doubt that virtualization of network features, to create a secure framework, must include practices to standardize and tightly control interactions among the software components.  The “standardize” part is particularly important, because the more latitude is offered in the way that software is introduced into a virtual framework, the more difficult it will be to determine whether the software is behaving properly.  “No boundaries” isn’t a formula for successful child care, and it’s not for successful virtualization either.

Logically speaking, all of the slices in 5G networks should interact with infrastructure via a virtualization/abstraction layer.  Given that, I think we should assume that such an abstraction layer would be advisable in all cases.  I don’t mean just “virtual machines” or “containers” here, but the ecosystem of tools and APIs that represent resources as they’d be expected to be used by consuming processes within a functional layer.  Similarly, each slice should contain a functional layer whose interactions with the 5G core functions (registration for example) are abstracted and presented by an API that is visible to the network operator as well as the slice operator—a NAT gateway or proxy.

It may be that 5G, like so much in our modern world, risks being a victim to a form of virtualization pushed by standards bodies who aren’t able to do much beyond spelling the word.  We should have realized by now that when you promote virtual functionality in networking, you have to make sure you abstract everything that the virtual world touches.  A coupling to “real” stuff is inefficient, and in security and compliance terms it’s also very dangerous.

5G, Virtualization, and Complexity: The Ops Dimension

One operator offered an interesting view of hosted-function networking: “In some ways I hope it’s different from devices, and in others I hope it isn’t.”  The “I-hope-it-is” position is based on the economic driver for using hosted virtual functions; it would be cheaper than proprietary appliances.  The “I-hope-it-isn’t” position relates to the fear that the network operators don’t have a handle on the operational issues associated with function hosting, including operations cost and complexity and security/compliance.

If you host something instead of buying the same functionality in device form, you’ve generated a second tier or layer of infrastructure, below the functionality.  It’s convenient to use this model as the basis for our discussion here.  We have a “functional” layer that represents the way the functions themselves interact with each other and with their own EMS/NMS/SMS structure.  We have an “infrastructure” layer that represents the interactions involved in sustaining the hosting infrastructure and sustaining the lifecycle of the functions.  Got it?

There are two impacts that this dualism creates, obviously.  The first is that whatever you’d expect to do to manage the functional behavior, you now have the additional requirement of managing the infrastructure layer.  Nearly all the operators who have experience with this tell me that infrastructure lifecycle operations is more complex and expensive than functional, meaning device, management.  That shouldn’t be a surprise given the number of things that make up a cloud platform, but somehow it seems to have surprised most operators.  The second impact, at least as often overlooked, is in security and governance.

Both these impacts are important in next-gen networking, and more so where network technology relies explicitly on “virtualization” of devices/functions.  5G is such a place, but hardly the only one.  In this blog, I’m going to talk about operations considerations arising from the structure/function dualism, and I’ll address security in the next one.

One factor that complicates both issues is that having two “layers” actually means having three.  Yes, we have functional and infrastructure, but we also have what I’ve called a “binding layer”, the relationship between the two that’s established by provisioning applications onto infrastructure.  The relationship between the two layers is important for the obvious reason that you have to host functionality on infrastructure, but also because problems in either layer have to be considered in the context of the other, and perhaps fixed there.

The operational challenge created by the two separate layers has already been noted; you have two layers to operationalize instead of one and that’s pretty easily understood.  The binding layer introduces a different set of issues, issues relating to the relationship between the other two layers.  Since that relationship is at least somewhat transient and dynamic, normal management problems are exacerbated, and since the two layers are in a sense both independent and interdependent, the problems can be profound.

The binding process, the linkage of the layers into a service-based relationship, can be subsumed into the infrastructure layer or treated through a separate piece of “orchestration” software.  The latter seems to be the prevailing practice, so let’s assume it’s in place.  When a service is created, the functional elements are deployed (by orchestration), and any necessary pathways for information (data, control, and management planes) are connected.  How this is done could be very important, and that may or may not be “remembered” in any useful way.  Whether that hurts depends on which of two broad management approaches are taken.

The first of these options is to assume that the top layer is responsible for the SLA and also for remediation.  The bottom layer is only responsible for fixing failures that aren’t functional failures.  If something in the infrastructure layer breaks and that breaks functionality, the presumption is that the functional layer will “re-bind” the functional element to infrastructure, which fixes the functional problem.  It’s then up to the infrastructure layer to actually fix the real problem.  If the infrastructure breakage can be remedied within the infrastructure layer through some simple replacement or re-parameterization, it never reaches the status of a functional failure at all.  Ships sort-of-in-the night.

“Orchestration” in this approach sits outside both layers, creating the original binding between the layers but not really using whatever knowledge it might have obtained out of its process of hosting and connecting things.  That means that there’s really no “systemic” vision of what a service is built from.  You can probably replace a component that has failed even without that knowledge, but a broader multi-component failure could leave you with no understanding of how to connect the new with the remaining non-failed elements.

The second option is to assume that the binding layer is the source of remediation, of lifecycle management.  In this option, the binding process, and the “binding layer” are the real objective.  We need a map, a model, of infrastructure, stated in service terms.  The presumption is that there’s a “template” for service-building that is followed, a blueprint on how to commit infrastructure to the functional layer.  This is filled in with details, the specific commitments of resources, as deployment progresses.

With this approach, the model/map is available throughout the lifecycle of the service.  It’s possible to use it to rebuild a piece of the service, even a large piece, because the connections between elements are recorded, and where the rebuilding impacts those connections, they can then be rebuilt as well.  If there’s a problem at either the functional or infrastructure level, it’s possible to correlate the impact across the binding layer because we’ve recorded the bindings.

The problem with this approach is that it requires a very sophisticated model, one that doesn’t try to model topology directly, but at some level has to incorporate it in at least virtual terms.  Some modeling approaches, like YANG, are more topology-related and not particularly suited to modeling bindings.  TOSCA can be enhanced to represent/record the structure of a “service” from a binding perspective, and a few firms have been playing with that.  I’ve used XML to do the modeling in the experiments I’ve run because it lets me control the way that relationships are represented without imposing any assumptions by the syntax of the modeling language.

The take-away here is that virtual-function infrastructure is going to add operations complexity somewhere.  Either you add “technology complexity” to the modeling phase, and carry it through deployment and lifecycle management, or you add complexity to the lifecycle management process when you encounter a condition that crosses the border between the functional and the infrastructure layers.  Right now, it’s my sense that we’re dodging the problem by not recognizing it exists, and if 5G really does commit us to broad-scale virtualization of functions/features, we’ll have to fix this border problem when we try to deploy and exploit it.