Protecting Networks from Built-In Backdoor Hacks

Could a vendor plant back-door code in a network device?  That’s a question that’s been asked many times, especially recently.  The answer is obviously “Yes!” so the more relevant question might be “Is there anything that could be done, realistically, about that?”  That’s a harder question to answer, but we may now need to think about answering it, or deciding it can’t be answered.

I’m not going to get into the politics of this.  Everyone knows that Huawei has been accused of having the potential, or intentions, of slipping in a back-door trap that could allow them to intercept traffic or bring down a network.  Most know that the US has been accused of similar things.  Believe what you like, and defend what you must.  We’ll focus on the latter here.

A “backdoor” is a way to gain access to software, either loaded or embedded, for the purpose of something unauthorized.  Programmers have been accused of building backdoors into their code for reasons ranging from spying or theft to holding a client hostage against non-payment.  Backdoors are designed to go undetected, but the level of subterfuge needed to achieve that depends on how much effort is taken to uncover the hack.

The simplest backdoors were things like administrative passwords that were built in and hidden, so someone could log on with full access rights.  This sort of thing can usually be spotted by code analysis, especially if source code is available.  If we assumed that somebody wanted to do something truly sinister and not get caught, it’s unlikely they’d use that approach.

The more sophisticated approach to creating a backdoor involves creating something that happens in normal software due to poor programming practices.  Buffer overflow is the classic example.  Programs read information into memory space called “buffers”.  If the message that’s read in is longer than the buffer, and there’s no protection against that, the data at the end of the message then runs out of the buffer and into whatever is beyond it.  That’s “buffer overflow”.

The hacking comes in when you put code in the end of the message, and that code overwrites the code that happens to lie beyond the buffer.  If that code is then executed, it can open a door to a broader hack, often by jumping to a short program earlier in the packet.  This sort of thing happens regularly with commercial software, which is proof that it’s not easy to find, but if the computer chip and operating system understand the difference between “code memory” and “data memory”, the hack could be caught when the overflow occurs.  Again, it would probably not be a foolproof approach for someone who didn’t want to get caught.

The really sinister approach would be a lot harder to catch.  Suppose your network device has (as most high-end ones do) a bunch of custom chips that perform specialized packet analysis or data forwarding management.  These chips, in effect, have their own software hidden in silicon.  This software could respond so some external trigger, like a sequence of bits in a series of packets, by triggering something seemingly normal at the software level.  The sequence could also do things like alter forwarding instructions.  It would be difficult enough to know that a hack like this had been perpetrated after the fact, examining the results.  It could be near-impossible to find it ahead of time.

I don’t think that code reviews of the kind various companies and countries have proposed would have a realistic chance of finding a backdoor hack, and I certainly don’t think that such a review could prove there was none.  Does that mean that there is truly no way to say that network devices or other IT devices and software, don’t have backdoor hacks waiting to be activated?  Maybe no ironclad way, but there are surely some things that could be done to reduce the risk, and they all fall into what I’ll call “functional isolation”.

The idea behind functional isolation is that hardware and software components are more easily hacked when the boundaries between the elements are blurred.  A chip, for example, is a greater risk if it can interact with device software directly, manipulate memory directly.  Software elements are riskier if they have broad ability to invoke other components or change memory contents.  If you limit these interactions through functional isolation, you reduce the risk of a hack.  Furthermore, if you have a functional boundary, you need to ensure that a single vendor doesn’t control how that boundary can be crossed.

One obvious step to achieve functional isolation is to protect each “function” by making it impossible to access the memory of other functions; code protection through memory management.  This has to be a feature of both the operating system and the CPU chip, though, and for devices that aren’t built that way, the obvious may be impractical.

One easier standard to enforce would be to say that any network device has to have its hardware and software unbundled, with the specifications for the interface available in open form, and the hardware must have an accompanied open-source software implementation available for it.  This wouldn’t prevent proprietary coupling of hardware and software because the all-proprietary option would be available, but it would allow a buyer to separate the hardware and software if they believed they needed it for security.

A better level of security could be created by isolating hardware from software through a standard software intermediary, a “driver”.  Many specialized interfaces and hardware elements are connected to the operating system or software via a driver, and if the driver has specific, standard, interfaces to what’s above it, and if it’s secure, then it’s harder for a chip to introduce a hack.

Let me offer a specific example.  Many network devices can be expected to use custom forwarding chips.  If the vendor who provides the device provides both these chips and the software that uses them, there’s a risk of a backdoor created by having a chip open an avenue of attack that software then follows up.  Suppose, though, that all network devices that use special forwarding must use a third-party driver to support their chip and make it compatible with the P4 forwarding language.  If the driver is certified carefully, the risk of a chip-created backdoor problem is reduced because the software won’t follow up on an attack.

Functional computing (lambda, serverless, microservices, whatever you’d like to call them) could also reduce the risk of a backdoor hack.  A “function” in programming terms is a process whose output depends only on its inputs:  result=functionof(inputs), in other words.  If functions cannot store things internally or put them in some central store, then it’s harder to contaminate execution without coopting a bunch of functions to support your nefarious deeds.  It’s easier to scan functional code for hacks, too, because they work specifically and singularly on what’s passed to them, and they produce only the specified result.

The challenge for operators or businesses that fear backdoor hacks is that vendors are likely to be slow to adopt these kinds of practices.  That would leave the same choice that buyers have today; trust or go somewhere else for products.

Does Microsoft Plan on Building the Carrier Cloud?

The Microsoft purchase of Affirmed Networks is yet another indicator that much of the transformation operators plan will be implemented not on “carrier cloud” but on public cloud providers.  This might be very bad for vendors who’d hoped to fill all those carrier cloud data centers with equipment and software, but good news for 5G.  The operative word, of course, is “might”.

Affirmed Networks is a virtual-mobile-infrastructure player, in somewhat the same space as Metaswitch, a company I’m familiar with from the older days of NFV.  They provide both current 4G mobile network and 5G network support, and also offer mobile-edge computing, IoT, and WiFi virtualization.  In my analysis of carrier cloud, the mobile space accounts for about 16% of the total opportunity driver through 2030, and for almost 20% of the driver in the next three years.  Of the “traditional” or “evolutionary” drivers of carrier cloud, the ones that relate to services network operators already offer and are comfortable with, it’s the best of the near-term lot.

That makes mobile services particularly critical to carrier cloud, because without near-term deployment support, the less-comfortable application classifications, including IoT and contextual services, would have to justify the early carrier cloud data center buildout, which my modeling says would be unlikely.

Operators liked the idea of a carrier-cloud-hosted mobile network because it would generate more competition and lower prices.  The challenge has always been defining exactly how one created the software, and for that, operators have depended on vendors.  All the major mobile infrastructure vendors (Ericsson, Nokia, Huawei) have their strategies, and innovative players like Affirmed and Metaswitch do too.

Mobile cores work pretty much the same way at the functional level, meaning that they accommodate the fact that smartphone users and other mobile users have to be reached in the cell they’re currently located in.  The packets addressed to a phone have to be routed to the cell site, not the phone directly.  This is handled through the use of tunneling between the packet network gateway (to the Internet) and the cell sites.  This process is facilitated by mobility management and registration, and of course by the tunnel management.

The big difference between 5G Core and the 4G Evolved Packet Core is that the latter is based on the traditional “network appliance” model.  Virtualization and hosting are fine, but you virtualize/host the software instance of the appropriate EPC device.  5G Core is based on a service model, wherein physical interfaces or ports give way (at least in terms of specification) to APIs.  A service-based mobile core is of course more elastic and resilient, and it fits better with the notion of network slicing, which creates what are effectively virtual mobile networks on top of a single set of radio/backhaul/network facilities.

It’s easy, though, to lose sight of the forest of carrier cloud servers in the trees of mobile function software instances, so to speak.  The big capital investment for mobile operators would be the carrier cloud data centers.  Since we’ve seen (and I’ve blogged about) a growing number of partnerships between operators and cloud providers that at least suggested the former might be very happy to offload carrier cloud data centers to the public providers, it’s not surprising that one such provider, Microsoft, has bought a mobile infrastructure function player in Affirmed Networks.

Any cloud provider should love the idea that they might be the recipient of carrier cloud data center outsourcing, and all the more so if they could support operators’ outsourcing entire carrier cloud applications.  Microsoft should love the idea more than most, perhaps, because mobile network functions are not direct enterprise services but services that enable other services—mobile communications, in this case.  Any public cloud provider would be a credible option to host them, because it’s not likely the implementations would require any of the web services that cloud providers use to differentiate their specific hosting options.  Google, eager to elbow its way to the top 2 or the absolute top of the cloud market, would be especially eager.

Cloud players would love mobile infrastructure software wins for another reason, which is the already-noted point that 4G/4G infrastructure was the dominant driver for carrier cloud deployment through 2023.  If network operators outsource the 5G function hosting to cloud providers, that largest driver of deployment for carrier cloud disappears, making it much more likely that operators will at least start their later-developing carrier cloud projects in the public cloud instead.  Then, of course, it’s more likely that those projects will stay in the public cloud, which will then make it harder to start the next wave on carrier cloud infrastructure.  That could shift most of the carrier cloud applications to public cloud providers.

All of this raises an interesting question, though.  Do the public cloud providers have a truly effective implementation of the kind of function hosting 5G would drive?  We hear about “cloud-native” all the time, but there’s no broadly accepted definition of what the term means, and no objective way to apply it to products whose details are not available for inspection.  A big part of the problem is that when we’re talking about infrastructure that’s shared among multiple users, there are multiple layers in even the term “service”, which obviously impact service lifecycle management.

Back in the 2000s, the first body I know to have encountered this issue took up the debate.  The IPsphere Forum, responding to the interests of the EU operators who were members, started to look at how the principles of IPsphere could be applied to mobile/IMS/EPC.  The paramount question was what the “service” was.  Clearly, doing lifecycle management on individual mobile calls wouldn’t be practical.  Do you do lifecycle management on customers/phone numbers, or do you lifecycle-manage IMS/EPC itself?

The view that emerged was that the “service” in a shared-infrastructure relationship was collective, meaning that you applied lifecycle management to the components of the infrastructure and not to the commitment those components made to individual customers or customer experiences.  I think this is how 5G would have to be supported too, but that decision would seem to disconnect 5G from the concept of “service modeling” and a “service contract” that I’ve always said was the centerpiece of modern thought on service lifecycle management.  What does the TMF concept of the “NGOSS Contract” mean to the collective 5G core?

In my ExperiaSphere project, I called these collective network functions “foundation services” because, once deployed, they became a part of the infrastructure.  Foundation services, IMHO, look more like applications than like network functions in the familiar sense.  You don’t deploy a bank deposit and check application (“demand deposit accounting” or DDA) every time someone makes a deposit or cashes a check.  It’s there, available for use.  The same would be true of mobile-network-related functions from 4G (IMS/EPC) or 5G (Core).  Thus, there’s less a TMF-like service contract for them than an application model.

There’s no reason why this application model couldn’t serve the same role as that NGOSS contract, though.  A lot of the elements might not seem to make sense (there’s a commercial side and a resource side to the TMF data model), but if you think about the notion of network slicing, you could see that a slice might well have a commercial side and a resource side.  If you want symmetry, you could say that the commercial side of an on-us service would be a null, but the central point of the NGOSS contract approach is the fact that the “contract” mediates between events and processes, and so it defines how lifecycle management works.

The design of this event-coupling piece will, of course, determine the way the processes have to be developed.  If, as I’ve presumed in ExperiaSphere, the processes are fed their portion of the data model when they’re activated, then the processes have all the information they need to run.  To paraphrase programming nomenclature, result=functionof(inputs).  There is no dependency on data stored by the function, so it doesn’t need to store any, and that means that the processes can be true stateless functions, or lambdas, or microservices, depending on what term you like.

This is what would be needed for an implementation of a 5G core to be “cloud-native”.  You need stateless processes that can be spun up or replaced as needed.  You need to have some mechanism for orchestrating these processes through stateful analysis of event streams.  The model approach provides all of that.

What’s not clear is whether Affirmed used the model-and-function approach.  If they didn’t, then Microsoft would have to do what might turn out to be a significant amount of software maintenance to create something that’s cloud-native.  If they didn’t, they would risk having a competing public cloud provider build a solution for 5G that’s clearly better.

It’s possible Microsoft would take that risk.  It’s possible Microsoft would expend the effort needed to optimize Affirmed’s stuff for 5G.  It’s possible they don’t even care about 5G and bought Affirmed for another reason (I doubt there is another reason good enough, but hey, it’s their company).  We’re not going to know all the details till we see more of what Microsoft does with its new acquisition, but carrier cloud could justify a hundred thousand global data centers, it could sure boost revenues for public cloud providers too.  If there’s a race among public cloud providers to get 5G and other carrier cloud applications implemented, we might see progress a lot faster than we’d see if we waited for the telcos or their traditional vendors.

Can We Simplify Networks to Improve Economics?

We have network infrastructure, and we’ve had several models describing how it might be built better, made more efficient.  So far, none have really transformed networks.  We knew, through decades of FCC data on spending on telecommunications, what percentage of after-tax income people were prepared to devote to network services.  We now know that the percentage hasn’t changed much.  Thus, we have a picture of network services where revenues are static and where cost controls have proven difficult.  What happens now?

I’d followed the FCC “Statistics on Common Carrier” for all the years the report was published.  In it, the FCC offered the insight that telecom services accounted for 2.2% of disposable (after-tax) income for consumers.  The most recent information suggests that it’s between 2.5% and 2.7%, and IMHO the difference from the number decades ago is within the margin of error for the calculations.  From this, I think we can say that there’s no credible indication that consumers will pay more over time for communications.  Business spending on networks, as all operators know, is under considerable cost pressure, and things like SD-WAN promise to increase this pressure.  Revenue gains over time, therefore, are difficult to validate.  Losses might be more credible.

Operators, again IMHO, have known this for a long time, and their goal has been to reduce costs—both capex and opex.  Capex currently accounts for about 20 cents of each revenue dollar.  Another 18 cents are returned to shareholders.  “Process opex”, meaning the operations costs associated with infrastructure and services, accounted for about 29 cents per revenue dollar in 2018, the year that it peaked.  Under pressure from tactical cost-savings measures, it’s fallen to about 27 cents in 2020.

Capex reductions of about 20% have been the goal of operators, meaning a reduction of 4 cents per revenue dollar.  SDN and NFV (the “First Model” of change) were both aimed at framing a different model of networking based on more open-commodity technology, but neither have produced any significant reduction in capex, and neither has been widely adopted.  Most recently, operators have looked at a combination of open-source network software and open-model “white-box” hardware (the “Second Model”).  They indicate that this combination appears to have the potential to constrain capex growth, and perhaps reduce it by 10% if all the stars aligned.  The smart planners tell me that’s not enough either.

Some operators have been exploring what I’ll call the Third Model, and I’ve mentioned it in some earlier blogs.  This model says that the network of the future should be built primarily as an enormous optical capacity reservoir, capped with a minimalist service overlay.  You reduce both capex and complexity by burying network problems in oversupply, which is becoming cheaper than trying to manage a lower level of capacity.

How much of what goes on inside an IP network relates to actually forwarding packets?  My estimate is 15%.  The rest is associated with adaptive traffic management and overall network management.  If we stripped out all of the latter stuff, could we not propose that a “router” was very much like an SDN forwarding device?  If the forwarding tables in this stripped-down white box were maintained from a central control point (which even Segment Routing proposes as an approach), the entire service layer of the network could almost be a chip, which would surely cut the capex of operators.

The capacity-reservoir approach presumes that if there’s ample capacity in the transport network, and transport-layer facilities to reroute traffic if a trunk fails, the service layer doesn’t see many (if any) changes in status, and the central controller has nothing to respond to.  That eliminates scalability issues; there can be no floods of alarms to handle, no need for mass topology changes that require resetting a bunch of forwarding tables.

The central control point would then look like what?  Answer: a simulator.  When something happens, the control point models the new conditions (which, due to the enormous residual capacity, involves only a few conditions to model) and defines the new goal state.  It then gracefully adapts the service layer.  No hurry; the transport layer has already taken care of major issues, so we’re just tweaking things.

From the opex perspective, this has the advantage of creating a simpler network to operationalize.  Our Second Model of white-box open networking still has the same devices as the traditional IP networks do, and so there’s no reason to think it’s any inherent opex advantage.  The Third Model would have very little complexity to manage, and all the real logic would be in the central control point.  In most respects it would be an evolution of the original SDN/OpenFlow model, with much less pressure on the control point and much less concern about how it would scale.

We could combine this with the notion of network virtualization and intent modeling to make it better.  We could say that IP networks are made up of domains of various types and sizes.  Some domains are “domains of domains”.  If each domain had its own controller, and each controller had the objective of making its associated domain look like a single giant router, then the scalability issues go away completely, and you also have the ability to take any number of domains and translate them to the new model while retaining compatibility with existing infrastructure.

Google did something much like this with its Andromeda SDN model, which surrounded SDN with a ring of “BGP emulators” that proxied the control plane protocols of IP into/from the Andromeda core.  In effect, an IP packet like BGP becomes a request for or input to control-point information.  The domain is a black box, one that looks like an AS (in this case) but is really SDN.

It’s likely we could tune up SDN and OpenFlow to optimize for this model.  It’s likely that we could work out a way to migrate MPLS and segment routing to this approach too, which further defines an evolution from the current router infrastructure to the Third Model.

Making the Third Model work doesn’t require much invention, then.  We have large-scale implementations of all the pieces of the puzzle.  We have transport networking that include agile optics, using for example, the MEF 3.0 model.  This lets the transport layer present what’s essentially a constant QoS to the service layer.  We have SDN/OpenFlow implementations.  We have the P4 programming language that lets us build forwarding devices from suitable chips.  We have central control point software, and we have the suitable IP control-plane emulation tools.

I think the picture would be better with some intent-based model-and-management overlay, because it would make the transformation to this picture more efficient.  The problem is that such an approach has been possible for a decade and never taken seriously.  It’s hard for me to promise it would be taken seriously now.  The Third Model I’ve cited has the advantage of seeking simplicity.  Maybe that’s what everyone has craved all along.

A Deeper Dive into COVID-19s Impact on Networking

I blogged yesterday about the general impact scenarios for COVID-19.  Today I want to look at some more network-specific points, in particular the general impact on network capacity and the capex response and the impact on 5G planning.  It’s easy to say that network operators, faced with significant traffic growth, will respond with massive capital plans, but we all know that bits don’t push capex upward, only revenues do that.

In the scenarios I outlined yesterday, service providers were likely to respond to COVID at the two-month-shutdown scenario.  Traffic increases would create a competitive churn risk, and so the impact would be greatest in the areas where competition was greatest.  The impact, however, is most likely to be one of accelerating the current 2020 budget spending rather than augmenting it, because there is still little indication that buyers are willing to spend more for broadband.

If the shutdown period lasts longer than 2 months, it’s likely to put more and more cost pressure on consumers and businesses alike.  Consumers would then start looking for cheaper services, first for mobile broadband, then TV, and then wireline broadband.  The result of this would be a gain in customers for those operators who have naturally favorable cost positions, but overall a price war that would eventually result in a dip in industry revenues.  That dip would almost surely prevent any incremental budget improvements for 2020 at least, and likely make 2021 budgets more conservative.

In the long term, consumer spending on home broadband will be under pressure, with cable providers forced to change plans to allow lower entry prices, or to drop TV in favor of streaming support.  It’s not likely that any home broadband provider will be increasing customer bandwidth through infrastructure investment; there is no long-term indication that willingness to pay will rise even when COVID is controlled, and if it’s not, household incomes would not support paying more.

On the business side, a shutdown at the two-month point would certainly ignite another review of network service spending.  How this would impact specific services is harder to say, because it would depend on the nature of the businesses and how they used their branch locations.  If businesses were required to work from home even in non-retail situations, branch offices would then be empty, and the cost of networking them would be difficult to sustain unless there was a clear end to the shutdown in sight.  In that situation, businesses would likely ask operators for contract relief, and any who had renewals in process would almost surely rethink their strategies.

This could eventually produce a major uptick for SD-WAN services, particularly in MSP form.  Businesses would be much more likely to embark on an SD-WAN VPN adventure that didn’t involve purchase/licensing, favoring a service approach in general, and in particular one with a fairly short contract period.  Remember the branches are presumed empty here.  SD-WAN providers, then, should be thinking about this scenario when they plan their 2020 sales/marketing activities.

Collaborative tools are already benefitting from the WFH pressure of the virus, but don’t expect to see a huge and long-lasting boom.  The reason is that traditional collaborative tools aren’t really optimized for a WFH model.  Some of the collaborative players have been telling me for years that the majority of their users are working from office locations, and using web conferences or collaborative video meetings just to avoid travel.  These are actually mobile workers in most cases, and they make up only about 8% of the workforce overall.

It’s likely that at around the two-month point in a shutdown, collaborative vendors would start thinking about a WFH-tuned model of collaboration, and initiate a product plan to address it.  The opportunity is most likely to emerge in as-a-service form, meaning a version of web meeting or conferencing, rather than as a product, because users would be unlikely to accept a license/purchase track to getting the capability.

The complicating factor here is the duration of the shutdown.  Up to about the three or four-month point, businesses are likely to believe that normal business behavior will be reestablished shortly after the virus situation resolves.  Beyond that, the problem is that layoffs likely would be anticipating a long-term decline in revenues, and there’s no need for a laid off employee to work from home.

Overall, it appears that as long as the virus is contained within 3-4 months, network spending at the service provider level could be sustained in the long term and even grow in the short term.  Beyond that, things become difficult to predict.

5G impact of the virus is difficult to predict from the start.  There are two forms of 5G to consider—the mobile 5G and 5G millimeter-wave 5G/FTTN hybrids.  Each has its own issues and opportunities.

For the 5G mobile space, the problem is that consumers have to get a 5G phone to take advantage of the service, and that means spending more money, unless someone subsidizes 5G smartphones to the point where they’re zero cost.  For that to happen, operators would have to believe that there was incremental revenue for 5G service.  Verizon in the US charges more for 5G, but it’s far from certain they’d be able to sustain this position in a competitive market even without the impact of the virus on consumer willingness to pay.

If 5G does gain any mobile traction, it seems certain that it will be in the NSA (non-standalone) form, mixing 5G New Radio with LTE IMS and EPC.  5G Core doesn’t seem likely to deploy in a pandemic world, for lack of a convincing revenue upside to justify the cost.  Again, I’m sure that 5G proponents can identify things you could do with 5G in a pandemic, but that’s not the question.  The question is who will have the money to pay for them.

On the 5G/FTTN hybrid, the same issue applies.  Consumers under price pressure are looking to spend less.  Spending less means the pie gets smaller, and so industry costs (capex and opex) are under pressure even without increasing the budgets.  However, competitive pressure created by consumers shopping for cheaper broadband might justify an entry into home broadband via 5G/FTTN as an alternative to losing the customer to somebody else.

The theory here is that the virus would, if it’s not quickly contained, result in a shift to more home viewing.  That could mean that households who don’t stream video today could become streaming prospects, and if their current broadband service infrastructure could not deliver it with acceptable QoE, they would bail to another provider who could, presuming competition were available.  Remember that 5G/FTTN could be deployed by some ISPs even out of their home regions, so competition might indeed become available.  Also, 5G mobile could be used in rural or thin-population areas.

The big question, in my view, is whether network operators would see and take advantage of the higher-level service opportunities being generated.  There is little question that primary and secondary schools and businesses preparing for WFH in a future pandemic (or a recurrence of the current one) would be interested in having as-a-service support to projecting their activities to the home.  The network operators could benefit from this, enough to justify more investment in infrastructure, providing they won the business.  Unfortunately, network operators have proved totally unwilling or unable to conceptualize this kind of service or even recognize the opportunity.  Thus, I believe that what the pandemic will do is not to accelerate “carrier cloud” but to sign its death warrant.  By accelerating the opportunity for above-the-connection services, the virus will assure that the cloud providers will win the business.

Carrier cloud was, five years ago, the source of the most credible explosive driver to data center deployment, offering the potential of one hundred thousand incremental data centers by 2030.  If this driver is lost, it doesn’t kill the servers and software those data centers represented, but it does reduce the total capex.  Larger pools of resources owned by a half-dozen public cloud providers don’t have the competitive overbuild of carrier cloud.  Most of all, though, the loss of carrier cloud confines operators to connection services for a very long time.

It’s possible that some operators will see the light on this, but let’s go back to the stock market for a moment.  Telcos in particular are highly valued as defensive stocks, which (with the exception of AT&T, perhaps) has been demonstrated in the recent movement of stocks.  Will the telcos embark on a massive capital program to fund carrier cloud when it’s clear they have no real idea how to realize any of the opportunities that would justify the investment?  Probably not.

For network vendors, I think the answer is clear.  In the network operator space, bet on modest capacity augmentation in the near term, managed services and SaaS-type services in the mid-term, and competition and “public utility” status in the longer term.  Don’t expect 5G mobile to benefit from the virus; it would be more likely it would be damped down beyond the NSA form or the FTTN-millimeter-wave hybrid form.  For the enterprise, expect a short-term interest in WFH, but in the long term, expect that “cloudsourcing” will be more popular, and things like SD-WAN will become an almost-default model.

Stay well, everyone.

The Impact of COVID-19 on Tech

I keep getting notes about the bargains in tech and telecom stocks, followed by the promise that stocks could dip another 30%.  I keep seeing stories about how we could hit 20% or 30% unemployment, which would surely make anything you’d pay for almost any stock today the furthest thing from a bargain.  The recovery will be “V” shaped, or maybe “U” or “W” shaped, or maybe it will be the great depression all over again.  We need to avoid overreacting, or we need to start taking this seriously.  The nice thing about all this is that you can find support for any viewpoint you like out there, somewhere.  Nearly all of them, of course, will turn out to be wrong.  What should a company do, in the tech or telecom space, to plan for an optimum response?

Starting with the facts would be nice, but the truth is that we don’t have many to work with.  What we have with COVID is a dangerous disease, but not a deadly one.  The problem is that we don’t really know much about how it spreads, and because we haven’t had good systematic testing of populations as well as potential victims, we don’t even know how many people have it, which means we’re judging the disease almost totally by the population who have had enough symptoms or reasons for testing to have qualified.  In the US, with about 40,000 cases and 455 deaths, we have a fatality rate of about 1.1%, but that’s not definitive because we don’t know how many people are actually infected.

A piece I saw this AM said that there were now estimated to eleven times the number of cases that have been reported.  This morning I saw that there were about 340 thousand cases worldwide, with roughly 15 thousand deaths.  If we assume that 11x multiplier to account for non-symptomatic victims, that would mean there are already three and three-quarter million cases.  The fatality rate would then be (neglecting disease timing) about 0.4%, which isn’t much worse than a bad flu season.  The problem is that we can’t make that assumption.  If the rate is more like the 2% to 3% level some say it is, and if a conservative 40% of the population were to get it, then we could see three million deaths in the US, compared to perhaps 40 thousand from seasonal flu.

The business impact of COVID is created not by the disease but by our steps to contain its spread, given that for now we have neither vaccine nor treatment.  Closing businesses and enforcing social distancing has the effect of not only impacting buyer confidence, but also direct sale of goods and services.  That means a reduction in spending by the companies who produce and sell goods and provide services, which of course then daisy-chains through the economy at large.  The impact of that depends on how deep the decline in the economy overall is, and how long it lasts.

I tried to run my model, which many of you know is based on buyer behavior, to see how various things might play out.  Here’s what I found.

First, the assumptions.  The direct impact of COVID is, as I’ve said, largely due to containment efforts, meaning closing of stores, suspension of flights, broader business closing, social distancing, and so forth.  These hit as soon as the measures are imposed, and the result is first a loss of revenue from reduced sales, and second the layoff of workers, which then reduces their ability and willingness to spend, further hitting company revenues.  The indirect impact is the fear factor, the conservatism in buyers as they face a situation that’s at least unfamiliar and perhaps even downright scary.  These play out over time, and here’s what the modeling says.

Even a complete closure of retail businesses and services for about a month would have a minimal long-term impact on companies.  Credit issues could develop, both for individual (particularly SMBs) businesses, and systemically, but these can likely be weathered for a short period, particularly with central bank support like we’re seeing.  We could expect that if we were able to safely suspend restrictions, and that there were no reversals in the positive trend in cases created by this lifting of social distancing, we would expect a “V” recovery.

From a tech industry perspective, the 1-month scenario would create a spurt of interest in WFH technology and general conferencing, lasting several months, but this would likely taper off.  Budgets for remote work products and services would be increased slightly for 2020.  IT spending and network spending for the year would see relatively little impact, and so the key in positioning would be to promote approaches to empowerment that could be implemented quickly and with modest carrying cost.  Think SaaS strategies versus product or perpetual license approaches.

A shutdown of around two months would create a growing number of small business failures, and it would also begin to impact the supply chains.  Consumer confidence would take a strong hit at this point, and the impact would likely affect buying behaviors for several months beyond.  As we’re coming into summer, vacation plans that had to be suspended would create a psychological impact at the least, and if travel insurance or forbearance on the part of companies didn’t refund the charges, chances are that vacations and their associated revenues would be lost.  Many would cancel early, taking no chance they’d lose the money.

Overall, this is the point at which we could call things a “U” recovery.  We could expect both Q1 and Q2 GDP to take a hit, with most in Q2.  The stimulus packages being proposed would have to aim at this 60-day point, with the goal of improving confidence and also sustaining stock prices.  Companies will likely accept some hit in share prices if they have confidence that there will be stimulus, but should that fail to happen, then a continued decline in stock prices would almost surely trigger larger and larger layoffs.

The biggest tech impact of this scenario would be in education, primary and secondary schools.  A two-month shutdown will close schools long enough to have compromised education for the school year, and there will be considerable interest in deploying technology to permit mass home schooling (where that’s not already possible).  This is likely to look a lot like the web-meeting technology, with some controls and changes to suit classroom social behaviors.

The two-month scenario, having greater impact, would surely raise interest in remote work overall, and begin to generate opportunities for solutions that allowed companies to create “jobspaces” that were virtual and portable, but these would have to be in a form that could evolve from the short-term tactics of the 1-month approach, given that during the early product/service assessment phase, buyers wouldn’t know that the impact of the virus would last beyond the first month.  Economic pressure on companies would be enough to cause them to reduce their spending, and we so all capital budgets would be reduced for the entire year.  Unless a vaccine were to be made widely available by year’s end, 2021 budgets would also be cut.

Service providers, in contrast, would likely pull the trigger on modest network expansion programs, and even though a recovery at this point could eliminate the need, most think they’d likely go forward with the programs in the event that the virus reappeared in the fall, which is possible.  For network vendors, this could make the service provider space the sweet spot.

Cloud providers could also see a sharper uptick in their own sales, as companies will be reluctant to commit to capital programs for IT expansion, preferring a cloud/service option.  Some cloud providers think this could create a longer-term shift to the cloud, and are gaming out pricing scenarios to match it.

At the three-month point, we don’t really have any nice graphic-letter example to describe the recovery.  At this point, we can expect that enough small businesses would have failed, and even some franchise operations, and this will impact retail real estate and put greater stress on the credit markets through defaults.  Again, a stimulus package could help, but by this point consumer confidence and behavior would likely change even in the longer term.  The economic impact of the virus would extend well into the fall, easing only around the first of 2021.  Stock prices would be unable to recover their early-2020 levels, devaluing 401Ks and IRAs, and impacting those who depend on them for income.

At this point, GDP for the entire year is likely to take a hit.  Three months from now is only early summer, and if it takes that long to get over the current virus phase, the risk of a fall outbreak will surely suspend interest in travel, and wage and income losses in the three months will cramp consumers for the balance of the year when combined with their fear the virus will start up again in the fall.  Restaurants, travel, and entertainment will take much longer to recover in this scenario, and so the economic damage will now extend into Q3, which will then fall into negative growth as well.  Enterprises will surely cut their overall 2020 spending and put a conservative slant on 2021 spending as well.

Network operators may now face a more serious level of pressure.  If consumers continue to stay at home and network demands aren’t reduced, more spending for capacity may be required, and price wars on broadband and TV may also be forthcoming.  While network services aren’t typically as hard-hit in a downturn as capital spending, there still may be pressure to downsize home TV, network, and even wireless plans.

The cloud providers would be the bright spot here, since businesses will cut their capital budgets and hold off on any projects that involve capex.  On the other hand, tactical business services, especially SaaS, will look very smart if they can take up the slack and perhaps even offer a market advantage.

And finally, we come to the case where the virus isn’t contained in four or more months.  Now, candidly, things start to look bad.  There is little chance that if we have no response to the virus in four months, we’d have credible reason to believe it wouldn’t kick off even worse in the fall, or simply continue to get worse until a vaccine was validated.  That would mean a sharp economic decline, and the analogy to the Great Depression might seem valid.  I don’t think it will be quite like that, even if the virus worst-case happens.  There’s too much government can do to pull things out, and so what we could expect is several years of very anemic growth.

Which of these scenarios will we see?  I wish I had an answer to that, but because we lack so much basic data on the virus, I’ve got nothing concrete to feed into the modeling.  Is there a risk that the economic impact of our attempts to mitigate the spread of the virus will be worse than the disease?  Surely, and there’s also a risk that the virus, even with the current measures in place, will eventually create a massive death rate.  We can’t know which risk is greater at this point, or tell where in that spectrum any specific decisions on lifting or imposing restrictions will fall.

One thing that’s clear to me is that we need to take better advantage of our virtual-world capabilities.  We already live in two realities, our real world and a network-integrated social framework we loosely call “online”.  There have been only limited attempts to somehow integrate these two, to allow one of the worlds to influence or augment or even replace the other.  We should learn a lesson here, and take the time to think this through.  We still don’t know how COVID will play out, and how many COVID-like events we can expect down the road.

NFV: The History of Wrong and Right

We may need a summary of my cloud centric view of network infrastructure.  I blog four days of every week and I don’t want to repeat myself, but some who don’t follow me regularly may have a hard time assembling the complex picture from multiple blogs.  It is a complex picture too, one that can’t be turned into a sexy 500-word piece.  Thus, this isn’t going to be one.  In fact, it will probably take 500 words just to set the stage.

The traditional view of networking divides software and hardware.  Devices like routers, switches, optical transport devices, CPE, and so forth are the hardware side, and network management and operations management tools (the classic NMS, OSS, and BSS) are the software.  Operators have been complaining about both these areas for at least a decade.

On the software side, most operators have said that OSS/BSS systems are monolithic, meaning they’re big centralized applications that don’t scale well and are difficult to customize and change.  They believe that network management is too operations-center focused, making it dependent on expensive and error-prone human processes.  On the hardware side, they believe vendors have been gouging them on pricing and trying to lock them in to a vendor’s specific products, which then exacerbates the gouging.

On a more subtle level, experts on traffic handling (especially Google) have believed that the traditional model of IP network response to topology changes and congestion, which is adaptive route management supported via local status exchanges, makes traffic management difficult and leads to underutilization in some areas and congestion in others.

The traditional view of computing is that software frames the applications that deliver value, and hardware hosts the software.  Virtualization in the computing and IT world focuses on creating “virtual hardware”, using things like virtual-machine (VM) or container technology.  This allows software to be assigned to virtual hardware drawn from a pool of resources, eliminating the 1:1 relationship between software and host that’s been traditional.

Network operators use software and computing in their operations systems, and the expertise operators have with software is found in the CIO organization, which is responsible for the OSS/BSS stuff.  The “network” side, under the COO, is responsible for running the networks, and the “science and technology” people under the CTO are responsible for network technology and service evolution planning.  The way these groups operate has been defined for a generation, and the fact that network hardware is a massive investment has focused the CTO/COO groups on creating standards to ensure interoperability among devices, reducing that lock-in and gouging risk.

In the first five years or so of the new millennium, John Reilly of the TMF came up with what should have been the defining notion in network transformation, the “NGOSS Contract”.  John’s idea was to use a data model (the TMF’s SID) to define how events (changes in network or service conditions) were coupled to the appropriate processes.  The original diagram (remember this was in the opening decade of the 21st century) postulated using the SOA standards, but the principle would work just the same with modern APIs.

The TMF embarked on another initiative in roughly 2008, called the Service Delivery Framework (SDF) for composing services from multiple functional components.  I was a member of the SDF group, and toward the end of the work, five Tier One operators who were also in the TMF asked me to do a proof-of-concept execution of SDF principles.  This resulted in my original ExperiaSphere project, which used XML to define the service contract and Java to implement the processes.  Some of the concepts and terminology of ExperiaSphere were incorporated into TMF SDF.

In roughly 2010, we started to see various efforts to virtualize networking.  Software-defined networks (SDN) proposed to replace expensive proprietary hardware with cheaper “white-box” forwarding devices, controlled by a central hosted software element via a new protocol, OpenFlow.  The most important feature of SDN was that it “virtualized” an entire IP network, not the devices themselves.  This approach is symbiotic with Google’s SDN deployment, which creates a kind of virtual giant BGP router with BGP at the edge and SDN inside.

NFV came along at the end of 2012, maturing into the Industry Specification Group (ISG) in 2013.  The goal of NFV was to replace “appliances”, meaning largely edge CPE boxes, with cloud-hosted instances.  NFV was thus the first network-side initiative to adopt a “software” model, combining software (virtual network functions, or VNFs), virtualization (VMs and OpenStack), and hardware resource pools (NFV Infrastructure or NFVi).

NFV had a goal of virtualizing devices, not services, and further was focused not on “infrastructure” devices shared among services and customers, but rather on devices that were more customer-specific.  While there were use cases in the NFV ISG that included 5G or mobile services, the body never really addressed the difference between a per-customer-per-service element and an element used collectively.  While they adopted the early cloud VM strategy of OpenStack, they ignored the established DevOps tools in favor of defining their own management and orchestration (MANO).  All this tuned NFV to be a slave to OSS/BSS/NMS practices, which encouraged it to follow the same largely monolithic application model.

While all this was going on, the cloud community was exploding with hosted features designed to create cloud-specific applications, the precursor to what we’d now call “cloud-native”.  VMs and infrastructure as a service (IaaS) was giving way to containers (Docker) and container orchestration (Kubernetes), and microservices and serverless computing added a lot of elasticity and dynamism to the cloud picture.  By 2017 it was fairly clear that containers were going to define the baseline strategy for applications, and by 2018 that Kubernetes would define orchestration for containers, thus making it the central tool in application deployment.

That NFV was taking the wrong tack was raised from the very first US meeting in 2013.  The TMF’s GB932 NGOSS Contract approach postulated a model-driven, event-coupled process back in about 2008, and the very first proof-of-concept approved by the ISG demonstrated this approach.  Despite this, the early NFV software implementations adopted by the network operators were all traditional and monolithic.  I reflected the TMF vision in my ExperiaSphere Phase II project, and there are extensive tutorials on that available HERE.

My goal with ExperiaSphere was to expand on the TMF service modeling to reflect modern intent-model principles.  A “function” in ExperiaSphere had a consistent interface, meaning that any implementation of a given function had to be equivalent in all ways (thus, it was the responsibility of the organization implementing the function to meet the function’s interface specifications).  That meant that a function could be realized by a device or device system, a hosted software instance or a multi-component implementation, and so forth.  I also made it clear that each modeled element of a service had its own state/event table and its own set of associated processes to run when an event was received and matched to the table.  Finally, I released the specifications for open use, without attribution to me or CIMI Corporation, as long as the term “ExperiaSphere” and my documentation were not included or referenced without permission.

If you presume that the TMF framework for event-to-process binding is accepted, you have something that would look very much like the microservice-centric public-cloud vision we now see.  The “service contract”, meaning the data model describing the commercial and infrastructure commitments of the service, become the thing that maintains state and directs the reaction of the service to events.  If the processes themselves are truly microservices, this allows any instance of a process to serve when the related state/event condition occurs, which in turn lets you spin up processes as needed.  The result is something that’s highly scalable.

The monolithic and traditional model of management starts with EMS, then upward to NMS and SMS, with the latter becoming the domain of operators’ OSS/BSS systems.  If the implementation of NFV is constrained by the EMS/NMS/OSS/BSS progression, then it is not inherently scalable or resilient.  You can make it more so by componentizing the structure, but you can’t make it like the cloud if you start with a non-cloud model.  The lack of a service-contract-centric TMF-based approach also complicates “onboarding” and integration because it doesn’t build in a way of requiring that all the elements of a service fit in a standardized (modeled) way.

By 2018, operators and vendors were starting to recognize that the NFV ISG model was too monolithic and rigid, requiring too much customization of VNFs and too much specialization of infrastructure.  The cloud-container approach, meanwhile, was demonstrating that it could be easy to onboard components, to deploy on varied resource pools, etc.  This led some operators to promote the notion of “container network functions”, taking the view that if you containerized VNFs (into CNFs), you’d inherit all the benefits of the cloud-container-Kubernetes revolution.  Another group tried to standardize resource classes, thinking that this would make the NFV approach of resource pools and virtual infrastructure managers workable.

Neither of these approaches are workable, in fact.  NFV launched the convergence of network and cloud, but did so without knowing what the cloud was.  As a result, its approach never supported its own goal, because it let its specifications diverge from the sweep of cloud technology, which ultimately answered all the questions of function deployment in a way that’s demonstrably commercially viable, because it’s used daily in the cloud.

The cloud is a community approach to problem-solving, and this kind of approach always leads to a bit of churning.  I have a foot in both worlds, and I think that everything networks operators need to fulfill both the SDN and NFV missions optimally is already available in open-source form.  All that’s needed is to integrate it.  We have open-source TOSCA-based service modeling.  We have Kubernetes orchestration, which can be driven by TOSCA models.  We have monitoring and tools for lifecycle automation, and perhaps best of all, we have application-centric implementations of function deployment that are totally compatible with the higher-level (above the network) services that most operators believe have to be exploited to create an upturn in their revenue line.

A cloud-centric NFV would be one based on the prevailing cloud tools, and conforming to the trends in how those tools are applied.  There is little to gain (if anything) from trying to retrofit cloud concepts onto NFV, because NFV wasn’t really a cloud concept from the first.  It would be fairly easy to simply adopt the combination of TOSCA, the “Kubernetes ecosystem”, microservices, SDN and segment routing control plane principles, and build a cloud-ready alternative.  In fact, it would take less effort than has already been spent trying to support things like CNFs and NFVI classes, not to mention the earlier NFV efforts.

I’m frustrated by where we are with this, as you can probably tell.  I’ve fought these issues since I first drafted a response to the Call for Action that launched the NFV ISG in 2012, from that first meeting in the Bay Area in 2013, and from that first PoC.  This was a failure of process, because some at least tried to warn people that we were heading to a wrong place.  Recognizing that process failure is important, because the cloud software movement has succeeded largely because it didn’t have a formal process to fail.  Innovation by example and iteration is the rule in open source.  It should have been so in NFV, and the concept of NFV can rise from the implementation ashes only if NFV forgets “standards” and “specifications” and embraces open source, the cloud, and intent modeling.

I’m not going to blog further on NFV, unless it’s to say that somebody got smart and launched an open-source NFV project that starts with the TOSCA and Kubernetes foundation that’s already in place.  If that happens, I’ll enjoy blogging about its success.

We Don’t Need to Modernize NFV, We Need to Move Beyond It

It seems impossible to shake the debate on “containerized” virtual network functions for NFV, even as we should be debating the generalization of cloud-ready models.  Red Hat and Intel, both with a considerable upside if network devices were suddenly turned into hosted virtual functions, has launched an onboarding service and test bed to facilitate “CNFs”.  I’d be the last person to say that we didn’t need to think more about containerizing virtual functions, versus nailing them to virtual machines, but I think we’re at risk to focusing on a limited symptom here.  Some more important points are becoming obvious, so it’s hard to see how they’re still being ignored.

If you were to look at a very general vision of NFV, you might expect it to look a lot like DevOps.  You start with a declarative model of a service and from that you invoke specific steps to manage the lifecycle, from deployment through tearing it down.  The specific tasks would be carried out by integrated tools that could deploy on containers, virtual machines, or bare metal, and these same tools would be able to parameterize and configure both systems (virtual or otherwise) or networks.

I was involved in NFV from the first, and in those early days the vision was in fact fairly generalized.  Even when it was codified into the “End-to-End model” by the NFV ISG, you still had fairly high-level concepts represented by blocks.  We had “management and orchestration” (MANO), VNF Manager (VNFM), Virtual Infrastructure Manager (VIM), and so forth.  It was really the case studies that created the issues, I think, combined with the fact that the NFV ISG was populated mostly by network types doing a software/cloud job.

What issues are we talking about?  I contend that there were three.  First, NFV focused on virtualizing devices that deployed within a single customer’s service, not on building multi-service infrastructure.  Second, NFV presumed that VNFs would deploy within VMs.  Third, NFV permitted or even encouraged a strict translation of the functional E2E model into reference code.  We’ll take each of these, and their impacts on market requirements and NFV suitability, below.

So on to our first issue.  From the very first, a series of interlocking and seemingly harmless decisions relating to what NFV virtualized conspired to put NFV in a subordinate rather than transformative mode.  A “virtual network function” was determined to be the software-hosted equivalent of a “physical network function”, which in turn was a device.  A VNF was to be managed by the same element management and higher-layer tools as those devices were.  The only exception was that features that might typically have been delivered by daisy chaining several CPE devices were to be implemented as “service chains” of VNFs, which eventually turned into co-hosting multiple VNFs in a piece of universal CPE (uCPE).

The problem with this sequence is that virtualizing CPE (vCPE) impacts only the cost of some business services, not the vast majority of infrastructure capex.  SDN had taken the approach that “routing” was possible without “routers”, and sought to define how that might work.  NFV could have enveloped the SDN initiatives, defining IP networks as enormous virtual routers and letting everyone innovate with respect to what was inside the black box, but they didn’t.  As a result, NFV missed the critical notion of abstraction that’s central first to virtualization and then to the cloud.  Without a powerful abstraction concept at its core, NFV is not going to be able to define a cloud-native toolkit.

Even if we were to extend NFV to multi-service infrastructure by deploying, for example, virtual routers, the majority of the NFV work focuses on deployment, which for multi-service virtual routers is largely a one-off issue.  If a virtual router could be an entire network, we might have been able to apply lifecycle management automation within a virtual-router black box, but with only real devices to abstract, we’re left with the same management structure that devices had.  The rest of the service lifecycle isn’t in scope for NFV at all, since device, network, and service management are all ceded to a higher-level process.  Thus, we can’t expect opex to be impacted, and without a significant capex/opex benefit, we’ve accomplished nothing.

The second issue started with a natural assumption.  Hosting virtual functions on a one-per-server basis wasn’t likely to have a favorable impact on capex.  Virtual machines were the rage in the 2013 timeframe, and OpenStack was the open-model approach to deploying things in virtual machines.  It’s not surprising that NFV presumed both VMs and OpenStack, but things went a bit south from there.

The biggest problem was the Virtual Infrastructure Manager.  Over the course of the proof-of-concepts introduced to the NFV ISG, the VIM got a bit conflated with OpenStack.  The more people thought about OpenStack VIMs, the less the thought was applicable to the general question of how you host something.  The singular flavor of OpenStack percolated upward to the VIM, making it specialized not only to VMs but to OpenStack itself.

“Singular” is a nice segue to the final issue.  Functionally, the VIM sits at the bottom of a deployment, translating an abstract view of something into a software-to-VM commitment.  We could have salvaged a lot of NFV had the functional diagram of NFV not been strictly translated into reference implementations.  There is a single block for MANO, VNFM, and VIM, and to many that meant that these were three monolithic elements.  Functional had become architectural.

For the specific issue of VNF versus CNF, the singularity of the VIM creates a massive and unnecessary problem.  Any service “object” or abstraction needs to be realized, and so a service model would be expected to define the software element responsible for that realization.  If a VIM were a logical structure, one as varied as the options for realizing a functional element, there would be no need to fret about VNFs and CNFs.  If MANO were similarly logical, we could map its functionality to any of the DevOps tools, or to Kubernetes for containers.  The decision to translate functional blocks into explicit implementations robbed NFV of that flexibility.

If we go back to that initial, simple, NFV model I opened with, we should expect that an abstract service would decompose into simple service elements, which in turn would be realized either by committing them to a device or system of devices, or by deploying virtual functions.  The three issues I’ve cited here interfered with our capability to realize this simple approach with a conformant NFV implementation.  That is the problem.  The whole CNF/VNF thing is simply a symptom of that problem.

I’m raising this point because we’re creating a risk with this kind of discussion, not a reduction in risk.  The risk is that we’ll spin our wheels to cobble together a vision for a CNF that can be glued onto the NFV model as the ISG framed it, a model which is simply not conformant to cloud principles.  How do monoliths like MANO and VNFM and VIM create an elastic cloud?  They’re not even elastic themselves.

What these points add up to is simple.  First, hosting virtual instances of the kind of devices normally considered “infrastructure” doesn’t even require NFV because the devices have to be hosted where the trunks terminate.  Second, hosting virtual functions in white boxes doesn’t require NFV, because the white boxes replace appliances that go in static locations (again).  Third, virtual function missions associated with individual users are beneficial only for high-value business services, and won’t move the capex/opex ball.  Fourth, dynamic cloud functionality is going to be delivered through cloud software, not through something like NFV, defined by the network community.  That’s especially true given that operators seem to be moving away from building their own clouds.

We could easily make NFV container-ready, simply by taking the right position with respect to the VIM and allowing any number of model-specified VIMs be used in a given service.  The question is “Why bother?”  Why try to create another cloud deployment model when we have an exploding ecosystem of container software and Kubernetes already.  What the operator community should do is let NFV be a limited business-service strategy, and move on.  We don’t need to make NFV “cloud-ready”, we need to skip it and move to the cloud.

Vendors Fight the Kubernetes Wars

Containers and Kubernetes rule, which makes it unsurprising that they figure in a lot of recent announcements.  In fact, VMware and HPE have both launched major initiatives that focus on the “dynamic duo of IT”.  Both of them show why the combination is important, but they also show that it’s still possible to undershoot the opportunity.  Both companies face the classic server/software dilemma, and that may be where both have gone just a bit astray.

HPE is one of the true server giants in the market, and like another (Dell) it’s better known for its personal computer products than servers.  In HPE’s case that’s problematic because their PCs are actually made by their sibling company, HP, who also does their printer line.  In any event, HPE has great IT credentials, but they’ve shared the general problem server vendors have, which is differentiation.  Hardware, in the IT world, is something you run software on.

VMware is Dell’s partially captive software play.  They were (and still are, though less so) the acknowledged giant in data center virtualization, and they also have a strong networking portfolio.  They were an early adherent to the container-and-Kubernetes play, and it’s been my view for some time that they have the strongest product set to play in the future world of containers and the cloud.  Their recent positioning, though, and even their recent M&A, has been a bit murky.

The challenge that both companies face is that enterprise computing means hybrid cloud computing.  None of the enterprises I’ve talked or worked with have any interest in moving everything to the cloud.  Most realize that their cloud computing activities will be primarily related to the front-end handling of mobile/web extensions of their core applications.  Most are also interested in using SaaS for internal activities like CRM.  However, data center IT focuses on marrying servers to software, and the cloud marries virtual machines to software.  Containers make an interesting bridge because you can host them in VMs or on bare metal, and because cloud providers offer managed container services.

A bridge is still connected to the banks, of course, and for HPE and VMware/Dell, the “banks” are the current installed base of data center gear, and the emerging cloud-native world.  All the action, the positioning and marketing opportunities, the editorial mentions, and so forth, are in cloud-native.  Much of the money is still in data center iron.  Both HPE and VMware/Dell have to stay a bit agnostic at one level, but not so much that they seem to have stuck their heads in the sand.  IBM, after all, bought Red Hat, a software competitor in both data center and cloud, because they lost their broad appeal when they focused on mainframes in data centers.

The central element in the HPE announcement is Container Platform 5.0, which differs from other Kubernetes-based strategies in that it doesn’t require VMs in the data center.  HPE says it’s resolved the “noisy neighbor” and security problems that can arise when containers are deployed directly on bare metal, problems that some users have gone to VM intermediary hosting to resolve.  VMware, of course, is one of those companies, and it’s hard not to see the HPE move as anything but a swat at VMware’s strategy.

Positioning by counterpunching competitors’ positions is always a risk; they get to set the agenda.  It’s a special risk in this case, though, and also a special opportunity at the same time.  That’s because VMware’s offering, the fleshed-out Tanzu announcement, brings Kubernetes, VMware’s Pivotal acquisition, and vSphere all into the same picture.  The result is a loss of focus in the way Tanzu is presented.  HPE, counterpunching, could be drawn into unfocused positioning too, or it could take advantage of VMware’s muddle to gain some extra mindshare.

The big question for both “companies”, and for the industry as a whole, is just how to play “hybrid cloud”.  For the next three years at least, the great majority of IT spending will be focused on core business applications and the data centers they run in.  During those same three years, the great majority of CIO and development project focus will be on public cloud front-end technology.  It’s the classic issue of “I’m putting five-thousand-dollar windows on a half-million-dollar house”; do you focus on the house (which isn’t what’s changing but that’s where the investment is), or the windows (which is what the buyer is actually doing)?

VMware has focused more on the house.  Tanzu clearly provides a migration path for vSphere, and a credible future for the Pivotal Cloud Foundry stuff.  It actually offers at least a credible cloud-native element, because Cloud Foundry is a cloud-native runtime platform.  The problem is that the positioning sounds more like a migration strategy than an on-ramp for broader-market users.  That’s problematic because vSphere isn’t the biggest source of hybrid cloud interest.

HPE seems to take on the vSphere-migration piece with their singular focus on containers on bare metal.  I happen to think that this is really a better strategy, and one that’s better aligned with the broad hybrid cloud market opportunity, but it’s also one of those differentiating details that isn’t going to get a lot of media/management attention unless it’s impeccably packaged…which HPE didn’t do.

It seems to me that both companies started off by raising the Kubernetes flag, but both then forgot to salute.  Hybrid containerized clouds are built on the presumption of some sort of effective federation of multiple Kubernetes domains, or they commit the user to deploying their own “in-house” Kubernetes on public IaaS services rather than on managed Kubernetes services.  The federation piece, the notion of One Kubernetes to Rule them All, should have been critical in both announcements, and it was not.

Hybrid cloud really requires a unified model of deployment across all data center clusters and all public clouds in use.  Google, with Anthos, and IBM with Kabanero, have approached this and perhaps achieved it.  It’s certain that both will be putting together a story that’s complete and compelling.  While Google’s enterprise position is relatively weak, IBM/Red Hat have a position that’s been strong and is growing stronger.

VMware has addressed this within Tanzu by, in a sense, broadening the definition of “service mesh”.  Traditionally, service meshes are purely microservice virtualization fabrics, Istio and linkerd being the two main examples.  These are based on sidecar technology and they offer microservice communications, discovery, and security built through the sidecars.

VMware emphasizes NSX, its SDN technology (acquired with Nicira) and Global Namespaces (GNS).  It’s hard, given the fact that Tanzu is recent and Tanzu Service Mesh has only just been released for purchase, to know exactly what the limits of the approach are, but it seems to be able to connect not only microservices but also containers and VMs, and it seems to be reliant on deploying Kubernetes clusters in the cloud, rather than on using a form of cloud-provider managed Kubernetes services.

HPE takes a similar position; if you deploy HPE Container Platform in the cloud, then you can manage it seamlessly with your data center tools.  There’s no specific federation support provided, and they also don’t make a specific reference to service mesh, though of course their Container Platform could be used with either Istio or linkerd.

Interestingly, nobody seems to be taking the true “high road”.  Containers are a model of application development and deployment that can be adapted easily to both the public cloud and the data center.  The container model is not prescriptive on “cloud-native” behavior; it works fine for monolithic applications too, as well as for the old-line SOA applications that rely on higher-level (bigger, more complex) services and stateful behaviors.  Kubernetes and containers are thus a great operational framework, but we still need a development framework, an application architecture for the modern container-and-Kubernetes world.  Microservices are not the universal answer.

That could work to VMware’s advantage, if they could position their NSX-mesh approach as a universal hybrid-cloud networking model.  They’ve taken some baby steps in that direction, but it’s almost like they’re unwilling to make a virtual-networking story into a centerpiece of a cloud strategy.  That’s odd because SD-WAN (which VMware has rolled into NSX via VeloCloud) is currently differentiated on its ability to support the connection of cloud-hosted components to a corporate VPN.

That point could be transformational in the SD-WAN space, not because vendors don’t currently support the cloud, but because most vendors don’t position their SD-WAN offerings as true virtual networks.  I’ve noted in the past that SD-WAN is an application of a broader virtual-network position, and that ultimately that broad positioning was going to win the day.  We may be approaching the time when that happens, particularly if VMware gets smart on its own NSX story.

Can Vendors Counter Open Network Models?

A lot of companies in the networking space are “reorganizing” or “refocusing”, and most of them say they’re getting more into “software” and “services”.  Their intentions are good, but their plans might be as vague as I’ve summarized them to be.  In 2020, every company in networking is going to have to face significant pressures, and the COVID-19 problems are only going to exacerbate what would have been significant challenges, and risks, even with a healthy global economy.

What’s the future of network equipment?  That’s the real question, and it’s a difficult question to answer, in no small part because people aren’t very good at staring risk in the face if there’s a way to stick their head in the sand instead.  Most senior executives in the network equipment space learned their skills in the heyday of supply-side infrastructure.  We had IT and information, and we needed a better mechanism to deliver it to workers.  Networking was that way, but technical limitations and lack of mature service deployments were barriers.  Pent-up demand means supply-side drivers, and that’s what we had…till recently.

For about ten years, there have been clear signs that enterprises and operators were both running into cost/benefit problems with incremental network spending.  On the enterprise side, the balance of funding for IT and networking, historically a combination of sustaining budgets for existing applications and services and project budgets for advancing productivity, slid more and more into the budget-only category.  For some companies, a full third of all past IT and network funding was thus put into question.  For service providers, revenue per bit has declined, and more stringent capex and opex cuts are necessary to keep ROI on infrastructure from going negative.

The biggest problem that networking has, which is therefore the biggest problem that network vendors have, is that networking is the truck that delivers your Amazon order, not the order itself.  The value proposition isn’t what networking is but what it delivers.  That, to a very significant degree, puts the means of raising budgets and improving network spending beyond the reach of network vendors themselves.

The biggest challenge for the network operators is how to rise up the value chain, and sadly they seem stuck in the notion that adding lanes to the highway, or introducing a highway with adjustable lanes, is progress.  Earth to Operators: It’s still delivering the goods, not being the goods.  There is no connectivity value chain above a bit pipe, so operators need to be thinking about what they can do besides push bits, and vendors need to be helping.

I’ve talked about what I thought was above the network, value-wise.  It’s the contextualization services that could improve personal viewing, communications, ad targeting, and worker productivity.  Rather than beat you over the head with that (again!), let me just say that the OSI model long recognized that applications were the top of the value chain, the highest of the higher layers.  In today’s world, that almost surely means some set of cloud-native, cloud-hosted, tools that create something people actually want.  If that’s true, then companies like Cisco will eventually have to look above things like ACI to cloud-application tools.  “Eventually” isn’t necessarily “immediately”, though, and even companies with strong higher-layer strategies evolving as we speak may need a mid-life kicker.

If basic network equipment is going to be under increased price pressure, then the emergence of an open-model approach is inevitable.  In fact, we’ve been seeing efforts to establish open-model networking in the network operator segment of the market for a decade.  Back in early 2013, I had a meeting with most of the senior management of a big telco, and they were looking even then at having router software hosted on server platforms.  Their thinking was representative of the explorations that eventually drove NFV.

NFV’s problem, from an infrastructure efficiency perspective, was that it was designed to support a lot of dynamism and elasticity, in an environment where virtual functions were dedicated to users.  The real problem wasn’t that at all, it was supporting multi-user instances of “router functionality”.  Today, operators see that coming about through a combination of open-source network software and white-box devices built to work together.

The separation of “software” from “hardware” here is one driver behind the vendor’s proposed shift to a more software-focused business model.  Software is where features live, even in appliances like routers, and so it seems likely that the hardware side of the duo of the future will commoditize.  In fact, some may hope it does just that, because that would create a sharp reduction in capex, stave off Huawei competition, and still leave an opportunity in “software” and “services”.

The obvious question is whether it’s possible to differentiate vendor software for gadgets like routers, from open-source software for the same devices.  Cisco, always a leader in thinking about how to optimize sales in any situation, has addressed this challenge through the notion of a control/management ecosystem.  Basic forwarding devices live in a management and policy world defined by initiatives like ACI.  Even if this world is created by open standards to appease network operators’ fears of lock-in, it would take time for the open-source movement to address this new higher layer of network software functionality.  Let’s call this new layer the policy layer, for reasons you’ll see below.

The task of creating a policy layer for an IP network in open-source form would be somewhat like creating the Kubernetes ecosystem; you assemble symbiotic elements around a central, pivotal, element.  However, while Kubernetes is likely to play a role in policy-layer definition, it wouldn’t be the central element because most open-device deployments would, like real routers, be physical appliances in a static location.  Kubernetes is a great anchor point for cloud application models because it’s about deploying the pieces.  What, in a network control and management layer, is the critical anchor point?

I (again) resort to the world of poetry; Alexander Pope in particular.  “Who sees with equal eye as god of all…” is Pope’s question, and if Pope were a network person, he’d likely say “the central controller”.  I think that networks are built around a vision, a vision of universal connectivity but subject to policy controls that could limit what of the connectable universe can really be connected, and how well.  Think of this as a sort of SDN controller, or as the controller that some propose to have in a segment routing approach.  This controller may act through agents, it may be physically distributed even though it’s logically singular, but it’s the place where everything in a network should be looking.

Policies define the goals and constraints associated with the behavior of a system of devices or elements, which of course is what a network is.  Policies define the services, the way infrastructure cooperates, and the steps that should be taken to remedy issues.  Policies, properly done, can transform something so adaptive and permissive it’s a risk in itself (like an open IP network) into a useful tool.  Finally, policies may define the way we unite cloud thinking and network thinking, which is crucial for the future of networks.  Network vendors, especially Cisco, have been pushing policies for a long time, and it may be that policies are about to pay off.

Of course, policies can’t be a bunch of disconnected aspirations.  Any policy system requires a means of applying the policies across discontinuous pieces of infrastructure, whether the discontinuity is due to differences in technology or implementation, or due to differences in ownership and administration.  If Kubernetes contributes the notion of an anchor to our future policy layer, it also contributes the notion of federation…at least sort of.  I came across the idea of federation, meaning the process of creating central governance for a distributed set of domains, in the networking standards activities of the early 2000s.  It’s not well-known in networking today, but the approach we see in concepts like Google’s Anthos looks broadly useful in network federations too.

You could view this future equal-eye approach as one that starts with the overall policy-layer models, and from them drives lower-layer functions appropriate to the network technology in use in various places.  We don’t have to presume uniform implementation of networks, only universal susceptibility to management, if we have something that can apply policies.  The most mature example of this is the cloud, which suggests that cloud tools designed to support some sort of hybrid cloud or multi-cloud federation could be the jumping-off point for a good policy story.  Cisco’s ACI and prior policy initiatives align better with the older network-to-network policy federation of the past than they do with the cloud.

Interestingly, Cisco rival Juniper bought HTBASE, one of several companies that specialize in creating cloud virtual infrastructure.  It was a strong move, weakly carried through, but it probably took out one of the few possible targets for network vendor acquisition in the space.  Cisco, to get a real position above the network in the cloud, would have to buy up somebody a lot bigger, and that would be pretty difficult right now.  For the other network vendors, things are even sketchier.  Nokia and Ericsson are laser focused on 5G, which isn’t anywhere close to where “higher layer value story” would mandate they be thinking.

This would have been a great time for a startup, or better yet a VC stable of related startups, to ride in and save the day.  There doesn’t seem to be any such thing on the horizon, though.  VCs have avoided the infrastructure space for over a decade, and they’ve avoided substantive-value companies in favor of an option for a quick flip, which this sort of thing wouldn’t create.  It seems to me that the cloud providers themselves are the ones most likely to drive things here.

Cloud providers will likely build the framework for contextual services, as I’ve already noted.  They might also end up building the framework of the control/management ecosystem that will reign over the open-model technologies that will be a growing chunk of the network of the future.  However, vendors like Cisco or even Juniper could give them a run for their money.

We’re clearly going to take a global economic hit this year, a hit that could make buyers even more reluctant to jump off into a Field of Dreams investment in infrastructure.  If that’s the case, then a policy-layer-centric vision of the future might be an essential step in bridging the gap between a network that hauls stuff and one that is stuff.

Obviously, It’s Time to Think More about Remote Work

What lessons should tech be learning from COVID-19?  Not the personal lessons that employees and employers learn about contagion and how to prevent it; we know most of them by now.  Not the supply-chain lessons, which in the end could be resolved only if everyone made everything themselves.  Not even lessons relating to statistical analysis and AI in understanding the virus itself; others are far more qualified than I in those areas.  Instead, I want to think about lessons related to how we use technology.  Specifically, we need to learn some lessons about remote/home work.

We don’t gather government data on the specific number of workers who aren’t tied to a workplace, or even the job categories that permit some flexibility.  There’s a temptation to use data on mobile workers to project how many workers might really be able to work well from anywhere—including home—but my surveys and modeling says that doesn’t work.  For example, the largest percentage of mobile workers of all the occupation classes is in the healthcare field, and while these workers are not tied to desks, they are tied to places.  In contrast, office workers have among the lowest mobility factor of all worker types, yet they seem to be the category that could most easily be “unplugged” to work from home or anywhere else.

Many have now recognized that unplugging workers in the face of a dangerous and contagious disease would be very wise.  Any social gatherings that involve close personal contact pose a risk to spread of the disease, and if there’s no alternative to sending workers home to avoid contact, a lack of support for unplugged work means that the business is effectively shut down.

Some measures to support unplugged work have been applied for some time.  Website support for sales and service not only allows self-help measures, but contact initiated from these websites is easily directed to people with little regard for where they’re located.  Call center support of distributed pools of agents is similarly an old concept, and widely used.

One thing these initiatives have demonstrated is that it’s not enough to have an agent answer a call; you need to supply the agent with access to the information and applications that are needed to respond to the caller in an effective way.  This has resulted in coordinated information/application access for agents, based on entry of an account number or other identifying information.  It works.

The problem is that most workers are not agents handling support or sales calls.  Their roles in their business are highly variable, and the information they need to do their jobs is likewise variable.  In many cases, their work practices include meetings and face-to-face discussions with others.  Getting these kinds of workers unplugged is a lot more complicated.  If you take the worker out of the workplace, can they work at all?  Most today say they can, but that their productivity will be reduced, but that might not have to be the case.

I was the head of an “Unplugged” conference decades ago, a conference dedicated to exploiting technology to reduce or eliminate the implicit connection between work and the workplace.  We say “I’m going to work” because we are going, meaning we’re not there already.  Work is something we do, and at the same time the place we do it, a central collecting point for workers.  The most important question, perhaps, for business today is whether that conflation of activity and place is necessary, and where it isn’t, what technology could do to disconnect the two.

The big mistake in approaching this is to assume that we’re looking for a kind of back-up strategy for working at home.  Hey, there’s a contagious disease or a natural disaster, so we “work from home”.  Home isn’t where we’re used to working, and transplanting information and application access to another place, or providing a nice printer at home, isn’t going to change that.  Instead, we should be looking for a new model of working that doesn’t care whether we’re “home” or “at work”.  As long as we view remote work as requiring different procedures, processes, and tools than “at-work” work does, we’re in trouble.

We have software aimed at supporting remote work, but it’s really focused more on remote communications, collaboration, and project management.  It doesn’t alter the fundamental assumption that what we do at work is what we need to be doing at home, and that’s simply not possible.  You cannot duplicate a work environment at home, even where physical interaction isn’t an explicit part of the process.  What you need to do is to change how we work while “at-work” to be more portable.

A big part of the challenge in doing that is that we don’t have a model for portable work, and that is in part due to the fact that we don’t know much about the workers who might be candidates.  I saw recently that someone had said that over 70% of workers were now working at home, which is patently ridiculous.  How many workers could work at home, or in an arbitrary remote location, and what industry are they in?  How can we approach this?

My own attempt at modeling the unplugged potential for worker categories suggests that about a third of the workforce of a typical industrial economy would be theoretical candidates for unplugged status, if properly supported by technology tools.  That doesn’t address every worker or every job category, but if we could create a “virtual workplace” that could be “hosted” in our usual offices or anywhere else, equally, we could make it a lot easier to respond to the kind of emergency we’re now in.  If every industry and every job classification had to justify its own toolkit, we’d never get anywhere.  We need a baseline approach to the problem, one that can then be customized to fit the specific needs of an industry and job class.

IMHO, the simplest approach we could take here is to assume that an unplugged worker still had a workplace, but that the workplace was virtual rather than real.  A virtual workplace could be constructed to suit conditions, which among other things means it could be made to initially resemble the real workplace as much as possible, and then transform over time to an optimized model as workers became comfortable with the approach.

To understand this in terms of IT, I want to recall my “jobspace” concept.  The sum of information a worker needs to perform the work assigned is the worker’s “jobspace”.  A virtual workspace is then a composable model of information presentation that supports that jobspace.  What might this look like, at least at the high level?

I propose that a jobspace is made up of a series of what I’ll call “taskspaces”.  The tasks are the specific assignments that make up a worker’s job, and it’s these tasks that create specific information or application dependencies.  A task is then a set of information/application interactions that relate to a real-world business activity.  To prepare a report is a task, and so is to get approval for something or contact someone to make a sales pitch.

It’s convenient to think of a task as being rooted, technically, in a series of screens the worker would access on whatever device or devices were available and suitable.  I’m going to call this series of screens a panel.  The virtual workplace of an unplugged worker consists of a number of panels, through which the taskspaces the worker needs are supported and presented, and these add up to the jobspace.  Got it?

Workers who are managed would be given assignments in this approach by having a taskspace panel sent to them.  My implementation theory is that these panels would be in blockchain form, and that they would store information input and output, so they’d be an enduring record of actions taken.  They’d also convey access rights where such rights weren’t permanently assigned to the worker involved.  An assigned task could be the responsibility of several workers, in which case the panel would not only present the screens of the primary worker, but also those of any cooperative workers.  It would also define their collaboration.

Some of this may sound familiar to some people; it’s not too far from what Google proposed with Google Wave.  I actually suggested an approach like this to the Google moderator of the developer forum I was a member of while Wave was still active.  Wave, as most of you know, didn’t get off the ground because Google relied almost entirely on outside contributions rather than doing something on their own.

Another aspect of the panels approach is that you could consider a panel to be “fractal”, meaning that it’s composed of information-, application-, and mission-specific elements.  Some of these could be contributed by vendors or service providers, and some reusable within an enterprise or even an industry, mediated in the latter case perhaps by industry groups.  Collectively, panels represent business activity, but they’re framed in IT-virtual terms rather than as a set of business policies or job procedures.

It’s now time to get to the second reason why creating a rigid work-to-workplace connection is bad.  Injecting information empowerment into a manual process, or making the coupling between information and work more intimate, changes the optimum model of worker behavior.  If the worker is immersed in the same old place, same old desk, with the same old coffee-machine companions, the revolutionary nature of empowerment can get buried in sameness.  We’ve seen that happen before.

The modeling theory behind all of this is that it’s possible to track the progression of IT, and in particular the progression in productivity and productivity-targeted spending, by analyzing the extent to which IT is integrated with work.  In the earliest days, IT was applied almost retrospectively; we captured transactions by “keypunching” commercial paper.  That evolved to online transaction processing and the substitution of electronic data interchange (EDI) for commercial paper exchanges.  As this transformation evolved, work became more defined by the tools than the other way around, but at each step in the evolution, there was a delay in uptake because of the inertia of the past.

We’re overdue for a productivity empowerment revolution, and we don’t want to stifle it by failing to account for its needs as we try to unplug workers.  The panel approach, because it would allow the information window onto the taskspaces and jobspaces, would be agile enough to accommodate a different model of worker-to-information relationship coupling.  The virtual workplace matches well with the virtual-world model of IoT and productivity empowerment, for example.  It’s possible that if we did our homework (no pun intended) we could end up making work away from the workplace as productive, or even more productive than traditional in-the-office work practices.

We’ve had coronaviruses before (SARS and MERS), and we’ve had pandemics (H1A1).  With more humans on the planet, and more interconnected supply chains and tourism, we’ve shrunk the world while its population exploded.  Epidemiologists would say we’re simply too close for comfort, and so it’s naïve to think we’re not going to see this same kind of thing happen again.  Indeed, COVID-19 may recur this fall, even if it does ease for the summer.  Until we have a good vaccine and treatment strategy for a contagious disease, we’re constrained to handle it through social isolation.  It would be nice if technology could help us deal with less socially intensive work practices.

The traditional home-work tools I’ve reviewed don’t do what I think is essential, which is create that virtual, portable, work environment.  It may be difficult to get something like I’ve described in time to address COVID-19, but just as the virus has been a warning signal to public health, it should be one to employers.  Disease thrives on proximity, and where productivity depends on proximity as well, we have an obvious and perhaps unacceptable level of risk.  That can be fixed, and the fix might also launch us into a new era of productivity-driven IT investment.