Can NFV Deliver the TCO Reductions Operators Need?

Truth sometimes hits you in the face.  So it is with NFV savings.  A Light Reading story quotes Goran Car of Croatia’s Hrvatski Telekom as saying that every function they virtualize has to reduce TCO by 40%.  This is particularly interesting, given that most of the ten operators who created the “Call for Action” paper on NFV back in 2012 said as early as October 2013 that even 25% reduction in capex wasn’t worth it.  “We can get that beating Huawei up on price,” was the key comment.  According to my recent conversations with operators, they still fee the same way, and in fact their TCO reduction goal is much the same as that cited in the story.  Enterprises, by the way, also want to see cost improvements between 32% and 45% before they’ll approve new projects, so the savings goal range is pretty universal.

Operators think that NFV is falling way short of the savings they’d hoped for, and they cite three reasons for that.  We’ll look at each of them here.  Reason one is that licensing the VNFs is costing much more than expected.  Operators told me they had been offered VNF licenses that would have cost them at least three-quarters of what an appliance would cost.  Reason number two is that operations complexity, including VNF integration, is boosting NFV opex instead of generating savings.  In fact, many operators said that net TCO, including capex and opex, was actually higher in some trials.  Finally, reason three is that expected capital economies of scale were not being obtained.  Part of this is because so many early applications rely on uCPE boxes, dedicated to a customer/service and looking for all the world like appliances.

Every one of these issues was raised with operators in 2013, the first year of work on NFV.  Obviously they didn’t get addressed, and so we need to look at each to see what could be done.

VNF licensing was a certain-to-be problem.  Operators, like everyone in the industry, it seems, want to improve their own profit margins but are oblivious to what their measures would do to their suppliers.  It was never likely that vendors who had successful network appliances such as firewalls would license their software (which is after all their real proprietary value) for a fraction of the appliance price.  Surprise, surprise, those vendors want to charge more.

There has never been any solution to this problem other than to exploit open-source software.  In the summer of 2013, I suggested it was critical that NFV “can draw immediately on open-source tools without requiring forking of the projects and modification of code.”  Had this been done, there would be plenty of open-source options with zero licensing cost.  Note that Metaswitch, who contributed an open-source IMS implementation to the first ETSI NGV proof-of-concept, has gone on to be a successful provider of open-source elements to operators.

Another cause of VNF licensing cost problems is the high cost of integration of VNFs.  Again citing 2013 documents, the goal should have been to first define NFVi properly, then define the software framework into which VNFs would integrate. “Develop the requirements for an NFV-compliant data center, including the hardware issues that would impact performance and stability, the platform software issues, and the mechanism for virtualization” was the first of the recommendations, and “to be able to create a VNF by “wrapping” any software component that exposed an interface, any hardware device, or any deployed service” was the second.  These points are being addressed only now, six years after having been first brought up, and even now aren’t being done optimally.

Our second reason for insufficient NFV savings is an operations cost overrun.  There are two sources of operations complexity and cost in an NFV deployment.  One is the specific costs associated with deployment and maintenance of the VNFs as independent functional elements.  This would include the deployment, scaling, redeployment, diagnosis of problems, etc.  The other is the cost associated with the management of the service lifecycle overall.  To be successful the first of these opex cost sources would have to trend to zero, since any VNF operations is incremental to the operation of networks that had no VNFs.  If, in addressing VNF operations costs, overall service lifecycle operations costs could also be addressed, then opex reduction would contribute directly to lower TCO.

From that summer of 2013 document, the relevant goal:  “Establish the optimum way of defining management practices as “co-functions” of the service, composed in the same way and at the same time, and representing the service and its elements in the optimum way based on one or more existing sets of operations practices, or in a new way optimized for the virtual world.  The functionality created by combining VNFs defines not only how a service works but how it should/must be managed, and the only way to harmonize management with functionality and at the same time create a link between virtualized elements and real services is to link them at the time of deployment by composing both side-by-side.”

Build management as you build services.  Seems logical, but instead what was done was to presume virtualization only built devices, meaning that NFV created virtual equivalents of physical boxes, which were then networked and managed in the way they always were.  This meant that all the VNF-specific management tasks had to take place with little or no knowledge of the state of the rest of the service, and that NFV’s management tools could not reduce opex overall.  The best they could do is ensure it didn’t increase, if we could presume that VNF management as a task consumed no resources and cost no money.

The opex failure is the biggest problem with NFV’s business case.  It started with a lack of an effective abstraction of NFVi and the software structure into which VNFs would plug.  It was exacerbated by a decision to target NFV at virtualizing only devices, not services, and it reached its peak in insulating virtualization from the operations lifecycle, and vice versa.  If this issue isn’t addressed, there is no chance NFV could be broadly useful.

The last of the issues was the failure of NFV to establish expected economies of scale.  The problem here is due to a focus on a very limited but easily understood use case, the “virtual CPE” use case.  This, in turn, evolved in part from that “virtual-box” focus I mentioned earlier.  What kind of “box” is present in the greatest numbers in any network?  Answer, the premises device that acts as the termination of the service, the “customer premises equipment” piece.  That’s vCPE in NFV terms.

The problem is that the most numerous kind of CPE is the broadband hub that terminates residential and small business services.  These devices are available to operators for less than $50, and some report prices as low as $35.  They include a firewall, a DHCP server, WiFi base station, and more.  Be kind and say there are two or three virtual functions inside.  Two or three VNFs, meaning two or three VMs.  You need “service chaining” (how much time in NFV has been wasted on that!).  All this to replace a device that will probably be installed for five to six years, for an annualized cost of perhaps ten bucks.

Opex alone would eat the ten bucks, and then you have the cost of the NFVi.  “I lose money on every deal but I make it up in volume,” is the old sales joke.  It’s no joke here.  You can’t have economy of scale when the base cost of what your replacing is so low you can’t even support the NFV equivalent at the price.  But if you make things into vCPE, there’s no cloud infrastructure on which to develop capital or operations economies of scale.

The first NFV proof-of-concept in the summer of 2013 used Metaswitch’s Project Clearwater open-source IMS as the target, because its VNFs were cloud-ready and could thus be hosted on carrier cloud infrastructure, where both capital and operations economies of scale could be measured.  The focus of most PoC work, though, was vCPE, and it’s no surprise that’s where most early NFV deployments have been focused too.

Why, if all these problems were raised, and proposals to solve them were presented, did we do NFV with no solutions for them, and thus left them to devil operators today?  Part of the problem was the inevitable bias introduced into any standards group by money.  Vendors can spend massive budgets on these bodies because they can use them to promote their views and perhaps seize some advantages by pushing things in directions that favor their own solutions.  This kind of pressure tends to first encourage the body take the easy path (because to vendors, it’s the profits in the next quarter that drive their stock price), and second to create a diffusion of strategy because a standard approach doesn’t support vendor differentiation.  If all firewalls are interchangeable, then what’s to keep somebody from changing mine out?

So we have three totally avoidable problems with NFV impacting the TCO.  We didn’t avoid them, so can they be fixed?  Yes, but.

The easiest thing to fix would be the open-source VNF problem.  What operators would need to do is to first establish the software framework I mentioned in relation to operations cost overruns, then contribute developer resources to the open-source projects for the main VNF classes (like firewalls) to make them compatible with the established framework.  This sort of thing wouldn’t disrupt current NFV specifications, only augment them.

The next-easiest thing to fix would be the economy-of-scale problem.  I believe that the preponderance of vCPE projects is partly or even largely driven by opportunistic factors like vendor support or easy validation of the effort put into NFV so far.  Operators need to accept that without cloud hosting, efficient cloud hosting, NFV fails.  They need to prioritize support for projects that are actually cloud-hosted, not those based on uCPE.  This could be done fairly easily, though making true cloud-native VNFs happen goes back to that notion of a software framework.

The hardest thing to fix, sadly, is the operations problem that includes the need for that software framework.  We do have service models (in TOSCA, usually) that can define full end-to-end, multi-technology, services.  These would have to be augmented in two ways to make them effective as solutions to the operations problem.

The first way is to build that software framework for VNFs, and also for NFVi.  Good network function software has to be an interplay of abstractions—feature abstractions hosted on infrastructure abstractions.  We have neither of the two now, and we need both of them.  The feature abstractions would create the “plugins” that would match VNFs to baseline NFV MANO elements, and the NFVi abstractions would make any hosting environment look like a “resource pool”.

The second way is to integrate lifecycle management into service models using state/event tables.  A service is a set of cooperative functional elements.  Each element eventually decomposes into a set of resource commitments, and the manner of this decomposition can be (and in fact must be) represented by a hierarchical model, something that looks like an organization chart.  Each element in the model is an independent state machine, each accepting events from its neighbors, generating events to its neighbors, and interpreting those events based on its own process state.  You can model lifecycle automation this way, and I submit it’s hard to model it any other way.

None of this is technically difficult, but it’s apparently politically difficult.  NFV proponents don’t like to toss out a lot of what they’ve done, or let changes expose the truth that it wasn’t the optimum approach.  Vendors who have pushed things in their favor don’t want things to push them back.  But every day we let this go is a day we move NFV further from the truth, further from the point where remedy is practical, or even possible.  Those who want NFV to succeed need to get along with the task of fixing the problems operators are now clearly recognizing.  Nevertheless, it’s those operators who have to be convinced to deploy it.

The Changing World of Operator-Provided SD-WAN

The excitement about SD-WAN as an operator offering doesn’t surprise me.  My own research showed that for all of this year, the operator sales channel for SD-WAN was the fastest growing.  My modeling shows that it’s very likely to make up two-thirds of all SD-WAN sales by the end of next year.  But despite this rosy picture, operators don’t have a free run here.  In fact, they face some major risks, which we’ll have to dig a bit to expose and understand.

SD-WAN adoption has multiple drivers according to enterprises I’ve dealt with, and many are influenced by more than one of them.  Today, over 85% of all SD-WAN buyers say that their use of SD-WAN it primarily to connect small sites to company VPNs.  About 60% say that the high cost of MPLS VPNs is a motivator.  Just over 70% say that they can’t get MPLS VPNs at all their sites.  Less than 5% of companies say they’re replacing MPLS VPNs with SD-WAN, and about the same number cite cloud connectivity as a driver.  However, buyers cite high MPLS costs and MPLS replacement twice as often as they did at the start of this year.  SD-WAN cloud connectivity as a driver is more than twice as high as the January level.

Operators also have multiple drivers for offering SD-WAN.  The primary driver (over three-quarters of operators cite it) for operators is competition; they fear that managed service providers who are approached to connect small sites (not connected to the corporate VPN) via SD-WAN will try to sweeten their own pies by offering to replace MPLS VPNs in other small sites.  A close second (with over two-thirds) is to gain additional revenue from sites not candidates for MPLS VPNs.  The third (cited by, coincidently, a third of operators), is to contain the support impact of VPN connectivity problems created by SD-WAN and MPLS in combination.  Operators think they’ll be on the hook to troubleshoot even MSP or an enterprise’s own SD-WAN setup.  Only 15% of operators say they worry about losing revenue, but that number has also doubled since January.

If you mash all this information up, what you get is an SD-WAN market that started with a limited goal and is transforming into a potential VPN revolution.  Buyers are far more willing to toss MPLS aside than they were at the start of the year.  That’s increasing operator concern that they’ll lose money in the net with SD-WAN.  One operator said that the profit from a single, small, MPLS VPN site was equal to the profit from five SD-WAN sites.  Buyers are also driven to SD-WAN by their increased use of the cloud, in the form of hybrid cloud, and if you presume that most enterprises will adopt and expand hybrid cloud, then cloud SD-WAN will be the nose of a potentially bigger camel under the operators’ tent.  That could lead to more MPLS VPN displacement if users get comfortable with SD-WAN’s cloud mission.

You can see market validation of the concern about hybrid cloud in things like Stateless’ software-defined interconnect offering, a strategy for (among other things) extending VPNs to the cloud without SD-WAN.  Some industry pubs even cite Stateless’ approach as an SD-WAN competitor, which it is not.  Do we think operator concerns about hybrid cloud pulling SD-WAN through and displacing more MPLS VPNs is unrelated to this Stateless announcement?  I sure think it is.

There are only ## possible pathways for operators to follow, given their fear of net revenue loss with SD-WAN.  One, obviously, is to stick their heads in the sand and drop selling SD-WAN, but as the operator who cited the profit difference between SD-WAN and MPLS VPN sites noted, the loss of an MPLS site to a competitor cuts profit even more than losing it to another of your own services.  The second is to try to raise SD-WAN revenues, and the third is to try to limit the MPLS VPN displacement risk.  It’s these last two that operators seem to be considering.

Charging more for SD-WAN without changing the nature of the services offered has very little credibility, but a full third of operators say they’re essentially looking at the option.  The most popular notion would be some sort of enhanced over-the-Internet transport option, but some operators are wary of how that could be done without running into net neutrality issues.  The prevailing idea is to create something like a local gateway on the Internet, to which SD-WAN traffic would be targeted, and which would then put that traffic on a “superhighway” (to paraphrase an operator supporter of the idea).  There are also ideas linked to 5G evolution and infrastructure.

Creating new features to justify new charges is an idea almost all operators say they’re looking into.  Cloud integration is such a feature, but operators admit that it’s table stakes at this point.  Additional security, traffic prioritization, and explicit connection control are all being reviewed too.  Some operators think that SD-WAN could be a Trojan Horse for NFV, via a piece of universal CPE (uCPE) on premises justified by SD-WAN then multi-tasked with other feature hosting.

Linking these new service features to some form of uCPE seems to offer an option in the “contain-the-MPLS-loss” approach too.  One truth about SD-WAN to date is that whatever it does, it does to the traffic it actually handles.  Traffic among users and resources still on the MPLS VPN are not typically impacted by SD-WAN at all.  Sticking a nice uCPE box into every site would give operators a VPN-plus story everywhere, and it could also revitalize the moribund NFV space.

Be careful what you wish for, though, if you’re an NFV advocate.  It’s very clear that NFV architecture is a long way from optimal, and in my view it’s actually over the line into non-functional.  The work on creating NFVi “classes” to eliminate the variety of hosting requirements set by VNFs is a stark admission the ISG’s approach failed to properly abstract NFVi in the first place.  If we were to get a lot of uCPE deployments, it would make any changes to NFV more difficult by posing the classic stranded-installed-base risk.

The problem with all the happy SD-WAN operator outcomes is the long-standing tendency for operators to be motivated more by competition than by opportunity.  You can see that in their SD-WAN behavior to date, in fact.  If operators play defense on SD-WAN, the inevitably fall behind, and this clearly isn’t a market you can afford to fall behind in.

An Organized Look at OSS/BSS Transformation

Light Reading did a nice piece on OSS/BSS in the cloud, which happens to be the next of my “what operators think” topics, arising from my almost-two-months of exchanges with a variety of operators globally.  The LR article covers some things I didn’t get into, and I got into more detail with operators on some other topics.  This all adds up to some points I think are interesting.

Let me open with what I didn’t cover in my talks with my operator contacts; the operator politics.  I’ve noted in past blogs that there is a real tension within operators on the topic of OSS/BSS.  The CIOs are indeed a conservative bunch, but that’s true of almost all the telco executives I’ve dealt with.  The unique problem that CIOs have is that OSS/BSS change is, to quote a comment from decades ago, “All cost and no benefit.  The best you can hope for is that nobody will know you did it.”  But conservative or not, CIOs are under pressure to modernize in some way, and it’s pretty clear that cloud-hosting more of their operations systems has a lot of CEO and CFO backing.

A cloud-hosting decision for OSS/BSS has wider impacts than the CIO organization.  The reason why OSS/BSS as a cloud application is important overall is that it’s a primary driver in the early network operator decisions to partner with public cloud providers.  Progress in cloud-hosting OSS/BSS would enhance these relationships, teach operators cloud-think, and perhaps accelerate the move to true carrier-cloud infrastructure.  A slow roll here might push back operators’ cloud plans for a long time.

Very few operators have really done much with OSS/BSS transformation to the cloud, no matter what steps you define as “transformation”.  Smaller operators, who have a harder time sustaining their own data center resources for hosting operations, have been more aggressive in migration.  Operators have also been looking to bypass traditional operations systems entirely for new services, and what’s happening in both these areas is interesting.

Perhaps the biggest question in OSS/BSS migration to the cloud is whether “migrating”, meaning simply moving, makes sense.  For the last decade, there’s been a powerful and growing set of operator insiders who believe that legacy OSS/BSS is really a bad idea.  Thus, to “migrate” what they have is to waste effort cloud-hosting something that’s going to be replaced, inevitably.  Of the operators I’ve communicated with, almost four out of five expected that “major changes” to their overall operations software model would be likely within three years.

There are major changes and major changes, of course.  All software tends to change over time, and some of the changes seem profound but are actually simple while others seem simple and send everyone back to the drawing board.  Most applications will tolerate GUI changes, and many will even tolerate having new data elements added.  What’s hard to absorb is a basic change in mission, but interestingly the OSS/BSS is somewhat insulated from that extreme fate by a structural truth.

To understand why this is, you have to look at operations software the way operators themselves are starting to look at it.  Don’t think of the difference between “OSS” and “BSS” or the antecedents of either of the two.  Think in terms of the way functionality is organized.  There are essentially three layers of operations software, and each of these layers is under a different set of pressures.  As a result, cloudification of each could be driven in different ways and different directions.

At the bottom are the batch systems that do things like billing and analytics.  These systems operate off databases and it’s highly questionable in the minds of two out of three operators that they’d be candidates for change or cloud-hosting.  Generally, the data held by and used by these systems are important and regulated, and so the operators are wary of pushing them into the cloud.

The middle layer of software is the transaction processing that does the updates to these databases.  The transactions come from a variety of sources, including customer service reps, customer online portals, and network/service operations systems.  This layer is much like enterprise OLTP software, and there’s been ongoing work to use the cloud, web, and mobile devices to support at least the front-end portion of these applications.

The top layer is the real-time operations layer, the layer responsible for actually supporting the service lifecycles and the resources that support the services.  This layer is handling service and network events, to generate transactions, trigger operations tasks, and perform whatever automated operations responses are available.  Obviously, this is a layer that’s subject to a lot of pressure for innovation, and thus is the one changing the fastest under business pressure.

One of the differences between Tier One and Tier Three operators is the extent to which these layers are supported by commercial off-the-shelf general business software.  Some Tier Threes use simple tools to handle their operations tasks; I’ve visited Tier Threes who used Microsoft Office components for everything.  Tier Ones are more likely to use OSS/BSS products, supplemented in many cases by stuff they’ve developed on their own.  Obviously, the more an operator relies on third-party software, the more they’re committed to the cloud strategies of their software provider.  That doesn’t mean that it would be impossible for an operator to migrate a commercial software package to the cloud without vendor help, but it would be a fork-lift migration rather than something optimized for cloud hosting, which most operators know is less than optimal.

Another difference is the pace of change/innovation in the three layers.  Over three-quarters of operators say their batch-layer stuff is fairly stable, and most Tier Threes and over two-thirds of Tier Twos say that’s also true of the rest of their layers.  The Tier Ones, with more services, more regulations, more infrastructure, and more competitive pressure, have seen considerable change in their middle-layer transactional stuff already, focused on improving productivity and customer service.  They’re also the ones looking at radical changes in the top layer.

It’s the pace of expected change that raises the biggest questions about OSS/BSS migration to the cloud.  There are few vertical markets whose core business software structure has been as static for as long as network operators’ have been.  Those who lived through the early days of the Bell System will recognize, in some OSS/BSS systems, the pieces of SS7 and Class 5 switching in some of the implementations.  The times, as they say, are a’changing now.  Would even the batch layer be designed as it was, were today’s requirements the drivers and not the requirements of yesteryear?

If operator transformation is driven, as operator standards have been, from the bottom up, technology changes (including SDN and NFV) would have the greatest immediate impact on the top layer, where events were turned into transactions and where any form of operations automation would likely hit.  If transformation were driven from the top, from the introduction of new services, then the biggest impact would come in the middle transactional layer.  The difference here could be significant.

Transactional impact would mean that changes in the middle layer would make it a candidate for cloud hosting and even optimal cloud hosting.  Since new services would likely involve new real-time events, the top layer would also be a candidate for cloud-native modernization.  If we have mostly real-time operations event impact, the tendency would be to try to preserve the transactions generated by the top layer, to minimize the ripple impact on the transaction layer.  That would tend to confine software modernization to the top layer.

Closed-loop or “zero-touch” automation could have a broad impact on both the top and middle layers of OSS/BSS.  At the high or “goal” level, the hope is that a customer portal could be the source of service moves, adds, and changes, and also a means of linking users directly with the operations processes intended to rectify service problems.  This could cut out the CSR and the associated transactions.  At the middle level, the hope is to link condition changes to remedial processes without human mediation, which would eliminate the need to dispatch craft personnel and reduce the role of the network operations center.  At the bottom level, it would tend to eat events in the top layer of the software, making that layer much larger and more important, and perhaps eventually even turning the middle layer into a kind of stub between real-time and batch.  Even the batch layer could be reduced because the use of portals would reduce the need for commercial paper and the associated processing.

Then there’s pricing.  The majority of the batch or bottom layer of OSS/BSS relates to the paperwork in service ordering and changes, and to billing.  The majority of the billing transactions relate to service changes that have billable impact.  As the transactional middle layer declines, transactions also decline.  That means less for the batch layer to do.  If you combine fixed-price services with all the other factors here, you can see why a lot of people think OSS/BSS is a dinosaur.

But we’re ignoring the new services.  I think it’s true that traditional OSS/BSS will decline gracefully into a largely irrelevant state, if operators remain fixated on selling only connectivity services.  If they go more OTT-ish, I could see a lot of new requirements.  Would they be fulfilled by the OSS/BSS?  That depends on how quickly operators move to these new services.  If I were an OSS/BSS vendor, I’d be erecting “Start Selling New Stuff!” billboards outside my customers’ executive offices.

Injecting Some Reality at the Edge

Because we just came off an edge computing conference, we’ve got a lot of edge computing stories this week.  A big part of this is the usual tech hype cycle, of course.  When you have ad sponsorship, the site owners have an incentive to create links that are clicked often and so serve ads often.  Stories that build are called “rolling thunder”, and the easiest way to get it is to let stories build by focusing on a topic and making it exciting.

This doesn’t mean that edge is all B.S (though it probably means that what we read about it is).  In my operator discussions over the last month, I heard from plenty of operators who thought edge computing was one of their future differentiators, or that it was critical to one or more of their future services.  What I didn’t hear was a lot of specificity.  “Edge?  Oh, yeah, we’re going to do that,” was a typical comment, and one hardly comforting in its detail.

Edge computing, broadly, means computing placed close to the point of information.  For users, it means placing computing close to work, and that of course has had its problems over the years.  We wouldn’t have needed server consolidation had we not had too many servers deployed “at the edge”.  Enterprises realized quickly that utilization of these edge resources was low and support costs were high, and a lot of early cloud interest came about because of enterprise-created “edge computing”.

Today’s discussions of edge computing are more related to edge-as-a-service, meaning edge computing services offered by public providers.  Because network operators would be logical candidates for offering edge computing services, “the edge” has been expanded to mean applications of edge computing to operator missions, meaning to building services, and not just as a form of public computing.  Edge, for most operators, is part of the “carrier cloud”.

Most, but not all.  About three-quarters of operators think that carrier cloud will include or even be dominated by edge computing resources.  Since that’s very possibly true, it’s a good sign.  Less a good sign is the fact that slightly less than half of operators understand that edge computing is really cloud computing.  The quarter who don’t see edge and carrier cloud as congruent think of edge computing in the old enterprise sense—move compute close to the work.  They see edge applications as being rather static, but immobile because of the link between edge computing and low latency.

Logically, standards groups could be expected to resolve a basic point like the mission of edge computing.  We have an ETSI “edge computing” group, and Light Reading ran an interesting story about some concerns operators have on the group’s activity.  Most center on things like the Multi-Access Edge Computing (MEC) work.  I have my own concerns, both about MEC and MEC compliance, and about edge computing overall.  At the least, I don’t think they’re dealing with the question of basic edge mission, and things go downhill fast when you don’t know why you’re doing something.

My biggest concern is we have, with edge computing standards, yet another case of developing detailed technical specifications from the bottom up.  How do you do API specs without having functional elements with specific behaviors that are accessed via those APIs?  You don’t, but how do you develop a model of functional elements without starting with the missions and constraints on edge computing?  You don’t, but we did just that.

Let’s look at edge logically.  First, it would be lunacy to think that edge computing was anything other than a special case of the cloud, a class of resources and “foundation services” that existed as part of cloud infrastructure and that happened to be located at the edge of the network, which for operators would mean the central offices and perhaps some metro locations.  Second, the applications that drive the edge are a subset of the same applications that would drive carrier cloud.

I’ve blogged often and in detail about what those applications are.  I could make assumptions about how the services needed to those applications would look, and from that I could define APIs that would expose features that could be composed into those services and support those applications.  That’s not been done.  Instead, what we’ve done is attempt to define generalized functional APIs, which necessarily focus not on what edge computing does but how edge computing does it.

Some say that the “application” of MEC is actually mobile edge computing (sadly, the same acronym!), but mobile edge isn’t an application at all.  It’s an assumption that because 5G would “reduce latency”, it marries with a compute resource set close to the edge in order to take advantage of that reduced latency.  But will 5G really reduce latency, enough to matter?  You can’t say without postulating a specific application, and we have none really identified.

The best edge applications would deal with personalization and contextualization of services, and in a related sense to IoT.  If a software architect were to look at those areas, they’d first frame what functionality the three areas would require, decompose that into a service model that would present as a set of hostable features, and then define APIs for them.  Equipped with that, developers could build the applications.  Without that, developers are faced with a set of APIs that represent a framework for deployment and management of something they can’t identify a business case for.  Sounds a lot like ETSI NFV ISG repeated, just as I feared in yesterday’s blog.

Telefonica, according to the article in Light Reading, is chiding the vendors who responded to their RFI on edge computing, but what did they expect?  The MEC work offers nothing truly useful, and vendors are confronted with the choice of either doing all the heavy lifting in framing low-level management-oriented APIs into real services (and giving all their competitors the benefit of their work), or selling stuff that’s focused on the only place that MEC offers (the wrong place), and not being helpful.  Obviously they took the latter option.

This stuff is starting to look like an operator conspiracy to do useless stuff, and a media conspiracy to make it sound important.  That, from someone who hates conspiracy theories.  What it really is represents something harder to fix.  It’s a collision of self-interest and self-delusion.  The media is paid for hype, and operators want to believe that they can shove transformation back into the virtual version of the same old boxes and the same old services they grew up on.  Supply-side thinking leads to bottom-up design, and ad sponsorship to hype waves.  QED.

Why is this process so broken?  Ultimately, as I suggested above, the operators as the buyers need to be responsible for enforcing rationality if they’re going to enforce anything.  Nothing has been done in transformation standards except what operators have pushed for, so every mistake in approach can be traced to their doing the wrong thing.  It would be better not to have standards at all, to let open-source strategies develop and then select from them.  But now, given that they have already developed in the cloud community, it would be best to simply adopt what’s working.

Just because this problem isn’t caused by vendors (directly, at least; they do certainly contribute to the hype waves!) doesn’t mean vendors couldn’t or shouldn’t fix it.  We already see some vendors positioning ecosystemic solutions to cloud-native development, and VMware is also positioning itself directly to network operators.  I’d like to see more of that, more vendors assembling the tools needed.  I’d also like to see all the vendors doing more to elevate their features and APIs, looking upward to the applications that will have to be there, earning revenue, or nothing will happen…ever.

Some Operator Views on NFV versus Cloud-Native

Do operators think cloud-native is better than NFV?  Do they think NFV could be done in a cloud-native way?  What do they think they’ll end up deploying most?  These are questions I tried to get answered in my recent exchanges with network operators.  I was a bit less successful than I’d hoped, but I think what I learned was very useful.

To the first point, two of every three operators saw cloud-native as “more important than NFV.”  The advantage, in their mind, came from the fact that NFV doesn’t fully exploit the operations tools that the cloud increasingly relies on.  It’s not that they didn’t think NFV would work, but rather that they thought it was too much work.  They think containers and Kubernetes are the right approach, and that NFV has taken another approach.

I’d expected a different response, not that cloud-native was less important but that it was important for another reason.  It’s been my view that function virtualization under NFV is too device-centric, too conserving of the old device-network problems.  Obviously I believe that to be true, but operators are focusing on a short-term issue, which is how to keep virtualization from creating a killing increase in opex.

“NFV is too complicated,” said one operator.  “Everything is a one-off,” said another.  When you press on these points, it’s clear that the concerns are that VNF onboarding is almost like a systems integration task, and “management and orchestration” in NFV doesn’t do enough managing and orchestrating.  To me, these things are simply expected consequences of a poor overall architecture, one that didn’t consider the full scope of NFV impact because the ISG kept operations out of scope at the time the critical architectural model was devised.

If you expect to manage virtual functions with device management tools and principles, you end up replicating the real devices via virtual devices.  That means that your management and operations focus is on the virtualization of each box, not on the collective operational efficiency of the network.  NFV wasn’t designed to make network operations better, only to make it device-like.  But it’s the failure to fully operationalize that worries operators.

The same thing is true with on-boarding.  They see the process as being too resource-intensive, but they don’t see that as a failure of the architecture (which it is), but as a set of missing tools.  It’s possible to make on-boarding of VNFs, their integration into a functional and operationally stable network, less labor-intensive, but it’s not as easy as it should be, because there was no mechanism provided to abstract VNF functionality in a systematic way, to make all “firewalls” look, operationally, like a single class of VNFs with common deployment and operations rules.  That would have solved the on-boarding problem.

On the second question, operators do believe that NFV could be done via cloud-native principles.  The problem is that they think it would take a long time.  Half the operators said it would take “years”, a third “one year”, and the rest “more than six months”.  This illustrates two things.  First, that most operators think cloud-native NFV would be almost a redo of the original effort, gaining perhaps a bit of execution efficiency from experience, but still a long slog.  Second, that cloud-native principles would result in something totally different.  That point is what separates the operators in their responses, in fact.  The “years” camp think that you need to start over, and the “more than six months” camp think that you could retrofit cloud-native principles on NFV without making fundamental architectural changes.

The fact that cloud-native NFV is expected to take a long time means that a lot of operators think they’ll end up deploying both “NFV” and “cloud-native”, which means they don’t think the two will converge technically in the next couple of years.  Over three-quarters of operators say that “cloud-native” deployment of network service features would be possible using the tools the cloud community have already developed.  A slight majority of this group think that some policies and practices would be needed to augment these tools to fit into the special concerns of network operators.  The rest think that you could use cloud tools “in the same way cloud providers do”.  They see the OTTs and cloud providers as models in how cloud-native should be used.

What are the “special concerns” of operators?  Primarily reliability and availability, say the operators.  Almost all operators say that transformation to virtual elements would be possible only if all current network SLAs could be met.  This, given that more and more user communications takes place over best-efforts services.  Those services that aren’t totally OTT, like voice calls, are declining in importance.  I think this is important, because it suggests that not only are operators fixated on virtualization as the creation of 1:1-mapped-to-devices virtual boxes, they’re fixated on a service model that’s out of sync with current market reality.

This view of cloud-native means that operators do see it separate from NFV in deployment terms.  For things like virtual CPE, operators think NFV will prevail.  For implementing new services “above” the network (meaning OTT-like services) they think cloud-native will prevail.  They also think that, over time, there will be more and more of those OTT-like services, and thus that NFV will deploy less often.  About a third of operators think cloud-native will naturally displace NFV over time, even in vCPE applications.

You might see these views as validations of many of the things I’ve said in past blogs, but I have to admit that there’s a big hole in the story that’s troubling me.  Operators are unhappy with the symptoms of NFV’s issues as they are visible today.  They have a very tactical view of what’s wrong with NFV, which means that they really don’t have a proper planning mindset to formulate a successor cloud-native model, or even to assess one.  That’s bad, because outside a few technical specialists in the cloud-native world, the cloud-native model isn’t well-understood anywhere.

We do have a sort-of-ad-hoc vision of the cloud-native story, and that vision is getting clearer as major vendors start assembling and marketing collections of open-source technology that collectively address all the cloud-native requirements.  Still, the vision is still so hazy that most enterprises (to whom the efforts of vendors are largely targeted) don’t understand it.  Operator planners have less exposure than even enterprise planners, and more ossified thought processes too.  This is why operators think a full cloud-native transformation will take a long time.

This is also likely why so many operators are looking to partnerships with the cloud providers as a means of realizing their own near-term opportunities for cloud-hosting of features and functions.  They may learn enough from those to get their own infrastructure right.  But while they’re learning, nothing much they do with cloud-native technology, whether it’s in something old like NFV, something current like 5G, or something future like IoT, is likely to flounder.

The biggest risk we face in transformation is this planning-lags-behind-markets problem.  Operators need to understand what the future will look like, a future they know instinctively has been shaped by the cloud providers.  If they don’t, then they’ll keep doing NFV-like projects and making the same old mistakes.  We can fix the mistakes of the past only if we don’t keep repeating them.

Principles for the Creation of Optimum Virtual Networks

What exactly are the attributes of a good virtual network?  This is perhaps the most critical question in our industry, because virtual networking is a foundation for not only operator service initiatives like NFV, but also the foundation of cloud-native applications.  You’d think, given this, that we would have nailed down the characteristics of virtual networking by now, but perhaps because virtual networks are…well…virtual, we haven’t.

The early virtual networks were segments of Ethernet or IP (VLANs and VPNs).  Over time, we evolved a model that built “virtual networks” primarily through the use of virtual trunks, pseudowires if you like.  MPLS VPNs were an IP VPN built that way.  So, in a sense, we’d virtualized practically everything but the actual network.  SDN kind-of-proposed to do that through the use of generalized forwarding with central control of the forwarding tables, but isn’t that really control-plane virtualization only?

The point is that we’re struggling to decide what a virtual network is, or at least how to approach one.  Rather than try to nail down implementations of something that shouldn’t really be nailed down, I’d like to propose some principles that need to be followed.

First Principle:  A virtual network must define a service interface, the connection to its users, that’s compatible with the current technologies and services in use.  We are already, not only as an industry but as a global economy, totally committed to networking.  There are hundreds of billion dollars’ worth of equipment and software installed, and nobody is going to trash all that investment and start over.

What makes this a bit complicated in practice is that IP networks present a number of sub-services to the primary IP-as-the-Internet model we all see constantly.  Service providers have interconnections with each other, for example, that rely on the Border Gateway Protocol (BGP), and however much many would like to see a better approach, we can’t disrupt current IP networking by failing to support all the interfaces it’s committed to.

That leads to our Second Principle, which is that a virtual network must be capable of supporting new service interfaces if those interfaces become valuable/desirable, with minimal (if any) changes to the deeper elements of the network.  This means, in practice, that it’s the edge elements of a virtual network that are responsible for creating the service interface(s).

A corollary to this is that the internal elements of a virtual network should be, insofar as possible, service-interface-independent.  In practice, that would mean that control-plane and management-plane elements of the service interface and the internal elements of the network should be independent of each other.  If specific control-plane behavior from a service interface is promulgated through the internal elements of a virtual network, that will tend to bind internal behavior to a specific external network interface.

The Third Principle is that virtual networks must allow any internal structure or function that can satisfy the service interface.  A network, first and foremost, is a connectivity resource, and as long as the external connectivity properties are satisfied and the external control/management interfaces can be synthesized at the point of service connection, that’s fine.

This is the principle that may be viewed as colliding with my long-standing conviction that there’s a difference between a network of virtual devices and a virtual network.  A router network is a network of devices, and if I use hosted router instances or even white-box technology with P4, I could construct a “router network” from somewhat-virtual elements.  The advantage of this approach is that it could allow a more graceful transition from the “real router network” to a true virtual network.

The operative word here is “could”.  It’s my view that if the design of the virtual network, and the virtual elements within, was framed explicitly to evolve from a device-specific substitution of virtual for real to a network-wide substitution of device-centric to truly virtual, then that’s fine.  The design just has to provide for that, for example, by saying that once an area of the network has been replaced with virtual devices, it can then act as a single virtual device with respect to the rest of the network, but shed its device-specific behavior within the area that’s been virtualized.

Fourth Principle is that a virtual network must be capable of defining connectivity in any way, as long as that definition doesn’t contaminate the requirement to sustain current service interfaces.  That means that connectivity is a policy matter, something that can be altered by the user and/or provider of service to limit what can actually be passed.  It means that a virtual network can both apply policy filters to the service, or provide inline features to augment it, within the general service interface it must support.  The new capabilities might include the application of encryption or forms of deeper inspection.

Looking again to the practical side, this would mean that in-line services like firewalls could become an integral part of the service of a virtual network, rather than be provided through something external (outside the service interface).  The exercise of this principle, then, would be the logical way for operators to build “new” services that were extensions of current connection services, as a means of gaining new revenue opportunities.

This principle would also allow a network service to be augmented by something that’s actually a kind of virtual destination.  DNS is already this sort of augmentation; it’s not strictly a data-plane IP feature, but it’s a logical part of any useful IP service.  Again, this is a pathway to augmenting existing services, perhaps even a way to introduce OTT-like features or provide IoT-related services.

One particular asset related to this principle is the ability to place cloud-native features, even within a service mesh, as a virtual destination in a service.  Thus, it’s this principle that would be the connection between traditional operator connectivity services and most of the carrier cloud services.  In a way, it’s the operators’ bridge to future services overall.

These principles add up to what we could call agile abstraction.  We know virtual things are “black boxes” or “intent models” whose properties are only known through their interfaces.  Ideally, a virtual network would be a nest of these abstractions, which is what creates agility.  The entire network might represent a virtual device, and within it we might have communities of black boxes that decompose further into smaller communities, even to individual virtual devices.  If a network provides services, the services can present themselves through the outer interfaces and be implemented within in any way and any place that’s convenient.

If the cloud is a fully elastic set of resources, capable of doing anything and being anywhere, then how can that cloud be built and used without this kind of virtual network?  Think about it.

Are Enterprises Ready for Cloud-Native Applications?

One of the interesting questions raised during my recent enterprise Q&A related to the adoption of cloud-native technology for traditional business applications.  An enterprise who’d gone unusually far in assessing cloud-native, to the point of starting a small application test, had developers suspend writing code to answer a basic question about architecture.  “How does this application divide into cloud-native microservices?”  They could picture this web of cloud-deployed components, but they were wary about the impact of all the network connections among components.  That was a good question, but it was the iceberg-tip of a better one.

Let’s say we have to process a million orders in an eight-hour period.  Back in the old days of data processing, the paperwork would have been captured via data entry, and the assembled orders (“batched”) would have been read and sequentially applied.  In this kind of batch processing, the time it takes to process a single order, multiplied by one million, becomes the time it takes to run the job.  Every disk I/O, every component processing time, is serialized.  A hundred milliseconds of delay (a tenth of a second) equates to a hundred thousand seconds of accumulated delay.

When we do the same million orders in real time, things are different.  The order processing it now coupled to the actual transaction, typically involving things like human customers and human salespeople pushing real merchandise around.  Toss a hundred milliseconds into this mix and you don’t see anything different where the dollar meets the point-of-sale device.

The reason I’m bringing up this seemingly tangential point is that the nature of an IT application, fundamentally, depends on the way it relates to the real world.  In the ‘60s and ‘70s when most commercial activity was batched, we had to worry a lot about process efficiency.  Concentration of data into data centers was smart because the penalty for distribution was latency, which accumulated in those old application models.  With OLTP we could forgive latency more easily, which is a big change.

But OLTP doesn’t justify latency, nor is the goal of “having latency-tolerant distributed applications” a goal.  OLTP was justified by the improved quality of information and the improved productivity of the workers involved in commercial activity.  What happened with OLTP was that we created a different application model that interacted with the real commercial world in a different way.  That model prioritized different IT issues, required different trade-offs.

The reason this is relevant for cloud-native is illustrated by a question: What would happen if we applied the OLTP model of applications to batch processing, if we let a new technical model get type-cast into an old application model?  Answer, hundreds of thousands of seconds of extended runtimes.  You should not expect to take the application model used for OLTP and fork-lift it into a cloud-native implementation.  You have to go back and ask the question where’s the benefit, then decide how that benefit could be best achieved.  That’s when you look at cloud-native.

The “bigger question” that I opened with is how do you know what applications justify cloud-native treatment.  The application mentioned by my enterprise friend wasn’t a good candidate, period, and that’s probably why developers were struggling to map it to cloud-native.  To determine what is a good cloud-native application you have to know both the benefit the application will bring, and then the benefit that cloud-native will bring to the application.  Only a favorable combination of the two can actually justify a cloud-native implementation.

Batch-modeled applications are lousy cloud-native applications because distributing stateless pieces only builds up latency.  OLTP applications may be “good” candidates for cloud-native in a technical sense, but you’d have to know what cloud-native brought to the table to know whether the new model would benefit a given OLTP application, or just “work” for it.  The best way to find out what’s a good technical candidate is to look at what cloud-native does that traditional technical models for application development don’t.

Most people agree that cloud-native technology works best when applied to simple signals from the real world, what are usually called “events”.  A transaction can be dissected into a series of events, each representing a step a user takes in interacting with a commercial application.  The benefit I get from that kind of interaction is that I can then take the processes for each step and make them so simple as to be stateless and fully scalable and distributable.  What is that worth?  That’s a question that needs to be answered in the enterprise trial cloud-native application I mentioned above.

Then there’s the cost or risk.  The event nature of the application is likely to generate increased network traffic among components.  There’s latency associated with that traffic, and there’s a risk that the loss of connectivity will create a failure that simple resiliency at the cloud-native component level can’t address (you can’t replace something when there’s no connectivity, and if you did, you’d not know it!)

The enterprise I talked with admits that they didn’t consider any of this when they picked their trial application target, and it’s now clear to them that they picked the wrong target if their goal was to try a cloud-native development.  But in a sense, they picked a great target, because it showed them an important truth, which is that everything doesn’t benefit from cloud-native.

We have batch applications today, including a lot of what we call “analytics”.  These operate on stored data and conform to one class of benefit/risk tradeoffs.  We have OLTP today, working as the front-end to an IT enabled business and supporting workers, customers, and partners.  That we have both at the same time proves that one model didn’t replace the other.  Instead, one set of missions—the real-time missions—spawned a new technical application model.  That new model then absorbed the missions that justified it.  That’s what will happen with cloud-native.

Does that mean all the attention we (and, yes, I) have given cloud-native isn’t justified?  I don’t think so.  IT evolution isn’t driven by what stays the same, but by what changes.  When new technical options are introduced, they let us address missions differently, and even create new missions.  The new technical options will likely reduce some commitments to past options, but many will stay in place.  Still, the changes (other than the inevitable cost optimization) in our IT infrastructure will come due to the new technical options and their associated missions.  We need to know what the new infrastructure will need, and how we’ll run it.

Then there are those new missions.  We do things with OLTP today, in its manifestation as web commerce, that open new business opportunities and revenues impossible to realize using batch processing.  We will do things with cloud-native that could not be done using OLTP, and those things are what will drive the spending, the technology changes, and ultimately the vendors.

The likely common thread with these new missions isn’t “IoT” or “autonomous vehicles” or any single technology, but rather the symbiosis of a lot of things we’ve atomized in stories.  It’s “the real world”, the environment in which we all live and work.  Batch processing put IT a long way from the worker, and OLTP let us (literally) touch IT.  The next step is for us to live information technology, or at least live with and within it.  In order to do that, real-world knowledge has to be gathered and correlated with behavior, business and personal.

We see this today.  If I search online for a camera system or component, I start seeing ads for that class of product almost instantly when I visit web pages or do related searches.  The Internet ad world has absorbed my interest and responded.  Whether this process represents an invasion or not, it’s an example of contextualization.  The next step in IT is to introduce context to applications, in order to make those applications more effective and productive.  To the extent that this is successful, the benefits will justify the cost of development and deployment.

All of this raises questions for cloud providers, whether current or aspiring.  If users are not really ready for cloud-native, then offerings that take advantage of it will have to be more architected to lower barriers to adoption.  That also means they’ll be more easily protected from competition, since there are no standards for the “services” that might frame a cloud-native application.  Perhaps this is where cloud providers will get in, and ahead, in the competition.  If things stay development-focused and developer-limited, it’s the traditional software vendors who are in the driver’s seat for the cloud.

Cloud-Native Ecosystems Morph to Cloud-Native Products

What we need is cloud for the masses.  We’ve failed to realize almost all the heady projections about cloud computing, not because cloud computing couldn’t have met them, but because we couldn’t meet cloud computing on a reasonable footing.  Fortunately, that’s starting to change.  The most profound thing going on in the cloud today isn’t technology advance, it’s populism.

Every advance we’ve had in the cloud, from web services to serverless to containers to Kubernetes to service mesh, has increased the capability of the cloud while also increasing the complexity associated with development and deployment.  That web publications don’t get the technology is a given, but what’s truly interesting is that enterprises don’t get it either.  I noted earlier this week that operators were struggling with what “cloud-native” meant, and enterprises are struggling even more.

I’ve recounted my experience with CIOs and senior development people many times in my blogs.  They aren’t charged with assessing all the leading-edge technical issues and developments, but to keep the applications their company’s business depends on running, and cheaply.  When somebody promises a revolution, they might be intellectually curious about it, but their mission isn’t to drive tech revolutions, but to respond to business revolutions.  In technology terms, they look for things they can adopt for which they can make a strong business case.  They look for products.

In one of my recent enterprise exchanges, I was told by a technology planner that “My company doesn’t want adopt technologies, they want to adopt products.”  That seems pretty obvious on the face; you buy products, install them, and use them, and that’s how IT has worked for generations.  The problem we have starts with the fact that the technologies are what get the ink, and it ends with the fact that the association between business goals, technologies, and products is getting very difficult to navigate.

Kubernetes is a great example of technology-ink-getting strength.  You can’t read anything about application development or operations today without reading about Kubernetes, so it’s pretty obvious that if you’re a vendor who’s promoting a set of development/deployment tools, you’d better be talking Kubernetes somewhere.  Kubernetes, of course, is open-source and so it’s hardly a big differentiator.  Further, users agree that it’s complex, and experts agree that it’s important in no small part because it’s the center of a growing ecosystem.  The “Kubernetes ecosystem” is just starting to emerge as a technology in itself, and yet it’s already morphing into a “cloud-native ecosystem.”  Imagine how enterprise technologists feel!

Why do we have this confusion?  Enterprises say that the “ecosystem” model emerged over the last year because of a split in the necessary feature-function-benefit approach to sales.  Business start at the “benefit” side of the chain, vendors at the “feature” end, and they meet (hopefully, eventually) in the function-in-the-middle.  Container technology generated a lot of features that enterprises frankly didn’t understand well.  Eventually, as enterprise questions and concerns created a better vision of how containers might be applied, vendors started pushing a combination of tools that aligned better with specific IT functions.  At the same time, this helped enterprises visualize how functions would benefit them, how an ROI would be generated.

We know now, for example, that containers are a new application development and hosting models.  You write container-optimized applications, and you deploy them on arbitrary resources using an orchestration tool like Kubernetes.  If you can make Kubernetes work across your resource set, you can make containers work.  We also know that as you broaden the choice in resources, adding public cloud to the data center, you necessarily remix your application design process to take advantage of the cloud as well as containers.  The fusion of cloud and containers is the foundation of the notion of an ecosystem-as-an-offering.

The ultimate in that fusion is cloud-native.  While containers allow for elastic, agile, deployment, scaling, and redeployment, realizing that with basic Kubernetes alone still leaves a lot of custom work to do.  The addition of stuff like service mesh (Istio) and serverless (Knative) go a long way toward submerging the complexity of cloud-native into the tool ecosystem.  Kubernetes ecosystems, as “functions”, have presented cloud-native in a productized (or productizable) form.  In that form, enterprises are starting to understand the benefits, and so we have the beginning of the realization of real sales potential here.

Sales, of course, implies vendors to make them.  Does the cloud-native ecosystem then spell the death of open-source populism?  Yeah, it sort of does, not that there ever really was such a thing.  Open source is a technology-development model that eliminates vendor differentiation, but it didn’t (and doesn’t) eliminate vendors.  Users need more than just a pretty picture of the future.  They need a blueprint for achieving it, and for that a casual open community is unlikely to serve.  You need vendors, because you need products.

The two vendors who are leading the productization charge here are IBM/Red Hat and VMware.  Both companies have the strong open-source, cloud-centric, mindset needed to pull together the right pieces.  Of the two, I think VMware has the clearest internal vision, which isn’t surprising given that IBM’s credentials are divided between the personnel and management of two very different firms just recently semi-joined by M&A.  I also think that among the cloud providers, Google has the best understanding of where things are going, and so it forms a kind of third player in the productization game.

The war of words between IBM/Red Hat and VMware, already visible, is clear evidence of the stakes in this new game.  The cloud, and cloud-native, has the potential for adding a trillion dollars a year to overall IT spending.  In the hosting, software, and virtual networking space, it could increase vendor revenues by 50%.  Given all the buying pressure we see in the market, this is big news indeed, and it’s why IBM bought Red Hat and why VMware has been on an M&A tear of its own to build up its cloud-native assets.

But if, as many say, cloud-native is really complicated, something even network operators can’t get their heads around, isn’t a focus on cloud-native ecosystem-building counterproductive?  Aren’t vendors likely to be mired in long sales cycles and education commitments?  No, because we’re reaching that feature-function-benefit meet-in-the-middle point.  Buyers are seeing, in the new ecosystems, that product they’ve longed for, a manifestation of cloud technology that you can touch, that you can install.

IBM and VMware are coming at this from very different places.  IBM has enormous enterprise sales influence, a presence in major accounts that’s the envy of many competitors, but they have fairly fossilized views of applications and data centers, and virtually no historic presence in cloud-native technology.  Their Kabanero announcement this summer was their first credible excursion into ecosystem-building.  VMware has a good customer base, but more in the VM than the container area, but they obviously know what a cloud-native ecosystem looks like and they’re buying up the pieces at a time when their value hasn’t been recognized.

The competition between these two companies, and the likelihood that others, like HPE and Microsoft, will earnestly join in, is good for the cloud, the users, and the industry overall.  We can’t hope for full cloud success if cloud-native technology is an arcane discipline attended by robed acolytes and accompanied by mystic rituals.  This has to be a mass-market technology, something that’s as readily adopted as servers and even PCs, for cloud-native and the cloud to realize its full potential.  The creation of vendor ecosystems is a big step.  Yes, one-stop shops are often seen as invitations to lock-in, but absent some group of people framing the cloud for widespread use, there’s nothing happening at all.

So far, everything that’s critical to cloud-native is open-source, and that prevents true proprietary lock-in from happening.  It doesn’t mean that some vendors might not assemble so compelling a set of tools, backed by such incredible planning insight and pre- and post-sale support, that they couldn’t run away with the market and create a dynasty.  Is that the price of fully realizing the cloud?  What open-source has always lacked was marketing.

Some will say open-source and cloud-native don’t need marketing, don’t need vendor support or productization. They’ll cite Linux, but the example of Linux, which wasn’t marketed in a product sense, isn’t a good one because Linux became a household word because hardware vendors needed an OS for their servers.  What pulls the cloud-native technology future through into the present?  It’s going to take marketing and education.  The growing ecosystem-building product focus of key vendors shows we may be on the verge of getting what the cloud needs.

What Issues Shape Operators’ Tech Plans for 2020?

Technology planning, for operators, is the traditional start of their budget cycle.  While budgets are usually calendar-year, technology planning typically starts in the second half of September and runs through mid-November.  The priorities for each of these planning seasons sets the tone for spending not only in the coming budget cycle, but for several cycles beyond.

I usually do a fall planning analysis (THIS is the analysis of last year’s results), sometimes early in the cycle to look at things that operators are already watching.  This is one of the years I do that, because there are a lot of questions and issues on their minds.  What I want to do here is introduce the only really new issue operators mentioned, then explore where operators are on the technologies they focused on during the last planning cycle.

The new issue is SD-WAN.  Operators were not really looking at SD-WAN last year because they saw it as another of those annoying over-the-top, leverage-my-services, exploitations.  Most still do, at least in part, but all the Tier Ones and about a third of Tier Two operators are thinking they may have to be a bit more proactive with SD-WAN.

One reason is the recognition that SD-WAN is going to happen whether the operators offer it or not.  In fact, a couple of Tier One service strategists told me they now believed it was a mistake not to have jumped aggressively on the SD-WAN bandwagon as soon as the technology emerged.  “If we’d entered the market ourselves, we might have discouraged competition in that critical early period,” one strategist said sadly.  Instead, the operators let managed service providers and enterprises disintermediate them, then belatedly realized that they’d lose less if they displaced their own MPLS VPNs than if they let others displace them.

A better reason is defined by AT&T’s comments on achieving over three-quarters replacement of MPLS with SDN in the core.  A service, operators are coming to realize, is the demarcation interface and the service-level agreement (SLA), not the internals.  Virtual networking can offer lower capex and opex, but you have to preserve the service interface to the user.  SD-WAN can, in theory, run over any transport technology that provides connectivity among the sites being served.  Operators know that even without SD-WAN from MSPs, enterprise price pressure on VPNs was cutting profits alarmingly, demanding a corresponding reduction in cost.  SD-WAN could provide it.

The question is how SD-WAN should be positioned.  The conservative operator planners think it’s a pure low-end VPN offering, something that companies would embrace (reluctantly) for sites where MPLS VPNs were too costly or unavailable.  The very aggressive say it’s the foundation of a whole new service vision, something that lets them add security and other features to gain revenue, without adding a lot of cost.  The middle-ground operators say it’s the model for future VPN services, a model that displaces MPLS eventually and lets things like SDN in.  This planning cycle, operators hope to dig out a strategy for the next couple years.

Last year’s big question was 5G, and while there’s still a lot of budget interest in 5G, the technology planning for 5G isn’t showing very much change.  Operators are committed to a 5G RAN, and many are also committed to millimeter-wave, but few have committed to 5G core.  What I’m hearing now is that 5G core is a bit of a back burner for now, at least in a general case, because operators don’t expect much from it prior to 2021.

There is one question that’s emerged relative to 5G core, or at least one related to it.  Many operators recognize that the current structure of the mobile metro infrastructure, which is what “core” really means in mobile terms, is based on a fairly primitive view of virtualization.  5G Core, which mandates NFV, is based on a model of virtualization that operators, at the very least, are skeptical about.  They wonder whether there’s a virtual-network model for metro in general that would “subduct” the 5G Core issues.  I hasten to note that this isn’t a universal concern; about a third of my Tier One contacts have it, but few Tier Two or Three operators are thinking about universal metro infrastructure now.

Millimeter wave is also an issue that impacts mostly Tier Ones, and impacts more where video/TV delivery competition is a major issue, as it is in the US.  For decades, everyone accepted the view that TV was the golden goose of wireline, the service that justified the connections and kept the lights on for operators.  Over time, video margins have shrunk, partly because of competition and partly because OTT streaming services that don’t mimic fixed linear-TV schedules have combined with mobility to disrupt viewing habits.  Networks are also seeing revenue pressure as advertisers shift to cheaper and more easily targeted online ads.

The video problems are one reason why companies like AT&T and Comcast have been buying content companies—networks or studios.  As pricing pressure mounts, consolidation in the value chain is inevitable, and smart at least to a point.  However, it’s clear that broadband Internet is becoming the flagship of wireline, and in order to keep profits up, operators need to get costs down.  Millimeter wave promises high bandwidth at a much lower cost than fiber to the home (FTTH), using a hybrid of 5G radio and fiber to the node (FTTN).  Verizon just said that most customers will be able to self-install, which could be a major boon to operators.

The question for many operators in 2019 is whether the 5G/FTTN hybrid will cut costs enough, and the answer will of course depend on the nature of the service area.  Where demand density (the measure of infrastructure to “pass” revenue opportunity) is high, 5G/FTTN is almost sure to deploy.  As it declines, the problem is that even its pass cost relative to FTTH may not be enough, particularly if there is already CATV cable deployed.  DSL against CATV is, in the long term, a losing battle.

There’s also a linkage between the virtualization and 5G/FTTN issue and last year’s second priority, which was carrier cloud.  The questions operators had last year were primarily circling around the drivers of carrier cloud.  Operators now admit they were hoping that early drivers would solidify the technical framework for carrier cloud, but they also admit that they didn’t get any early drivers to speak of.  Thus, they don’t know what the technical model for carrier cloud should/will be.

This is where “cloud-native” comes in.  Last year, it wasn’t a term that operators were using much, but this year most Tier Ones realize that it’s not enough to “migrate” to the cloud, you have to develop for it.  That, of course, makes the technical model for carrier cloud critically important.  It’s even more important when you consider that, lacking any single and convincing early driver of carrier cloud, cloud infrastructure and strategy for operators is at risk to being dispersed to death, dividing among a lot of implementations that don’t establish economies of scale in opex or capex.

Carrier cloud is important this year too, with cloud-native the leading topic, but even Tier Ones don’t seem to have a handle on things.  One important development the operators themselves see is that some vendors are clearly interested in promoting what might be called a “cloud-native” ecosystem.  There was a time when containers and Kubernetes defined the cloud, as far as operators were concerned, and so also defined “cloud-native”.  Then they realized you had to supplement Kubernetes, what I’ve called the “Kubernetes ecosystem”, to make cloud-native complete.  Now, some are thinking that perhaps we’re looking for a “cloud-native ecosystem” instead of a Kubernetes ecosystem.

The good news is that operators also think that vendors are moving in that direction.  They cited IBM/Red Hat and VMware/Dell as examples of companies that, to them, are positioning a cloud-native framework and are large and credible (big enough to sue and collect) companies.  These vendors were among the list that operators wanted as NFV partners during NFV’s heyday, in fact.  Now they seem like they’re sprouting cloud-native wings.

But are the operators ready to take flight?  Last year, the technology planning cycle seemed to have both a sense of purpose and a sense of urgency.  This year, I’m getting a mixture of frustration, boredom, and cynicism.  More planners are saying in 2019 that they don’t really know for sure what to be planning for in the coming year than said that in the last fifteen years, back to the dawn of “transformation”.  They know what’s happening, they know they don’t really like it, but they don’t know what they should really be doing…yet.

This is what, in my view, the vendors (and VMware in particular) are trying to address.  Operators have long seen their role as applying the technology made available to them, rather than influencing what that technology was and did.  In something like cloud-native, where there’s no question that the industry in general is struggling, operators are struggling more.  Many, perhaps even most, would love to adopt a model of cloud-native and base their future “carrier-cloud” deployments on it.  If only they had one to adopt!  Vendors would love to have them commit and spend, of course, and I think where VMware is leading the pack is in combining these points.  Give the operators something they can believe in, and they’ll adopt it.

Enterprises are in the same boat, of course, which is why the ecosystemic approach to cloud-native makes so much sense.  There are literally hundreds of thousands of data centers, a trillion incremental revenue dollars, and whole ranges of applications like IoT that depend on a cloud-native approach.  Operators are usually laggards in adoption, but they represent such an enormous piece of the pie and they’re under such stringent profit pressure in their own businesses that they could look very promising indeed.

Cloud-native and carrier cloud, overall, are the big technology planning issues, because they raise the most technical questions.  They also frame the way that things like 5G and IoT are likely to develop, which frames future revenue planning.  For all their importance, though, operators (somewhat ruefully) admit they’ve not made much progress in prior years.  They hope 2019’s cycle will be different, and of course, so do the vendors who hope for relief from budget constraints and pushed-out project plans.

They may not get that relief.  I don’t see many vendors effectively pushing an ecosystem for cloud-native and carrier cloud.  Without that vendor support, I don’t think the technology planners can really frame a model for the cloud in time for financial budgeting in early 2020.  With all the political and economic distractions that year will bring, technical uncertainty certainly won’t be a help, so if vendors want a rosy 2020, they probably need to get their positioning in order.

Taking a Mission-Focused Look at AI

I blogged last week on the reality of edge computing, and I think that it’s time to take a mission-focused look AI too.  We tend, in all of tech, to focus entirely on a new technology rather than asking just what the technology will accomplish.  As a result, tech promises turn into hype more than realization.  Let’s forget AI tech for a moment and look at what we might do with it, focusing as always on things that could emerge, near-term, and drive AI adoption.

AI, of course, means “artificial intelligence”, and so all AI missions should be places where human intelligence, meaning something like judgment, would be beneficial but cannot be readily applied.  The reason for that could be cost (we couldn’t afford to have a human standing around to make decisions), lack of expertise (a suitable human isn’t likely available), or speed and consistency of action (a human would waffle and make inconsistent decisions).  These three reasons are the ones given by enterprises most often when they’re asked about building an AI business case.

One of the problems we have with “AI-washing”, meaning the claims for application of AI technology where there is no mission (no advantage) and/or no basis (it’s not true), is that it’s riskless.  If somebody says they have a new AI feature, they get coverage.  I’ve not seen any case where a claim was investigated, debunked, and the truth then published.  In fact, just as edge computing has inherited the mantle of the cloud when cloud stories got blasé, AI seems to be inheriting the title of “best general technology.”   Unfortunately, AI isn’t a general technology.

Here’s the thing.  Suppose we have a keypad to give us entry into a gated area.  We enter a code and a pound sign, and if it’s correct we open the gate.  It’s difficult to make an objective case for AI here, given that interpreting a simple code is hardly something that requires human intelligence.  But some will argue that since the keypad and gate replace a human guard in a shack, this must be AI.  Automation isn’t always AI, and automation always reduces human effort and so replaces human judgment.  AI applications require something more.

Suppose we take a different tack with the keypad-and-gate thing.  We use cameras, analyze the image of the vehicles, the license plates, the faces of the people.  We could construct a solution to our gated-area problem that could actually benefit from AI, but it would be AI that benefits and not the application.  We don’t need all of that to open a gate, and probably couldn’t justify the cost.  This approach doesn’t meet the three tests of AI value I noted above.

Let’s look at some other popular applications.  How about where “AI” is used to scan data looking for correlations?  We used to call this “analytics”, and so we have to distinguish between simple analytics and AI if we’re to dig out an AI mission.  Go back to the rules.  Could a human, with proper expertise, pour through vast assemblies of raw data looking for patterns that suggest something interesting?  Obviously not, because we’d have a problem finding the right humans, and likely a greater one getting them to focus on a vast assembly of raw data for long enough to do any good.  This meets all of our AI value tests.

What we can draw from these two examples is that AI is more likely needed to hash through a lot of stuff to dig out value than to support a very simple task.  AI is then a logical companion to something like IoT or personalization and contextualization.  The reason is that tasks that try to support and optimize human behavior are almost certain to justify artificial intelligence, because it’s human intelligence they’re supporting.  The limiting factor is the extent to which support is really needed, as our keypad example shows.

Might there then be no such thing as a “little AI” or applications of AI on a very small scale?  I think that’s likely true.  I also think it’s likely that a task that’s performed regularly by thousands of people based on the application of a rule set isn’t an application crying out for “machine learning” at the site of each person.  Why not have a system learn the rules instead, in one place over a reasonable period, and then build a rule-driven or policy system to enforce what we learned?  In other words, AI and ML might be justified in putting a system together, but not needed in the operation of the system.

This is probably the biggest issue in assessing the value of machine learning.  If analysis of handling of a few variables always results in the same decision, then you don’t need machine learning where the action is happening.  But even for other forms of AI, like neural networks, what’s still true is that AI is valuable where set policies aren’t effective.  In most cases, that’s where there’s a lot of unforeseen conditions.  If one and one always make two, you don’t need AI to figure it out.

What justifies point-of-activity AI, then, is a combination of things.  First and foremost, variability.  Human judgment is able to extrapolate based on many variables.  If there aren’t many variables, if there isn’t any major need for extrapolation, then simple policy-based handling of conditions is fine (AI might still be useful in setting up those policies).  If conditions are highly variable, especially if they’re diverse, then a human could make a better decision than a set of policies, and an artificial intelligence cloud would likely do better too.

At this point, we can see that AI missions can be divided into two groups—those that apply AI to the point of activity, and those that use AI to set policies that will later be enforced at the point of activity.  Both of these are valid missions, and so admitting that some AI isn’t going to live in your phone or in your car doesn’t mean AI has somehow failed (except perhaps that it might have failed the hype contest!)

We can also see another division of missions, into missions that are analyzing information to look for patterns or abnormalities, and missions that are acting or proposing actions to conditions or changes.  Microsoft’s IntelliCode is an example of the first of these two missions.  It’s a code review product that’s designed to pick out badly structured program statements and segments, and it’s based on a massive analysis of public repositories of code that set a practice baseline.  Things that deviate from patterns that are clear in the baseline are flagged.  Most of the AI attention, of course, goes to the latter group, which tries to identify appropriate actions to take.  IntelliCode does a bit of this too.

And there’s yet another division—between missions that involve learning and those that involve inference.  A learning mission means that the application will track both conditions and actions, and will “learn” what conditions trigger a consistent action sequence.  That sequence can then be initiated with little or no oversight.  An inference mission is one where AI works to emulate how human assessments and decisions are made, and can then act (or recommend actions) based on conditions.

Inference is the aspect of AI that might have the greatest value and also likely generates most of the “my-robots-are-out-of-control” risks (or myths).  Actual human intelligence is a mixture of learning and inference, but I think that it’s inference that dominates.  People extrapolate, taking rules that were defined to deal with a certain situation and applying them to other situations that “seem” related.  Inference is what separates policies from intelligence.  We could argue that for humans, learning is a shorthand way at improving inference by imparting both the rules of specific things and the relationship among things.

Would AI necessarily lead to something that might mimic humans?  We already have software that can engage in a dialog with someone on a specific topic, play a game like chess.  That shows the problem of emulating humans may well be more of scale than of functionality.  Could learning-plus-inference give us something scalable?  Cloud cloud-elastic resources combined with inference actually create something that, when examined from the outside, looked human?  Not today, but likely in the not too distant future.  Should we fear that?  Every tech advance justifies fear of some aspect of its use or misuse.  We just have to get ahead of the development and think the constraints through.