Why is Microsoft Taking its Own Service Mesh Path?

Microsoft has decided to go its own way on service mesh, or at least to try.  Given that Istio, a Google development but still open-source, is the de facto standard in service mesh, why would Microsoft make that decision?  Would a service mesh battle help the cloud-native space?  Is there a carrier cloud dimension to all of this?  Interesting things to speculate about, so let’s speculate.

One thing that doesn’t seem to demand much speculating is that the announcement timing is related to Google’s decision not to contribute Istio to the Cloud Native Computing Foundation (CNCF).  The fact that Microsoft will contribute its Open Service Mesh to the CNCF is the most obvious point made in the announcement.  Its support of CNCF-member-supported Service Mesh Interface API (SMI) is the second-most-obvious point.  The third?  It’s a “lightweight” implementation of service mesh.

All of the service mesh technologies in general use are based on something like the Envoy proxy, which is a software sidecar, an agent that links to a microservice and provides the microservice with basic connectivity features.  Microsoft proposed last year to make SMI a universal API, an open way of communicating with services (and proxies), and thus enhancing service portability among mesh implementations.  It seems pretty clear that Microsoft hopes that the combination of a lightweight service mesh and an open API will get Open Service Mesh a lot of traction.

That raises our first truly interesting question, which is “Why?”  What could it be about service mesh that matters so much?  Microsoft adopted Kubernetes, another Google project, after all.  The answer here, I believe, is that service mesh is the foundation of cloud-native, as much as Kubernetes is the foundation of container deployments.  Kubernetes is heading toward being a de facto standard, and if the market is heading to cloud-native, you can’t differentiate your cloud-native approach by adopting the same tools as everyone else.

Despite the way the market has talked about the cloud and cloud-native, it’s always been really a development issue.  The cloud has features that the typical data center doesn’t have.  If those features are to be fully exploited, applications have to be written to take full advantage of them, which is what “cloud-native” means.  The application was written to run in the cloud, not moved there from somewhere it was already running.

The central adaptation of components to cloud-native status is making them into microservices, which are stateless (in some way) features that can be spun up and run, taken down, moved, and so forth, as conditions demand.  Having a bunch of application features floating out there in the ether isn’t exactly how most application developers think, and the tools to make this work start with an abstraction of connectivity, or virtual connection fabric.  That’s what service meshes provide.

The reason this is important is that the cloud’s role in computing underwent a major transformation just last year, when enterprises recognized that a true “hybrid cloud” was a combination of their current data center and a cloud front-end component set that provided the user interface to browsers and mobile devices.  This insight showed that you don’t “move something to the cloud”, but rather create something new in the cloud that mates with your current applications.  That thinking is a stepping-stone to the realization that cloud-native applications would have to be truly different.  What makes them different, as enterprises will eventually realize, starts with service meshes.

In July, I blogged about why Google might want to retain development direction control over Istio and other projects that it elected not to submit to the CNCF.  My suggestion was that Google sees an architecture for cloud-native emerging, sees Istio as a key piece of that architecture, and wants to make sure Istio develops according to Google’s cloud-native vision.  Their decision on governance control wouldn’t prevent others from exploiting Istio, but it would prevent having Istio go off-track, which implies Google has a track for it.  Which implies it’s strategically important to Google, which implies it would be to Microsoft too.

The next interesting insight we can derive here relates to the “lightweight” property of Microsoft’s Open Service Mesh.  Why would a lightweight version of service mesh be needed when there are at least three well-known full-feature ones (Istio, Linkerd, and Consul)?  The answer is that all the current service meshes were designed with the full mission of service mesh in mind.  Enterprises, having just figured out that there really was such a thing as “cloud-native development”, are hardly pushing the technical boundaries of service mesh thinking with their current projects.  You don’t buy a backhoe when you need to plant a bush.

But you may need a backhoe eventually, of you’re going to dig enough holes.  I think Open Service Mesh is a basic framework, not a complete competitor to things like Istio but a framework on which further feature development can be built.  The CNCF, then, would be not only a way of rubbing Google’s nose in the you-know-what, but a way of encouraging open development of advanced service mesh features, based on the expanding experience and needs of enterprise cloud prospects and customers.

Whether this is going to work will obviously depend on two things.  The first is the pace at which enterprises gain insight into the totality of cloud-native development techniques and their value propositions.  The second is the pace at which open contributions to extend Open Service Mesh are offered and accepted.  If the first horse in our race runs way ahead, then Open Service Mesh quickly becomes a shovel in a backhoe race.  If the second outpaces the first, then Open Service Mesh can hope to address user needs as they develop, never confronting enterprises or enterprise software developers with more complexity than current applications demand.

If we rethink our horserace in terms of Google versus Microsoft, we could frame the race in terms of the pace of education versus the pace of development.  Google needs to accelerate the full cloud-native understanding of the development community.  They also need to ensure their own cloud can support the full range of Istio features, and do so with reasonable ease of use and cost.  Microsoft needs to prime the pump on Open Service Mesh enhancement, seeding any reasonable project with resources and contributing an insight as to the way that the features should guide the expansion of service mesh overall.

Where, then does “carrier cloud” fit into all of this?  Recent surveys have suggested that there are no new standards areas that operators feel are critical to them; they reject most by a 2:1 margin or better.  Surveys also suggest that operators believe that OSS/BSS modernization is hopelessly stalled (I found the same attitudes back in 2013).  Operator initiatives in software design (NFV, ONAP) don’t suggest a lot of experience with or understanding of cloud-native development.  Thus, I think, it would be safe to say that the operators are in no danger of pushing the boundaries of a basic service mesh.

And yet…5G function hosting is clearly a mission that all the public cloud providers think they could make money on.  What could we say is the cause of carrier cloud outsource interest, if not lack of understanding of carrier cloud by the carriers themselves?  We therefore could have the classic situation of a demand driver whose “demander” can’t muster much sophistication in meeting the demand themselves.  That could favor an Open Service Mesh model; simple things for simple folk.

Then there’s the mission of 5G Core hosting itself.  What telco standards group or open-source activity has framed out a carrier-cloud-ready architecture?  There’s no reason to think that the 3GPP did, or could do, that.  We could use a service mesh in the 5G control plane for sure.  We could (IMHO) adapt it to serve as an agile data plane too.  The technical requirements of service mesh in 5G control-plane applications are fairly basic, though (again, IMHO) still beyond the very basic Open Service Mesh feature set.  The adaptation of Open Service Mesh to the control-plane mission would be simple to accomplish, and adding data plane not much harder.

That’s if you know what you’re doing, of course.  Given that the 3GPP stuff isn’t really framed for cloud-native, some architecting of the structure, preserving the standard interfaces, will be necessary.  Microsoft may well be in the best position to do all this adapting and adding.  They bought Metaswitch, who has more experience with open, software-based, IMS/EPC/5G implementation than pretty much anyone out there.

The process could give Microsoft an edge in 5G, but also an edge in the battle for the hearts and minds of evolving service-mesh users.  5G signaling is a great event-driven application, a good test for a service mesh and microservice implementation.  As Microsoft/Metaswitch goes through it, Microsoft could learn a lot about what’s really needed in Open Service Mesh enhancement, and make sure their own cloud service is optimized to provide it.

The pressure, then, is likely on Google first, and Amazon and IBM as well.  Google needs to make 5G an Istio application, with a strong framework and whatever specialization is required.  They have perhaps the best microservice and service mesh people in the industry, but I’m not sure where they are with respect to 5G details.  They’ll need those skills if they want to ride 5G and carrier cloud to a broader service mesh (and public cloud) victory.

Or they could try to jump ahead, of course.  There’s still the possibility of educating the enterprise buyer, but there are a lot of enterprises to be educated, and only a dozen or so big network operators who collectively account for the majority of the 100,000 potential new carrier cloud data centers.  It’s the opportunity to get operators to outsource those data centers, and then to build on that base, that seems the most direct path to success here.  It will be interesting to see who takes it.

What Can We Really Expect from AI/ML?

How useful would artificial intelligence and machine learning (AI/ML) really be in network lifecycle automation?  The topic gets a lot of attention, and a lot of vendors have made claims about it, but the real benefits are actually difficult to assess.  Part of the reason is that there are many ways AI/ML could be applied, and part the fact that we tend to intuitively think of AI as “Hal” in “2001: A Space Odyssey”.

If we cut through the jargon, the goal of AI/ML in network operations is to provide some form of root cause analysis, some kind of response optimization, or both.  Lifecycle automation means detecting conditions, performing analysis to establish a most-likely cause, and then triggering remediation based on that cause.  We need to break these three points down to assess what might be practical in the application of AI/ML to networks.

Detecting conditions is a function of monitoring, which in the real world is all about receiving events that signal conditions or condition changes.  However, events don’t necessary signal the necessity of action.  Some events may signal that previous actions are working, in fact, and others may signal the initiation of self-remediating processes down in the network.  The key point for monitoring is that in order to monitor effectively, you have to know what generated the event and how to read it.  There are plenty of monitoring tools available, and we have long experience in monitoring, so there’s not much we need to worry about here as far as gathering information is concerned.

But is gathering events the same as “detecting conditions?”  The first complicating factor in AI/ML is that detection of conditions and root-cause analysis are fuzzy areas, and you could rightfully assign some tasks to either of those overall steps.  We therefore have to identify those fuzzily-positioned tasks and see where that takes us.

The first task on our list is that of event associations.  A network is a system of complex, interconnected, functionality.  An event is a report of conditions in a particular place, because probes that report conditions have to be placed somewhere and read conditions where they’re placed.  But what is that place, in the overall context of the network?  What, for example, is the association between information gathered at Point A and Point B?  Unless you know what the network-topological relationship of those points is, you can’t make proper association of the results.  Suppose the points are two ends of a trunk.  Packet count discrepancies would indicate data loss, but having either count in isolation might tell you nothing.  You need to be able to define associations rooted in topology, that then frame a kind of super-event from the relationship between sets of events.

Then there’s actionability.  Not all events indicate a need for action to be taken.  Some simply report conditions, and we may want to log them and even analyze them in a later step as a part of a machine learning process, but we don’t want to kick off an operations process based on the event.  However, actionability is sometimes related to expected versus experienced conditions, where an event is actionable not because of what it is atomically, but what it is relative to what was expected.  That gets us into context.

The third task on our list is state/event relationships.  Many (perhaps most) events needs context to be analyzed properly.  If a trunk reports zero traffic, is that an error?  It depends on whether there’s supposed to be traffic, and there might be no such expectation if we just took the trunk out of service or were diddling with the devices on either end.  We don’t want to trigger remediation based on an event that’s been created by our own reaction to a prior event or set of events.

Our final task is event correlation and root cause analysis.  This, in most cases, is going to involve the integration of all the prior task results, conducted with reference to the topology of the service or network in question.  If an automated (or human-driven) operations process responds to events discretely, it’s likely to step all over itself by treating symptoms rather than the problem.  A series of events that are related and contextualized should be examined to determine whether they stem from a specific condition.  If they do, and if that deduction can be validated, then the right action is one to address that common, specific, condition.

As I noted at the start of this blog, there are a lot of ways that we could think about applying AI/ML to network operations, but the ways can be broken into two groups.  The first group is the application of AI/ML directly to events, with contextualization and causal analysis done by successively analyzing the event stream through the tasks noted above.  The second is the creation of a contextual map that itself guides the entire event collection and analysis process.

It’s probably obvious to most of you that the first approach is going to have scalability challenges.  Everyone who ever spent any time in a network operations center (NOC) has seen the result of an event storm, where a single problem generates thousands of alerts that quickly swamp both the operator and the entire process of gathering and displaying events.  The bigger the network, the bigger the storm can be.  I don’t think this is the right way to go, AI/ML or not.

But what is a “contextual map” and how could we get one?  A contextual map is a model that divides a network, or a set of service resources, into functional units, each of which can be depicted as a “black box” with external interfaces and behaviors.  When an event is generated in this situation, instead of going to some master event-handler, it goes instead to a contextual handler defined within that black box.  The underlying principle is that a network/service’s state is the combined state of the unitary functions that make it up.  We deal with the functions individually, and collectivize the results.

Contextual maps are a dazzling insight, and they were postulated first by John Reilly in the TMF NGOSS Contract work.  Events are steered to processes via the state/event table of an element in the service contract data model.  The TMF service data model, the “SID” wasn’t ideally structured for the mission, something John and I chatted about a number of times.  TOSCA, the OASIS standard “Topology and Orchestration Specification for Cloud Applications”, appears to me to have the capability to define contextual maps that could then be used to guide the way that events are analyzed.  I’m working through my own TOSCA guide (as time permits!) to validate the view and explain how it would work.

But what does this have to do with AI/ML?  I believe that contextual mapping is an essential step in applying AI/ML to network operations, because it reduces the scope of a given analysis and permits hierarchical assessment of conditions.  If Box 1 has an event, AI/ML can analyze within that scope, and if the analysis indicates the scope of the problem has to expand, then Box 1’s determination becomes an event to the next box up in the hierarchy.  By this means, I can handle a wide range of events with independent AI/ML instances, because each is handling the interior of a black box.  Wider scope just means kicking off an analysis at a higher-level box.

What, though, could AI/ML do within a given black box?  If we have state/event tables, the boxes already contain (or, more accurately, their data model contains) the necessary indication-to-action representations.  The obvious answer is that AI/ML could replace the state/event tables, which would mean that having people sit down and figure out those event-to-process associations, and the states related to handling them, would no longer be required.  Machine learning, AI, or both combined could be used to create the event-to-process mapping needed, which could generate a lot more agile and effective event-handling while preserving the value of the service data model as a “contextual map” for the network or service.

This still leaves us with a potentially open question, which is the last step of our process—identifying the appropriate/optimum action to take.  People have talked about how ML might “learn” the best way to address a given fault (properly contextualized, of course!), but the problem with ML is that it has to learn from experience, and while it’s learning it’s either got to be disconnected from the action-generating processes, or it has to be expected to fail some number of times.  It might take a long time for ML to get the experience it needs.

The good news here is that contextual mapping will reduce the learning period by containing the number of condition/action correlations that have to be learned.  The notion of dividing complex tasks into a series of simpler ones is a common human response to being overwhelmed by the scope of a problem, so it’s a good way to manage AI/ML too.  But even here, we may need some additional help.

I’ve noted in a couple of past blogs that we’re overlooking a powerful ally in network lifecycle automation—simulation.  If a set of conditions, created by the correlation of events within a context, can generate a recommended action or set of actions, simulations of the inside of that contextual black box could be run to establish the likely result of each action identified.  The result, particularly if it’s generated with a specific confidence level, could then either automatically trigger the optimum action, or present it for human assessment.

One obvious application of simulation would be creating baseline “states” that ML could be taught.  This is good, this is impaired, and this is pretty bad.  It might take a long time for all these conditions to be visible in a live network, available for ML to examine and learn from.  Giving it a head start with simulation could speed the process considerably.

Simulation could also model the way a given action might impact the network/service.  It could also, in theory, model the progression of faults.  Since simulation requires a model on which to base its recommendations, it could be said to help enforce the notion of contextual mapping.  That alone could be valuable, and it would make sense to leverage a simulation map or model to provide additional insight.  Think of it as a kind of machine learning, with a similar application to operations processes overall.

The conclusion here is that while AI/ML could be useful, the way they’re used has to be firmly anchored in the topology or context of the network/service, or the application of AI/ML is likely to create scalability and relevance issues.  You can’t accomplish anything by gluing the term onto a product.  You have to integrate it with the network and service, and to do that, you have to be able to represent the structure to define the constraints of your AI/ML.

I think AI and ML could be very beneficial in service automation and other operations tasks.  However, we still get stuck in “Hal” mode, expecting real human (or superhuman) intelligence.  We’re not there yet, but there are still things that we could do to enhance the way AI/ML is applied to operations tasks.  The same, of course, would be true for other tasks.  It would be helpful to focus on these things instead of just AI/ML-washing everything, don’t you think?

What Cloud Provider Revenue Growth for the Quarter Might Mean

Amazon’s cloud computing growth in the latest quarter dropped to 29%.  Microsoft’s cloud revenue growth also slowed, but it still grew 47%.  Google’s cloud revenue growth increased, posting a 43% gain.  Are we seeing some important shifts in the public cloud space?  I think so, but most of them have their roots in a broader shift in the cloud market that’s been happening for at least three quarters.

It’s interesting that, with everyone saying that the cloud is a bright spot in the lockdown-and-virus recession, the top cloud providers saw less-than-usual revenue growth.  I attribute this in part to the fact that, as we’ll see below, corporate cloud buyers account for only a portion of cloud revenues.  Much of the “online” companies consuming cloud services are ad-driven, and that group is clearly impacted by a decline in advertising resulting from the lockdown-linked retail slump.  But this isn’t the whole story.

None of the cloud providers offer a good breakdown of how their cloud revenues are derived, but most people realize that public cloud revenues are historically the combination of web-company revenues (OTTs, often social media sites or related services) and enterprise cloud spending.  In the early days of cloud computing, the first of these sources outstripped the second by margins of better than 3:1.  Amazon has always gotten a greater piece of the OTT-cloud business, and so it jumped out to an early lead.  Amazon’s greater exposure to this space is also likely a contributor to its slowing revenue growth in the quarter.

Cloud computing for enterprises, during this early period, was hampered by the lack of a realistic and practical model of cloud adoption.  The early expectation, the one that got all the media attention, was that enterprises would “move applications to the cloud”, meaning shift entire applications out of the data center to the cloud.  That was a practical response for some applications, but not the mission-critical ones that make up most of enterprise IT.  As a result, public cloud adoption by enterprises was slower than the cloud growth numbers (which included the OTT piece) showed.

In 2019, enterprises started to understand the ways in which public cloud computing could augment their data center apps rather than replace them.  The “hybrid cloud” talk that emerged was an imperfect description of what was starting to happen, which was that enterprises were adding cloud front-end pieces to legacy applications to enhance their accessibility via browsers or mobile apps.  The back end, the data center piece, of these applications weren’t really cloud at all, they were the same stuff that had run in the data center all along.

The front/back application model resulted in significant growth in cloud adoption by enterprises in 2019, and that resulted in Microsoft, the public cloud provider with the best credentials for enterprise cloud applications, gaining market share on Amazon.  It also gave other providers of public cloud service, notably IBM and Google, a bit of a road map to how to achieve better public cloud sales.

In 2020, obviously, COVID came along, and the uncertainty associated with the virus and lockdown discouraged companies from making new capital investments in IT.  Expensing public cloud services made more sense.  In addition, COVID and the lockdown created a massive push for work-from-home (WHF), which created a massive push for creating secure and productive interfaces for core applications that could be pushed out to remote workers.  This is the dominant driver in the cloud market wars today.

The growth rate decline for both Amazon and Microsoft is a reflection of the fact that the front/back model of cloud applications came as a surprise to everyone, including users.  While most enterprises now understand “hybrid cloud” means front/back application separation between cloud and data center, most have only begun to deploy based on this new realization.  I think that the next two quarters will show cloud revenue gains based on the exploitation of this model.

The front/back model of applications that’s now dominating is different from “hybrid cloud” because the latter term implies public and private clouds, and as I’ve noted, the current model is almost always implemented using a cloud front-end and legacy data center deployment in the back.  Since Amazon, in particular, initially viewed hybrid cloud more literally, their approach presumed that users would structure on-premises hosting to mimic AWS cloud services, which wasn’t the model that was succeeding.  Amazon is currently working to adapt their approach (more on that later, below), and the other cloud providers are doing likewise.  The way they approach the future, the pace at which their approach can address opportunity, and the extent to which the future opportunities align with each provider’s approach, will decide who gains share and who loses.

A second market kicker that’s emerging is the public cloud provider recognition that “carrier cloud” could represent a truly enormous public cloud opportunity.  My modeling said about eight years ago that carrier cloud could, by 2030 and if all the drivers were realized, generate 100,000 incremental data centers, most at the edge.  It would then be the largest single driver of cloud services.  When, in 2020, operators started to show signs they might outsource some of their own carrier cloud applications, all the public cloud providers saw gold in them thar hills, as they say.

There isn’t an explicit need for cloud provider strategies for enterprises and for carrier cloud to line up; the positioning of the two would likely be very different, for example.  Some technological harmony might be helpful if the cloud provider intended to launch a sweeping strategy for the future cloud.  Let’s look at the providers and see what’s what.

Amazon has no specific account presence at the data center level, and rather than trying to develop one, it seems to want to establish a series of partnerships with key providers of data center technology.  VMware is an example; vSphere is a dominant data center platform, and Amazon’s been working with VMware to provide a means of linking vSphere and AWS efficiently.

One emerging force in this effort is container and Kubernetes technology.  Enterprises realize that containers are a good way to deploy both data center and cloud applications, and while harmony in deployment isn’t essential in creating a front/back cloud application model, it would offer a way to provide users with the opportunity to burst application components across the cloud/data-center boundary, which could facilitate further cloud migration down the line.

Microsoft is in a better basic position for the enterprise cloud than Amazon, having better data center account engagement and a better front/back integration capability almost from the first.  However, they’ve also been enhancing their container approach, focusing strongly on “federation” of cloud container and Kubernetes deployment with the data-center counterpart.

In Microsoft’s carrier cloud story, there’s a strong element of edge computing.  Edge computing could also be used by Microsoft to create a kind of bridge between cloud and data center, particularly where development of applications focuses more on cloud-native technology, making it possible for the cloud architecture to leak over the boundary to the data center.

Google seems to have the most “futuristic” approach.  They’re building a cloud-native platform and even creating development tools needed to optimize for that vision.  Their theory, which is valid, is that it’s relatively easy to link cloud and data center simply to have a workflow cross over.  The question is how both sides of the boundary might optimize for the new cloud/data-center relationship.  If applications can be built and deployed for a universal application PaaS-like model spanning all hosting options, then future applications can exploit the cloud without change, so if enterprises lose their compliance and security fears, even mission-critical elements might qualify for cloud hosting.

Architectures can cut both ways, as I’ve noted in past blogs.  Container hosting is a credible requirement for our universal application architecture, and that could take the form of managed Kubernetes federated across into the data center, or in the form of data center Kubernetes driving a cloud implementation.  In the latter case, there’s little opportunity for cloud providers to differentiate themselves, which may be why the providers are looking at making some fast moves.  Wait too long and a vendor like VMware or IBM/Red Hat might just come out with a cloud-independent vision.  That could level the cloud playing field.

More Carrier Cloud Action, with No Outsourcing

The latest in the carrier-cloud 5G story could be big, but not for the mainstream network infrastructure vendors.  Dish announced it had picked VMware’s Telco Cloud solution for its 5G network, a platform that will be hosting the Mavenir open RAN software and almost certainly their 5G Core elements as well.  VMware will serve as a prime integrator for other components of 5G and carrier cloud.  The operative question is whether the story just “could be big” or “will be”.

This deal is important for both VMware and Dish.  VMware needs to establish itself as a credible carrier cloud infrastructure vendor, a vendor who can build and integrate a 5G and carrier cloud ecosystem.  Dish needs something to save it from the continued decline of the cable-and-satellite TV business.

Dish is suffering from major losses in its satellite TV business, and has been struggling to figure out what it could do for its next act.  They had a satellite IoT strategy that, predictably, went nowhere because the initiative seemed to be driven more by IoT hype than any respectable business planning.  5G mobile is its next attempt, and obviously you have to take the prior IoT flop into account when judging the credibility of its initiatives there.

Still, Dish has made some sensible moves.  They acquired Boost Mobile from T-Mobile, which gave them access to T-Mobile’s network for seven years, and they’ve purchased over $20 billion in spectrum over the last ten years or so.  The Boost deal (T-Mobile had to sell Boost as a part of the Sprint/T-Mobile merger approvals process) gives Dish time to frame a logical 5G strategy free from the pressure that’s impacting other 5G players and hopefuls.

From the start, it’s appeared that Dish intended to use that time to formulate a 5G infrastructure plan based (as much as possible) on the cloud, on software, and on open-model networking principles.  Mavenir is one of the few companies who offer a 5G software solution from RAN to core, and they inked that deal in April of this year.  The decision to use VMware’s Telco Cloud software as the infrastructure platform gives credence to what I’ve heard, which is that Dish intends to implement the full 5G stack, including 5G Core software.  However, neither Dish’s nor VMware’s PR have made that point explicitly.

The first question this all raises is whether Dish is serious this time, and I think it is.  They’ve got no real options, other than to simply fade into the sunset.  While their satellite TV business is still supporting millions of customers, the business is deteriorating, along with margins and profits.  The deal for Boost and the spectrum acquisitions would seem to commit Dish to 5G mobile.

The second question is whether Dish can make this work, serious or not.  Are they still perhaps hopeful that 5G IoT will ride up to save them?  I think there’s an element of that thinking that still prevails, but it’s more a kicker on other drivers, or an attempt to redeem a costly blunder.  I think that the strategy of Boost plus spectrum plus Mavenir plus VMware suggests that they realize two important truths about their opportunity.

Truth One is that if they’re going to succeed in 5G, they’re going to have to manage infrastructure and operations costs relentlessly.  In the near term, there is little or no chance that any meaningful 5G technical differentiation will be possible, and little chance that any “killer app” will emerge that they could latch onto.  That means that profitable service offerings will have to be profitable because they have lower costs.  Buying the same technology as current 5G mobile competitors to achieve cost primacy would be a fairly foolish approach.

Truth Two, though, is that cost management vanishes to a point, along with those who hope to survive on that alone.  There will have to be something available as a service revenue kicker, and that something will almost surely have to be a form of over-the-top service set that leverages the agility of cloud-native technology.  The same drivers that dominate the future of “carrier cloud” in general are the opportunities that Dish has to try to exploit.  Thus, they need to be thinking “carrier cloud” from the first.

Realizing truth is a necessary condition for successfully exploiting it, but not a sufficient condition.  Network operators have, for literally decades, ignored reality in their strategies for “transformation”.  The biggest advantage that Dish may have over those operators is the lack of history, the lack of a culture that’s bounded the operators into “connection-think” with respect to services.  But even with a lack of negative bias, Dish will need positive skills to make this work, and from what I hear, they do not have them internally.  That means they’ll be dependent on Mavenir and VMware to supply them.

Both these companies have strong credentials in the specific areas where Dish has engaged them.  Mavenir knows 5G virtualization and infrastructure, including open RAN.  VMware knows cloud-native infrastructure and hosting.  That’s enough for Dish to realize the implications of Truth One, as I’ve outlined it above.

Who knows the rest of carrier cloud?  Truth Two says that Truth One is a transitional strategy, which implies that there has to be a transition to something else.  Carrier cloud depends, in the long term, on the successful exploitation of IoT-related services, personalization, and contextualization.  I’ve blogged about all of these in the past so I won’t bore you with more of it here.  The point is that there are enormous service opportunities associated with each of these things, opportunities the operators could realize with their own retail OTT offerings (subject to regulatory approval) or via wholesale features that they’d let OTTs compose at the retail level.

I don’t have any reason to believe that either VMware or Mavenir have any capabilities in these new areas.  They might, but given the fact that very few operators or vendors have demonstrated any vision beyond connectivity, it’s safer to presume that they don’t have the skills yet either.  That means that they’d have to develop them, and quickly.

The challenge this sort of thing creates for Dish is pretty obvious.  If they raise a banner but fail to raise an army, they signal their intentions to a host of others who may not make the same mistake.  Of particular concern to Dish’s hopes, and the hopes of Mavenir and VMware, are the public cloud providers Google, IBM, and Microsoft.  All of these players have clear carrier-cloud engagement hopes, and plans to exploit those hopes.

Google has had recent success in promoting relationships with telcos to outsource some carrier cloud applications relating to 5G.  Microsoft acquired Metaswitch, who like Mavenir is a pioneer in virtualized mobile infrastructure.  IBM has Red Hat, who can muster similar tools and capabilities to those of VMware.

VMware might benefit from Amazon’s inclusion of them in what appears to be an emerging telco and carrier cloud strategy, but Amazon is the cloud partner the operators seem to fear the most.  Still, Amazon hosts more OTT services in the cloud than any other cloud provider, perhaps as many as all the others combined.

One potentially ominous sign, IMHO, is the comment carried in Light Reading that “The companies explained that they will work together to test and certify vendors’ network functions as they are installed in Dish’s network, such as those from other Dish vendors like Mavenir and Altiostar.”  This could be interpreted as a dependence on NFV for the service functions.  NFV isn’t useful in 5G infrastructure, though it could be cobbled together to work.  It’s totally useless as a framework for other future carrier cloud services, which have no connection with devices and thus don’t fit the NFV mold of transforming devices into virtual devices.

VMware has this slant in its own Telco Cloud material, and I think it represents their biggest risk overall, and as far as the success of the Dish initiative is concerned.  At the least, an NFV story tends to submerge the benefits of asserting cloud-native hosting and service creation, since NFV has yet to define any meaningful cloud-native initiative.  Ericsson, the leading “traditional” 5G vendor, is already buffing up its’ own 5G cloud-native story.  Thus, coming up with a real strategy for carrier-cloud function hosting that works for connection service elements and OTT service elements should be VMware’s, Dish’s and even Mavenir’s top priority.

Why is Google Credible as a Telco Cloud Partner?

Why does Google seem to have momentum in carrier edge computing relationships?  This piece in Light Reading highlights a general if low-key set of successes Google has enjoyed in the network operator space.  If anything, their momentum in the space seems to be building.  Why are they doing so well, so suddenly?

First-off, there’s nothing sudden about it.  A series of operators asked me to feel Google out for partnership well over a decade ago, but at the time Google’s management wasn’t interested.  The reason they had provisionally picked Google was interesting, though.

First, Google wasn’t Amazon, who operators almost universally feared.  This was the primary reason why Google was preferred a decade ago.  Because Amazon was (and is) the leading provider of public cloud services, operators saw them as being the most likely long-term competitor with regard to any cloud plans operators might develop.  Microsoft has been seen mostly as a provider of enterprise cloud services, though today it’s clear that Microsoft has broader aspirations.  Google was the clear choice, if somewhat by default.

The second point (more applicable today than in years past) is that Google is seen as having the most experience in areas where operator interest is high.  Google’s cloud network is the world’s largest SDN application.  Google developed Kubernetes and Istio, both of which are now seen as critical pieces of the cloud-native world.  Google’s content delivery processes, linked to its YouTube services family, are also one of the best examples of edge computing currently deployed.

These two early plusses for Google have increased in credibility today, and this gain has collided with some new factors, some on the side of the network operators and some on Google’s front.

The first new-ish development is Google’s willingness to consider a partnership with operators.  My sources say that arose in earnest about five years ago, and it was due in part to a recognition that operators had a natural advantage in edge computing—they had real estate at the edge.  More recently (in the last year) Google was aware that other cloud providers were showing interest in hosting carrier cloud applications, which could have given a competitor an unassailable lead in the cloud space.

From the operators’ side, the big change is the realization that they don’t have a clue as to how to proceed with carrier cloud deployment.  The vendors they trust, the network equipment vendors they know, don’t have a clue either.  The vendors who do understand the mission (Red Hat, VMware, and so forth) don’t understand the carriers, and the carriers have little or no experience.  In addition, since the operators have no internal cloud software expertise, they’re in no position to assess any solutions.  Outsourcing of some sort makes sense.

An additional operator concern is 5G.  There was a hope, for a time at least, that 5G initiatives could be confined to the NSA (non-standalone) RAN-only upgrades, but operators now believe that for competitive reasons and to harness hoped-for 5G specific benefits, they’ll need to implement 5G Core.  That means having somewhere to host it, and to preserve latency goals for 5G applications, that somewhere has to be the edge.  Imagine fearful operator planners confronting the need to deploy tens of thousands of edge data centers.  Not a happy picture.

The union of these interests lies in the final benefit of outsourcing; incremental commitment.  Operators don’t know how much 5G Core deployment they might push because they don’t know the rate of 5G adoption or the pace at which new 5G-specific (or facilitating) applications might evolve.  If things are slow, they could end up with their own edge data centers sitting idle till they’re obsolete.  If the opportunities develop quickly, they could end up being delayed in addressing them by a lack of edge computing capacity.

Google could really be on to something here, IMHO.  There’s no question that if you had to pick a single player who has a thorough understanding of the cloud and cloud-native development, who understands how to apply hosted function technology to IP networks, who understands next-gen services, Google would be at or near the top of the list.  That doesn’t mean they have an automatic win, though.  What does Google need to buff up their chances of total victory?

Thing one is to recognize that network operators are the world’s most experienced tire-kickers.  Over 80% of all operator RFPs don’t end up delivering any significant production deployments.  The process of assessing technology and the process of deploying it are so separated in most operators that the two organizations may not even like each other.  The “assessors” tend to be in the driver’s seat in early proof-of-concept deals.  Google needs to make sure they get broadly engaged, early on.

The second thing is that the operators’ goals for 5G are actually beyond the scope of their influence on the market.  Operators cannot make IoT or augmented reality or contextualized services happen.  There has to be a broad-market commitment to the concept.  Operator notions of how to build these communities center on announcing developer programs that really don’t offer much benefit to the developers at all.  Google needs to be able to frame things like IoT in terms of new carrier cloud services, since operators can’t.

The third thing is that operators don’t really know what they want from carrier cloud, and don’t know how to find out.  Current interest focuses on 5G mobile-edge because 5G deployment is a given for operators in this market environment.  But what justifies further carrier cloud deployment once you’ve hosted 5G Core features?  When would operators decide to pull the hosting in-house?  What you hear from operators is platitudes like “transformation”, which is clearly too vague to serve as the basis for a plan.  Google needs to understand where operators could and should take carrier cloud, and ensure that there’s always a new application on the horizon to renew interest in outsourcing.

The final thing is that operators don’t really want to sell enterprises cloud computing, they just want to let them buy.  Many of these cloud deals involve the cloud provider becoming a partner in selling cloud computing to enterprises.  The cloud providers usually see this as a way to create a channel partner, but that would only be true if the operators were really trying to sell.  They’re happy to take orders, but where would operator sales people get the training and contacts to actually sell cloud computing services?  If Google expects something to happen here, they have to be prepared to prime the pump.

Then, of course, there’s competition.  Google is far from the only game in town, not as a cloud provider and not as a carrier cloud architecture contender.  There’s clearly an opportunity out there, and the same issues that Google would have to face could be faced by a competitor, perhaps quicker and more effectively.

Amazon, never the favored partner, has been working hard to establish itself in the space, and Amazon has a lot of edge experience and content delivery capabilities on their own.  They held a Telecom Symposium that got not only network operator participation but also the participation of vendors, including integrators, OSS/BSS players, and some network vendors.  While this doesn’t guarantee Amazon can bring an ecosystem of its own to play, it at least indicates it might have the credentials to attract one.

Microsoft wants to be the carrier cloud outsource player of choice, and wants it badly.  They have their own program aimed at victory, and they made an incredibly smart play acquiring Metaswitch, a software company with specific expertise in mobile infrastructure virtualization, including 5G.  IBM also wants the prize, and they’re working to frame their own cloud plus Red Hat tools into a carrier cloud framework that could not only be run on IBM’s cloud but also hosted on operator data centers.

VMware has designs on carrier cloud too.  Dell, the senior partner in VMware, had at one time committed to a group of operators that they would take a big position in function virtualization—including having Michael Dell make the announcement.  It didn’t happen, but VMware seems now to be carrying the torch.  VMware has a good relationship with all the public cloud providers, including Google, and they could convert Tanzu into a cloud-portable strategy that would also allow operators to pull some hosting back into their own data centers.

Service Automation: OSS/BSS or ZSM?

Are we seeing a hidden battle between operations automation alternatives?  On the one hand, there are clearly many developments in the OSS/BSS space, driven by vendors like Amdocs who want to reduce operations costs and improve operations practices, by enhancing traditional operations applications.  On the other hand, some operators are still looking for near-revolutionary changes in lifecycle automation, through things like ETSI ZSM or ONAP.  The balance of these two approaches could be very important.  In fact, it already has changed the nature of lifecycle automation.

One fundamental truth in network cost of ownership is that capex, for most operators, is lower than opex.  In fact, operators spend only about 20 cents per revenue dollar on capex, and they spend over 40% more than that on opex.  In 2016, when I started analyzing opex cost trends, service lifecycle automation could have saved operators an average of 7 cents on each revenue dollar, equivalent to cutting capex by a third.

Things like SDN and NFV were aimed primarily at capex reduction, and that in fact has been one of the issues.  The actual benefit of hosting functions on a cloud versus discrete devices has proven to be far less than 20%, and a lot of operators report that benefit is erased by the greater operational complexity of hosted-function networking.  It’s therefore not surprising that as the hopes for a capex revolution driven by virtualization waned, operators became sensitive to opex reduction opportunities.

Service lifecycle automation, meaning the handling of service events through software processes rather than manual intervention, has the advantage of scope of impact, and the same thing is a disadvantage.  Retooling operations systems to be driven by centralized automation platforms of any sort is the kind of change that makes operators very antsy.

That’s particularly true when there’s really no service lifecycle automation model to touch and feel.  In 2016, when the opex challenge really emerged in earnest, we had no progress in standards and little progress with the proto-ONAP framework.  Sadly, it’s my personal view that we’re in much the same situation today.  I do not believe that ETSI is on the right track, or even a survivable track, with ZSM, and I don’t think ONAP would scale to perform the lifecycle automation tasks we’re going to confront.

That’s where the OSS/BSS alternative comes in.  Vendors and operators both realized that “opex reductions” or “lifecycle automation” was really about being able to cut headcount.  Yes, you could frame a true service lifecycle automation to do that optimally, if you knew what you were doing.  Neither telcos nor telco equipment vendors apparently had the confidence they did.  You could also tweak the current operations systems to handle the current network-to-ops relationships better, and leave the network and network-related event-handling alone.

This isn’t, in the short term at least, a dumb notion.  Of that seven cents per revenue dollar that’s on the table for full-scale lifecycle automation, about three cents could be achieved by tweaking the OSS/BSS.  Some additional savings can be had by framing services to require less lifecycle automation; less dependence on SLAs, customer portals to reduce operator personnel needs, and so forth.  Overall, operators have generally been able to hold their ground on opex, and in many cases have even been able to reduce it over time.  At least four of those seven cents are now largely off the table.

While that doesn’t mean that ZSM (or what I’d consider a better model of lifecycle automation) is dead.  What it does likely do is tie lifecycle automation success to the widespread use of carrier cloud technology.  The substitution of functions as the building blocks for services, versus devices, demonstrably creates more service complexity.  However, even carrier cloud success might not create ZSM or ONAP success.

The cloud community, including Google, Microsoft, Amazon, Red Hat, and VMware, are all working feverishly to enhance the basic Kubernetes ecosystem.  That process will shortly create a framework for lifecycle automation for nearly all componentized applications, the only possible exception being the components associated with data-plane handling.  The exact nature of data-plane functions is still up in the air; most operators favor the notion of a white box rather than a commercial server.  Given that, only widespread NFV adoption using cloud hosting would be likely to accelerate the need for the ZSM or ONAP model of service lifecycle automation.  Otherwise the cloud-centric approach would serve better.

White box data-plane functions would really look like devices with somewhat elastic software loads, similar to the NFV uCPE model.  These applications don’t really impose a different management model for function-based services; the services are just based on open devices rather than vendor platforms.  I doubt whether the differences are sufficient to justify any new management model; we already manage devices in networks.

This seems to have been one of the original goals of NFV; if you focus on virtual devices you can employ device management for at least the higher-level management functions.  The only remaining task is the management of how the virtual elements are hosted and combined, which is a more limited mission.  It could have been a reasonable approach had it been more explicitly articulated and if the consequences of the approach (divided management) had been dealt with, by (for example) embedding the collection of functions within a virtual device in an intent-modeled element.

The OSS/BSS players seem to be sticking with the device-management approach, and that may well be because they don’t see a widespread operator push for deploying their own carrier cloud resources.  Until you commit to carrier cloud on a broad scale, meaning beyond NFV and 5G Core, you have no real need to consider how you manage naked functions.  That’s because follow-on drivers for carrier cloud, like IoT, don’t have real-device network models in place, so function-based services are likely to develop.  Where we have devices, the OSS/BSS-centric solution is workable, or can be made so.

As is often the case, where we end up with regard to operations automation will likely depend on just how far operators take “carrier cloud” and function-based services.  That seems likely to depend on whether operators stay within their narrow connection-services comfort zone, or step out into a broader vision of the services they could provide.

Why Function Integration Needs to Pick an Approach

“What are we supposed to integrate?”  That’s a question a senior planner at a big telco posed to me, in response to a blog where I’d commented that virtualization increased the need for integration services.  The point caught me by surprise, because I realized she was right.  Integration is daunting at best, but one of the challenges of virtualization is that it’s not always clear what there is to integrate.  Without some specifics in that regard, the problem is so open-ended as to be unsolvable.

In a device network, it’s pretty obvious that we integrate devices.  Devices have physical interfaces, and so you make connections from one to the other via those interfaces.  You parameterize the devices so the setup of the interfaces is compatible, and you select control-plane protocols (like BGP) so everyone is talking the same language.  We all know that this isn’t always easy (how many frustrations, collectively across all its users, has BGP alone generated?) but it’s at least possible.

When we move from device networks to networks that consist of a series of hosted virtual functions, things get a lot more complicated.  Routers build router networks, so functions build function networks—it seems logical and simple.  The problem is that “router” is a specific device and “function” is a generic concept.  What “functions” are even supposed to be connected?  How do they get connected?

Standards and specifications, in a general sense, aren’t a useful answer.  First, you really can’t standardize across the entire possible scope of “functions”.  The interface needed for data-plane functions, for example, might still have to look like traditional network physical interfaces, such as Ethernet.  The interface needed for authentication functions might have to be an API.  Second, there are so many possible functions that it’s hard to see how any given body could be accepted to standardize them all.  Finally, there’s way too much time needed, time we don’t have unless we want virtualization to be an artifact of the next decade.

A final issue here is one of visualization.  It’s easy to visualize a device, or even a virtual device.  It’s a lot harder to visualize a function or feature.  I’ve talked to many network professionals who simply cannot grasp the notion of what might be called “naked functions”, of building up a network by collecting and integrating individual features.  If that’s hard, how hard would it be to organize all the pieces so we could at least talk integration confidently?

I’ve been thinking about this issue a lot, and it appears to me that there are two basic possibilities in defining a framework for virtual-function integration, including the ever-popular topic of “onboarding”.  One is to define a “model hierarchy” approach that divides functions into relevant groups, or classes, and provides a model system based on that approach.  The other is to forget trying to classify anything at all, and instead devise an “orchestration model” that defines how stuff is supposed to be put together.

We see examples of the first approach where we have a generalized software module that includes “plugins” that specialize it to something specific.  OpenStack uses this approach.  The challenge with it is to somehow avoid having to define thousands of plugins because you’ve been completely disorderly in how you defined the input to that plugin process.  That’s where the idea of a hierarchy of classes comes in.

All network functions, in this approach, descend from a superclass we could call “network-function”.  This class would be assigned a number of specific properties; it might, for example, have a function you could call on which would have it return its specific identity and properties.  In the original ExperiaSphere project, I included this as the function “Speak”.  Most properties and features would come from subordinate classes that extended that superclass, though.  We could, for example, say that there were four subclasses.  The first is “flow-thru-function” to indicate that the function is presumed to be part of a data plane that flowed traffic in/out.  The second is “control-plane-function” that handled peer exchanges to mediate behavior (BGP would fall into this), and the third “management-plane-function” where management control is applied.  The final subclass is “organizational-function” to cover functions that are intended to be part of the glue that binds a network of functions together.

If we look a bit deeper here, we can find some interesting points.  First, there are going to be many cases where network functions depend on others.  Here, “flow-thru-function” is almost certain to include a control-packet shunt facility, a kind of “T” connection.  This would feed the “control-plane-function” in our example, providing for handling of control packets.  Since it’s necessary to separate control and data plane for handling in our example, rather than require a separate function to do that, which would increase cost and latency, we could require it of at least some flow-thru-functions.

The second point is that we would need to define essential interfaces and APIs for each of our classes.  The goal of doing this based on a class hierarchy is to simplify the process of adapting specific implementations of a class, or implementations of a subordinate class, to software designed to lifecycle-manage the class overall.  If we know what interfaces/APIs a “firewall-function” has, and we write software that assumes those APIs, then all we have to do is adapt any implementations to those same interfaces/APIs.

Another useful point is then raised by the last one.  We still need to define, in order to build our hierarchy of classes, some basic assumptions about what network functions do and how they relate.  We also need to have vendors/products aligned with the classes.  If both of these are done, then the integration of a function requires the creation of whatever “plugin” code is needed to make the function’s interfaces conform to the class standard.  Vendors provide the mapping or adapting plugins as a condition of bidding for the business.

The other approach is simpler on one hand and more complicated on the other.  It’s simpler because you don’t bother defining hierarchies or classes.  It’s more complicated…well…because you didn’t.  In fact, it’s complicated to explain it without referencing something.

If you harken back to my state/event-based concept of service management, you recall that my presumption was that a service, made up of a collection of lower-level functions/elements, would be represented by a model.  Each model element, which in our example here would correspond to a function, has an associated state/event table that relates its operating states and events to the processes that are supposed to handle them.  Remember?

OK, then, what the “orchestration model” says is that if a vendor provides the set of processes that are activated with all the state/event combinations, then these processes can take into account any special data models or APIs or whatever.  The process set does the integration.

Well, it sort-of does.  You still have to define your states and events, and you still have to agree on how events flow between adjacent elements.  But this seems a lot less work than building a class hierarchy.  But even here we have to be wary of appearances.  If there are a lot of vendors and a lot of functions, there will be a lot of work done, when had we taken time to put together our class hierarchy, we might have been able to do some simple adapting and largely reuse all those processes.

A class-hierarchy approach organizes functions, following the kind of practices that have been used for decades in software development to create reusable components.  By structuring functional interfaces against a “class reference”, it reduces the variability in interfaces associated with lifecycle management.  That limits how much integration work would be needed for actual management processes.  The orchestration model risks creating practices so specialized that you almost have to redo the management software itself to accommodate the variations in how you deploy and manage functions.  Class hierarchies seem likely to be the best approach, but the approach flies in the face of telco thinking and, while it’s been known and available from the first days of “transformation”, it never got much traction with the telcos.  The orchestration model almost admits to a loss of control over functions and deals with it as well as possible.

Our choice, then, seems a bit stark.  We can either pick a class-hierarchy approach, which demands a lot of up-front work that, given the pace of telecom activity to date, could well take half a decade, or we can launch a simpler initiative that could end up demanding much more work if the notion of function hosting actually catches on.  If we could find a forum in which to develop a class hierarchy, I’d bet on that approach.  Where that forum might be, and how we might exploit it, are as much a mystery to me as ever.

I think I know what to do here, and I think many others with a software background know as well.  We’ve known all along, and the telco processes have managed to tamp down the progress of that knowledge.  Unless we admit to that truth, pick a strategy that actually shows long-term potential and isn’t just an easy first step, and then support that approach with real dollars and enforced compliance with the rules the strategy requires, we’ll be talking about this when the new computer science graduates are ending their careers.

An Assessment of Four Key ETSI Standards Initiatives in Transformation

The topic of telco transformation is important, perhaps even critical, so it’s good it’s getting more attention.  The obvious question is whether “attention” is the same as “activity”, and whether “movement” is equivalent to “progress”.  One recent piece, posted on LinkedIn, is a standards update from ETSI created by the chairs of particular groups involved in telco transformation, and so frames a good way of assessing just what “attention” means in the telco world.  From there, who knows?  We might even be able to comment on progress.

The paper I’m referencing is an ETSI document, and I want to start by saying that there are a lot of hard-working and earnest people involved in these ETSI standards.  My problem isn’t in their commitment, their goals, or their efforts, it’s in the lack of useful results.  I participated in ETSI NFV for years, creating the group that launched the first approved proof-of-concept.  As I said in the past, I believe firmly that the group got off on the wrong track, and that’s why I’m interested in the update the paper presents.  Has anything changed?

The document describes four specific standards initiatives; NFV, Mobile Edge Computing (MEC), Experimental Networked Intelligence (ENI), and Zero Touch Network and Service Management (ZSM).  I’ll look at each of them below, but limit my NFV comments to any new points raised by the current state of the specifications.  I do have to start with a little goal-setting.

Transformation, to me, is about moving away from building networks and services by connecting devices together.  That’s my overall premise here, and the premise that forms my basis for assessing these four initiatives.  To get beyond devices, we have to create “naked functions”, meaning disaggregated, hostable, features that we can instantiate and interconnect as needed.  There should be no constraints on where that instantiation happens—data centers, public clouds, etc.

This last point is critical, because it’s the goal of software architecture overall.  The term most-often used to describe it is “cloud-native” not because the stuff has to be instantiated in the cloud, but because the software is designed to fully exploit the virtual, elastic, nature of the cloud.  You can give up cloud hosting with cloud-native software, if you want to pay the price.  You can’t gain the full benefit of the cloud without having cloud-native software, though.

Moving to our four specific areas, we’ll start with the developments in NFV.  The key point the document makes with regard to Release 4 developments is “Consolidation of the infrastructural aspects, by a redefinition of the NFV infrastructure (NFVI) abstraction….”  My problem with this is that in virtualization, you don’t enhance by narrowing your hosting target or subdividing it, but rather by enhancing how hosting abstractions are reflected.  This, in NFV, is handled by the Virtual Infrastructure Manager (VIM).

Originally, the VIM was seen as a single component, but the illustration in the paper says “VIM(s)”, which admits to the likelihood that there would be multiple VIMs depending on the specific infrastructure.  That’s progress, but it still leaves the question of how you associate a VIM with NFVI and the specific functions you’re deploying.  In my own ExperiaSphere model, the association was made by the model, but it’s not clear to me how this would work today with NFV.

The paper makes it clear that regardless of the changes made in NFV, it’s still intended to manage the virtual functions that replace “physical network functions” (PNFs), meaning devices.  Its lifecycle processes and management divide the realization of a function (hosting and lifecycle management of hosting-related elements) from the management of the things that are virtualized—the PNFs.  That facilitates the introduction of virtual functions into real-world networks, but it also bifurcates lifecycle management, which I think limits automation potential.

The next of our four standards areas is “Multi-access Edge Computing” or MEC.  The ETSI approach to this is curious, to say the least.  The goal is “to enable a self-contained MEC cloud which can exist in different cloud environments….”  To make this applicable to NFV, the specification proposes to create a class of VNF (the “MEC Platform”) which deploys, and which then contains the NFV VNFs.  This establishes the notion that VNFs can be elements of infrastructure (NFVI, specifically), and it creates a whole new issue set in defining, creating, and managing the relationships between the “platform” class of VNFs and the “functional” class we already have.

This is so far removed from the trends in cloud computing that I suspect cloud architects would be aghast at the notion.  The MEC platform should be a class of pooled resources, perhaps supported by a different VIM, but likely nothing more than a special type of host that would (in Kubernetes, for example) be selected or avoided (by taints, tolerations, affinities, etc.) via parameters.

The MEC concept seems to me to be moving carriers further from the principles of cloud computing, which are evolving quickly and effectively in support of both public and hybrid cloud applications.  If operators believe that they can host things like 5G Core features in the public cloud, why would they not flock to cloud principles?  NFV started things off wrong here, and MEC seems to be perpetuating that wrong direction.

Our next concept is Experiential Networked Intelligence (ENI), which the ETSI paper describes as “an architecture to enable closed-loop network operations and management-leveraging AI.”  The goal appears to be to define a mechanism where an AI/ML intermediary would respond to conditions in the network by generating recommendations or commands to pass along to current or evolving management systems and processes.

Like NFV’s management bifurcation, this seems aimed at adapting AI/ML to current systems, but it raises a lot of questions (too many to detail here).  One question is how you’d coordinate the response to an issue that spans multiple elements or requires changes to multiple elements in order to remediate.  Another is how you “suggest” something to an API linked to an automated process.

To me, the logical way to look at AI/ML in service management is to presume the service is made up of “intent models” which enforce an SLA internally.  The enforcement of that SLA, being inside the black box, can take any form that works, including AI/ML.  In other words, we really need to redefine how we think of service lifecycle management in order to apply AI to it.  That doesn’t mean we have to scrap OSS/BSS or NMS systems, but obviously we have to change these systems somewhat if there are automated processes running between them and services/infrastructure.

That brings us to our final concept area, Zero-touch network and Service Management (ZSM).  You can say that I’m seeing an NFV monster in every closet here, but I believe that ZSM suffers from the initial issue that sent NFV off-track, which is the attempt to depict functionality that then ends up being turned into an implementation description.

I absolutely reject any notion that a monolithic management process, or set of processes built into a monolithic management application, could properly automate a complex multi-service network that includes both network devices and hosted features/functions.  I’ve described the issue in past blogs so I won’t go over it again here, but it’s easily resolved by applying the principles of TMF NGOSS Contract, a concept over a decade old.  However, an NGOSS Contract statement of ZSM implementation would say that the contract data model is the “integration fabric” depicted in the paper.  Absent that insight, I don’t think a useful implementation can be derived from the approach, and certainly an optimum one cannot be derived.

What, then, is the basic problem, the thing that unites the issues I’ve cited here?  I think it’s simple.  If you are defining a future architecture, you define it for the future and adapt it to the present.  Transition is justified by transformation, not the other way around.  What the telcos, and ETSI, should be doing is defining a cloud-native model for future networks and services, and then adapting that model to serve during the period when we’re evolving from devices to functions.

Intent modeling and NGOSS Contract would make that not only possible, but easy.  Intent modeling says that elements of a service, whether based on PNFs or VNFs, can be structured as black boxes whose external properties are public and whose internal behaviors are opaque as long as the model element’s SLA is maintained.  NGOSS Contract says that the service data model, which describes the service as a collection of “sub-services” or elements, steers service events to service processes.  That means that any number of processes can be run, anywhere that’s convenient, and driven from and synchronized by that data model.

The TMF hasn’t done a great job in promoting NGOSS Contract, which perhaps is why ETSI and operators have failed to recognize its potential.  Perhaps the best way to leverage the new initiative launched by Telecom TV and other sponsors would be to frame a discussion around how to adapt to the cloud-native, intent-modeled, NGOSS-contract-mediated, model of service lifecycle automation.  While the original TMF architect of NGOSS Contract (John Reilly) has sadly passed, I’m sure there are others in the body who could represent the concept at such a discussion.

This paper was posted on LinkedIn by one of the authors of the paper on accelerating innovation in telecom, a paper I blogged about on Monday of last week.  It may be that the two combined events demonstrate a real desire by the telco standards community to get things on track.  I applaud their determination if that’s the case, but I’m sorry, people.  This isn’t the way.

A Possible Way to Avoid Direct Subsidies for Rural Broadband

Is it possible to estimate broadband coverage potential for new technologies?  I’ve blogged many times about the effect of “demand density” (roughly, a measure of how many opportunity dollars a mile of infrastructure would pass) on the economics of broadband.  Where demand density is high, it’s possible to deliver broadband using things like FTTH because cost/opportunity ratios are favorable.  Where it’s low, cost has to be ruthlessly constrained to get coverage, or subsidies are needed.

We know from experience that, using my metrics for demand density, an average density of about 4.0 will permit quality broadband under “natural market” conditions for at least 90% of households.  Where demand density falls to about 2.0, “thin” areas, meaning low populations and economic power, will be difficult to support profitably, so penetration of broadband is likely to fall below 80%, and at densities approaching 1.0, penetration will fall to 70% or less without special measures.

The characteristics of wireline infrastructure are usually the limiting factor here.  If broadband deployment costs were very low, then a low economic value passed per mile of infrastructure would still create a reasonable ROI.  Obviously, running any form of physical media to homes and businesses, even with a hierarchy of aggregation points, is going to be more costly where prospective customers are widely distributed.  Almost all urban areas could be served with wireline broadband, where most deep-rural areas (household densities of less than 5 households per square mile) would be difficult to serve unless the service value per household was quite high.

Public policy is almost certainly not going to permit operators to cherry-pick these low-density areas based on potential revenue, but that would be difficult in any case because the revenue that could be earned per household depends on the services the household would likely consume.

What is the service value of a household?  Here we have to be careful, because an increasing percentage of the total online service dollars spent per household don’t go to the provider of broadband access.  An example is that many households who used to spend around $150 per month on TV, phone, and Internet, have dropped everything but Internet and now spend less than $70 per month.  Sure, they may get Hulu and even a live TV streaming service, and spend another $70 or even more, but the broadband operator doesn’t get that.

Generally, the preferred relationship for broadband in US markets seems to be a household revenue stream (all services monthly bill) that’s roughly equal to one third of the combination of pass cost (per-household neighborhood wiring) plus connect cost.  Today, average pass costs run roughly $250 and average connect costs roughly $200, for a total of $450.  That would mean a household revenue stream of $150 is needed, on the average.

In US urban and suburban areas, it’s getting more difficult to hit that monthly revenue target, but it’s still largely possible.  Household densities even in the suburbs tend to run between 300 and 600 households per square mile, which is usually ample to support profitable broadband.  As you move into rural areas, though, household densities fall to an average of less than 100 per square mile, down to (as previously noted) as little as 5 or less.

Wireline infrastructure is rarely able to deliver suitable ROI below densities of 150 households per square mile.  Even in higher household densities, as many as 500 or more, it’s often necessary today for developers to either share costs or promise exclusivity to induce broadband providers to offer quality infrastructure for new subdivisions.

5G millimeter wave, just beginning to deploy, is typically based on a combination of short-haul 5G and fiber-to-the-node (FTTN).  The overall cost will depend in large part on whether there are suitable node points where there’s either already fiber available or where fiber can be introduced at reasonable costs.  Operators tell me that they believe that, on the average, it should be possible to serve household densities of between 100 and 200 per square mile with monthly revenues of $120 or more per household, since self-installation is a practical option here.  This would cover a slightly broader swath of low-density suburbs to high-density rural.

The problem here is that 5G/FTTN tends to support demand densities of somewhere in the 2.5-3.5 range, which is better than the 4.0 lower limit for traditional technologies but still far too high to address many countries and most rural areas.  For that, the only solution is to rely on cellular technologies with greater range.

Studies worldwide suggest that 5G in traditional cellular form (macrocells in low-density areas, moving to smaller cells in suburbs and cities) could deliver 25 Mbps to 35 Mbps per household at acceptable ROIs, and many operators and vendors say that these numbers could probably be doubled through careful site placement and RF engineering.  My models suggest that using traditional 5G, it would be possible to support demand densities down to as low as 0.8, without any special government support.

The “would be possible” qualifier is important here, and so is the 0.8 demand density floor.  The “possible” issue relates to the fact that while it’s possible to hit minimal ROI targets on demand densities below 1.0, it’s not clear whether minimal ROI could actually get anyone interested in deployment.  With every operator chasing revenue, many leaving their traditional territories to seek opportunities half a world away, would they flock to rural areas?  Maybe not.

With respect to the 0.8 limit, the problem is that there are a lot of areas that fall well below that.  In the US, there are 18 states with demand densities below that limit, and that’s entire states.  Within well over 80% of states there are areas with demand densities below 0.8.  Does this mean that even in the US, widespread issues with broadband quality are inevitable without government support?  Yes.  Does it mean the support has to be direct subsidization?  Perhaps not.

You can swing the ROI upward by lowering the cost of infrastructure.  The biggest cost factors in the use of 5G (in either form) as a means of improving broadband service to low-demand-density areas are the spectrum costs and cost of providing fiber connections to cell towers and nodes.  Both these costs could be reduced by government programs.  For example, governments could provide 5G spectrum at low/no cost to those who would offer wireline-substitute broadband at 40 Mbps or more, and they could trench fiber along all public routes, when any construction is underway, then offer fiber capacity under the same terms.

This could be an alternative to direct subsidies.  I’ve not been able to model the impact of the approach, because there are so many country-specific variables and low-level data on population and economic density isn’t always available, but it would appear from my efforts that it could pull over 90% of the US into a zone where ROIs on even rural broadband could be reasonable, enough to make it possible for existing wireless operators at least to serve rural areas profitably.

Reading Cloud Patterns from IBM’s Quarter

One of the expected impacts of COVID is pressure on long-term capital projects.  These pressures would tend to favor a shift toward various expense-based strategies to achieve the same overall goals, and in application hosting that would mean a shift from data center commitments to public cloud.

As it happens, 2020 was already destined to see a growth in public cloud services to enterprises, because the model of using the cloud as a front-end technology to adapt legacy applications to mobile/browser access was maturing.  This “hybrid cloud” approach is why Microsoft was gaining traction over Amazon as the cloud provider of choice for enterprises.

As I noted briefly in my blog yesterday, IBM surprised the Street and many IT experts by turning in a sterling quarter, fed in no small part by its IBM Cloud service.  I want to look at their quarter and what it might teach us about the way that cloud services, cloud technology, and data center technology are all evolving.  In particular, I want to look more deeply at the premise of yesterday’s blog—that Google had a cloud-native strategy that it planned to ride to success in tomorrow’s world of IT.

Let me start with a high-level comment.  Yes, IBM’s successful quarter was largely due to its Red Hat acquisition.  Even IBM’s cloud success can be linked back to that, but why do we think they bought Red Hat in the first place?  IBM had a good base relationship with huge companies, good service organization, and a good brand.  They needed more populism in their product set, and they got it.  We need to understand how they’re exploiting the new situation.

One of IBM’s Krishna’s early comments is a good way to start.  He indicated that “we are seeing an increased opportunity for large transformational projects.  These are projects where IBM has a unique value proposition….”  There is no question that COVID and the lockdown have opened the door for a different, far less centralized, model of how workers interact and cooperate.  As I said in an earlier blog, this new approach will survive the pandemic, changing our practices forever.  I think Krishna’s transformational opportunities focus on adapting to the new model of work.

The related point, IBM’s unique value proposition, is also predictable.  If you’re going to do something transformational, you don’t cobble together a bunch of loosely related technologies, or trust your future to some player who might fold in a minute under the very pressures of pandemic you’re responding to yourself.  You pick a trusted giant, and IBM not only fits that bill, they’ve been the most consistently trusted IT player for half a century.

Now let’s look at the next fascinating Krishna comment: “Only 20% of the workloads have moved to the cloud. The other 80% are mission critical workloads that are far more difficult to move, as a massive opportunity in front of us to capture these workloads. When I say hybrid cloud, I’m talking about an inter-operable IT environment across on-premise, private and publicly operated cloud environments, in most cases from multiple vendors.”  This really needs some deep examination!

I’m gratified to see the comment on workloads already migrated to the cloud, admittedly in part because his numbers almost mimic my own data and even earlier modeling.  The most important reason why public cloud for enterprises isn’t an Amazon lake is that 80%.  It’s not moving soon, and so hybrid cloud services have to augment the existing mission-critical stuff rather than replace it.  But, that “inter-operable IT environment” Krishna is talking about is the cloud-native framework that my blog yesterday suggested was Google’s goal.  So, it appears IBM is saying that the future of that 80% mission-critical application set depends on a new environment for applications that sheds technology and location specificity.  Build once to run anywhere.

What’s in that framework?  Containers and Kubernetes, obviously (and Krishna in fact mentions both).  Linux, open-source software, OpenShift, Red Hat stuff, not surprisingly.  What IBM seems to be doing is framing that inter-operable IT environment in terms of software components it already has and which are considered open industry assets.  IBM could reasonably believe it could lift the Red Hat portfolio to that new IT environment level, making all of it an element in the future of IT.

What isn’t in the framework may be just as important.  Nowhere on the call does Krishna suggest that the new framework is “cloud-native” (he never mentions the term), nor does it include a service mesh (never mentioned) or an application platform that’s intrinsically portable, like Angular.  In other words, none of the stuff that Google may be relying on is a part of the IBM story.  That doesn’t mean that Google is on the wrong track; it might mean IBM doesn’t want to make it appear that Google is on the right track.

The risk this poses for IBM is pretty simple.  If there are in fact technology pillars that have to hold up the new application framework, then IBM has to be an early leader in those areas of they risk losing control of what they admit to be the future of IT.  It seems, at one level at least, foolish to take a risk like that, so why might IBM be willing to do so?

The first reason is their nice quarter, and they’re citing their unique value proposition for those current transformational projects.  It’s the wild west out there in the hybrid cloud; let IBM be your Sherriff.  IBM is clearly reaping benefits in the here and now, and so the last thing they’d want to do is push the fight off for a year or more, losing revenue and momentum along the way.

The second reason is that Red Hat is unique in having complete platform and application solutions.  If future transformational applications have to be built on a new framework, IBM’s Red Hat assets might require a lot of rebuilding.  Google has no inventory of stuff at risk so they can not only afford to risk transformation of the architecture of future applications, they’d benefit from it.

The third reason is that, absent a transformational architecture for transformational applications, it’s not unlikely that building those applications would involve more integration work.  Guess who has a booming enterprise services business; IBM!  Quoting Krishna again, “you’ll recall that nearly half the hybrid cloud opportunity lies in services.”  Nothing kills a professional services opportunity like a nice, fully integrated, architecture.

I think that IBM’s success this quarter, and its views on why it succeeded, demonstrate that we’re likely heading into a polarization period in hybrid cloud.  One camp, the camp IBM is in, sees the hybrid future as an adaptation of existing open applications to a new architecture, via professional services and container suites (OpenShift).  The other camp, which I believe Google is in, sees the future as the definition of a true, universal, cloud-native application framework that has to be supported from development to deployment.

An interesting kind-of-parallel dynamic is the swirling (and confusing) telco cloud space.  It is very possible, even likely, that the first and biggest opportunity to introduce a sweeping new application architecture into the cloud world would be the telco or carrier cloud.  The current market conditions and trends suggest that carrier cloud is both an opportunity for outsourcing to the public cloud and a new hosting mission to justify a new architecture.  It certainly represents a complex hybrid-cloud opportunity, a fusion of the two hosting options.

IBM sees all of this; Krishna said “we have continued to deliver a series of new innovations in the last quarter. We launched our new Edge and Telco network cloud solutions, built on Red Hat OpenStack and Red Hat OpenShift, that enable clients to run workloads anywhere from a data center to multiple clouds to the edge.”  So, of course, do all the other public cloud vendors, and so does HPE and VMware, both of whom could be credible sources of new-architecture elements.  And, of course, with every possible advance of cloud technology into the world of telecom, we have pushback.

A recent story suggesting that container centerpiece Kubernetes may not be the right orchestrator for NFV, and citing a Cisco expert to buttress the point.  The issue, to me, seems linked to the fact that containers aren’t an ideal data-plane element and don’t fit the NFV model.  OK, but software instances of data-plane functionality hosted on commercial servers aren’t ideal either; white boxes designed for the mission would surely be better.  And the NFV model doesn’t seem to fit well with its own mission; most VNFs get hosted outside the cloud, not in it.  Containerized Network Functions (CNFs) are different from containers, if they really are, only because the NFV community chose to make them so.  Nevertheless, the result of this could be a slowing of cloud-native adoption by operators, which would limit their ability to realize carrier cloud opportunities beyond 5G and NFV.

From the perspective of telco cloud services, IBM, then, may be taking a risk, but so are Google and those relying on some sensible carrier cloud thinking.  By taking their winnings when they can, IBM may emerge as the smart player at the table, particularly if the carrier cloud space descends into the disorder we’re becoming accustomed to seeing in the telco world.

I think that, in the net, the cloud opportunity generated in our post-COVID world will overcome the carrier cloud uncertainties.  Carrier cloud is less likely to be a decisive driver, not only because the carriers continue to fumble on the issue, but because COVID-related changes are clearly on the rise in the cloud space.  In that world, forces seem evenly balanced between IBM’s integration-transformational approach and Google’s (by my hypothesis, anyway) architectural approach.  I think the latter offers greater cloud penetration and more overall tech opportunity in the long term, but if I’m right about Google’s intentions, they need to start making announcements of their new direction and win some planner hearts and minds.  Planning requires an understanding of an approach, where IBM’s approach requires only sales account control.