We Need to Think About More than Two-Layer Hybrid Clouds

There are layers of the clouds in the sky.  Similarly, for a bunch of reasons, there are (or should be) layers in the cloud applications and infrastructure that companies are adopting or planning.  Yes, there are some applications and even some businesses that may view both cloud and data center as a single resource, or even adopt one or the other.  For most, we’re back to layers, so we need to understand what they are and how to manage them.

When I’ve surveyed businesses about their applications, or talked with them about application design and design and hosting trends, they’ve pointed out that most of their “applications” have a decidedly two-layer structure.  The bottom layer is the core logic of business applications, the things like transaction processing and database maintenance.  The top layer is what we could call the “presentation” layer, where we create user access to applications, run analytics to generate reports, and so forth.  This is a good starting point because it’s the layer structure that businesses report spontaneously, making it a kind of observational perspective.

When we thought about cloud hosting in the early days of the cloud, everyone was thinking about transporting the entire two-layer structure to the cloud.  While there are applications that businesses were happy to transport en masse, most companies ran into issues with the bottom layer, where information security and regulatory or business policy compliance concerns tend to arise.

The current growth in cloud success, and the pace of adoption of what we call “hybrid cloud”, is due to the recognition by businesses that it’s possible to move the top layer to the cloud and leave the bottom layer in the data center.  This creates what I’ve called in my blogs the “front-end/back-end” approach to application-building.  A cloud front-end piece, designed for flexibility, resiliency, and agility, is married to a back-end portion that continues to be run in the data center.  This eliminates the compliance issues of the “move-to-the-cloud” approach, and adds flexibility to the GUI to support mobile devices and browsers, workers and customers, with customized portals.  Better insight leads to better adoption.

The obvious question is whether understanding layers better, or perhaps by looking at more than “high clouds and low clouds” would further improve our insights.  Why does the cloud work better for what I’ve called the “front-end” piece?  Is the back-end piece in the data center because it doesn’t suit the cloud, because of security/compliance issues, or what?

The interface between application and user is critically important, and even small improvements there result in significant improvements in overall quality of experience.  If we look at a typical transaction, we find that it tends to fall into four phases.  First, the user selects an action they want to take.  Second, they obtain baseline data for the support of the action.  Third, they do something that updates that data, and finally they get a visual confirmation of success (or an indication of failure).  These phases take place at human speeds, and a complete transaction might take a minute or more to complete.

Front-end/back-end exchanges happen in the last three of our four phases.  In a traditional implementation, we click on a link to get a product description, and the back-end system delivers the associated information.  We do an explicit or implicit update (click a “buy” button is an implicit update of inventory), and the back-end system relieves the inventory, adds an item to the accounts receivable or processes a credit card.  We then get back a confirmation or refusal.

One thing users point out about the way these four phases are implemented is that the APIs involved are usually inherited from applications where the users were highly trusted.  Often there’s a generalized API that lets a requestor inquire about status, read existing data, write new data, etc.  This approach creates an important problem in our cloud layers discussion.  The database itself is always a protected asset.  When an application that has database rights is created, meaning when an API is developed, the API now has to be protected too.  This is analogous to expanding the base layer of our cloud layer stack, which keeps things in the data center.

Even where security/compliance isn’t an issue, things can get stuck in the data center layer of our application because of the cost of moving them out.  If an application routinely accesses a database, then there’s a good chance that the cost of access across the cloud boundary is going to be an incentive to keep the application in the data center, along with the data.  However, with transaction processing (as opposed to analytics), the database access is really done only to retrieve a very small number of records.  If we run a database query from the cloud, we may drag many records into the cloud and not just the ones we want, depending on how the database is accessed.

This essentially defines another layer, a set of application components that aren’t inherently tied to the data center, but which are effectively tied because of cloud pricing policies associated with the data access across the cloud boundary.  Eliminate the access charge, or eliminate a mass amount of data doing the crossing, and you can move these components to the cloud.  That could be accomplished, in some cases at least, by sending a query from the cloud to the data center and database, and then returning only the result.  Recall that in our four steps of transaction processing, we really saw only one record and updated that record.  If the query process is run locally to the database, there are only limited charges for exchanging the result of the query across the cloud boundary.

In some cases, developers have created secondary databases that can be made cloud-resident, but another solution is to recognize that in our four phases of transaction processing, the database really isn’t being accessed, just a select and small number of records.  The problem of mixing the layers of our application arises because we don’t re-architect the APIs to fit the cloud model.  We don’t contain risk where we expose it, which is always the most effective way to deal with risk.

API restructuring would enhance security in itself, but it might also make it possible to move more of the transactional application into the cloud without additional loss of security.  The same would hold true for compliance concerns; if API restructuring could reduce compliance risks of cloud usage for a portion of an application, then compliance would not be a barrier to cloud migration.  Some pieces of core applications, if properly protected at the API level to limit what could be exposed there, could migrate to the cloud.  Another layer is created.

The point to all of this is that the prevailing hybrid cloud front-end/back-end model is a bit either/or, more so than needed.  We’re making assumptions in creating the two pieces that reflect old application design practices rather than modern requirements.  We could modernize our thinking, and if we did so, we could create more layers, layers of components that could live on either or both sides of the cloud boundary.  That could enhance application QoE considerably, and boost cloud prospects at the same time.

This also illustrates why cloud-native design is both important and different.  If you want to be able to create layers of application functionality based on mission and cost management, you’ll need to control the APIs that provide the linkage, or you can accidentally freeze your options into those original front-end/back-end polar layers, and that will defeat cloud-native implementation, no matter what techniques you decide to apply.

Why is Microsoft Taking its Own Service Mesh Path?

Microsoft has decided to go its own way on service mesh, or at least to try.  Given that Istio, a Google development but still open-source, is the de facto standard in service mesh, why would Microsoft make that decision?  Would a service mesh battle help the cloud-native space?  Is there a carrier cloud dimension to all of this?  Interesting things to speculate about, so let’s speculate.

One thing that doesn’t seem to demand much speculating is that the announcement timing is related to Google’s decision not to contribute Istio to the Cloud Native Computing Foundation (CNCF).  The fact that Microsoft will contribute its Open Service Mesh to the CNCF is the most obvious point made in the announcement.  Its support of CNCF-member-supported Service Mesh Interface API (SMI) is the second-most-obvious point.  The third?  It’s a “lightweight” implementation of service mesh.

All of the service mesh technologies in general use are based on something like the Envoy proxy, which is a software sidecar, an agent that links to a microservice and provides the microservice with basic connectivity features.  Microsoft proposed last year to make SMI a universal API, an open way of communicating with services (and proxies), and thus enhancing service portability among mesh implementations.  It seems pretty clear that Microsoft hopes that the combination of a lightweight service mesh and an open API will get Open Service Mesh a lot of traction.

That raises our first truly interesting question, which is “Why?”  What could it be about service mesh that matters so much?  Microsoft adopted Kubernetes, another Google project, after all.  The answer here, I believe, is that service mesh is the foundation of cloud-native, as much as Kubernetes is the foundation of container deployments.  Kubernetes is heading toward being a de facto standard, and if the market is heading to cloud-native, you can’t differentiate your cloud-native approach by adopting the same tools as everyone else.

Despite the way the market has talked about the cloud and cloud-native, it’s always been really a development issue.  The cloud has features that the typical data center doesn’t have.  If those features are to be fully exploited, applications have to be written to take full advantage of them, which is what “cloud-native” means.  The application was written to run in the cloud, not moved there from somewhere it was already running.

The central adaptation of components to cloud-native status is making them into microservices, which are stateless (in some way) features that can be spun up and run, taken down, moved, and so forth, as conditions demand.  Having a bunch of application features floating out there in the ether isn’t exactly how most application developers think, and the tools to make this work start with an abstraction of connectivity, or virtual connection fabric.  That’s what service meshes provide.

The reason this is important is that the cloud’s role in computing underwent a major transformation just last year, when enterprises recognized that a true “hybrid cloud” was a combination of their current data center and a cloud front-end component set that provided the user interface to browsers and mobile devices.  This insight showed that you don’t “move something to the cloud”, but rather create something new in the cloud that mates with your current applications.  That thinking is a stepping-stone to the realization that cloud-native applications would have to be truly different.  What makes them different, as enterprises will eventually realize, starts with service meshes.

In July, I blogged about why Google might want to retain development direction control over Istio and other projects that it elected not to submit to the CNCF.  My suggestion was that Google sees an architecture for cloud-native emerging, sees Istio as a key piece of that architecture, and wants to make sure Istio develops according to Google’s cloud-native vision.  Their decision on governance control wouldn’t prevent others from exploiting Istio, but it would prevent having Istio go off-track, which implies Google has a track for it.  Which implies it’s strategically important to Google, which implies it would be to Microsoft too.

The next interesting insight we can derive here relates to the “lightweight” property of Microsoft’s Open Service Mesh.  Why would a lightweight version of service mesh be needed when there are at least three well-known full-feature ones (Istio, Linkerd, and Consul)?  The answer is that all the current service meshes were designed with the full mission of service mesh in mind.  Enterprises, having just figured out that there really was such a thing as “cloud-native development”, are hardly pushing the technical boundaries of service mesh thinking with their current projects.  You don’t buy a backhoe when you need to plant a bush.

But you may need a backhoe eventually, of you’re going to dig enough holes.  I think Open Service Mesh is a basic framework, not a complete competitor to things like Istio but a framework on which further feature development can be built.  The CNCF, then, would be not only a way of rubbing Google’s nose in the you-know-what, but a way of encouraging open development of advanced service mesh features, based on the expanding experience and needs of enterprise cloud prospects and customers.

Whether this is going to work will obviously depend on two things.  The first is the pace at which enterprises gain insight into the totality of cloud-native development techniques and their value propositions.  The second is the pace at which open contributions to extend Open Service Mesh are offered and accepted.  If the first horse in our race runs way ahead, then Open Service Mesh quickly becomes a shovel in a backhoe race.  If the second outpaces the first, then Open Service Mesh can hope to address user needs as they develop, never confronting enterprises or enterprise software developers with more complexity than current applications demand.

If we rethink our horserace in terms of Google versus Microsoft, we could frame the race in terms of the pace of education versus the pace of development.  Google needs to accelerate the full cloud-native understanding of the development community.  They also need to ensure their own cloud can support the full range of Istio features, and do so with reasonable ease of use and cost.  Microsoft needs to prime the pump on Open Service Mesh enhancement, seeding any reasonable project with resources and contributing an insight as to the way that the features should guide the expansion of service mesh overall.

Where, then does “carrier cloud” fit into all of this?  Recent surveys have suggested that there are no new standards areas that operators feel are critical to them; they reject most by a 2:1 margin or better.  Surveys also suggest that operators believe that OSS/BSS modernization is hopelessly stalled (I found the same attitudes back in 2013).  Operator initiatives in software design (NFV, ONAP) don’t suggest a lot of experience with or understanding of cloud-native development.  Thus, I think, it would be safe to say that the operators are in no danger of pushing the boundaries of a basic service mesh.

And yet…5G function hosting is clearly a mission that all the public cloud providers think they could make money on.  What could we say is the cause of carrier cloud outsource interest, if not lack of understanding of carrier cloud by the carriers themselves?  We therefore could have the classic situation of a demand driver whose “demander” can’t muster much sophistication in meeting the demand themselves.  That could favor an Open Service Mesh model; simple things for simple folk.

Then there’s the mission of 5G Core hosting itself.  What telco standards group or open-source activity has framed out a carrier-cloud-ready architecture?  There’s no reason to think that the 3GPP did, or could do, that.  We could use a service mesh in the 5G control plane for sure.  We could (IMHO) adapt it to serve as an agile data plane too.  The technical requirements of service mesh in 5G control-plane applications are fairly basic, though (again, IMHO) still beyond the very basic Open Service Mesh feature set.  The adaptation of Open Service Mesh to the control-plane mission would be simple to accomplish, and adding data plane not much harder.

That’s if you know what you’re doing, of course.  Given that the 3GPP stuff isn’t really framed for cloud-native, some architecting of the structure, preserving the standard interfaces, will be necessary.  Microsoft may well be in the best position to do all this adapting and adding.  They bought Metaswitch, who has more experience with open, software-based, IMS/EPC/5G implementation than pretty much anyone out there.

The process could give Microsoft an edge in 5G, but also an edge in the battle for the hearts and minds of evolving service-mesh users.  5G signaling is a great event-driven application, a good test for a service mesh and microservice implementation.  As Microsoft/Metaswitch goes through it, Microsoft could learn a lot about what’s really needed in Open Service Mesh enhancement, and make sure their own cloud service is optimized to provide it.

The pressure, then, is likely on Google first, and Amazon and IBM as well.  Google needs to make 5G an Istio application, with a strong framework and whatever specialization is required.  They have perhaps the best microservice and service mesh people in the industry, but I’m not sure where they are with respect to 5G details.  They’ll need those skills if they want to ride 5G and carrier cloud to a broader service mesh (and public cloud) victory.

Or they could try to jump ahead, of course.  There’s still the possibility of educating the enterprise buyer, but there are a lot of enterprises to be educated, and only a dozen or so big network operators who collectively account for the majority of the 100,000 potential new carrier cloud data centers.  It’s the opportunity to get operators to outsource those data centers, and then to build on that base, that seems the most direct path to success here.  It will be interesting to see who takes it.

What Can We Really Expect from AI/ML?

How useful would artificial intelligence and machine learning (AI/ML) really be in network lifecycle automation?  The topic gets a lot of attention, and a lot of vendors have made claims about it, but the real benefits are actually difficult to assess.  Part of the reason is that there are many ways AI/ML could be applied, and part the fact that we tend to intuitively think of AI as “Hal” in “2001: A Space Odyssey”.

If we cut through the jargon, the goal of AI/ML in network operations is to provide some form of root cause analysis, some kind of response optimization, or both.  Lifecycle automation means detecting conditions, performing analysis to establish a most-likely cause, and then triggering remediation based on that cause.  We need to break these three points down to assess what might be practical in the application of AI/ML to networks.

Detecting conditions is a function of monitoring, which in the real world is all about receiving events that signal conditions or condition changes.  However, events don’t necessary signal the necessity of action.  Some events may signal that previous actions are working, in fact, and others may signal the initiation of self-remediating processes down in the network.  The key point for monitoring is that in order to monitor effectively, you have to know what generated the event and how to read it.  There are plenty of monitoring tools available, and we have long experience in monitoring, so there’s not much we need to worry about here as far as gathering information is concerned.

But is gathering events the same as “detecting conditions?”  The first complicating factor in AI/ML is that detection of conditions and root-cause analysis are fuzzy areas, and you could rightfully assign some tasks to either of those overall steps.  We therefore have to identify those fuzzily-positioned tasks and see where that takes us.

The first task on our list is that of event associations.  A network is a system of complex, interconnected, functionality.  An event is a report of conditions in a particular place, because probes that report conditions have to be placed somewhere and read conditions where they’re placed.  But what is that place, in the overall context of the network?  What, for example, is the association between information gathered at Point A and Point B?  Unless you know what the network-topological relationship of those points is, you can’t make proper association of the results.  Suppose the points are two ends of a trunk.  Packet count discrepancies would indicate data loss, but having either count in isolation might tell you nothing.  You need to be able to define associations rooted in topology, that then frame a kind of super-event from the relationship between sets of events.

Then there’s actionability.  Not all events indicate a need for action to be taken.  Some simply report conditions, and we may want to log them and even analyze them in a later step as a part of a machine learning process, but we don’t want to kick off an operations process based on the event.  However, actionability is sometimes related to expected versus experienced conditions, where an event is actionable not because of what it is atomically, but what it is relative to what was expected.  That gets us into context.

The third task on our list is state/event relationships.  Many (perhaps most) events needs context to be analyzed properly.  If a trunk reports zero traffic, is that an error?  It depends on whether there’s supposed to be traffic, and there might be no such expectation if we just took the trunk out of service or were diddling with the devices on either end.  We don’t want to trigger remediation based on an event that’s been created by our own reaction to a prior event or set of events.

Our final task is event correlation and root cause analysis.  This, in most cases, is going to involve the integration of all the prior task results, conducted with reference to the topology of the service or network in question.  If an automated (or human-driven) operations process responds to events discretely, it’s likely to step all over itself by treating symptoms rather than the problem.  A series of events that are related and contextualized should be examined to determine whether they stem from a specific condition.  If they do, and if that deduction can be validated, then the right action is one to address that common, specific, condition.

As I noted at the start of this blog, there are a lot of ways that we could think about applying AI/ML to network operations, but the ways can be broken into two groups.  The first group is the application of AI/ML directly to events, with contextualization and causal analysis done by successively analyzing the event stream through the tasks noted above.  The second is the creation of a contextual map that itself guides the entire event collection and analysis process.

It’s probably obvious to most of you that the first approach is going to have scalability challenges.  Everyone who ever spent any time in a network operations center (NOC) has seen the result of an event storm, where a single problem generates thousands of alerts that quickly swamp both the operator and the entire process of gathering and displaying events.  The bigger the network, the bigger the storm can be.  I don’t think this is the right way to go, AI/ML or not.

But what is a “contextual map” and how could we get one?  A contextual map is a model that divides a network, or a set of service resources, into functional units, each of which can be depicted as a “black box” with external interfaces and behaviors.  When an event is generated in this situation, instead of going to some master event-handler, it goes instead to a contextual handler defined within that black box.  The underlying principle is that a network/service’s state is the combined state of the unitary functions that make it up.  We deal with the functions individually, and collectivize the results.

Contextual maps are a dazzling insight, and they were postulated first by John Reilly in the TMF NGOSS Contract work.  Events are steered to processes via the state/event table of an element in the service contract data model.  The TMF service data model, the “SID” wasn’t ideally structured for the mission, something John and I chatted about a number of times.  TOSCA, the OASIS standard “Topology and Orchestration Specification for Cloud Applications”, appears to me to have the capability to define contextual maps that could then be used to guide the way that events are analyzed.  I’m working through my own TOSCA guide (as time permits!) to validate the view and explain how it would work.

But what does this have to do with AI/ML?  I believe that contextual mapping is an essential step in applying AI/ML to network operations, because it reduces the scope of a given analysis and permits hierarchical assessment of conditions.  If Box 1 has an event, AI/ML can analyze within that scope, and if the analysis indicates the scope of the problem has to expand, then Box 1’s determination becomes an event to the next box up in the hierarchy.  By this means, I can handle a wide range of events with independent AI/ML instances, because each is handling the interior of a black box.  Wider scope just means kicking off an analysis at a higher-level box.

What, though, could AI/ML do within a given black box?  If we have state/event tables, the boxes already contain (or, more accurately, their data model contains) the necessary indication-to-action representations.  The obvious answer is that AI/ML could replace the state/event tables, which would mean that having people sit down and figure out those event-to-process associations, and the states related to handling them, would no longer be required.  Machine learning, AI, or both combined could be used to create the event-to-process mapping needed, which could generate a lot more agile and effective event-handling while preserving the value of the service data model as a “contextual map” for the network or service.

This still leaves us with a potentially open question, which is the last step of our process—identifying the appropriate/optimum action to take.  People have talked about how ML might “learn” the best way to address a given fault (properly contextualized, of course!), but the problem with ML is that it has to learn from experience, and while it’s learning it’s either got to be disconnected from the action-generating processes, or it has to be expected to fail some number of times.  It might take a long time for ML to get the experience it needs.

The good news here is that contextual mapping will reduce the learning period by containing the number of condition/action correlations that have to be learned.  The notion of dividing complex tasks into a series of simpler ones is a common human response to being overwhelmed by the scope of a problem, so it’s a good way to manage AI/ML too.  But even here, we may need some additional help.

I’ve noted in a couple of past blogs that we’re overlooking a powerful ally in network lifecycle automation—simulation.  If a set of conditions, created by the correlation of events within a context, can generate a recommended action or set of actions, simulations of the inside of that contextual black box could be run to establish the likely result of each action identified.  The result, particularly if it’s generated with a specific confidence level, could then either automatically trigger the optimum action, or present it for human assessment.

One obvious application of simulation would be creating baseline “states” that ML could be taught.  This is good, this is impaired, and this is pretty bad.  It might take a long time for all these conditions to be visible in a live network, available for ML to examine and learn from.  Giving it a head start with simulation could speed the process considerably.

Simulation could also model the way a given action might impact the network/service.  It could also, in theory, model the progression of faults.  Since simulation requires a model on which to base its recommendations, it could be said to help enforce the notion of contextual mapping.  That alone could be valuable, and it would make sense to leverage a simulation map or model to provide additional insight.  Think of it as a kind of machine learning, with a similar application to operations processes overall.

The conclusion here is that while AI/ML could be useful, the way they’re used has to be firmly anchored in the topology or context of the network/service, or the application of AI/ML is likely to create scalability and relevance issues.  You can’t accomplish anything by gluing the term onto a product.  You have to integrate it with the network and service, and to do that, you have to be able to represent the structure to define the constraints of your AI/ML.

I think AI and ML could be very beneficial in service automation and other operations tasks.  However, we still get stuck in “Hal” mode, expecting real human (or superhuman) intelligence.  We’re not there yet, but there are still things that we could do to enhance the way AI/ML is applied to operations tasks.  The same, of course, would be true for other tasks.  It would be helpful to focus on these things instead of just AI/ML-washing everything, don’t you think?

What Cloud Provider Revenue Growth for the Quarter Might Mean

Amazon’s cloud computing growth in the latest quarter dropped to 29%.  Microsoft’s cloud revenue growth also slowed, but it still grew 47%.  Google’s cloud revenue growth increased, posting a 43% gain.  Are we seeing some important shifts in the public cloud space?  I think so, but most of them have their roots in a broader shift in the cloud market that’s been happening for at least three quarters.

It’s interesting that, with everyone saying that the cloud is a bright spot in the lockdown-and-virus recession, the top cloud providers saw less-than-usual revenue growth.  I attribute this in part to the fact that, as we’ll see below, corporate cloud buyers account for only a portion of cloud revenues.  Much of the “online” companies consuming cloud services are ad-driven, and that group is clearly impacted by a decline in advertising resulting from the lockdown-linked retail slump.  But this isn’t the whole story.

None of the cloud providers offer a good breakdown of how their cloud revenues are derived, but most people realize that public cloud revenues are historically the combination of web-company revenues (OTTs, often social media sites or related services) and enterprise cloud spending.  In the early days of cloud computing, the first of these sources outstripped the second by margins of better than 3:1.  Amazon has always gotten a greater piece of the OTT-cloud business, and so it jumped out to an early lead.  Amazon’s greater exposure to this space is also likely a contributor to its slowing revenue growth in the quarter.

Cloud computing for enterprises, during this early period, was hampered by the lack of a realistic and practical model of cloud adoption.  The early expectation, the one that got all the media attention, was that enterprises would “move applications to the cloud”, meaning shift entire applications out of the data center to the cloud.  That was a practical response for some applications, but not the mission-critical ones that make up most of enterprise IT.  As a result, public cloud adoption by enterprises was slower than the cloud growth numbers (which included the OTT piece) showed.

In 2019, enterprises started to understand the ways in which public cloud computing could augment their data center apps rather than replace them.  The “hybrid cloud” talk that emerged was an imperfect description of what was starting to happen, which was that enterprises were adding cloud front-end pieces to legacy applications to enhance their accessibility via browsers or mobile apps.  The back end, the data center piece, of these applications weren’t really cloud at all, they were the same stuff that had run in the data center all along.

The front/back application model resulted in significant growth in cloud adoption by enterprises in 2019, and that resulted in Microsoft, the public cloud provider with the best credentials for enterprise cloud applications, gaining market share on Amazon.  It also gave other providers of public cloud service, notably IBM and Google, a bit of a road map to how to achieve better public cloud sales.

In 2020, obviously, COVID came along, and the uncertainty associated with the virus and lockdown discouraged companies from making new capital investments in IT.  Expensing public cloud services made more sense.  In addition, COVID and the lockdown created a massive push for work-from-home (WHF), which created a massive push for creating secure and productive interfaces for core applications that could be pushed out to remote workers.  This is the dominant driver in the cloud market wars today.

The growth rate decline for both Amazon and Microsoft is a reflection of the fact that the front/back model of cloud applications came as a surprise to everyone, including users.  While most enterprises now understand “hybrid cloud” means front/back application separation between cloud and data center, most have only begun to deploy based on this new realization.  I think that the next two quarters will show cloud revenue gains based on the exploitation of this model.

The front/back model of applications that’s now dominating is different from “hybrid cloud” because the latter term implies public and private clouds, and as I’ve noted, the current model is almost always implemented using a cloud front-end and legacy data center deployment in the back.  Since Amazon, in particular, initially viewed hybrid cloud more literally, their approach presumed that users would structure on-premises hosting to mimic AWS cloud services, which wasn’t the model that was succeeding.  Amazon is currently working to adapt their approach (more on that later, below), and the other cloud providers are doing likewise.  The way they approach the future, the pace at which their approach can address opportunity, and the extent to which the future opportunities align with each provider’s approach, will decide who gains share and who loses.

A second market kicker that’s emerging is the public cloud provider recognition that “carrier cloud” could represent a truly enormous public cloud opportunity.  My modeling said about eight years ago that carrier cloud could, by 2030 and if all the drivers were realized, generate 100,000 incremental data centers, most at the edge.  It would then be the largest single driver of cloud services.  When, in 2020, operators started to show signs they might outsource some of their own carrier cloud applications, all the public cloud providers saw gold in them thar hills, as they say.

There isn’t an explicit need for cloud provider strategies for enterprises and for carrier cloud to line up; the positioning of the two would likely be very different, for example.  Some technological harmony might be helpful if the cloud provider intended to launch a sweeping strategy for the future cloud.  Let’s look at the providers and see what’s what.

Amazon has no specific account presence at the data center level, and rather than trying to develop one, it seems to want to establish a series of partnerships with key providers of data center technology.  VMware is an example; vSphere is a dominant data center platform, and Amazon’s been working with VMware to provide a means of linking vSphere and AWS efficiently.

One emerging force in this effort is container and Kubernetes technology.  Enterprises realize that containers are a good way to deploy both data center and cloud applications, and while harmony in deployment isn’t essential in creating a front/back cloud application model, it would offer a way to provide users with the opportunity to burst application components across the cloud/data-center boundary, which could facilitate further cloud migration down the line.

Microsoft is in a better basic position for the enterprise cloud than Amazon, having better data center account engagement and a better front/back integration capability almost from the first.  However, they’ve also been enhancing their container approach, focusing strongly on “federation” of cloud container and Kubernetes deployment with the data-center counterpart.

In Microsoft’s carrier cloud story, there’s a strong element of edge computing.  Edge computing could also be used by Microsoft to create a kind of bridge between cloud and data center, particularly where development of applications focuses more on cloud-native technology, making it possible for the cloud architecture to leak over the boundary to the data center.

Google seems to have the most “futuristic” approach.  They’re building a cloud-native platform and even creating development tools needed to optimize for that vision.  Their theory, which is valid, is that it’s relatively easy to link cloud and data center simply to have a workflow cross over.  The question is how both sides of the boundary might optimize for the new cloud/data-center relationship.  If applications can be built and deployed for a universal application PaaS-like model spanning all hosting options, then future applications can exploit the cloud without change, so if enterprises lose their compliance and security fears, even mission-critical elements might qualify for cloud hosting.

Architectures can cut both ways, as I’ve noted in past blogs.  Container hosting is a credible requirement for our universal application architecture, and that could take the form of managed Kubernetes federated across into the data center, or in the form of data center Kubernetes driving a cloud implementation.  In the latter case, there’s little opportunity for cloud providers to differentiate themselves, which may be why the providers are looking at making some fast moves.  Wait too long and a vendor like VMware or IBM/Red Hat might just come out with a cloud-independent vision.  That could level the cloud playing field.

More Carrier Cloud Action, with No Outsourcing

The latest in the carrier-cloud 5G story could be big, but not for the mainstream network infrastructure vendors.  Dish announced it had picked VMware’s Telco Cloud solution for its 5G network, a platform that will be hosting the Mavenir open RAN software and almost certainly their 5G Core elements as well.  VMware will serve as a prime integrator for other components of 5G and carrier cloud.  The operative question is whether the story just “could be big” or “will be”.

This deal is important for both VMware and Dish.  VMware needs to establish itself as a credible carrier cloud infrastructure vendor, a vendor who can build and integrate a 5G and carrier cloud ecosystem.  Dish needs something to save it from the continued decline of the cable-and-satellite TV business.

Dish is suffering from major losses in its satellite TV business, and has been struggling to figure out what it could do for its next act.  They had a satellite IoT strategy that, predictably, went nowhere because the initiative seemed to be driven more by IoT hype than any respectable business planning.  5G mobile is its next attempt, and obviously you have to take the prior IoT flop into account when judging the credibility of its initiatives there.

Still, Dish has made some sensible moves.  They acquired Boost Mobile from T-Mobile, which gave them access to T-Mobile’s network for seven years, and they’ve purchased over $20 billion in spectrum over the last ten years or so.  The Boost deal (T-Mobile had to sell Boost as a part of the Sprint/T-Mobile merger approvals process) gives Dish time to frame a logical 5G strategy free from the pressure that’s impacting other 5G players and hopefuls.

From the start, it’s appeared that Dish intended to use that time to formulate a 5G infrastructure plan based (as much as possible) on the cloud, on software, and on open-model networking principles.  Mavenir is one of the few companies who offer a 5G software solution from RAN to core, and they inked that deal in April of this year.  The decision to use VMware’s Telco Cloud software as the infrastructure platform gives credence to what I’ve heard, which is that Dish intends to implement the full 5G stack, including 5G Core software.  However, neither Dish’s nor VMware’s PR have made that point explicitly.

The first question this all raises is whether Dish is serious this time, and I think it is.  They’ve got no real options, other than to simply fade into the sunset.  While their satellite TV business is still supporting millions of customers, the business is deteriorating, along with margins and profits.  The deal for Boost and the spectrum acquisitions would seem to commit Dish to 5G mobile.

The second question is whether Dish can make this work, serious or not.  Are they still perhaps hopeful that 5G IoT will ride up to save them?  I think there’s an element of that thinking that still prevails, but it’s more a kicker on other drivers, or an attempt to redeem a costly blunder.  I think that the strategy of Boost plus spectrum plus Mavenir plus VMware suggests that they realize two important truths about their opportunity.

Truth One is that if they’re going to succeed in 5G, they’re going to have to manage infrastructure and operations costs relentlessly.  In the near term, there is little or no chance that any meaningful 5G technical differentiation will be possible, and little chance that any “killer app” will emerge that they could latch onto.  That means that profitable service offerings will have to be profitable because they have lower costs.  Buying the same technology as current 5G mobile competitors to achieve cost primacy would be a fairly foolish approach.

Truth Two, though, is that cost management vanishes to a point, along with those who hope to survive on that alone.  There will have to be something available as a service revenue kicker, and that something will almost surely have to be a form of over-the-top service set that leverages the agility of cloud-native technology.  The same drivers that dominate the future of “carrier cloud” in general are the opportunities that Dish has to try to exploit.  Thus, they need to be thinking “carrier cloud” from the first.

Realizing truth is a necessary condition for successfully exploiting it, but not a sufficient condition.  Network operators have, for literally decades, ignored reality in their strategies for “transformation”.  The biggest advantage that Dish may have over those operators is the lack of history, the lack of a culture that’s bounded the operators into “connection-think” with respect to services.  But even with a lack of negative bias, Dish will need positive skills to make this work, and from what I hear, they do not have them internally.  That means they’ll be dependent on Mavenir and VMware to supply them.

Both these companies have strong credentials in the specific areas where Dish has engaged them.  Mavenir knows 5G virtualization and infrastructure, including open RAN.  VMware knows cloud-native infrastructure and hosting.  That’s enough for Dish to realize the implications of Truth One, as I’ve outlined it above.

Who knows the rest of carrier cloud?  Truth Two says that Truth One is a transitional strategy, which implies that there has to be a transition to something else.  Carrier cloud depends, in the long term, on the successful exploitation of IoT-related services, personalization, and contextualization.  I’ve blogged about all of these in the past so I won’t bore you with more of it here.  The point is that there are enormous service opportunities associated with each of these things, opportunities the operators could realize with their own retail OTT offerings (subject to regulatory approval) or via wholesale features that they’d let OTTs compose at the retail level.

I don’t have any reason to believe that either VMware or Mavenir have any capabilities in these new areas.  They might, but given the fact that very few operators or vendors have demonstrated any vision beyond connectivity, it’s safer to presume that they don’t have the skills yet either.  That means that they’d have to develop them, and quickly.

The challenge this sort of thing creates for Dish is pretty obvious.  If they raise a banner but fail to raise an army, they signal their intentions to a host of others who may not make the same mistake.  Of particular concern to Dish’s hopes, and the hopes of Mavenir and VMware, are the public cloud providers Google, IBM, and Microsoft.  All of these players have clear carrier-cloud engagement hopes, and plans to exploit those hopes.

Google has had recent success in promoting relationships with telcos to outsource some carrier cloud applications relating to 5G.  Microsoft acquired Metaswitch, who like Mavenir is a pioneer in virtualized mobile infrastructure.  IBM has Red Hat, who can muster similar tools and capabilities to those of VMware.

VMware might benefit from Amazon’s inclusion of them in what appears to be an emerging telco and carrier cloud strategy, but Amazon is the cloud partner the operators seem to fear the most.  Still, Amazon hosts more OTT services in the cloud than any other cloud provider, perhaps as many as all the others combined.

One potentially ominous sign, IMHO, is the comment carried in Light Reading that “The companies explained that they will work together to test and certify vendors’ network functions as they are installed in Dish’s network, such as those from other Dish vendors like Mavenir and Altiostar.”  This could be interpreted as a dependence on NFV for the service functions.  NFV isn’t useful in 5G infrastructure, though it could be cobbled together to work.  It’s totally useless as a framework for other future carrier cloud services, which have no connection with devices and thus don’t fit the NFV mold of transforming devices into virtual devices.

VMware has this slant in its own Telco Cloud material, and I think it represents their biggest risk overall, and as far as the success of the Dish initiative is concerned.  At the least, an NFV story tends to submerge the benefits of asserting cloud-native hosting and service creation, since NFV has yet to define any meaningful cloud-native initiative.  Ericsson, the leading “traditional” 5G vendor, is already buffing up its’ own 5G cloud-native story.  Thus, coming up with a real strategy for carrier-cloud function hosting that works for connection service elements and OTT service elements should be VMware’s, Dish’s and even Mavenir’s top priority.