Making the Best of Multi-Cloud

Just how good an idea is multi-cloud?  If you follow the topic in online tech sites, you’d think that everyone was totally committed to multi-cloud, and that it was therefore the preferred approach.  My own data, as I’ve noted in a prior blog, doesn’t agree with popular reports, and my own experience in hybrid and multi-cloud consulting and research suggests that there are issues with multi-cloud that often catch enterprises by surprise.  Most of them could be significantly mitigated by simply planning properly.

Part of the problem with multi-cloud is deciding just what it means.  To most, “multi-cloud” means the use of two or more public cloud providers, with “public cloud provider” meaning a provider of a cloud service.  Since SaaS is a cloud service and Salesforce’s offerings are widely used, this tends to make everyone who uses another public cloud, like AWS or Azure or Google or IBM, a multi-cloud user.  I don’t think that’s a helpful definition, so what I propose here is that we ignore SaaS in the determination of multi-cloud.  SaaS should be, from a security, integration, and operations perspective, be considered separately anyway.

According to my research, and mindful of my definition above, multi-cloud usage divides sharply based on motivation.  The largest multi-national corporations tend to use multi-cloud because they can’t properly cover their global market area with a single cloud provider.  Among this group, comprising perhaps a thousand companies globally, multi-cloud is almost universal.  The other motivation is for backup, and here there are perhaps another two thousand companies spread across a wider size range.

A universal issue in multi-cloud is security.  As THIS Light Reading piece notes, there are security issues associated with multi-cloud, but they emerge from multiple sources and have to be addressed in different ways depending on cloud usage.

The universal security issue for multi-cloud is increased attack surface.  An “attack surface” is the number of places where possible attack exposure exists, and obviously the more clouds you have, the more attack surface you have.  Every cloud provider has security risks, so the more providers you have, the more risks you’re exposed to.  It’s almost always possible to mitigate some of these risks, but not all of them.  An internal flaw in a cloud provider’s own shield compromises users, and can’t be fixed easily by those users.

A second fairly common multi-cloud security issue is differences in cloud provider architecture can make it difficult to create a common secure framework across provider boundaries.  If you write some components for optimal performance in AWS, the differences in how the web services are invoked and managed will mean that the same code almost surely won’t run on Azure.  There are now two versions of the code, and that means two places where you have to deal with security and two different ways that has to be done.  The cloud-specific code problem bites in other areas, as we’ll see.

A related problem is that shedding cloud-provider web services to promote a common code set across clouds (sticking to vanilla IaaS and adding in your own middleware for containers, IoT, and so forth) puts a greater security burden on the user.  All platform tools have the potential to open security holes, and if users have to supply everything above a naked virtual machine, there are plenty of places they could mess up.  Thus, the user is often forced to choose between using cloud-provider web services and accepting the higher cost and loss of code portability, or rolling their own platform toolkit and accepting a risk they’ll fail to consider security and operations challenges.

The other common problem with multi-cloud is operations and code compatibility.  As I suggested above, multi-cloud users are often stuck with a choice between building their own cloud middleware and then deploying it across multiple clouds, or keeping different application versions to account for the difference between web services in Cloud A and Cloud B.

Companies who adopt multiple clouds for multiple geographies alone will often be able to maintain independent code for each cloud, because there’s no expectation of redeploying components of applications across cloud boundaries.  Even in cases where one cloud is expected to back up another, enterprises say it’s often possible to maintain separate application components for each cloud, either because no cloud-provider features other than basic hosting are used, or because the cloud components aren’t changing rapidly enough to create a development bottleneck.  Things are more complicated if multiple clouds are seen as a part of a common resource pool, allocated freely as needed.

As we get into more complex cloud-native applications, it’s going to be very difficult to maintain multiple application versions.  Many experts agree that the future of cloud computing is a unified cloud resource pool to which everything contributes, including enterprise data centers and multiple cloud providers.  Even bare-metal hosting providers are likely to get enveloped in this concept, but it poses pretty significant challenges.

The most obvious challenge is the need to define a single application platform (a PaaS, in a real sense) that can be deployed onto everything, so application components can be hosted freely on whatever is best, without changes in the code or in operating practices.  Container architectures (Docker, Kubernetes, and the rest of the ecosystem) are the best underpinning for this sort of thing, since virtual machines provide too much latitude to customize middleware, making application portability problematic.

Even with containers and the whole of the Kubernetes ecosystem, there’s still the problem of those web services offered by cloud providers.  Every cloud provider knows that basic IaaS is a commodity, meaning it’s under tremendous price pressure and it offers no “stickiness” in holding a customer in once they’ve committed.  Adding custom web services encourages applications to depend on them, so making the applications portable is the opposite of the cloud provider goal.

We are moving, slowly and irregularly, to develop a set of tools that are cloud-provider-independent.  Knative is a deployable serverless toolkit that at least somewhat replaces cloud provider serverless/functional computing services.  We also have deployable database and cloud storage options that replace cloud provider tools, and there are even specialized tools for IoT and other applications.  The problem with all of these is that they need to be integrated, which even large enterprises find difficult.

The net of this is that multi-cloud can’t reach its full potential unless there’s a single, integrated, alternative to cloud-provider web services, a suite.  That can come only from a software vendor who builds out from the Kubernetes ecosystem, and who has credibility with enterprises as a platform provider.  The obvious contenders are IBM/Red Hat, VMware, and HPE.  Other up-and-coming possibilities would include Suse, who just acquired Rancher.

The major suite players are already striking alliances with cloud providers, creating a hybrid cloud framework that’s actually a multi-cloud web service suite in disguise.  If Cloud A and Cloud B can host OpenShift, for example, and so can enterprise data centers, then ultimately Cloud A and Cloud B can become commoditized by the loss of their own web-service differentiators.  This could be the way that the suite vendors gain cloud ascendency.

Or not, of course.  There’s a lot of work to be done turning the limited cloud suites of today into full-blown web-service alternatives for hybrid and multi-cloud.  The fact that the reward for success here would be truly enormous, even transformational to the industry at large, means that it’s very likely that every suite vendor will take a shot, and at least one might succeed.

The Reality of Autonomous Behavior

Autonomous behavior is often cited as a 5G, low-latency, or edge computing application.  That’s a vast oversimplification, in my view, and to understand why, we have to look at the way human reactions take place.  That should be the model for autonomous behavior, and it will demonstrate just that could and should be done with low latency or at the edge.  A good application to use in our example is autonomous vehicles, since that’s the most-hyped example out there.

Suppose you’re driving down a country road, and a deer steps out onto the road a hundred yards ahead.  You see the deer, and you likely take a second or two to see if it’s going to move away.  If it crosses into the field or brush, you likely slow a bit in case it reverses course.  If it doesn’t move off the road, you then take your foot off the accelerator, and if it still doesn’t move as you get closer, you apply the brakes.  This is a fairly typical encounter, and you’d probably do the same thing with a vehicle or person on the road.

Now suppose you’re a hundred feet from the point where the deer (or car, or person) comes out onto the road.  You don’t take time to consider here, but instead immediately take your foot off the gas and prepare to break.  If you don’t see the subject’s movement off the road very quickly, you apply the brakes.  Again, a typical action.

Finally, suppose that you’re on that same road and a deer jumps out 20 feet in front of you.  You immediately jump on the brakes aggressively, because that’s what would be needed to avoid a possible collision.  Hopefully you don’t have this sort of experience often, but it does happen.

Let’s now try to categorize these three reactions, with the goal of deciding just what and where the reactions are processed.

We could call the first example a reasoned response.  There is a trigger, and the trigger sets up an assessment of the situation (when you “took a second or two”).  The assessment results in an action, not the initial trigger.  After the assessed action, you’d have another assessment, perhaps several, in a kind of loop, until you either pass the point of risk or stop the vehicle.

The second one, we can call a reaction.  Here, the trigger stimulates a response, then an assessment of whether the response is appropriate.  The response is then assessed as it would be in the first case.

The final case could be called a synapse, which is a direct connection from stimulus to response.  There is no assessment until the action, the “panic stop” is complete.

If we want to complete our autonomy framework, we need to add a fourth thing, something completely different, which is a plan.  Suppose you’re following a route or heading to a specific destination.  You’ll still respond to conditions as in our first three examples, but in addition you’ll be sensitive to other triggers, such as the fact that you’re coming up on a turn or that traffic is getting heavy on your planned route, or perhaps that you’ve been looking for gas or a rest stop, and one is coming up.  What we have here is a set of different triggers, things that represent more systemic conditions.

Some 5G proponents will look at this and suggest that all of it is a formula for edge-computing-based, low-latency, applications, but I think we have to put these four things into an autonomous context and validate those claims.  To do that, I propose to take them in order of “immediacy”.

The synapse-type response is obviously the one that’s most critical in terms of trigger-to-response latency.  In human-response terms, this is the kind of thing where you hope your own reaction time is fast enough.  The thing is, we already have on-vehicle automatic braking systems that provide the trigger-response combination.  Why would we elect to offload this sort of thing to a network-connected software element, when all manner of things would risk a delay and an accident?  In my view, there is absolutely no credibility to the synapse-type response justifying 5G, low-latency connectivity, or edge computing.

The reaction response differs from the synapse in two ways.  First, the response to the trigger doesn’t have to be as instantaneous.  It’s not instinct to hit the brakes as much as a fast assessment of the situation.  Second, the conditions that could give rise to the reaction response are more complex to assess.  The deer is 100 feet ahead, and so what specific technology lets us know that it’s a moving obstacle that’s now on the road, or perhaps about to be on the road?

The question here isn’t one of response timing as much as the assessment of the need for a response.  A radar or ultrasonic picture warning of proximity is easy to put on-vehicle, but for our reaction scenario, we’d almost surely need to have some form of image analysis.  The question is whether the analysis should be on-vehicle where we would have direct access to camera data, or whether it should be remote, in which case we’d have to have real-time network connection to the point of analysis.

I don’t think that autonomous vehicles that demand real-time video from every vehicle is a practical strategy, and certainly not in the near term.  Thus, where the visual scene has to be analyzed to provide input into autonomous behavior, the handling should be embedded with the vehicle.  Given that doorbells and cameras can be made to recognize faces, eyes, and even animals, I don’t think this is a tall order.

Is it possible that our reactive recognition might be a collaborative function?  Could a vehicle system perform an analysis of the scene, and then send the result to a cloud function that would perform an AI analysis on the results?  Yes, that would indeed be a nice approach.  The relevant frames could be abstracted to focus, for example, on what is moving on or toward the road, and eliminating other distractions.  Think of a kind of wire-frame modeling.  This could be forwarded to an AI system to compare it to other “incidents” that would allow it to be classified, and the result (an action) returned.  The response time doesn’t have to be instant, but it could be a credible 5G and/or edge computing mission.

The reasoned response would be quite similar, and in fact it could be handled by the same kind of logic.  All that’s required is that the initial AI assessment return a kind of “look again in x seconds” result, which would then repeat the analysis at that future point.  It might also set what could be called a “vigilant” state, where the autonomous system (like a human driver) would be watching more carefully, meaning would be more likely to interpret a condition as requiring a reaction.

Introducing planning changes the dynamic somewhat, but not necessarily as much as might be thought.  The reaction time for planned trigger-action combinations can be slower, of course, which makes it practical to do more off-vehicle.  The problem is that we already have GPS systems for cars that do most of the planning work for us.  Modern ones will get traffic updates and suggest taking alternative routes, too.  I think it’s possible that real-time data collection from vehicles could be assembled and aggregated to produce better results than we get today from a GPS, but this isn’t a 5G or edge computing mission; there’s no requirement for sub-second response.

There’s another dimension to autonomous behavior that has to be considered too, and is rarely mentioned.  What is the fail-safe procedure?  What happens if a self-drive loses its sense of self, if an autonomous big rig barreling down the highway suddenly finds itself unable to operate normally because something broke, or because an essential data connection was broken?  We already know that even today’s driver-assist systems, explicitly stated not to be suitable for autonomous operation, result in drivers sleeping at the wheel.  We can’t rely on driver intervention or attention, and don’t suggest an alarm to wake the driver.  Who knows what their reaction would be?

We need two fail-safe procedures, in fact.  One would be targeted at dealing with a system failure of the autonomous element, and the other with some major catastrophic problem that could result in having too many autonomous decisions colliding because nobody knows what others are doing.  We’ll take the latter first, because it has an easier answer.

It may be that the strongest argument for central control, even with distributed autonomy as the rule, would be the ability to mediate a response to some catastrophic event.  A major accident, a bridge failure, or any number of things that could totally disrupt traffic, could result in synchronized behavior from similar autonomous systems.  If everyone turns left at the next intersection to avoid a major traffic accident, the avoidance conditions could become worse than the trigger.

The system failure problem is hard, and there’s no getting away from that.  If autonomous systems fail no more often than vehicle mechanics or human drivers, the failures could be statistically acceptable but still enough to create public backlash and even insurance penalties.  If they fail more often, then it’s not likely that the technology could survive the bad publicity.  The issues could be mitigated if the failure produced a graceful response.

I think that it’s logical to assume that our synapse systems should be fully redundant, and that if they detected a failure of the higher-layer functions, they should flash all the vehicle lights and slowly pull over into a parking lot, the shoulder, or the curb.  Obviously sounding a warning to the passengers and/or human driver would also be smart.  It would also be smart to have such a fault reported to any centralized traffic or vehicle control function, to facilitate the management of evasion by nearby vehicles.

Where does all this leave us?  I think that the majority of the things that 5G or edge computing or low-latency advocates cite to justify their technology of choice in support of autonomous vehicles are nonsense.  However, there are clearly things that aren’t cited at all that would ultimately justify 5G and edge computing and latency management.  These things, if not handled properly, could threaten the value of autonomous vehicles, not to mention the public at large.  If proponents of 5G, low latency, and edge computing are serious, they should look to these issues instead of hyping what’s easy and sensational.

An Example to Facilitate Understanding Service Models

I post my blogs on LinkedIn to provide a forum for discussion, and on my blog on ONAP and its issues, Paul-André Raymond posted an interesting insight: “There is something more easily understandable about Monolithic architecture. It takes an effort to for most people to appreciate a distributed solution.”  I agree, and that made me wonder whether I could explain the issues of distributed lifecycle automation better.

We tend to personify things, meaning that we take automated processes and talk about them as though they were human, as though they were reflections of us as an individual.  We are monoliths, and so we think of things like lifecycle automation in a monolithic way.  We describe functions, and we assume those functions are assembled in a grand instance—us.  We’re not distributed systems, so we don’t relate naturally to how they’d accomplish the same task.

The interesting thing is that we work as a distributed system most of the time.  Imagine an army of workers building a skyscraper.  The workers are divided into teams, and the teams are grouped into activities that might be related to the floor being worked on or the craft involved.  There are then engineers and architects who organize things at the higher levels.

In a highly organized workforce, there is a strict hierarchy.  People have individual assignments, their “local” team has a collective assignment and a leader, and so forth up the organization.  If a worker has a problem or question, it’s kicked to the local leader, and if two teams have to coordinate something, the joint leader above does the coordinating.  This is also how the military works, in most cases.

Can these organizational lessons be applied to services, applications, and other stuff that has to be managed automatically?  I think so, but let’s frame out a mechanism to prove the point.  We’ll start from the bottom, but first let’s frame a unifying principle.  We have to represent the people in our workforce, and we’ll do that by presuming that each is represented by an “object” which is an intent model.  These models hide what’s within, but assert specific and defined interfaces and properties.  It’s those interfaces and properties that distinguish one from another, as well as their position in the hierarchy we’ll develop.

A worker is an atomic and assignable resource, and so we’ll say that at the lowest level, our worker-like intent model will represent a discrete and assignable unit of functionality.  In a network or application, the logical boundary of this lowest level would be the boundaries of local and autonomous behavior.  If we have a set of devices that make up an autonomous system, one that’s collectively functional and is assigned to a task as a unit, we’d build an intent model around it.

In a workforce, a worker does the assigned work, dealing with any issues that arise that fall within the worker’s assignment and skill level.  If something falls outside that range, the worker kicks the problem upstairs. They were given the job because of what they can do, and they do it or they report a failure.  So it is with our service/application element—it has properties that are relied upon to assign it a mission, and it’s expected to perform it or they report a service-level agreement violation.

The next level of our hierarchy is the “team leader” level.  The team leader is responsible for the team’s work.  The leader monitors the quality, addresses problems, and if necessary, kicks those problems upstairs again.  Translating this to a service/application hierarchy, a “team-leader” element monitors the state of the subordinate “worker” elements by watching for SLA violations.  If one is reported, then the “team-leader” element can “assign another worker”, or in technology terms break down the failed element and rebuild it.

I’ve mentioned the notion of “kicking” a problem or issue several times, and I’ve also mentioned “assignments”.  In technology terms, these would be communicated via an event.  Events are the instructional coupling between hierarchical elements in our structure, just as communications forms the coupling in a cooperative workforce.  And just as in a workforce, you have to obey the chain of command.  A given element can receive “issue events” only from its own subordinates, and can make “reports” only to its direct superiors.  This prevents a collision of action resulting from multiple superiors giving instructions (conflicting, of course) to a subordinate element.

At every level in our hierarchy, this pattern is followed.  Events from below signal issues to be handled, and the handler will attempt to deal with them within its jurisdiction, meaning as long as the changes the handler proposes remain confined to its own subordinate elements.  For example, if a “virtual firewall” element reported a failure, the superior handler element could tear it down and rebuild it.  If a “virtual-IP” network element failed, not only could that element be replaced, but the interfaces to other IP elements (representing network gateways between them) could be rebuilt.  But if something had to be changed in an adjacent access-network element as a result, that requirement would have to be kicked up to the lowest common handler element.

Each of our elements is an intent model, and since each has events to manage, each would also have specific defined states and defined event/state/process relationships.  When our virtual firewall reported a fault, for example, the superior element would see that fault in the “operating” state, and would perform a function to handle an operating fault in that element.

The relationship between superior and subordinate elements opens an interesting question about modeling.  It appears to me that, in most cases, it would be advisable for a superior model to maintain a state/event table for each of its subordinate relationships, since in theory these relationships would be operating asynchronously, and a master one for itself.  “Nested” state/event tables would be a solution.  It’s also possible that there would be a benefit to modeling specific interfaces between subordinate elements, network-to-network interfaces or NNIs in the terminology of early packet networks.

A final point here:  It’s obvious from this description that our service model is modeling element relationships and not topology or functionality.  The models are hierarchies that show the flow of service events and responsibilities, not the flow of traffic.  It is very difficult to get a traffic topology model to function as a service model in lifecycle terms, which is why the TMF came up with the whole idea of “responsibility modeling” over a decade ago.

The Future of the Cloud and Network is…

Things change in networking and the cloud, and a big part of the reason is that new concepts that were being tried by a few experts are now becoming part of the mass market.

The mass market is different.  Early adopters are typically firms that have deep technical resource pools, a situation that the typical business user of technology can only envy.  When technology starts to broaden out, to seek maximum accessible market, it has to populized, to deal effectively with the challenges that the next tier of users will face, and face them for their users.  So it is with containers and Kubernetes.

The service provider or (network operator) market is different, too.  Network operators represent an enormous incremental opportunity for new technology, as they confront a market beyond connectivity, virtual resources, and automated operations.  Winning the “carrier cloud” space could be worth a hundred thousand data centers, according to my model.  Whoever gets the carrier cloud space gets that revenues, and carrier cloud will demand highly efficient hosting, and create challenges.  Kubernetes and containers will have to face this set of challenges too.

We’ve recently seen companies arguing over how great, or early, or both, their Kubernetes positions are.  That’s not the sort of marketing that’s going to win the hearts and minds of practical CIOs, of course.  What enterprise CIOs and operator CTOs want to know is how well a Kubernetes provider is doing keeping pace with the things that are changing their specific requirements, creating their risks and opportunities.  Extending Kubernetes to the future is, first and foremost, about extending Kubernetes.

Kubernetes so far has been a point solution to exploiting the inherent agility of containers.  People who have found application drivers for containerization ran into the orchestration wall, and there was Kubernetes.  Today, both the mass market and the giant carrier cloud market are trying to find the application drivers themselves.  There is a broader market now, so an easy answer is to try to make containers and Kubernetes easier, and the easiest way to do that is to establish a suite of some sort, a suite that combines Kubernetes with other tools to match the direction that both mass and carrier-cloud markets are evolving.

The leading Kubernetes players, suite-wise, have been Red Hat and VMware, with OpenShift and Tanzu, respectively.  OpenShift was arguably the first of the container/Kubernetes suites, and Tanzu came along when VMware accepted that Kubernetes was going to eat the world.  Now, HPE has entered the picture with its own Ezmeral software strategy, competing with the two incumbents.  The striking thing about this is that Red Hat was bought by IBM, Dell owns most of VMware, and now HPE—another IT giant—is getting into the picture.  It’s hard to escape the sense that Kubernetes and its suites are proxies in a broader battle for the data center.

Ezmeral is important to HPE, but perhaps mostly in a reactive sense, as THIS SDxCentral piece suggests.  As the Kubernetes market broadens, difficulties prospects have in assembling their own Kubernetes increase, which is the justifications for the suites in the first place.  For a server company like HPE, the lack of a Kubernetes suite means dragging in other players to get a complete container solution in place.  Since those other players are increasingly HPE competitors, that’s clearly not a good strategy.

Reactive steps tend to shoot behind the duck, though, because what you’re reacting to is what competitors are doing and not what they will be doing, what they’re currently planning.  Kubernetes has become the hub of a growing functional ecosystem, and even a container suite that’s aimed at unifying and integrating the basic container deployment framework doesn’t necessarily address the rest of that ecosystem.  What does?

We come now to those two markets, the mass market and the carrier cloud market.  The first is evolving with the broad interest in container adoption for microservice-based applications, where containers could fill a void between serverless applications and persistent deployment, providing scalability and resiliency without the potential cost explosions that serverless cloud can bring.  The second is driven by the knowledge that virtualizing service provider infrastructure to create that “carrier cloud” is perhaps the hottest potential single vertical-focused market opportunity out there.

We’re going to look at these two markets because they’ll create the missions that will then change cloud, container, and orchestration tools.  What those changes will be will then determine who gets the power, and the money, that comes from threading the needle of market variables to sew a golden cloak.

The microservice mission for containers foresees an application model that consists of a fairly large number of components, all used at a highly variable rate, meaning that balancing processing with workload levels is critical.  The presumption, given that a long string of components would create latency issues for many applications, is that while there are many possible components to be run for a given trigger, there aren’t many that will run.  Think of a couple hundred possible components, two to five of which are needed for a given trigger.

The technical foundation of this model is a service mesh, which is a combination of connectivity, discovery, and load-balancing all rolled into a common structure.  Istio is the up-and-coming service mesh champ as of now, and it’s a Google invention that pairs well with Kubernetes.  The Istio/Kubernetes combination prepares a business for the evolution of traditional IT missions to something more dynamic and interactive.

The beauty of this approach is that containers can then support applications that are very persistent, meaning they run all or nearly all the time, but also applications and components that are fairly transient.  That range of application can be extended by adding in a serverless scheduler, like Knative, that will associate events with containerized components and run them when the event comes along.

You might wonder why Knative is useful, given that Kubernetes alone, and Istio in combination with it, offer scaling, resilience, and load-balancing.  The reason is that sometimes there are many event-triggered processes that, overall, run often, but individually run more rarely.  Think of a hundred processes, of which five or six are running at a given time.  When these processes are short-duration, the container and even the service mesh combination, aren’t efficient.  Adding a serverless layer means that the same resource set (in my example, the resources needed to run five or six processes, with reasonable safety margin) can support that hundred processes.

As a final point, a broader cloud/hybrid-cloud commitment by enterprises opens the opportunity for additional tools that build on and extend that commitment.  Things like cloud backup and disaster recovery for hybrid cloud applications (the target of VMware’s Datrium acquisition) might benefit directly from an augmented platform, but they would surely benefit from greater container/cloud commitment.

Let’s move on now to the second of our exploitable trends, carrier cloud.  Network operators have been working for nearly a decade on creating services based on software instances of features/functions rather than on proprietary service-specific network hardware.  NFV (Network Functions Virtualization) is probably the best-known example.  In order to have software instances of functions, one must have a place to host them, and the term “carrier cloud” or “telco cloud” has been used to describe that resource pool.

For at least five years now, most people (including me) have assumed that the operators would build out their own cloud infrastructure to provide carrier cloud hosting.  I did modeling in 2013 that predicted that carrier cloud hosting would be, by 2030, the largest incremental source of new data centers.  These would be primarily edge data centers, too, which makes carrier cloud the biggest edge application.

The thing that makes carrier cloud big isn’t hosting NFV, though, but hosting network service features in general, including higher-layer features associated with things like IoT.  Thus, the early function-hosting opportunity is only the tip of the iceberg.  Where the real substance emerges is in the first “big” application of carrier cloud hosting, 5G.  Since 5G mandates virtualization across all its elements, implementation of “complete” or “standalone” 5G (5G Core) would mandate a virtual-function deployment of some sort.

The “of some sort” qualifier here is important.  If you refer back to my carrier cloud opportunity and driver model, you’d see that even in 2020, hosting NFV-style virtual functions accounts for less than a third of overall carrier cloud opportunity, and that by 2030 it would account for only 4% of incremental opportunity.  The question that poses is whether NFV principles would be a very small tail that somehow manages to wag an enormous dog, or whether future opportunities will drive a different model.  I think that the trends in 5G “carrier cloud” answer that question.

There are two important truths with respect to 5G hosting.  First, operators have indicated significant and unexpected willingness to lean on public cloud providers for their 5G Core hosting.  Second, the big 5G network vendors are emphasizing “cloud-native” implementations, which while it doesn’t foreclose NFV, doesn’t emphasize it.

NFV is the wrong approach to carrier cloud.  Carrier cloud applications are really bicameral; they have a “data-plane” and “control-plane” brain, and the two think differently.  The control plane applications look much like traditional IT applications, and their evolution would follow the model I’ve described above.  Data plane applications are really less about dynamic hosting of functions than about supporting low-latency, high-traffic, links between fairly static instance locations.  I’ve described my vision for a cloud-native 5G Core HERE.

This truth (if you believe I’m right) combines with another truth (for which, again, your belief is optional).  Carrier initiatives for next-gen networking are, in general, a failure.  I refer here to my blog on ONAP, which is the proposed foundation for next-gen operations automation.  The problem is that operators are generally stuck in box-land, an age of device networks that their own initiatives are aimed at escaping.  That sets the stage for the big question associated with the carrier-cloud trend; do you make it look like a cloud initiative or an operator initiative?

I believe that the cloud providers have decided that “carrier cloud” is a special case of cloud.  They are going to blow kisses at carrier standards, knowing full well that the operators themselves will accept any strong and well-integrated implementation that even sort-of-promises not to diss their own standards initiatives.  They believe that will work, and they hold that belief because they know operators are asking them to host 5G in their public clouds.

For the Kubernetes suite players, the nature of carrier cloud is critical.  Do they go along with the cloud providers and bet on the cloud connection?  Do they believe that operators will hew to their own standards when they come to their senses, and thus bet their carrier cloud positioning on NFV?  The issue is more significant than some might believe.

If you buy into a cloud-centric vision of carrier cloud, you inherit the cloud’s notion of service-based deployments, the monitoring and management, the ability to link to legacy software applications.  If you don’t, you have to assume that these things are going to come from the carrier side, meaning things like NFV and ONAP.  That means your approach stands or falls based on the success of these standards initiatives.  It means you’re bucking the cloud providers.

Could you adapt NFV and ONAP to containers, Kubernetes, service meshes, and the cloud?  Possibly, but remember that these initiatives are years along their path, have been paying lip service to cloud-centricity, and have yet to make any real progress in getting to it.  It sounds to me like a bad bet.

VMware has probably the longest-standing position with respect to carrier cloud, and their material has a distinct carrier-standards bias, despite the fact that VMware got a number one ranking in Cloud Systems and Service Management and clearly has a strong position in the cloud.  Red Hat’s website also cites their NFV credentials.  HPE doesn’t say much about carrier cloud in its Ezmeral material at this point, but has historically taken a very standards-centric position on NFV and carrier cloud.  Thus, our suite vendors are so far betting on something that could tend to separate the two trends in Kubernetes, with the applications going toward the Istio/Knative model and the carrier cloud side going to a totally different NFV-and-ONAP direction.  This separation could reduce the symbiosis between the two trends, and thus reduce the collective benefits that a single approach to the future cloud could yield.

Kubernetes and NFV can coexist, which has been Metaswitch’s point for some time.  They went so far as to say that the future of NFV Infrastructure (NFVi) was Kubernetes.  Metaswitch was acquired by Microsoft.  Ponder that, doubters.  CableLabs and Altran have combined to create the Adrenaline project, which unites feature hosting and Kubernetes.  Ponder that, too.  The point is that there’s a lot of evidence that a cloud-centric carrier-cloud model (as opposed to a standards-centric one) is evolving already.

Summary?  The best of all possible worlds for the Kubernetes suite players is that both network operators and enterprises shift toward the same Kubernetes ecosystem in the long term.  That would mean that both trends in the cloud could be addressed with a single, expanding, product offering.  My model says that the “unique” network operator drivers of carrier cloud will be almost inconsequential within a year or so, compared to the higher-layer service drivers that we see emerging.  That’s what I think the public cloud providers are betting on, and I think the suite players should bet the same way.