Tom Nolle – Page 37 – Welcome to CIMI Corporation's Public Blog

Why Does Google Want to Retain Development Control of those Three Projects?

A piece in Protocol on Google’s desire to control some of its open-source projects’ directions made me wonder why Google was willing to release Kubernetes to the CNCF and wants to hold back control of the direction of Istio, Angular, and Gerrit. What do a service mesh, an application platform and a code review tool have in common, if anything? There might not be a common thread here, of course. But Google is a strategic player, so is there something strategic about those particular projects, something Google needs to keep on track for its own future? Could there even be one single, common, thing? It’s worth taking a look.

To make things clear, Google isn’t making these projects proprietary, they’re just retaining the governance of the development. To some, that’s a small distinction, and there were fears raised that popular projects might end up under the iron control of Google. Why these three specific projects, though?

Istio is a fairly easy one to understand. Istio is a service mesh, a technology that does just what the name suggests, which is to provide a means of accessing a community of service instances that expand and contract with load, balancing the work and managing instances as needed.

What makes service meshes critical to Google is cloud-native development. Stuff that’s designed for the cloud has to have a means of linking the pieces of an application to create a workflow, even though the pieces will vary and even the instance of a given component will vary.

Service mesh technology is also the higher layer of implementations of serverless computing that are integrated with a cloud software stack and not part of cloud provider web services. Google would certainly be concerned that open community interest could drive Istio in a direction that doesn’t fit the long-term Google cloud-native vision.

What’s the right direction? One obvious candidate is “Istio federation”. Recall that the current rage in Kubernetes revolves around means of “federating” or combining autonomous Kubernetes domains. Service mesh technology, as an overlay to Kubernetes, might also benefit from the same kind of federating framework. It would also create a bridge between, say, a Google Cloud with an Istio in-cloud feature set, and Istio software in a data center.

Another thing Google might be especially interested in is reducing Istio latency. Complex service-mesh relationships could introduce a significant delay, and that would limit the value of service mesh in many business applications. Improving service mesh latency could also improve the serverless computing applications, notably Knative. Serverless doesn’t quite mean the same thing in a software-mesh framework because you’d still perhaps end up managing container hosts, but it does improve the number of services that a given configuration (cluster) can support.

We might get a glimpse of Google’s applications for Istio by looking at the next software package, Angular. The concept of Angular evolved from an original JavaScript-centric tool (now popularly called “AngularJS”) to the current model of a web-based application platform built on TypeScript, an enhancement to JavaScript to allow for explicit typing. Because Angular is portable to most mobile, desktop, and browser environments, it can be used to build what are essentially universal applications.

There are two interesting things about Angular, one organizational and one technical. The organizational thing is that it’s a complete rewrite of the original AngularJS stuff, by the same development team. That shows that the team, having seen the potential of their approach, decided to start over and build a better model to widen its capabilities. The technical thing is that Angular’s approach is very web-service-ish, which means that it might be a very nice way to build applications that would end up running as a service mesh.

Angular was a part of a reference microservice platform that included Istio and builds an application from a distributed set of microservices in a mesh. This would create a cloud-native model for a web-based application, but using a language that could take pieces of (or all of) the Angular code and host it on a device or a PC.

I have to wonder if Google is seeing a future for Angular as the way of creating applications that are distributable or fixed-hosted almost at will, allowing an application to become independent of where it’s supposed to be run. If you could describe application functionality that way, you’d have a “PaaS” (actually a language and middleware) that could be applied to all the current models of application hosting, but also to the new cloud-native microservice model. That would sure fit well with Istio plans, and explain why Google needs to stay at the helm of Angular development.

The connection between Istio and Angular seems plausible, but Garret is a bit harder to fit in. Garret is a variant to github, a modernized repository model that’s designed specifically to facilitate code review. Organizations used to github often find Garret jarring at first, and some at least will abandon it after the initial difficulties overwhelm them. It’s best to do just a few (even one) main repository first and get used to the process overall before you try to roll Garret out across a whole development team.

Without getting into the details of either Garret or code review, can we say anything about why Google might see Garret as very strategic? Well, think for a moment about the process of development, particularly rapid development, in a world of meshed microservices. You are very likely to have multiple change tracks impacting some of the same components, and you surely need to make sure that all the code is reviewed carefully for compatibility with current and future (re-use) missions.

As I said up front, Google might have three reasons for three independent decisions on open-source direction in play. The reasons might be totally trivial, too. I think Google might also be looking at the future of software, to a concept fundamental to the cloud and maybe to all future development—the notion of virtualization. If software is a collection of cooperative components that can be packaged to run in one place, or packaged to be distributed over a vast and agile resource pool, then it’s likely that developing software is going to have to change profoundly.

Would Google care about that, though? It might, if the mapping of that virtual-application model to cloud computing is going to create the next major wave of cloud adoption. Google is third in the public cloud game, and some even have IBM contending with Google for that position. If Google wants to gain ground instead of losing it, would it benefit Google’s own cloud service evolution to know where application development overall is heading?

That’s what I think may be the key to Google’s desire to retain control over the direction of these three projects. Google isn’t trying to stifle these technologies, it’s trying to promote them, but collaterally trying to align the direction of the projects (to which Google is by far the major contributor) with Google’s own plans for its public cloud service.

The early public cloud applications were little more than server consolidation onto hosted resources. The current phase is about building cloud front-ends to existing applications. Amazon has lost ground in the current phase because hybrid cloud isn’t its strength, and Microsoft and IBM are now more direct Google rivals. IBM killed their quarterly earnings, and IBM Cloud made a strong contribution. That has to be telling Google that if the hybrid-cloud game stays dominant, IBM and Microsoft will push Amazon and Google down. They need a new game, one in which everything is up for grabs, and Google could come out a winner. Real cloud-native could be that game, and these three projects could be the deciding factor for Google.

Can “Government Broadband” be Made to Work?

Are all government broadband initiatives doomed? After I commented a bit on Australia’s NBN as an example of why you can’t count on a form of nationalization to save broadband, I got additional material on Australia, as well as commentary from people involved in various government-linked broadband initiatives here in the US. I think the sum of the material sheds light on why so many (most, or even all?) such plans end up in failure. That could serve to deter some, and perhaps guide the initiatives of governments determined to give it a try.

The single point I got from all my sources is that all studies commissioned to evaluate government broadband programs will be flawed. In every case where such a study was done, the study forecast a shorter period to full deployment, better outcome, and lower costs than were experienced. According to my sources, the average study missed the mark by almost 100% in each of these areas. That seems too much of an error to be simple estimate difficulties; there seems to be a systemic issue. In fact, there are several.

The issue most sources cite for study errors is there is a desired outcome for the study, and it’s telegraphed to the bidders. I’ve seen this in study RFPs I’ve received in the past, when my company routinely bid on these sorts of things. “Develop a report demonstrating the billion-dollar-per-year market in xyz” is an extreme example, but it’s a near-quote of one such RFP’s opening.

The second main source of study errors is that the organization requesting the study has no ability to judge the methodology proposed or the quality of resources to be committed. It does little good for an organization or government entity to request a study when they wouldn’t even know a plausible path to a good outcome, and yet the majority of studies are commissioned by organizations with no internal qualifications to assess the results. In some cases, that’s not a barrier because (as my first point illustrates) the desired result is already known, and the study is just going through the motions. In other cases, the organization requesting the study is simply duped.

The second-most-cited reason for the failure of government broadband projects is that a vendor or integrator misleads the government body in the capabilities of the technology. Everyone who’s ever done any kind of RFP knows that vendors will push their capabilities to (and often past) their limits. “To a hammer, everything looks like a nail” is an old saw that illustrates the problem. Go to a WiFi specialist and you get a WiFi-centric solution, whether it’s best or not.

This is the biggest technical problem with government broadband. Sometimes it’s the result of underestimating the pace of progress in technology relative to the timeline of the project. If you embark on a five-year effort to do something, the fast-moving world of network technology is likely to render your early product examples obsolete before the end of the project is reached. Sometimes, there are fundamental architectural issues that should have been recognized and were simply missed, or swept under the rug.

The third-most-cited source of problems with government broadband is lack of flexibility in dealing with unexpected issues. This covered a number of more specific points. First, the government projects tended to push issues under the rug when they arose to avoid compromising the plan, when in fact it made the issues nearly impossible to address when they finally blew up. Second, government projects were slow to adapt the plan to changes in conditions that clearly indicated adaptation was necessary. Third, government broadband didn’t properly consider new technical options when they arose.

Then, of course, there’s the general complaint that all government broadband is too political. This issue came out very clearly in Australia’s NBN, where the whole topic was a political pawn. Politics tends to polarize the decision-makers on extreme opposite sides of any issue, and with broadband that tends to promote a kind of all-or-nothing mindset at every step of the project.

The input I got suggests that most involved in government broadband projects agreed with my point, which was that the best strategy was likely incentive payments to competing operators to induce the behavior the government wanted, rather than shouldering the free market aside and taking over. A number of Australia’s operators tell me that they believe that the broadband situation would be far better had the government done nothing at all, and that a positive approach to dealing with the specific issues of a low-demand-density market would have served far better.

What, then, could a government do to optimize their chances of succeeding? There are some specific points that seem to be consistent with the experiences my contacts related.

The step that’s suggested most often is perhaps the simplest: Governments need to contract for a service level. The most-cited success story in government/network partnerships is the one involving Google Fiber. People will argue that Google cherry-picks its sites, but that’s not a reason to say that Google Fiber isn’t a good approach, only a reason to say it can’t be the only one.

Google Fiber tends to go after areas that have reasonable demand density but are under-served by traditional telco and cableco providers. That there are such areas is proof that the competitive market doesn’t always create optimum strategies. Some telco/cableco planners have confided that many, even most, Google Fiber targets were considered for new high-speed broadband, but that the market areas were too small to create sufficient profit, and there was a fear that other nearby areas would complain.

New technology of some sort, however, is almost surely required for improving broadband service quality in low-demand-density areas. There’s too often a focus on reusing the copper-loop technology left behind by the old voice telephone services. Rarely can this plant sustain commercially useful broadband quality, so a bid for a given service level has to be assessed considering the real capabilities of the technology to be used.

Perhaps the most important lesson of Google Fiber is that if a network company can exploit new technology to serve an area, they should be encouraged to do that, even to the point where the encouragement is a partnership with government. I think that millimeter-wave 5G in conjunction with FTTN could well open up many new areas to high-speed broadband. Since technology companies are more likely to understand this than governments, a corollary to this point is that governments should encourage approaches by network companies rather than pushing for something specific.

The second step is only a little behind the first: Think small, meaning try to get local government initiatives before you look for something broader. A specific city or county is more likely to be suitable for a given broadband strategy than an entire country. A solution that’s applied nationally tends to encourage the spread of the approach to areas that really weren’t disadvantaged in the first place. Did Australia have to create a national NBN, or should they have instead focused on regional solutions where Telstra and others weren’t able to create commercial services profitably?

It may be that “little government” is always going to do better with broadband programs, not because its people know more, but that they recognize their own limitations more readily. It may also be true that the best thing a national government can do for broadband is to step aside and let the little guys drive the bus. That, in my own view, is the lesson Australia should teach us all.

A Response to the “Accelerating Innovation in the Telecom Arena”

Nobody doubts that the telecom ecosystem is in need of more innovation. That’s why it’s a good sign that there’s a group of people working to promote just that. They’ve published a paper (available HERE) and they’re inviting comments, so I’m offering mine in this blog. Obviously, I invite the authors and anyone else to comment on my views on LinkedIn, where a link to this will be posted.

The paper is eerily similar to the “Call for Action” paper that, in 2012, kicked off NFV. That the initiative seems to be promoted by Telecom TV also echoes the fact that Light Reading was a big promoter of NFV, including events and even a foundation that planned to do integration/onboarding. These points don’t invalidate what the paper contains, but they do justify a close look.

Let’s then take one. A key question at the end of the paper is “What did we get wrong?”, and so I’ll offer my view of that point first. The biggest problem with the paper is that it’s a suggestion of process reform. It doesn’t propose a specific technology, a specific approach, but instead talks about why innovation is important and how it’s been stifled. With respect, many know the answers to that already, and yet we are where we are.

The paper makes two specific process suggestions; R&D partnerships with vendors and improvements to the standards process. I heartily agree with the latter point; how many times have I said in my blog that telco standards were fatally flawed, mired in formalism, prone to excessive delays, and lacking in participants with the specific skills needed to “softwarize” the telco space.

NFV was softwarization, but it never really developed a carrier-cloud-centric vision. Instead, it focused on universal CPE, and as 5G virtual features became credible, operators started looking at outsourcing that mission to public cloud providers. No smaller, innovative, players there, and in fact really small and innovative players would have a major problem even participating in carrier standards initiatives. I’ve been involved in several, but only when I’d sold some intellectual property to fund my own activity. None of them made me any money, and I suspect that most small and innovative companies would be unable to fund their participation.

I could support the notion of improved R&D partnerships, if we knew what they looked like and what they sought to create. What would the framework for those partnerships be, though? Standards clearly don’t work, so open-source? ONAP was an operator open-source project, and it went wrong for nearly the same technical reasons NFV did. There’s not enough detail to assess whether there’s even a specific goal in mind for these partnerships.

Except for a single statement that I have to add to the “Where did we go wrong” point, with a big exclamation mark. “Softwarization of the infrastructure (i.e. NFV, SDN, cloud) has, in theory, created opportunities for smaller, more innovative players, to participate in the telecommunications supplier ecosystem, but there remain significant barriers to their participation, including the systemic ones identified above.” It’s clear from this statement, and other statements in the paper, that the goal here is to improve the operators’ ability to profit from their current network role. Innovation in connection services, in short, is the objective.

I’ve said this before, but let me say it here with emphasis. There is no way that any enhancements to connectivity services, created through optimization of data network infrastructure, can help operators enhance their profitability in the long term. Bits are the ultimate commodity, and as long as that’s what you sell, you’re fighting a losing battle against becoming dirt. Value is created by what the network supports in the way of experiences, not by how that stuff is pushed around. Yes, it’s necessary that there be bit-pushers, but the role will never be any more valuable than plumbing. The best you can hope for is to remain invisible.

As long as telcos continue to hunker down on the old familiar connection services and try to somehow make them highly differentiable, highly profitable, there’s nothing innovation can do except reduce cost. How much cost reduction can be wrung out? Cost management, folks, vanishes to a point. There is no near-term limit to the potential revenue growth associated with new and valuable experiences. The moral to that is clear, at least to me. Get into the experience business.

Smaller, more innovative, players have tried to break into telecom for at least two decades, with data-network equipment that would threaten the incumbents. It didn’t work; the giants of 20 years ago are the giants of today, with allowances made for M&A. The places where smaller players have actually butted heads with incumbents and been successful have been above basic bit-pushing. Trying to reignite the network equipment innovation war that small companies already lost, by moving the battlefield to software, just invites another generation of startup failures. Anyway, what VC would fund it when they can fund a social network or e-commerce company?

Innovation, meaning innovators, go where the rewards are highest. Telco-centric initiatives are not that space, as VC interest demonstrates. Will the operators fund initiatives to improve their infrastructure? I don’t think so; certainly none have indicated to me that they’re interested, and without some financial incentives for those innovators, innovation will continue to be in short supply.

If operators are serious about bringing more innovation into their operations, they need to start by setting innovative service goals. Different ways to push bits are really not all that different from each other. Different ways of making money off bits is a totally different thing. The whole of the Internet and the OTT industries were founded on that, so it’s perhaps the most proven approach to enhancing revenue we could find out there.

There is value to things like NFV hosting and 5G Core virtualization, as drivers to an expanded carrier cloud infrastructure that then becomes a resource for the other future higher-layer services an operator might want to provide. However, we don’t need innovation initiatives to do this sort of thing. We have cloud technology that’s evolving at a dazzling pace and addressing all the real issues. I’ve blogged in the past about how cloud-native control-plane behavior, combined with white-box switching, could address connection-network requirements fully. We don’t need to be able to spin up routers anywhere, we just need to spin them up at the termination of trunks. But even white boxes can be part of a resource pool, supported by current cloud technologies like Kubernetes.

We don’t need NFV, nor do we need an NFV successor. We can now turn to the cloud community for the solution to all our problems, even problems that frankly the operators don’t even know they have yet. Where innovation is needed is in the area of service planning. Bits are ones and zeros, so differentiating yourself with them is a wasted exercise. It’s what they can carry that matters, and so innovative operators need to look there, and leave cloud infrastructure to people who are already experts.

Does Nokia Really Have a New Switching Strategy for the Cloud?

The heart of a cloud is the data center, obviously. The heart of a cloud network is therefore a data center network, and as network equipment vendors struggle for profit growth, data center networking looks very promising. Given that, it’s not very surprising that Nokia is taking a position in the space. The question is whether it can turn a position into a win, and it won’t be easy.

Network vendors have been under profit pressure, in no small part because network operators and enterprises have been under pressure themselves. The one bright area of growth has been the cloud space, and as I noted above, the data center network is the centerpiece of any public cloud network-building plan. The same, by the way, is true for enterprises; my own research for well over two decades has shown that enterprises tend to build their networks out from the data center. That gives those vendors with a data center position a big advantage.

The “well over two decades” point is critical here, at least for Nokia. The position of the data center network has been known to the smarter vendors for at least that long, and players like Cisco and Juniper have been focusing increasingly on switching revenue for a very long time. Nokia is surely coming from behind, and that’s the biggest reason why their move into the space is a risk.

Every market tends to commoditize over time, with the pace of commoditization depending on the extent to which new features and capabilities can be introduced to justify additional spending. Data center switching is surely less feature-rich than routing, and routing is already commoditizing, so it follows that the switching space has been under price pressure. In fact, Facebook has been pushing its FBOSS open switching for five years now, and “white-box” switching was a big feature in the ONF’s OpenFlow SDN stuff.

There’s also the problem of sales/marketing. First and foremost, there has never been a carrier network equipment vendor who was good at marketing, given that their buyers were awful at it. Nokia is no exception. Then there’s the fact that to market effectively, even if you know the general principles of marketing, you have to know a lot about your buyer. That intimate knowledge is going to come only from direct relationships, meaning sales contacts. If you don’t call on a given prospect, you’re unlikely to know much about them, and if you’ve had no products to sell them, you’re unlikely to have called on them. Is Nokia a household word among cloud providers? Hardly.

This, for Nokia, sure looks like the classic long pass on fourth down in the last minute of a football game you’re losing. What in the world would make them go for it, other than that very football-game-like desperation? There are two possibilities, and making either of them work will demand Nokia make some big changes.

The obvious possibility is that Nokia is indeed in that last-pass situation. They’re behind both Ericsson and Huawei in the mobile space, and 5G is the only hope for carrier network equipment. The cloud providers are an opportunity, but so are large enterprises with their own data centers. Rivals Cisco and Juniper have enterprise sales of data center switches, and that gives them an advantage over a rival who might focus only on the cloud providers and operators. Could Nokia be looking to get into the switching space more broadly? Maybe.

The other possibility is that Nokia is reacting to the significant developments in the carrier cloud space. Network operators are committed to virtualization, and in today’s world the commitment is visible both in NFV in general, and in virtualization in 5G in particular. Future opportunities like IoT seem almost certain to demand hosting, and so it’s long been said that carrier cloud could be the largest single source of new data center deployments—including by me. The problem is that the carriers themselves have been extraordinarily (even for them) slow in developing any real carrier cloud plan, much less a commitment. They’re now increasingly favoring outsourcing of at least the early carrier cloud applications to the public cloud providers. Could Nokia see that outsourcing as a foot in the public cloud door today, and also see a future push by operators to return to hosting their own carrier cloud apps? Maybe.

If Nokia wants to be a broad-market data center switching player, their biggest challenge is that they don’t call on enterprises today and have little or no name recognition. To succeed, they’d need an incredibly well-done marketing program, with a great (possibly dedicated) website presence and great collateral. Without this, the burden placed on their sales personnel would be impossibly large, and it would be difficult to compensate them for the time needed to develop accounts.

Targeting only the public cloud providers might make things just a bit easier from a sales-marketing perspective, but this group has been looking more and more toward white-box switching and price pressure on the deals would be intense. Because the cloud provider buyers are quite platform-savvy, Nokia would need a highly qualified sales support staff to handle questions and step in when needed.

The last of the options seems best, at least in terms of opportunity versus effort. Nokia is knowledgeable about the carrier cloud opportunity, more so than nearly all the public cloud providers, which means they actually have an asset they could push through sales/marketing channels. Nokia, as a credible 5G player, has the same sort of positioning advantage in direct sales of carrier cloud infrastructure, so they could credibly tell a story of transition—start small with a public cloud using Nokia-approved technology, then migrate to your own clouds—to operators.

If all Nokia had was switches, this would be a lost battle for them. They do have more, however, including what they call “a new and modern Network Operating System”, a version of Linux designed to take advantage of microservices and cloud-native operation to unite a system of data center switches. This, obviously, could be extended to other devices at some point, though Nokia isn’t saying it will be. This makes Nokia’s story a story of cloud switching, which could be compelling.

“Could be”, but in order for the story to deliver, it has to overcome that sales/marketing challenge. Nokia, as with Alcatel and Lucent, has always had a very geeky approach to the world, one that hasn’t so far been able to overcome the basic truth that no matter how smart you are, you can’t win with a strategy that assumes your buyers are smart too.

Smart switches may be necessary conditions for a smart cloud, a cloud that’s capable of being elastic and efficient and operationalizable, but they’re not a sufficient condition. The cloud is an ecosystem, and you can’t optimize it by tweaking just the connective tissue.

The good news? Sales/marketing problems are relatively easy to fix. All it takes is determination and a good, systematic, approach. The bad news? Through the evolution of three companies, Nokia hasn’t been able to fix them. We’ll see how they do this time.

Making the Best of Multi-Cloud

Just how good an idea is multi-cloud? If you follow the topic in online tech sites, you’d think that everyone was totally committed to multi-cloud, and that it was therefore the preferred approach. My own data, as I’ve noted in a prior blog, doesn’t agree with popular reports, and my own experience in hybrid and multi-cloud consulting and research suggests that there are issues with multi-cloud that often catch enterprises by surprise. Most of them could be significantly mitigated by simply planning properly.

Part of the problem with multi-cloud is deciding just what it means. To most, “multi-cloud” means the use of two or more public cloud providers, with “public cloud provider” meaning a provider of a cloud service. Since SaaS is a cloud service and Salesforce’s offerings are widely used, this tends to make everyone who uses another public cloud, like AWS or Azure or Google or IBM, a multi-cloud user. I don’t think that’s a helpful definition, so what I propose here is that we ignore SaaS in the determination of multi-cloud. SaaS should be, from a security, integration, and operations perspective, be considered separately anyway.

According to my research, and mindful of my definition above, multi-cloud usage divides sharply based on motivation. The largest multi-national corporations tend to use multi-cloud because they can’t properly cover their global market area with a single cloud provider. Among this group, comprising perhaps a thousand companies globally, multi-cloud is almost universal. The other motivation is for backup, and here there are perhaps another two thousand companies spread across a wider size range.

A universal issue in multi-cloud is security. As THIS Light Reading piece notes, there are security issues associated with multi-cloud, but they emerge from multiple sources and have to be addressed in different ways depending on cloud usage.

The universal security issue for multi-cloud is increased attack surface. An “attack surface” is the number of places where possible attack exposure exists, and obviously the more clouds you have, the more attack surface you have. Every cloud provider has security risks, so the more providers you have, the more risks you’re exposed to. It’s almost always possible to mitigate some of these risks, but not all of them. An internal flaw in a cloud provider’s own shield compromises users, and can’t be fixed easily by those users.

A second fairly common multi-cloud security issue is differences in cloud provider architecture can make it difficult to create a common secure framework across provider boundaries. If you write some components for optimal performance in AWS, the differences in how the web services are invoked and managed will mean that the same code almost surely won’t run on Azure. There are now two versions of the code, and that means two places where you have to deal with security and two different ways that has to be done. The cloud-specific code problem bites in other areas, as we’ll see.

A related problem is that shedding cloud-provider web services to promote a common code set across clouds (sticking to vanilla IaaS and adding in your own middleware for containers, IoT, and so forth) puts a greater security burden on the user. All platform tools have the potential to open security holes, and if users have to supply everything above a naked virtual machine, there are plenty of places they could mess up. Thus, the user is often forced to choose between using cloud-provider web services and accepting the higher cost and loss of code portability, or rolling their own platform toolkit and accepting a risk they’ll fail to consider security and operations challenges.

The other common problem with multi-cloud is operations and code compatibility. As I suggested above, multi-cloud users are often stuck with a choice between building their own cloud middleware and then deploying it across multiple clouds, or keeping different application versions to account for the difference between web services in Cloud A and Cloud B.

Companies who adopt multiple clouds for multiple geographies alone will often be able to maintain independent code for each cloud, because there’s no expectation of redeploying components of applications across cloud boundaries. Even in cases where one cloud is expected to back up another, enterprises say it’s often possible to maintain separate application components for each cloud, either because no cloud-provider features other than basic hosting are used, or because the cloud components aren’t changing rapidly enough to create a development bottleneck. Things are more complicated if multiple clouds are seen as a part of a common resource pool, allocated freely as needed.

As we get into more complex cloud-native applications, it’s going to be very difficult to maintain multiple application versions. Many experts agree that the future of cloud computing is a unified cloud resource pool to which everything contributes, including enterprise data centers and multiple cloud providers. Even bare-metal hosting providers are likely to get enveloped in this concept, but it poses pretty significant challenges.

The most obvious challenge is the need to define a single application platform (a PaaS, in a real sense) that can be deployed onto everything, so application components can be hosted freely on whatever is best, without changes in the code or in operating practices. Container architectures (Docker, Kubernetes, and the rest of the ecosystem) are the best underpinning for this sort of thing, since virtual machines provide too much latitude to customize middleware, making application portability problematic.

Even with containers and the whole of the Kubernetes ecosystem, there’s still the problem of those web services offered by cloud providers. Every cloud provider knows that basic IaaS is a commodity, meaning it’s under tremendous price pressure and it offers no “stickiness” in holding a customer in once they’ve committed. Adding custom web services encourages applications to depend on them, so making the applications portable is the opposite of the cloud provider goal.

We are moving, slowly and irregularly, to develop a set of tools that are cloud-provider-independent. Knative is a deployable serverless toolkit that at least somewhat replaces cloud provider serverless/functional computing services. We also have deployable database and cloud storage options that replace cloud provider tools, and there are even specialized tools for IoT and other applications. The problem with all of these is that they need to be integrated, which even large enterprises find difficult.

The net of this is that multi-cloud can’t reach its full potential unless there’s a single, integrated, alternative to cloud-provider web services, a suite. That can come only from a software vendor who builds out from the Kubernetes ecosystem, and who has credibility with enterprises as a platform provider. The obvious contenders are IBM/Red Hat, VMware, and HPE. Other up-and-coming possibilities would include Suse, who just acquired Rancher.

The major suite players are already striking alliances with cloud providers, creating a hybrid cloud framework that’s actually a multi-cloud web service suite in disguise. If Cloud A and Cloud B can host OpenShift, for example, and so can enterprise data centers, then ultimately Cloud A and Cloud B can become commoditized by the loss of their own web-service differentiators. This could be the way that the suite vendors gain cloud ascendency.

Or not, of course. There’s a lot of work to be done turning the limited cloud suites of today into full-blown web-service alternatives for hybrid and multi-cloud. The fact that the reward for success here would be truly enormous, even transformational to the industry at large, means that it’s very likely that every suite vendor will take a shot, and at least one might succeed.

The Reality of Autonomous Behavior

Autonomous behavior is often cited as a 5G, low-latency, or edge computing application. That’s a vast oversimplification, in my view, and to understand why, we have to look at the way human reactions take place. That should be the model for autonomous behavior, and it will demonstrate just that could and should be done with low latency or at the edge. A good application to use in our example is autonomous vehicles, since that’s the most-hyped example out there.

Suppose you’re driving down a country road, and a deer steps out onto the road a hundred yards ahead. You see the deer, and you likely take a second or two to see if it’s going to move away. If it crosses into the field or brush, you likely slow a bit in case it reverses course. If it doesn’t move off the road, you then take your foot off the accelerator, and if it still doesn’t move as you get closer, you apply the brakes. This is a fairly typical encounter, and you’d probably do the same thing with a vehicle or person on the road.

Now suppose you’re a hundred feet from the point where the deer (or car, or person) comes out onto the road. You don’t take time to consider here, but instead immediately take your foot off the gas and prepare to break. If you don’t see the subject’s movement off the road very quickly, you apply the brakes. Again, a typical action.

Finally, suppose that you’re on that same road and a deer jumps out 20 feet in front of you. You immediately jump on the brakes aggressively, because that’s what would be needed to avoid a possible collision. Hopefully you don’t have this sort of experience often, but it does happen.

Let’s now try to categorize these three reactions, with the goal of deciding just what and where the reactions are processed.

We could call the first example a reasoned response. There is a trigger, and the trigger sets up an assessment of the situation (when you “took a second or two”). The assessment results in an action, not the initial trigger. After the assessed action, you’d have another assessment, perhaps several, in a kind of loop, until you either pass the point of risk or stop the vehicle.

The second one, we can call a reaction. Here, the trigger stimulates a response, then an assessment of whether the response is appropriate. The response is then assessed as it would be in the first case.

The final case could be called a synapse, which is a direct connection from stimulus to response. There is no assessment until the action, the “panic stop” is complete.

If we want to complete our autonomy framework, we need to add a fourth thing, something completely different, which is a plan. Suppose you’re following a route or heading to a specific destination. You’ll still respond to conditions as in our first three examples, but in addition you’ll be sensitive to other triggers, such as the fact that you’re coming up on a turn or that traffic is getting heavy on your planned route, or perhaps that you’ve been looking for gas or a rest stop, and one is coming up. What we have here is a set of different triggers, things that represent more systemic conditions.

Some 5G proponents will look at this and suggest that all of it is a formula for edge-computing-based, low-latency, applications, but I think we have to put these four things into an autonomous context and validate those claims. To do that, I propose to take them in order of “immediacy”.

The synapse-type response is obviously the one that’s most critical in terms of trigger-to-response latency. In human-response terms, this is the kind of thing where you hope your own reaction time is fast enough. The thing is, we already have on-vehicle automatic braking systems that provide the trigger-response combination. Why would we elect to offload this sort of thing to a network-connected software element, when all manner of things would risk a delay and an accident? In my view, there is absolutely no credibility to the synapse-type response justifying 5G, low-latency connectivity, or edge computing.

The reaction response differs from the synapse in two ways. First, the response to the trigger doesn’t have to be as instantaneous. It’s not instinct to hit the brakes as much as a fast assessment of the situation. Second, the conditions that could give rise to the reaction response are more complex to assess. The deer is 100 feet ahead, and so what specific technology lets us know that it’s a moving obstacle that’s now on the road, or perhaps about to be on the road?

The question here isn’t one of response timing as much as the assessment of the need for a response. A radar or ultrasonic picture warning of proximity is easy to put on-vehicle, but for our reaction scenario, we’d almost surely need to have some form of image analysis. The question is whether the analysis should be on-vehicle where we would have direct access to camera data, or whether it should be remote, in which case we’d have to have real-time network connection to the point of analysis.

I don’t think that autonomous vehicles that demand real-time video from every vehicle is a practical strategy, and certainly not in the near term. Thus, where the visual scene has to be analyzed to provide input into autonomous behavior, the handling should be embedded with the vehicle. Given that doorbells and cameras can be made to recognize faces, eyes, and even animals, I don’t think this is a tall order.

Is it possible that our reactive recognition might be a collaborative function? Could a vehicle system perform an analysis of the scene, and then send the result to a cloud function that would perform an AI analysis on the results? Yes, that would indeed be a nice approach. The relevant frames could be abstracted to focus, for example, on what is moving on or toward the road, and eliminating other distractions. Think of a kind of wire-frame modeling. This could be forwarded to an AI system to compare it to other “incidents” that would allow it to be classified, and the result (an action) returned. The response time doesn’t have to be instant, but it could be a credible 5G and/or edge computing mission.

The reasoned response would be quite similar, and in fact it could be handled by the same kind of logic. All that’s required is that the initial AI assessment return a kind of “look again in x seconds” result, which would then repeat the analysis at that future point. It might also set what could be called a “vigilant” state, where the autonomous system (like a human driver) would be watching more carefully, meaning would be more likely to interpret a condition as requiring a reaction.

Introducing planning changes the dynamic somewhat, but not necessarily as much as might be thought. The reaction time for planned trigger-action combinations can be slower, of course, which makes it practical to do more off-vehicle. The problem is that we already have GPS systems for cars that do most of the planning work for us. Modern ones will get traffic updates and suggest taking alternative routes, too. I think it’s possible that real-time data collection from vehicles could be assembled and aggregated to produce better results than we get today from a GPS, but this isn’t a 5G or edge computing mission; there’s no requirement for sub-second response.

There’s another dimension to autonomous behavior that has to be considered too, and is rarely mentioned. What is the fail-safe procedure? What happens if a self-drive loses its sense of self, if an autonomous big rig barreling down the highway suddenly finds itself unable to operate normally because something broke, or because an essential data connection was broken? We already know that even today’s driver-assist systems, explicitly stated not to be suitable for autonomous operation, result in drivers sleeping at the wheel. We can’t rely on driver intervention or attention, and don’t suggest an alarm to wake the driver. Who knows what their reaction would be?

We need two fail-safe procedures, in fact. One would be targeted at dealing with a system failure of the autonomous element, and the other with some major catastrophic problem that could result in having too many autonomous decisions colliding because nobody knows what others are doing. We’ll take the latter first, because it has an easier answer.

It may be that the strongest argument for central control, even with distributed autonomy as the rule, would be the ability to mediate a response to some catastrophic event. A major accident, a bridge failure, or any number of things that could totally disrupt traffic, could result in synchronized behavior from similar autonomous systems. If everyone turns left at the next intersection to avoid a major traffic accident, the avoidance conditions could become worse than the trigger.

The system failure problem is hard, and there’s no getting away from that. If autonomous systems fail no more often than vehicle mechanics or human drivers, the failures could be statistically acceptable but still enough to create public backlash and even insurance penalties. If they fail more often, then it’s not likely that the technology could survive the bad publicity. The issues could be mitigated if the failure produced a graceful response.

I think that it’s logical to assume that our synapse systems should be fully redundant, and that if they detected a failure of the higher-layer functions, they should flash all the vehicle lights and slowly pull over into a parking lot, the shoulder, or the curb. Obviously sounding a warning to the passengers and/or human driver would also be smart. It would also be smart to have such a fault reported to any centralized traffic or vehicle control function, to facilitate the management of evasion by nearby vehicles.

Where does all this leave us? I think that the majority of the things that 5G or edge computing or low-latency advocates cite to justify their technology of choice in support of autonomous vehicles are nonsense. However, there are clearly things that aren’t cited at all that would ultimately justify 5G and edge computing and latency management. These things, if not handled properly, could threaten the value of autonomous vehicles, not to mention the public at large. If proponents of 5G, low latency, and edge computing are serious, they should look to these issues instead of hyping what’s easy and sensational.

An Example to Facilitate Understanding Service Models

I post my blogs on LinkedIn to provide a forum for discussion, and on my blog on ONAP and its issues, Paul-André Raymond posted an interesting insight: “There is something more easily understandable about Monolithic architecture. It takes an effort to for most people to appreciate a distributed solution.” I agree, and that made me wonder whether I could explain the issues of distributed lifecycle automation better.

We tend to personify things, meaning that we take automated processes and talk about them as though they were human, as though they were reflections of us as an individual. We are monoliths, and so we think of things like lifecycle automation in a monolithic way. We describe functions, and we assume those functions are assembled in a grand instance—us. We’re not distributed systems, so we don’t relate naturally to how they’d accomplish the same task.

The interesting thing is that we work as a distributed system most of the time. Imagine an army of workers building a skyscraper. The workers are divided into teams, and the teams are grouped into activities that might be related to the floor being worked on or the craft involved. There are then engineers and architects who organize things at the higher levels.

In a highly organized workforce, there is a strict hierarchy. People have individual assignments, their “local” team has a collective assignment and a leader, and so forth up the organization. If a worker has a problem or question, it’s kicked to the local leader, and if two teams have to coordinate something, the joint leader above does the coordinating. This is also how the military works, in most cases.

Can these organizational lessons be applied to services, applications, and other stuff that has to be managed automatically? I think so, but let’s frame out a mechanism to prove the point. We’ll start from the bottom, but first let’s frame a unifying principle. We have to represent the people in our workforce, and we’ll do that by presuming that each is represented by an “object” which is an intent model. These models hide what’s within, but assert specific and defined interfaces and properties. It’s those interfaces and properties that distinguish one from another, as well as their position in the hierarchy we’ll develop.

A worker is an atomic and assignable resource, and so we’ll say that at the lowest level, our worker-like intent model will represent a discrete and assignable unit of functionality. In a network or application, the logical boundary of this lowest level would be the boundaries of local and autonomous behavior. If we have a set of devices that make up an autonomous system, one that’s collectively functional and is assigned to a task as a unit, we’d build an intent model around it.

In a workforce, a worker does the assigned work, dealing with any issues that arise that fall within the worker’s assignment and skill level. If something falls outside that range, the worker kicks the problem upstairs. They were given the job because of what they can do, and they do it or they report a failure. So it is with our service/application element—it has properties that are relied upon to assign it a mission, and it’s expected to perform it or they report a service-level agreement violation.

The next level of our hierarchy is the “team leader” level. The team leader is responsible for the team’s work. The leader monitors the quality, addresses problems, and if necessary, kicks those problems upstairs again. Translating this to a service/application hierarchy, a “team-leader” element monitors the state of the subordinate “worker” elements by watching for SLA violations. If one is reported, then the “team-leader” element can “assign another worker”, or in technology terms break down the failed element and rebuild it.

I’ve mentioned the notion of “kicking” a problem or issue several times, and I’ve also mentioned “assignments”. In technology terms, these would be communicated via an event. Events are the instructional coupling between hierarchical elements in our structure, just as communications forms the coupling in a cooperative workforce. And just as in a workforce, you have to obey the chain of command. A given element can receive “issue events” only from its own subordinates, and can make “reports” only to its direct superiors. This prevents a collision of action resulting from multiple superiors giving instructions (conflicting, of course) to a subordinate element.

At every level in our hierarchy, this pattern is followed. Events from below signal issues to be handled, and the handler will attempt to deal with them within its jurisdiction, meaning as long as the changes the handler proposes remain confined to its own subordinate elements. For example, if a “virtual firewall” element reported a failure, the superior handler element could tear it down and rebuild it. If a “virtual-IP” network element failed, not only could that element be replaced, but the interfaces to other IP elements (representing network gateways between them) could be rebuilt. But if something had to be changed in an adjacent access-network element as a result, that requirement would have to be kicked up to the lowest common handler element.

Each of our elements is an intent model, and since each has events to manage, each would also have specific defined states and defined event/state/process relationships. When our virtual firewall reported a fault, for example, the superior element would see that fault in the “operating” state, and would perform a function to handle an operating fault in that element.

The relationship between superior and subordinate elements opens an interesting question about modeling. It appears to me that, in most cases, it would be advisable for a superior model to maintain a state/event table for each of its subordinate relationships, since in theory these relationships would be operating asynchronously, and a master one for itself. “Nested” state/event tables would be a solution. It’s also possible that there would be a benefit to modeling specific interfaces between subordinate elements, network-to-network interfaces or NNIs in the terminology of early packet networks.

A final point here: It’s obvious from this description that our service model is modeling element relationships and not topology or functionality. The models are hierarchies that show the flow of service events and responsibilities, not the flow of traffic. It is very difficult to get a traffic topology model to function as a service model in lifecycle terms, which is why the TMF came up with the whole idea of “responsibility modeling” over a decade ago.

The Future of the Cloud and Network is…

Things change in networking and the cloud, and a big part of the reason is that new concepts that were being tried by a few experts are now becoming part of the mass market.

The mass market is different. Early adopters are typically firms that have deep technical resource pools, a situation that the typical business user of technology can only envy. When technology starts to broaden out, to seek maximum accessible market, it has to populized, to deal effectively with the challenges that the next tier of users will face, and face them for their users. So it is with containers and Kubernetes.

The service provider or (network operator) market is different, too. Network operators represent an enormous incremental opportunity for new technology, as they confront a market beyond connectivity, virtual resources, and automated operations. Winning the “carrier cloud” space could be worth a hundred thousand data centers, according to my model. Whoever gets the carrier cloud space gets that revenues, and carrier cloud will demand highly efficient hosting, and create challenges. Kubernetes and containers will have to face this set of challenges too.

We’ve recently seen companies arguing over how great, or early, or both, their Kubernetes positions are. That’s not the sort of marketing that’s going to win the hearts and minds of practical CIOs, of course. What enterprise CIOs and operator CTOs want to know is how well a Kubernetes provider is doing keeping pace with the things that are changing their specific requirements, creating their risks and opportunities. Extending Kubernetes to the future is, first and foremost, about extending Kubernetes.

Kubernetes so far has been a point solution to exploiting the inherent agility of containers. People who have found application drivers for containerization ran into the orchestration wall, and there was Kubernetes. Today, both the mass market and the giant carrier cloud market are trying to find the application drivers themselves. There is a broader market now, so an easy answer is to try to make containers and Kubernetes easier, and the easiest way to do that is to establish a suite of some sort, a suite that combines Kubernetes with other tools to match the direction that both mass and carrier-cloud markets are evolving.

The leading Kubernetes players, suite-wise, have been Red Hat and VMware, with OpenShift and Tanzu, respectively. OpenShift was arguably the first of the container/Kubernetes suites, and Tanzu came along when VMware accepted that Kubernetes was going to eat the world. Now, HPE has entered the picture with its own Ezmeral software strategy, competing with the two incumbents. The striking thing about this is that Red Hat was bought by IBM, Dell owns most of VMware, and now HPE—another IT giant—is getting into the picture. It’s hard to escape the sense that Kubernetes and its suites are proxies in a broader battle for the data center.

Ezmeral is important to HPE, but perhaps mostly in a reactive sense, as THIS SDxCentral piece suggests. As the Kubernetes market broadens, difficulties prospects have in assembling their own Kubernetes increase, which is the justifications for the suites in the first place. For a server company like HPE, the lack of a Kubernetes suite means dragging in other players to get a complete container solution in place. Since those other players are increasingly HPE competitors, that’s clearly not a good strategy.

Reactive steps tend to shoot behind the duck, though, because what you’re reacting to is what competitors are doing and not what they will be doing, what they’re currently planning. Kubernetes has become the hub of a growing functional ecosystem, and even a container suite that’s aimed at unifying and integrating the basic container deployment framework doesn’t necessarily address the rest of that ecosystem. What does?

We come now to those two markets, the mass market and the carrier cloud market. The first is evolving with the broad interest in container adoption for microservice-based applications, where containers could fill a void between serverless applications and persistent deployment, providing scalability and resiliency without the potential cost explosions that serverless cloud can bring. The second is driven by the knowledge that virtualizing service provider infrastructure to create that “carrier cloud” is perhaps the hottest potential single vertical-focused market opportunity out there.

We’re going to look at these two markets because they’ll create the missions that will then change cloud, container, and orchestration tools. What those changes will be will then determine who gets the power, and the money, that comes from threading the needle of market variables to sew a golden cloak.

The microservice mission for containers foresees an application model that consists of a fairly large number of components, all used at a highly variable rate, meaning that balancing processing with workload levels is critical. The presumption, given that a long string of components would create latency issues for many applications, is that while there are many possible components to be run for a given trigger, there aren’t many that will run. Think of a couple hundred possible components, two to five of which are needed for a given trigger.

The technical foundation of this model is a service mesh, which is a combination of connectivity, discovery, and load-balancing all rolled into a common structure. Istio is the up-and-coming service mesh champ as of now, and it’s a Google invention that pairs well with Kubernetes. The Istio/Kubernetes combination prepares a business for the evolution of traditional IT missions to something more dynamic and interactive.

The beauty of this approach is that containers can then support applications that are very persistent, meaning they run all or nearly all the time, but also applications and components that are fairly transient. That range of application can be extended by adding in a serverless scheduler, like Knative, that will associate events with containerized components and run them when the event comes along.

You might wonder why Knative is useful, given that Kubernetes alone, and Istio in combination with it, offer scaling, resilience, and load-balancing. The reason is that sometimes there are many event-triggered processes that, overall, run often, but individually run more rarely. Think of a hundred processes, of which five or six are running at a given time. When these processes are short-duration, the container and even the service mesh combination, aren’t efficient. Adding a serverless layer means that the same resource set (in my example, the resources needed to run five or six processes, with reasonable safety margin) can support that hundred processes.

As a final point, a broader cloud/hybrid-cloud commitment by enterprises opens the opportunity for additional tools that build on and extend that commitment. Things like cloud backup and disaster recovery for hybrid cloud applications (the target of VMware’s Datrium acquisition) might benefit directly from an augmented platform, but they would surely benefit from greater container/cloud commitment.

Let’s move on now to the second of our exploitable trends, carrier cloud. Network operators have been working for nearly a decade on creating services based on software instances of features/functions rather than on proprietary service-specific network hardware. NFV (Network Functions Virtualization) is probably the best-known example. In order to have software instances of functions, one must have a place to host them, and the term “carrier cloud” or “telco cloud” has been used to describe that resource pool.

For at least five years now, most people (including me) have assumed that the operators would build out their own cloud infrastructure to provide carrier cloud hosting. I did modeling in 2013 that predicted that carrier cloud hosting would be, by 2030, the largest incremental source of new data centers. These would be primarily edge data centers, too, which makes carrier cloud the biggest edge application.

The thing that makes carrier cloud big isn’t hosting NFV, though, but hosting network service features in general, including higher-layer features associated with things like IoT. Thus, the early function-hosting opportunity is only the tip of the iceberg. Where the real substance emerges is in the first “big” application of carrier cloud hosting, 5G. Since 5G mandates virtualization across all its elements, implementation of “complete” or “standalone” 5G (5G Core) would mandate a virtual-function deployment of some sort.

The “of some sort” qualifier here is important. If you refer back to my carrier cloud opportunity and driver model, you’d see that even in 2020, hosting NFV-style virtual functions accounts for less than a third of overall carrier cloud opportunity, and that by 2030 it would account for only 4% of incremental opportunity. The question that poses is whether NFV principles would be a very small tail that somehow manages to wag an enormous dog, or whether future opportunities will drive a different model. I think that the trends in 5G “carrier cloud” answer that question.

There are two important truths with respect to 5G hosting. First, operators have indicated significant and unexpected willingness to lean on public cloud providers for their 5G Core hosting. Second, the big 5G network vendors are emphasizing “cloud-native” implementations, which while it doesn’t foreclose NFV, doesn’t emphasize it.

NFV is the wrong approach to carrier cloud. Carrier cloud applications are really bicameral; they have a “data-plane” and “control-plane” brain, and the two think differently. The control plane applications look much like traditional IT applications, and their evolution would follow the model I’ve described above. Data plane applications are really less about dynamic hosting of functions than about supporting low-latency, high-traffic, links between fairly static instance locations. I’ve described my vision for a cloud-native 5G Core HERE.

This truth (if you believe I’m right) combines with another truth (for which, again, your belief is optional). Carrier initiatives for next-gen networking are, in general, a failure. I refer here to my blog on ONAP, which is the proposed foundation for next-gen operations automation. The problem is that operators are generally stuck in box-land, an age of device networks that their own initiatives are aimed at escaping. That sets the stage for the big question associated with the carrier-cloud trend; do you make it look like a cloud initiative or an operator initiative?

I believe that the cloud providers have decided that “carrier cloud” is a special case of cloud. They are going to blow kisses at carrier standards, knowing full well that the operators themselves will accept any strong and well-integrated implementation that even sort-of-promises not to diss their own standards initiatives. They believe that will work, and they hold that belief because they know operators are asking them to host 5G in their public clouds.

For the Kubernetes suite players, the nature of carrier cloud is critical. Do they go along with the cloud providers and bet on the cloud connection? Do they believe that operators will hew to their own standards when they come to their senses, and thus bet their carrier cloud positioning on NFV? The issue is more significant than some might believe.

If you buy into a cloud-centric vision of carrier cloud, you inherit the cloud’s notion of service-based deployments, the monitoring and management, the ability to link to legacy software applications. If you don’t, you have to assume that these things are going to come from the carrier side, meaning things like NFV and ONAP. That means your approach stands or falls based on the success of these standards initiatives. It means you’re bucking the cloud providers.

Could you adapt NFV and ONAP to containers, Kubernetes, service meshes, and the cloud? Possibly, but remember that these initiatives are years along their path, have been paying lip service to cloud-centricity, and have yet to make any real progress in getting to it. It sounds to me like a bad bet.

VMware has probably the longest-standing position with respect to carrier cloud, and their material has a distinct carrier-standards bias, despite the fact that VMware got a number one ranking in Cloud Systems and Service Management and clearly has a strong position in the cloud. Red Hat’s website also cites their NFV credentials. HPE doesn’t say much about carrier cloud in its Ezmeral material at this point, but has historically taken a very standards-centric position on NFV and carrier cloud. Thus, our suite vendors are so far betting on something that could tend to separate the two trends in Kubernetes, with the applications going toward the Istio/Knative model and the carrier cloud side going to a totally different NFV-and-ONAP direction. This separation could reduce the symbiosis between the two trends, and thus reduce the collective benefits that a single approach to the future cloud could yield.

Kubernetes and NFV can coexist, which has been Metaswitch’s point for some time. They went so far as to say that the future of NFV Infrastructure (NFVi) was Kubernetes. Metaswitch was acquired by Microsoft. Ponder that, doubters. CableLabs and Altran have combined to create the Adrenaline project, which unites feature hosting and Kubernetes. Ponder that, too. The point is that there’s a lot of evidence that a cloud-centric carrier-cloud model (as opposed to a standards-centric one) is evolving already.

Summary? The best of all possible worlds for the Kubernetes suite players is that both network operators and enterprises shift toward the same Kubernetes ecosystem in the long term. That would mean that both trends in the cloud could be addressed with a single, expanding, product offering. My model says that the “unique” network operator drivers of carrier cloud will be almost inconsequential within a year or so, compared to the higher-layer service drivers that we see emerging. That’s what I think the public cloud providers are betting on, and I think the suite players should bet the same way.

Is ONAP Advancing or Digging a Deeper Hole?

The announcement of ONAP’s Frankfurt Release last month raised a lot of questions from my contacts and clients. There is no question that the release improves ONAP overall, but it still doesn’t change the underlying architecture of the platform. I’ve described ONAP as a “monolithic” model of zero-touch operations automation, and said that model is simply not the right approach. In a discussion with an EU operator, I got some insights into how to explain the difference between ONAP and the (far superior, in my view) TMF NGOSS Contract model.

We think of networks as vast interconnected collections of devices, which is true at the physical level. At the operations level, though, a network is a vast, swirling, cloud of events. An event is an asynchronous signal of a condition or condition change, most often a change that represents a significant shift in state/status. In a pure manual operations world, human operators in a network operations center (NOC) would respond to these events by making changes in the configuration or parameterization of the devices in the network.

An automated lifecycle management system, like those humans in the NOC, have to deal with events, and as usual there’s more than one way to do that. The obvious solution is to create what could be called an “automated NOC”, a place where events are collected as always, and where some automated processes then do what the human operators would do. I’ll call this the “AutoNOC” approach.

The problem with AutoNOC is that by inheriting the human/NOC model, it inherited most of the problems that model created. Two examples will illustrate the overall issue set.

Example One: A major trunk fails. This creates an interruption of connectivity that impacts many different services, users, and devices. All the higher-layer elements that depend on the trunk will generate events to signal the problem, and these events will flood the NOC to the point where there’s a good chance that the staff, or the AutoNOC process, will simply be swamped.

Example Two: An outside condition like a brownout or power failure occurs as the result of a storm, and there are intermittent failures, over a wide area as a result. The events signaling the first of these problems are being processed when events signaling later failures occur, and the recovery processes initiated then collide with each other.

What we really need to fix this problem is to rethink our notion of AutoNOC operation. The problem is that we have a central resource set that has to see and handle all our stuff. Wouldn’t it be nice if we had a bunch of eager-beaver ops types spread about, and when a problem occurred, one could be committed to solve the problem? Each of our ops types would have a communicator to synchronize their efforts, and to ensure that we didn’t have a collision of recovery steps. That, as it happens, is the TMF NGOSS Contract approach.

The insight that the NGOSS Contract brought to the table was how to deploy and coordinate all those “virtual ops” beavers we discussed. With this approach, every event was associated with the service contract (hence the name “Next-Generation OSS Contract”), and in the service contract there would be an element associated with the particular thing that generated the event. The association would consist of a list of events, states, and processes. When an event comes along, the NGOSS Contract identifies the operations process to run, based on the event and state. That process, presumed to be stateless and operating only on the contract data, can be spun up anywhere. It’s a microservice (though that concept didn’t really exist when the idea was first advanced).

The beauty of this should be obvious. First, everything is infinitely scalable and resilient, since any instance of a state/event process can handle the event. If you have two events of the same type, you could spin up two process instances. However, if the processing of an event launches steps that change the state of the network, you could have the first event change the state so that a second event of the same kind would be handled differently. The data model synchronizes all our ops beavers, and the state/event distribution lets us spin up beavers as needed.

What do these processes do? Anything. The obvious thing is that they handle the specific event in a way appropriate to the current state of the element. That could involve sending commands to network elements, sending alerts to higher levels, or both. In either case, the commands/alerts could be seen as events of their own. The model structure defines the place where repair is local, where it’s global, and where it’s not possible and some form of higher-layer intervention is required.

I’ve blogged extensively on service models, and my ExperiaSphere project has a bunch of tutorials on how they work, in detail, so I won’t repeat that piece here. Suffice it to say that if you define a proper service model and a proper state/event structure, you can create a completely cloud-native, completely elastic, completely composable framework for service automation.

Now contract this with AutoNOC. Classic implementation of this approach would mean that we had an event queue that received events from the wide world and then pushed them through a serial process to handle them. The immediate problem with this is that the process isn’t scalable, so a bunch of events are likely to pile up in the queue, which creates two problems, the obvious one of delay in handling, and a less obvious one in event correlation.

What happens if you’re processing Item X on the queue, building up a new topology to respond to some failure, and Item X+1 happens to reflect a failure in the thing that you’re now planning to use? Or it reflects that the impacted element has restored its operation? The point is that events delayed are events whose context is potentially lost, which means that if you are doing something stateful in processing an event, you may have to look ahead in the queue to see if there’s another event impacting what you’re trying to do. That way, obviously, is madness in terms of processing.

My view is that ONAP is an AutoNOC process. Yes, they are providing integration points to new services, features, and issues, but if the NGOSS Contract model was used, all that would be needed is a few new microservices to process the new state/events, or perhaps a “talker” that generates a signal to an OSS/BSS process at the appropriate point in service event processing. Even customization of a tool for service lifecycle automation would be easier.

My concern here is simple. Is ONAP, trying to advance further along functionally, simply digging a deeper hole? The wrong architecture is the wrong architecture. If you don’t fix that problem immediately (and they had it from the first) then you risk throwing away all enhancement work, because the bigger a monolith gets, the uglier its behavior becomes. Somebody needs to do this right.