Should Managed Network Services Take a Cloud Lesson?

Kubernetes has been almost as revolutionary as containers, and almost synonymous with that concept too. As great as Kubernetes is, though, it’s often seen as a complexity black hole, and its complexity is surely both a barrier to its adoption and a barrier to cloud-native development. Google doesn’t like that, largely because cloud-native support is perhaps the biggest differentiator for Google’s cloud services. Well, Google is now doing something about Kubernetes Complexity with GKE autopilot. You can read about it HERE and HERE. I don’t want to cover Autopilot as a specific container technology; others will surely do that. I want to look at the market implications instead, and in particular what it means for networking, but that still requires a bit of Kubernetes and Autopilot introduction.

The most important thing to know about GKE Autopilot is that it is GKE Autopilot and not “Kubernetes Autopilot”. It’s not intended to be a generalized operationalization layer to simplify Kubernetes, but rather an added feature to the Google Kubernetes Engine feature of Google Cloud. GKE is a managed Kubernetes service, and what Google is dealing with in Autopilot is the fact that even managed Kubernetes may be too much Kubernetes for some users.

Any tool that’s designed to automate deployment and redeployment is replacing a manual process that’s complicated enough to be problematic. Otherwise, it wouldn’t be needed. The problem is that automated process setup can itself be complicated, particularly when the thing being automated is shooting at a moving, evolving, target. That’s what GKE, and the new Autopilot, is all about.

In a very real sense, both GKE and Autopilot are part of a broad trend I’ve talked about before. Both networking and IT are muddled by a constant wave of innovation. Yes, it’s added many valuable features and tools have been improved based on the experience of real users. In many cases, startups have been responsible for creating the newest and best stuff, and this has created a challenge of complexity in simply integrating all the stuff.

Functionally, Autopilot extends the GKE concept of a managed Kubernetes control plane to nodes and pods. With vanilla GKE, a user gets the managed control plane but still defines their own cluster configuration based on their needs. The GKE SLA doesn’t extend to nodes and pods. Autopilot lets a user define their cluster in “Autopilot mode”, where all the best practices for security, reliability, and availability are baked in by Google’s Site Reliability Engineering (SLE) process. Autopilot makes GKE more of a true managed service.

Which, of course, is what users want, apparently, and that’s my segue into the market implications. I know I’ve noted this user quote in a past blog, but it’s one of my favorites because it’s evocative. “I don’t want decision support,” a bank CIO told me, “I want to be told what to do!” Tech is a strange land to be a stranger in, according to almost every decision-maker I’ve talked with. Another quote from the same industry shows one good reason to feel that way. “You’ve gotta understand that if I screw this up, there’s fifty banks I could never work for!” A little exposure goes a long way, and I think Google recognizes that.

The broader market is likely doing the same thing, and that has major implications for the networking space. I think that the same forces driving Google to launch Autopilot will drive operators to launch a better set of managed services. We know that Autopilot is likely a strong step toward the right answer for IT services, but does it go all the way? How about network services?

Managed services have been around for a very long time, but most managed services are created by adding a management professional service to a traditional network service. The managed service provider (MSP) may integrate or develop special software to reduce their own human cost and improve service features, but the nature of the service is still pretty much what it was.

When SD-WAN came along, managed services got an additional kicker, because most SD-WAN deployments involved sites where there were no skilled technical people at all, much less network professionals. In many cases the sites were rural, or scattered across multiple countries, some of which were third-world locations. SD-WAN alone roughly doubled the number of MSPs.

It also introduced a new approach to the whole managed service story, one that added additional features to the basic connectivity offered by SD-WAN. This feature expansion came about because buyers had other virtual-network missions they wanted to address, like connecting cloud computing applications; in the cloud, extending the company VPN isn’t possible. Then, of course, there was the usual competitive-dynamic issue; SD-WAN vendors wanted to add features to attract customers, shorten the sales cycle, and potentially build margins.

One of the key elements in a managed service is the empowerment of the SLA process, which depends on gathering operating statistics. MSPs have to guarantee something, and both users and MSPs need some way of monitoring what’s being guaranteed. Statistics on operation are also a critical in any automation of SD-WAN management.

The reason why SD-WAN has become a kind of focus point in the whole MSP telemetry game is that, in order for SD-WAN to work, it has to have an agent element that’s sitting essentially on the point of user connection. This is usually a critical boundary in the SLA game, since MSPs will either set it as the limit of their SLA, or charge for taking responsibility deeper into the user network.

This might be a part of the reason why Juniper, who bought SD-WAN vendor 128 Technology earlier, has now announced it’s integrating 128 Technology’s data (which includes session-specific detail) with its Mist AI and Marvis virtual network assistant technologies. Enterprises could use the combination to enhance their own network management, but of course MSPs could also use the technology to buttress their own SLAs.

The big question is whether this sort of thing could move networks to parity with IT infrastructure in terms of managed services. If you look back at Google’s GKE and Autopilot combination, you see not only a strong SLA framework, but a divorcing of the user from a lot of explicit infrastructure responsibility or knowledge. An Autopilot-mode cluster is almost an intent model; it has external properties and an internal realization of those properties, but users don’t worry about the latter once they’ve defined the former. Could something like this be brought to networks?

My favorite concept, network-as-a-service or NaaS, would be one way. In a way, a session-aware edge element could be viewed either as a way to generate a connection “overlay” that could integrate with an MPLS VPN, or a way to request a connection, which would surely look a lot like NaaS. The difference in the two viewpoints would be that the second assumes that the session-aware edge was universally deployed, not just deployed where SD-WAN’s traditional small-site connectivity enhancement was appropriate.

That same question would apply to telemetry used in an SLA. MSPs selling “managed SD-WAN” and adding feature enhancements for differentiation would surely, at some point, be interested in selling “universal VPN” services, which would envelope both those traditional SD-WAN sites and MPLS VPN sites. The latter might then be targets for SD-WAN introduction to replace MPLS, or they might simply be targets for augmented features independent of the VPN connectivity.

128 Technology was taking a step in this direction, I think, with it’s L3 NID concept. Late last year, around the time when the deal with Juniper was announced, 128 Technology created a kind of split model of its SD-WAN edge, one that started with what’s essentially a telemetry generator, the L3 NID. You’d then add on other features to it, including traditional SD-WAN VPN features. It’s not clear what Juniper intends to do with this approach, but it would appear to be a step toward providing universal telemetry for MSPs, and perhaps add-on session awareness that could be used to create NaaS-like connection awareness without changing the interior network behavior at all.

If users want hands-free containerized applications, they’d seem likely to want something similar on the network side. In fact, you could argue that one of the reasons why things like managed container/Kubernetes services work is that they virtualize the network piece. Is this a signal that we really do need a similar approach to managed services on the network side? Might cloud providers like Google start to think about providing it? Remember, anything that’s at the point of user connection to the network can add features to facilitate managed-service capability. Cloud edge included? Could be.