What Exactly is “Cloud-Native” and How Do We Get There?

We are still hearing feedback from operators on the importance of “cloud-native” in things like NFV.  As an example, Fierce Telecom has run THIS piece on the topic.  I’m still concerned that we’re jumping onto a terminological bandwagon thinking it will have a tangible effect on things, and that’s not the case.  We need to look at the specific problems of NFV, and the way “cloud-native” might impact them.

As I’ve noted before, one foundational issue that NFV fell prey to from the first was focusing on the transformation of physical network functions (PNFs) into virtual network functions (VNFs).  The underlying presumption of that focus was that a network is built from a very static set of “network functions” that are currently embodied in specialized devices and should instead be hosted software instances.  This foundational point gives rise to the next generation of issues.

The first of those second-gen issues is that the fundamental nature of the network, as a collection of connected network functions, doesn’t get transformed by NFV.  There is a difference between a “router” and a “virtual router”, but the more you get fixated on the notion that the two will be used the same way, the less you get from the process of virtualization.  For example, a network of virtual routers hosted in fixed places and replaced by re-hosting them when something breaks, behaves a lot like a network of real routers.

The second of our issues is that the narrow focus on PNF-to-VNF limits the scope of impact of NFV.  The network functions are still used and managed the same way, which means that the only thing that NFV can really do is manage the efficiency of the PNF-to-VNF transformation; the rest stays the same.  Operations improvements are limited, and we’ve long since learned that it’s got to be opex efficiency and agility improvements that make a broad business case for NFV.

The PNF-to-VNF transformation is itself an issue.  If there exists a “network function” or NF that can live in multiple forms, physical and virtual, then software architects would say that you have to start your transformation architecture by defining NFs as a kind of super-intent-model, the implementation of which then has to be matched to the interfaces the NF specifies.  Onboarding is then the process of implementing the NF and meeting those interfaces, which is at least a specific task.  However, the NFV ISG didn’t do that.

Cloud-native advocates seem to be suggesting that their approach could resolve all these problems, but that assertion has its own fundamental flaw, which is that you can’t be just part cloud-native.

I could define and develop an implementation of a VNF that would be cloud-native, and at the same time would fall into every one of the issue traps I’ve just noted.  That’s because the problem isn’t just how a VNF is implemented, it’s how NFV works as a software system.  A cloud-native implementation of the wrong architecture is still wrong…and the fact it might be more efficiently wrong won’t make it right.  Let’s face a simple truth here.  There is no way of doing an effective cloud-native VNF within the current NFV architecture model.

Can we have a cloud-native VNF being controlled by a monolithic MANO or VNFM?  Can we grow apples on an orange tree?  Here again, it’s not a matter of saying that we’d employ cloud-native tactics to implement MANO or VNFM.  I submit that the biggest thing wrong with something like NFV’s MANO or VNFM isn’t how they are implemented, but that we think they exist as discrete elements at all.

A service should be represented by an abstract data model, defining a related collection of functional elements that correspond to network functions, whether they exist as PNFs today or are invented new to support emerging service opportunities.  This model, as the TMF has long described it, is used to associate service events with service processes.  What NFV wants to call “MANO” or “VNFM” isn’t a software element, it’s a composed event/process relationship set.  One service event is a “Deploy” order, for example.  That event activates the process of committing implementations to the abstract model NFs, which is “orchestration”.  If an error occurs, that error is an event that then activates a set of processes to correct it, which is “management”.  There is no MANO or VNFM, only a model-coordinated event-to-process-set relationship.  We compose management like we compose services.

This also represents my concerns about something like ONAP.  It was clear from the genesis of ONAP in AT&T’s ECOMP that it was, like NFV, based on the presumption of a set of connected and specific software processes, not on an event-to-process-via-model approach.  In fact, it didn’t really have a good or complete model structure.  About a year ago, I blogged that the ONAP people were promising to integrate data modeling, and I was concerned about the pace of that integration and the extent to which it could fundamentally shift a monolithic design to an event-driven one.  We are still not where ONAP said they wanted to be, and I’m more concerned than ever that they’re proving that not only is it very difficult to evolve to the event-to-process-via-model approach, it may be impossible.

Web giants like Facebook, Google, and Twitter have designed “cloud-native” applications, and they didn’t do it by taking monolithic software and somehow transforming it.  NFV, or ZTA, or 3GPP, aren’t going to get to cloud native any differently, if they really hope to get there.  All these players invented technologies, defined new architectures, to get massive scalability and agility of features.  That’s how network operators have to do it.  That’s what “cloud-native” really means.

I gently disagree with Telus’ Bryce Mitchell (quoted in the Fierce Telecom piece) that this is a matter of management mindset.  This is a matter of software architecture, and if we expect telecom managers to drive software design we have a very long and unhappy road ahead.