Is a New Model for Service Automation Emerging?

Is a new lifecycle automation concept championed by AT&T about to displace AT&T’s earlier ONAP?  According to recent stories HERE, HERE, and HERE, Airship might be emerging as the new darling of automation.  If that’s the case, what could it mean for networking and transformation?

We’ve been struggling for six years to frame a rational transformation architecture for network operators.  From very early on (the fall of 2013 to be exact) the operators themselves have admitted that operations automation would have to be a critical piece of that.  Obviously, automating the steps in service management would reduce opex even for legacy network architectures, and for new architectures based on hosted virtual functions (NFV) it would be essential.  The problem is that NFV declared operations to be out of scope, and later initiatives like ETSI’s Zero-Touch Automation and AT&T/Linux Foundation ONAP have failed to accept the basic requirement of an automated operations system—which is that it’s necessarily event-driven.

Airship, which is reported to be a big piece of AT&T’s 5G plans, is an assembly of open-source components designed to “declaratively automate cloud provisioning.”  It includes familiar tools like Kubernetes and Helm (including OpenStack-Helm), and it straddles the container-VM-bare-metal deployment options.  The term “declarative” means that Airship is a form of “DevOps” that declares an end-state and then automatically works to achieve and sustain it, rather than requiring users specify deployment steps in a script.  The declarations are in YAML form (Helm charts, for example) and they describe both deployment and updates to components.

AT&T doesn’t seem to see Airship as a replacement for ONAP, based on the architecture paper they’ve offered (see the diagram on page 4).  They see Airship as an “undercloud” or platform tool that operates below both NFV and ONAP, which retain their own cloud-level open-source tools (like Nova and Neutron for NFV and Keystone and Swift for ONAP).  However, it seems to me that Airship is nibbling at the edges of both ONAP and NFV even today, and that further development of Airship and of other Kubernetes-ecosystem tools could bring the concepts even closer.  In fact, Airship might eat both NFV and ONAP from below.

Helm charts are the central element of this structure, in my view.  A “release” is a collection of container resources that deploy as a unit, and so they’re a reflection of what the NFV people would call a VNF deployment.  The Helm charts make up a “portable deployment”, in short, and it’s easy to see how people looking at this piece of the structure would conclude that Airship has everything you need to do NFV or ONAP, but that’s not yet true.

The critical missing pieces here are network-related.  Airship has a virtual-network framework (Calico) for the control-plane activity of Airship and the cloud elements, but the tenant network is outside the scope of Airship today.  Similarly, Airship doesn’t see “real” devices or network services.  Given that Airship is a cloud deployment/redeployment platform, those omissions aren’t surprising, but in network operator applications where hosted and physical device elements would be expected to intermix, it creates a void.

The structure in the figure I referenced in AT&T’s architecture paper reflects a broad issue in the orchestration of network services.  Cloud orchestration traditionally deals with hosted components.  Network orchestration traditionally deals with devices.  What deals with both?  NFV planted an early flag by presuming that network devices would transmute into virtual functions, managed largely as the traditional devices were.  That would render “service management” unchanged.  In this model, which AT&T’s application of Airship perpetuates, the cloud orchestration tools like Airship (and like NFV and ONAP above them) are responsible for realizing virtual devices on resources and managing them inside the virtual box.

This bicameral management model has collided with “service lifecycle automation” goals and even software architectures.  If services are created by coercing network functions (virtual or physical) then that coercion has to happen above the level of both software orchestration and device management, which is what puts ONAP up where it is in functional diagram terms.  There are then two orchestrations, and while that obvious redundancy seems wasteful, it’s not the big issue.  That issue is the natural interdependence of the virtual and physical structures.

If a higher-level orchestration element is to deploy service on a mixed real/virtual infrastructure by invoking the appropriate management and orchestration (respectively) tools, then it has to know whether a given element is real or virtual.  If the management and orchestration tools are expected to “redeploy” or manage problems, then we have to presume that the new configuration might involve a different mixture of the real and virtual elements.  Thus, low-level remediation, including what’s built in to tools like Airship, is hamstrung in actually remediating because it can’t fix a problem that impacts the higher-level real/virtual model.

Let’s illustrate this issue quickly.  Suppose the original deployment (“A”) includes Pittsburgh, PA, where there’s a good data center and thus a virtual router is deployed there.  Suppose that virtual router fails because its server fails, and there is no available capacity to rehost it in Pittsburgh.  The best topological match to replace Pittsburgh is Steubenville, OH, and there is router capacity available there but not hosting capacity.  We cannot use Airship to rehost if we use our “best” city, and if we do use Airship and demand a rehosting of the virtual router, we might end up with traffic through Philadelphia, which isn’t the optimum location at all.

That’s the issue that AT&T, and Airship proponents in general, have to be looking at.  I can build resilient container-based services and use cloud tools to maintain them.  I can’t do that if I mix in physical network devices, because the tools don’t recognize network device management.  My declarative Airship/Helm models can’t represent the service overall, just the individual pieces of hosted functionality that might be integrated into it.

That’s not the end of the issues either.  Recall that AT&T makes a point of saying that they’re using Calico for control-plane virtual networking, but not tenant networking.  But think about it; in the cloud we do virtual networking for the service transactions, the equivalent of tenant networking.  If we presumed that cloud-native implementation of virtual functions was the rule, why would we not want to use a service mesh (like Istio) for the service data plane, the “tenant network?”  If we did that, then we’d be integrating Istio into Airship, in which case the scaling and locating of new or replacement instances would be done within Istio.  How, then, do we address mixed physical/virtual network infrastructure, given that Istio doesn’t do network devices?

The point is that the “network” is the “cloud network” in the future, and the current separation of real and virtual devices isn’t going to cut it as we move into that future.  It’s important that networking at the cloud level be purely virtual, which means that either there’s a total separation between the cloud network and the “real” network, or that the two are integrated rather than overlaid.  The former strategy is fine for network users, but it doesn’t address operator interests in virtualization of functions/devices rather than services.  The operators, then, have two choices to bring their own stars into alignment.

The first of these choices is to forget about “function orchestration” and build network models that define services in both real and virtual element terms.  That’s the approach that I think NFV should have taken, that I think the TMF has taken (and just not articulated well), and that operators would probably benefit most from promoting.  Use TOSCA to model services, use hierarchies and intent modeling to build services from functions that can then map (at the “resource layer”) to either real or virtual devices.  This is better, I think, because you could define one model-driven, event-based, orchestration and management platform for everything.

The second of the choices is to try to keep service orchestration and function orchestration separate, which means two layers of orchestration and all of the issues I’ve already raised of coordinating problem responses between the layers.  However, this approach is going to mean defining a more event-driven approach to that service layer process set, and that’s going to obsolete things like ONAP and ETSI ZTA, neither of which are truly event-driven.

But before we feel too sorry for the operators in having to make a tough decision, let’s return to the theme of this post.  Airship could eat all the current strategies from below, because cloud-native initiatives either have to ignore the dichotomy of real/virtual resources or they have to absorb it.  We already see, even in Airship, a recognition that you can host containers on Linux and bare metal or in VMs.  Is it really so far from that to the realization that we can either harness virtual routers or real ones?  I don’t think so, and I think we’ll see Airship, Kubernetes, Istio, and other tools expanding into the real/virtual space.  When they do, whatever they provide is what lifecycle automation will be based on, because the breadth of both support and adoption for it will swamp the smaller efforts of the operators.

Look to Helm for a sign of progress here.  Helm is “declarative” only in a minimal sense.  In its current form it’s not suitable for the kind of intent-driven hierarchical modeling that a service needs.  It wouldn’t take much to fix that, and there are good reasons to make the necessary changes even within the original cloud-centric mission of Helm.  AT&T might even be a catalyst for the changes, since it’s seeing both the cloud and service dimension of Helm already.

AT&T may have launched ONAP, and also what will eventually supplant it, and in my view that’s a good thing, because Airship is more cloud-native, and cloud-native is where we need to be heading.