Some Hidden Truths About Service Automation

The number one issue with making an NFV business case is that of service automation.  Any talk about “operations efficiency” or “service agility” that doesn’t start with the assumption that service activity is automated is just a waste of breath.  In addition, if we don’t have very precise notions of how we’d manage the incremental aspects of deployment and ongoing operation for VNF-based services, we can’t argue these services would reduce TCO relative to current device-based approaches.

While service automation is hardly a hidden issue, there are hidden aspects to its implementation, both in the way standards address it and the way vendors do.  It’s easy to claim service automation success or management completeness when there’s nothing to measure either against.  What I propose to do here is to offer some reference points, things that have been there all along but not discussed to any extent.

First and foremost, you cannot do service automation if you can’t tie processes to service events.  Automation means taking software-driven steps instead of human steps, in response to a condition requiring action.  Those conditions are signaled by events.  Automated platforms for service and network management have to be event-driven, period.  This is why, back in early 2013, I said that the foundation of the CloudNFV initiative I launched was the TMF’s NGOSS Contract/GB942.  That was the first, and is still the most relevant, of the specifications that provided for event-steering to processes via a data model.

The notion that handling events, or as it’s often called being “event-driven”, has been on the list of OSS/BSS changes users wanted to see for years, perhaps more than a decade.  It’s been in the TMF lexicon for at least seven or eight years, but my friends in the TMF tell me that implementations of the approach are rare to non-existent.  That’s a shame because the NGOSS Contract notion should be the foundation of modern network operations and management.

Second, being event-driven means recognizing the notion of state or context in a service-related process.  Events have to be interpreted based on what’s expected, and the notion of expectation is reflected in a “state”.  A service that’s ordered but not yet deployed is in the “Ordered” state, for example, and so an “Activate” event could be expected to deploy it.  An “Order” event, in contrast, would be a process error.

What makes state complicated is that services are made up of a bunch of related, connected, functions.  A VPN has a central VPN service and a bunch of access services, one for each endpoint.  You have to recognize that there is a “state” for each of these elements because they all have to be managed independently first, then collectively.  Thus, we have to visualize a service as a structure at a high level, broken down to lower-level structures, and recognize each of these has to have “state” and “events” that then have to be associated with service processes.

The next service automation issue is that multiple services and multiple users/tenants cannot exist in the same address space.  Service A and Service B might both use exactly the same virtual functions, but they have to be separated from each other or neither users nor management systems can securely and reliably control both of them.

In the cloud, players like Amazon and Google launched their own virtual networking initiatives to insure that tenants in the cloud could deploy and connect things as though these applications and users were alone in the world, even though they shared resources.  NFV should have taken that issue up from the first, but it’s still not described in a satisfactory way.  The logical answer would be to adopt the Amazon/Google model, which is based on the notion of a series of private address spaces (RFC 1918) and selective mapping of elements/ports from one such space to another (from tenant/service to NFV management, for example) and to public (user-visible) addresses.  But if this is going to be done we need to know who assigns the addresses and how the multiple mapping is done.

This problem is particularly incendiary in the area of management.  If a virtual function is to reflect its state through a MIB port like SNMP (for example) then you’d have to let it see the state of the resources used to host it.  How do you let that happen without opening the management portals of the resources to the VNF?  If you do that, you risk at the least flooding resource management APIs with requests from thousands of VNFs, and at worst letting one VNF change the state of shared resources so that other users and VNFs are expected.  This problem needed to be fixed up front; it’s not fixed yet.

The fourth issue is that of brittle implementations of service descriptions or models.  Ideally, a service should be seen at two levels, the logical or functional level and the deployed or structural level.  This has been reflected in the TMF SID data model as “customer-facing” and “resource-facing” services.  The goal of this kind of explicit modeling is to prevent implementation dependencies at the deployment level from breaking service descriptions.  If you have a service in your catalog with structural references included, changes in the network or server pool could break the definition, making it impossible to deploy.

Implementation dependencies can also create the problem of “ships-in-the-night” or “colliding” remediation processes.  When something breaks in a service, it’s common to find that the failure causes other elements elsewhere to perceive a failure.  When a high-level piece of a service can see a low-level condition directly, it’s easy to find the high level process taking action at the same time as other lower-level automated activities are responding.  Fault correlation is difficult if you have to apply it across what should be opaque pieces of service functionality, without risking this sort of collision.

A final aspect of “brittleness” is that if management processes are defined on a per-function basis (which is the ISG recommendation) then it’s very possible for two different implementations of a virtual function to present two different sets of management tools, and require different craft or automation processes to handle them.  The classic way to avoid this, which is to build service automation intelligence into the per-function management element, can create enormous duplication of code and execution overhead and still leave pieces of management processes inconsistently implemented.

The last of the service automation issues is lack of an effective service modeling framework.  A service model that represents the natural hierarchy of service/resource elements can contain, per element, the management hooks and state/event tables needed for service automation.  We have no such model with NFV even though that should have been done up front.  You can see that by the fact that the NFV ISG convoked a conference of SDOs to address the need for a unified model.  Without a model, we have no framework in which process automation can operate, so everything would have to be custom-integrated in order to work.  Or, it would have to be very limited in scope and in its ability to deliver benefits.

This is going to be a hard problem to solve because SDOs are essentially companies these days, and companies need profits.  SDOs get revenue from membership, events, etc. and this means there’s an incentive to jump into areas that are hot and would attract opportunities for new revenue.  So these bodies are now supposed to parcel out service modeling responsibility fairly?  Good luck with that.  Most of them also hold material internally, for members only, until the issues are fully resolved, which means nobody gets to comment or see what’s evolving (unless, of course, they join and often pay).

The good news in the model sense is that there are only a few rules (which I’ve cited in other blogs) that a model would have to support to insure that process automation strategies could be made portable even across model implementations.  You need processes to be defined as microservices and you need each model itself to be a microservice-implemented intent model.  But even these easy steps are yet to be taken in any organized way.  I know that ADVA/Overture, Alcatel-Lucent/Nokia, Ciena, HP, Huawei, and Oracle have the general tools needed to do what’s needed, but it’s hard to pull the detail from their public material.  Thus, it’s hard to say exactly how their service automation solution would work in the real world.  Huawei, at the fall TMF event, offered the greatest insight into service automation, but it wasn’t clear how much of it was productized.

The real world, of course, is where this has to be proven out.  Almost every operator I’ve talked with tells me that they need a strong NFV trial that incorporates full-scope service automation by sometime in Q3 if they’re to deploy in quantity in 2017.  Time is getting shorter.  We already see operators thinking about broad modernization rather than specifically at NFV, and if NFV doesn’t address the critical service automation issues then it may make up a smaller piece of operator deployment than many think, or hope.