What Has to Happen for Service Automation to Work

NFV is going to succeed, if we define success as having some level of deployment.  What’s less certain is whether it will succeed optimally, meaning reach its full potential.  For that to happen, NFV has to be able to deliver both operations efficiency and service agility on a scale large enough to impact operators’ revenue/cost compression.  Not only is operations efficiency itself a major source of cost reduction (far more that capex would ever likely be) but it’s an important part of any new service revenue strategy.  That’s because new services, and new ways of doing old services, introduce virtualization and connection of virtual elements, and that increases complexity.  Only better service automation could keep costs under control.  It makes no sense to use software to define services and then fail to exploit software-based service and network management.

I think pretty much everyone in the vendor and operator community agrees with these points (yes, there are still a few who want to present the simpler but less credible capex reduction argument).  Consensus doesn’t guarantee success, though.  We’re starting to see some real service automation solutions emerge, and from these we can set some principles that good approaches must obey.  They’ll help operators pick the best approaches, and help vendors define good ones.

The first principle is that virtualization and service automation combine effectively only if we presume service and resource management are kept loosely coupled.  It’s obvious that customers who buy services are buying features, not implementations.  When the features were linked in an obvious way to devices (IP and Ethernet services) we could link service and resource management.  Virtualization tears the resource side so far away from the functionality that buyers of the service could never understand their status in resource terms (“What?  My server is down!  I didn’t know I had one!” or “Tunnels inside my firewall?  Termites?”)

Software types would probably view service and resource management as an example of a pair of event-coupled finite-state machines.  Both service and resource management would be dominated by what could be viewed as “private” events, handled without much impact on the other system.  A few events on each side would necessarily generate a linking event to the other.  In a general sense, services that had very detailed SLAs (and probably relatively high costs) would be more tightly coupled between service and resource management, so faults in one would almost always tickle the other.  Services that were totally best-effort might have no linkage at all, or simply an “up/down” status change.

Where the coupling is loose, service events arise only when a major problem has occurred at the resource level, a problem that could impact customer status or billing.  The pool of resources is maintained independently of services, based on overall capacity planning and resource-level fault handling.  Where coupling is tight, fault management is service-specific and so is a response to resource state changes.  The pool is still managed for overall capacity, and faults, but remediation is now moved more to the service domain.

The second principle of efficient service automation is that you cannot allow either fault avalanches or colliding remedies.  Automated systems require event-handling, which means both “passing” events and “handling” them.  In the passage phase, an event associated with a major trunk or a data center might represent hundreds or thousands of faults, and if this number of faults is generated at any point, the result could swamp handling processes.  Even if there’s a manageable number of events to handle, you still have to be sure that the handling processes don’t collide with each other, which could result in collision in resource allocation, delays, or errors.  Traditionally, network management has faced both these problems, and with varying degrees of success.

Fault correlation tools are often used to respond to problems at a low level that generate many high-level events, but in virtual environments it may be smarter to work to control the generation of events in the first place.  I’m an advocate of the notion of a hierarchy of service objects, each representing an intent model with an associated SLA.  If faults are generated at a low level, remediation should take place there with the passage of an event up the chain dependent on the failure of this early stage effort.

Collisions in management processes seeking to remediate problems, or collision between those processes and new activations, are historically handled by serialization, meaning that you insure that in a given resource domain you have only one process actually diddling with the hardware/software functionality that’s being shared or pooled.  Obviously having a single serialized handling chain for an entire network would put you at serious risk in performance and availability, but if we have too many chains of handling available, we have to worry about whether actions in one would impact the actions of another.

An example of this is where two services are trying to allocate capacity on a data center interconnect trunk.  They might both “check” status and find capacity available, create a service topology, and then have one fail because it lost the race to actually commit the resources.  That loser would then have to rebuild the service based on the new state of resources.  Too many of these collisions could generate significant delay.

Fixing the handler-collision problem in a large complex system isn’t easy.  One thing that could help is to avoid service deployment techniques that rely on looking for capacity first and then allocating it when the full configuration is known.  That introduces an interval of uncertainty between the two that raises the risk of collision.  Another approach is to allocate the “scarce” resources first, which suggests that services elements that are more in contention should be identified for prioritizing during the deployment process.  A third is to commit the resource when its status is checked, even before actual setup can be completed.

The final principle of service automation is that the processes have to be able to handle the full scope of services being deployed.  This means not only that you have to be able to automate a service completely and from end to end across all legacy and evolving technologies, but also that you have to be able to do this at the event rate appropriate to the size of the customer base.

The biggest misconception about new technologies like SDN and NFV is that you can bring about wholesale changes in cost with limited deployments.  In areas like operations efficiency and service agility, you gain very little if you deploy a few little pieces of the new buried in the cotton of the old.  Not only that, these new areas with new service lifecycle and operations/management practices are jarring changes to those weaned on traditional models, which means that you can end up losing instead of gaining in the early days.

It seems to me inescapable that if you plan on a new operations or service lifecycle model, you’re going to have to roll that model out faster than you could afford to change infrastructure.  That means it has to support the old stuff first, and perhaps for a very long time.

The other issue is one of scale.  We have absolutely no experience building either SDN or NFV infrastructure at the scale of even a single large corporate customer.  Most users know that technologies like OpenStack are “single-thread” meaning that a domain has only one element that’s actually deploying anything at any point in time.  We can’t build large networks with single-threaded technology, so we’re going to have to parallel SDN and NFV on a very large scale.  How do we interconnect the islands, separate the functions and features, commit the resources, track the faults?  It can be done, I’m convinced, but I’m not the one that has to be convinced.  It’s the operations VPs and CIOs and CFOs and CEOs of network operators.

I’ve noted needs and solutions here, which means that there are solutions.  It’s harder for vendors to sell complex value propositions, but companies in the past didn’t succeed by taking the easy way out.  In any case, while a complete service automation story is a bit harder, it can be told even today, and easily advanced to drive major-league NFV success.  That’s a worthwhile goal.