Lifecycle Automation, Granularity, and Scope

One of the moving parts in lifecycle automation is scope, by which I mean the extent to which an action impacts the network or a service.  It should be fundamental to our thinking about lifecycles and lifecycle automation, but it’s almost never discussed.  We’re going to discuss it today.

Imagine yourself driving down a long, lonely, road.  You round a bend and a large tree has fallen across the road.  You get out and inspect the tree, and you realize you have a choice to make.  You have a simple saw, and it will take quite a while to remove enough branches to get past.  You could turn back and look for another route, but you don’t know how far out of your way that will take you, or whether that other route might also be blocked.  That’s the “scope decision”; is removing the obvious barrier a better choice than finding a non-blocked route?

A service fails, because a critical network element has failed.  Should we try to replace that element alone, or as small a number of elements as we can, to restore service?  Should we just tear the service down and re-provision it, knowing that task will avoid all failures and not just the one we happened to have?  That’s another “scope decision”, and it might be of critical importance in lifecycle automation.

The complexity of a remediation process is directly proportional to the complexity of the remedy.  In order to restore a service using the smallest possible number of changes or new elements, I have to understand the way the service was deployed, understand where the broken element fits, and then be able to “back out” the smallest number of my decisions, then remake them.  This requires a service map, a model that describes the way the service was first created.  That model could tell me what I need to do, in order to make minimal changes to restore service.

A service model divides a service into discrete elements in a hierarchical way.  We have, perhaps, something called “VPN-CORE”, and that particular element has failed because something within it failed.  It might have a dozen sub-components, chosen from a larger list when the service was deployed.  So, do we fix the subcomponent that broke?  If so, we’ll have to rebuild the service connections within and to/from VPN-CORE.  We can do that, but clearly this is the process that makes lifecycle automation possible.

The other option, of course, is to start over.  Suppose we simply tear down the service completely and redeploy?  Suppose the services were built that way, built so that they didn’t leave committed resources hanging around that had to be removed?  Suppose everything was probabilistic and capacity-planned?  We’d just, when we decommissioned the failed service, restore any “capacity withdrawal” we’d made, and when we commission the service again, we’ll withdraw again.

The question that we’re asking with our scope discussion is whether the overall effort associated with limited-scope remediation of a failure is worth the cost and complexity.  Or, alternatively, how granular should we presume deployment and redeployment should be, to maximize the complexity versus benefits trade-off?  That question requires we look at the benefits of highly granular remediation and redeployment.

The most obvious benefit would be limiting service disruption.  Depending on the service we’re talking about, a limited-scope remediation of a fault could leave the rest of the service up and running, while a total redeployment would create a universal fault.  In our VPN example, it’s probable that if we redeployed an entire VPN-CORE element, we’d break all the VPN connections for the period of the redeployment.  I don’t think this would be acceptable, so we can presume that a complete redeployment isn’t in the cards.

This is logical in the framework of our long-dark-road analogy.  If getting around the tree could be avoided only by going all the way back to the starting point of our journey, we’d probably not like that option much.  In service terms, we probably could not accept a remedy approach that took down more customers than the original fault.

Another benefit to granular remediation is that it avoids churning resource commitments.  That churn is, at the least, wasteful of both steps and capacity, since we’d have a period in which the resources weren’t available for use.  We also have a risk, if we churn resources, that we can’t get the right resources back because something else grabbed them between our release and our request for redeployment.  If we worked to limit that lost-my-place risk, we’d increase the risk we’d waste capacity, because we couldn’t be sure whether the same resources would be grabbed in a new deployment.

The biggest problem with granular remediation is the complexity it introduces.  Going back to our roadblock analogy, if you focus entirely on getting around a local obstacle, you can easily end up cutting a lot of trees when another route could have avoided all the falls.  However, if you think local remedies, how do you know what the global conditions are, and when another broad option is better?  Hierarchical decomposition of service models avoids this problem.

If “VPN” is decomposed into a half-dozen elements, then the failure of any one can either be handled locally, or you can break down everything and recompose.  The former can be more complicated because every time you break down a piece, you likely have to reconnect the other pieces anyway, and you may lose sight of the fact that there’s enough broken that full recomposition is needed.  Should we then say that all faults are reported upward to the higher “VPN” element, who can either command the lower-level (broken) piece to fix itself, or just start up again?

It’s clear to me that the overall problem with granular remediation is that it requires a very strong service modeling strategy.  I don’t think that should be a problem because I think such a strategy is important for other reasons, like its ability to support multiple approaches to the same service feature during a period of technology migration.  It also seems important to have a granular service model available to guide service setup.  Despite the value, though, effective granular modeling seems to be stalling the lifecycle automation efforts of vendors and operators alike.  ONAP is a good example.

I don’t like the “start-over” choice, in part because it’s not completely clear to me that it avoids service modeling.  In order for it to work, we need a completely homogeneous infrastructure or we need fully autonomous components that bind themselves together (adaptive, in other words) to do things or fix things.  I’m not sure this can be done in an era where we’re talking about migrating from proprietary boxes to hosted software feature instances.  But if we don’t figure out, and accept, and promote, a broad service modeling strategy, it may be our only option going forward.

You can bend infrastructure and service level agreements to eliminate granular remediation and service modeling.  You can probably bend your deployments of modernized hosted-feature technology for the same purpose.  You can possibly handle networks and services that cross administrative boundaries and accommodate persistent differences in technology from one network to another.  All these qualifiers make me nervous.  We have TOSCA-based service models today; why not focus on the modeling-and-granular-remediation approach, and define how it works?  It would greatly advance our progress toward lifecycle automation.