Sub-Service Management as a Long-Term SDN/NFV Strategy

For my last topic in the exploration of operator lessons from early SDN/NFV activity, I want to pursue one of the favorite trite topics of vendors; “customer experience”.  I watched a Cisco video on the topic from the New IP conference, and while it didn’t IMHO demonstrate much insight, it does illustrate the truth that customer experience matters.  I just wish people did more than pay lip service to it.

Customer experience management in a service sense is a superset of what used to be called SLA management, and it reflects the fact that most information delivered these days isn’t subject to a formal SLA at all.  What we have instead is this fuzzy and elastic conception of quality of experience, which is the classic “I-know-it-when-I-see-it” concept.  Obviously you can’t manage for subjectivism, so we need to put some boundaries on the notion and also frame concepts to manage what we find.

QoE is different from SLAs not only in that it’s usually not based on an enforceable contract (which, if it were, would transition us to SLA management) but in that it’s more statistical.  People typically manage for SLA and engineer for QoE.  Most practical customer experience management approaches are based on analytics, and the goal is to sustain operation in a statistical zone where customers are unlikely to abandon their operator because they’re unhappy.  That’s a very soft concept, depending on a bunch of factors that include whether the customer was upset before the latest issue and whether the customer sees a practical alternative that can be easily realized.

Sprint and T-Mobile have launched campaigns that illustrate the QoE challenge.  If I believe that some significant percentage of my competitors’ customers (and likely my own as well) are dissatisfied with service but unwilling to go through the financial and procedural hassle of changing, they I’ll make it easy for competitors’ customers to change—even give them an incentive.  Competition is the goad behind customer experience management programs; if your competitor can induce churn then you have a problem despite absolute measurements.

Operators recognize that services like Carrier Ethernet are usually based on recognizable resource commitments, which means that you can monitor the resources associated with the service and not just guess in a probabilistic sense what experience a user has based on gross resource behavior.  In consumer services there are no fixed commitments, and so you have to do things differently and manage the pool.

NFV, according to operators, has collided with both practice sets.  For business services, dynamic resource assignment and automated operations are great, but they introduce new variables into the picture.  With business services, NFV is mostly about deriving service state from virtual resource state.  That’s a problem that can be solved fairly easily if you look at it correctly.  The consumer problem is different because we have no specific virtual resource state to derive from.

What operators would like to avoid is “whack-a-mole” management where they diddle with resource pool behavior to achieve the smallest number of complaints.  That sort of thing might work if you could converge on your optimum answer quickly, and if resource state was then stable enough that you didn’t have to keep revisiting your numbers.  Neither is likely true.

One possible answer that operators are looking at, but have not yet been able to validate in a full trial, is correlating service and resource analytics.  If you have a quirky blip on your resource analytics dashboard, you could presume with fairly low risk of error that service issues at that time were correlated with the blip.  Thus, you could work to remedy the service problems by remediation of the resource blip, even if you didn’t understand full causal relationships.  The barrier to this mechanism is not only that it’s not easy to test the correlations today, it’s not even easy to gather the service-side analytics.  Measurement of QoE, you’ll recall from earlier comments, is measuring “windy”.  It’s in the eye of the beholder.

Most of the operators I’ve talked with are now of the view that NFV management, SDN management, and probably management overall, is going to be driven by the same notions (QoE substitutes for SLA, multi-tenancy substitutes for dedicated, virtualized substitutes for real) into the same path and that they need a new approach.  A few of the “literati” are now looking at what I’ll call “sub-service management”.

Sub-service management says that a “service” is a collection of logical functions/behaviors that are individually set to at least a loose performance standard.  The responsibility of service automation is to get each functional element to conform to its expectations.  Each element is also responsible for contributing a “management view” in the direction of the user, perhaps in the simple form of a gauge that shows red-to-green transitions reflecting non-conforming to beating the specifications.

If something goes wrong with a sub-service function we launch automated processes to remediate, and at the same time we look at the service through the user-side management viewer to see if something visible has gone bad.  If so, we treat this as a QoE issue.  We don’t try to associate user service processes with resource remediation processes.

The insight of sub-service management is that if you aren’t going to have fixed, dedicated, resource-to-service connections with clear fault transmission from resource to service, then you can’t work backwards from service faults to find resource problems.  The correlation may be barely possible for business services but it’s not possible for consumer services because the costs won’t scale.

There are barriers to sub-service management, though.  One is that we don’t have a clear notion of a service as a combination of functional atoms.  ETSI conflates low- and high-level structuring of resources and so makes it difficult to take a service like “content delivery” and pick out functional pieces that are then composed to create services.  And because only functionality can ever be meaningful to a service user, that means it’s hard to present a user management view.  Another is that there is no real notion of “derived operations” or the generation of high-level management state through an expression-based set of lower-level resource states.

I don’t think that it will be difficult to address any of these points, and I think the only reason why we’ve not done that so far is that we’ve focused on testing the mechanisms of NFV rather than testing the benefit realization.  As I’ve said in earlier blogs, the focus of PoCs and trials is now shifting and we’re looking at the right areas.  It’s just a matter of who will come up with an elegant solution first.