Making Analytics Work as the Basis for Management of Virtual Services

Anyone used to the SLAs of TDM services knows that SLAs have changed.  We used to talk about “error-free seconds”, and now we’re usually talking about monthly uptime.  Mobile devices have changed our view on things like call quality—“Can you hear me now?” is almost a classic example of a trend to assign “good” to something without fatal impairment.  People accept that, and I’m not complaining, but this trend might explain why Juniper bought AppFormix, and it might also help us understand how analytics and SLAs combine in SDN and NFV services.

Any “virtual” service, meaning any service that partitions resources or hosts components, has a potential issue with SLAs.  First, sharing any resource depends on statistical multiplexing, meaning matching expected overall utilization with a level of real resources.  There are no fixed assignments, and thus it’s not possible to write an exact SLA of the type we had with TDM, where resources were fixed.  Second, it’s difficult to reflect the state of a virtual service in terms of the behavior of the real resources.  In many cases, the real resource isn’t even something the service user expects to have—servers in a firewall service are a good example.

Sharing resources also creates issues in management, no matter what your specific SLA goals might be.  If a thousand virtual services utilize a single physical element, there’s a risk that the management systems of the virtual services would all try to read status or even write control changes to the shared element.  At best, this could create what’s almost a management-level DDoS attack where the “attack” is really a flood of legitimate (but ill-advised) status inquiries.  At worst, one user might optimize their service at the expense of others that shared the resource.

Early on in the evolution of virtual services, I suggested that the solution to both problems lies in what I called “derived operations”.  The real status of real functional elements—shared or not—would all be stored in a common repository by a status-gathering process independent of users.  This repository would then be queried to get status and status trends, meaning that analytic processes would run against this repository.  A management query then becomes a database query, and the relationship between each real-resource status and the overall status of a dependent service would be reflected in the formula whereby repository data was analyzed.  VPN status equals sum of status of VPN elements, so to speak.

A lot of possible SLA and management views can derive from this model.  One is the classic analytics-centric management vision.  Many NFV implementations propose analytics as the source of events that drive service lifecycle management.  Obviously, you have to be able to do database dips and perform over-time correlative analysis to be able to derive actionable events from mass status information.  If the trend in link utilization is a determinant into whether to declare the link becoming “congested” we need to see the utilization data sequence over time, not at this instant.  Many operators and vendors want to manage all virtual services this way, even those based totally on legacy network elements.

The issue with the pure analytics vision is that “fault” or “condition” is in the eyes of the beholder.  There is no single standard for event generation.  Even the hard failure of a given trunk might not create a service event if the service’s traffic doesn’t transit the failed resource.  Thus, all faults aren’t correlated with a given SLA, and that means you have to be able to understand what faults really have to be handled.

Fault correlation for effective event generation is what I meant by the notion that repository data was “formulized” to synthesize status for virtual elements.  However, the relationship between virtual services and fixed resources is variable, reflecting the current state of deployment.  That means that the status formulas have to be adjusted with changes in service resource assignment.  When you provision a service (or redeploy it) you essentially build the status formula for it.  Actually, you build an entire set of them, one for each event you want to generate.

This model of derived operations is, IMHO, the most important element in any analytics/repository model of virtual service management.  Yes, you need a repository, and yes you need a collector to populate it.  With these and fixed analytics models to generate events, you still have nothing.  A fixed model can’t really differentiate between a condition and an event, the former being something that happens and the latter being something that impacts some service’s (or services’) lifecycle.

A formula-linked approach to derived operations is a critical step, but not the only one.  You still have the problem of distributing the information, and here one of the management issues of the NFV model emerges.  VNFs, which represent the “service” element, have an autonomous component of the VNF manager co-resident, and that component would be a “logical” place to dip into a repository for the status of related resources.  The problem is that it’s not clear how the VNF (which plays no role in the hosting or connection decisions associated with deployment/redeployment) would know what resources it had.  Even if it did, you can’t have every VNF polling for status when it feels threatened; you have the same pseudo-DDoS problem that arises if you poll resources directly.

For anyone wanting to get cloud or virtual network service analytics right, you have to get events to service lifecycle management processes.  That means that you can’t use general tools that aren’t linked to building a service, which means that any vendor who buys an analytics player will have to augment the basic tools they’d offer with some specific mechanism to author and modify those formulas.  You also can’t let the lifecycle processes of parts of the service operate in an autonomous way; they are not the entire service and so their actions have to be coordinated.

The process of formula-building would be facilitated if we presumed that there was, for each class of resource and each class of virtual function, a baseline “MIB” that all member elements were expected to populate.  Now we could envision the “repository” as holding a time-series of element MIBs, and it’s then the responsibility of the collector functions to fit the real variables for the devices they collect from to the class-MIB that’s appropriate for storage in the repository.

If we can get resource status standardized by class and collected in a repository (the never-approved IETF work on “Infrastructure to Application Exposure” or i2aex could have done this) then we can use any number of publish-and-subscribe models to disseminate the relevant conditions as events.  Then all we need is state/event-driven lifecycle management to organize everything.

Right now, analytics as a solution to the challenges of virtualization-driven services is a cop-out because it’s not being taken far enough.  We’ve had too many of these cop-outs in virtualization (SDN, NFV, whatever) so far, and we eventually have to deal with the critical issues of software automation of the service lifecycle.  If we don’t then we’ll never have broad virtualization-based transformation.