If we still had typewriters these days, I’d bet that the typewriter vendors would be claiming support for NFV because they put “N”, “F”, and “V” keys on their keyboards. A few might even add an extra “V” key (we have “VNFs” in “NFV” implementations, and call what they do “advanced NFV support”. It’s no surprise then that we’re hearing a lot about the role of analytics and big data in NFV. But is any of the stuff true or relevant. Some is, some isn’t, and so overall we need to look pretty hard at how analytics plays in NFV.
In networking, analytics or big data relate to the gathering of network information from all the possible status sources, and correlating it for review. This is obviously two steps (gathering and correlating), and both require a bit of attention.
Gathering status information from a network means accessing the MIBs (in whatever format is required) and extracting the data. With NFV, one of the possible complications here is that of visibility. Is all of the management data created by NFV elements and resources visible in a common address space? Resource information, being multi-tenant and owned by the operator, is likely not available in the service address space, and even if it were the information would likely be compartmentalized. Service information, from either customer-dedicated “real” devices or per-customer VNFs, might be visible only as a MIB in the service address space, not accessible to a central collection process.
Even if we address the visibility issue we have the question of availability. We don’t want analytics processes polling MIBs that are already being polled by management systems and perhaps by VNFs and users too. The problem of exposing “infrastructure” and associated management data to applications and users was taken up by the IETF (Infrastructure to Application Exposure, i2aex) but it’s not advanced there. We do have commercial and open-source tools (OpenNMS for example) that use the i2aex solution, a central repository where all management data is deposited and from which all application/user management access is derived via query.
It seems to me that this repository model is the right answer for a number of reasons. If we stick with our gathering-phase comments for now, a repository controls the amount of polling of the real device MIBs, which allows designers to plan better for management traffic and per-element access loading. The query mechanism means that in theory we could present MIB data in any address space and in any convenient form.
Assuming we have the data gathered and accessible, we have to correlate it to make it useful. A central truth of management is that you have to make a high-level decision regarding what you are going to use as your service resource planning paradigm. Here there are three options. You can simply manage resources, let services vie for them without constraints, and offer best-efforts services. Second, you can manage resources but exercise some form of “service admission control” to limit the commitment of resources to your planned level, which lets you offer broad grades of service. Finally, you can map the specific way that services and resources are linked—provision services, in short. The approach you take determines what you’d do with your data.
In our first two options, we don’t necessarily need to know anything about service-to-resource relationships, so we can do capacity planning by examining resource usage and service demand. This is a “pure” big data or analytics problem, big because you’d likely need to look at everything and at frequent intervals, time-stamping and recording each combination. You’d then examine trends, peaks, valleys, and so forth—classical usage-based capacity planning.
Even here, though, we may have some issues relating to services. Capacity planning typically demands that we be able to relate service needs to resource usage, and unless the services are totally homogeneous in terms of how they use resources, we would need to understand a bit about the service mix, and also about any policies that were used to link services to resources. An easy example is that if an operator is looking at the question of provisioning more service features via NFV, they have to know what resource usage can be attributed to the NFV approach relative to the “legacy” approach.
But it’s our final category that really cements the need to have service context. When a service is provisioned, we are making a fixed resource assignment. Unless we presume that all NFV VNFs are multi-tenant, at least some provisioning of this sort is inevitable in NFV deployments. If there is a fixed resource commitment to a VNF or to any part of a provisioned service, we have to be able to derive the status of the service from the composite status of the assigned resources, and also to correlate resource faults with service-level SLA violations. I’ve talked about this issue in other blogs so I won’t dig into it here.
There’s also a third thing, beyond collecting and correlating. The final thing we have to consider in our big-data-and-analytics discussion is actionability. If we want to do service automation, meaning really operations automation, we have to be able to visualize the service as a finite-state machine of many layers, and we have to assume that the service data model will tell us how to steer events to processes depending on what’s happening and what it’s happening to. Analytics can easily feed things like capacity planning, but if we want to be able to make near-real-time or real-time changes based on analytics we have to have mechanisms to inform the service lifecycle processes or the resource processes that we want something done. For that to work, this informing has to be an event and it has to be handled in a way dependent on the state of the resources and services we might be impacting. So we’re back to state/event yet again.
I think the discussion of analytics and big data with NFV deployment is a clear demonstration that we’re building tires without having designed the car. Management implementations are necessarily slaves of management policy, which is in turn a slave of service policy. I think the right answer is actually visible here if you look at things that way, so we need to start demanding vendors take that approach.