Like most players in the network space, Nokia is eyeing SDN and NFV with a mixture of hope and fear. I’d guess for Nokia it may be a bit more of the latter, because major changes in the market could upset the merger of Nokia and Alcatel-Lucent, the latter being an example of almost perpetual merger trauma itself. Now, Nokia has announced…drum roll…a new acquisition, Deepfield, to improve their network and service automation. The obvious question is whether this makes any sense in the SDN/NFV space. A less obvious question is whether it makes sense without SDN or NFV.
Virtualization creates a natural disruption in network and service management because the actual resources being used don’t look like the virtual elements that the resources are assigned to support. A virtual router is router software running in a data center, on servers, using hypervisors, vSwitches, real data center switches, and a bunch of other stuff that no router user would expect to see. Because of this disconnect, there’s a real debate going on over just how you manage virtual-based networks. The disagreement lies in just how the virtual and real get coordinated.
If you looked at a hypothetical configuration of a totally virtualized (using NFV and SDN) IP VPN service, you’d see so much IT that it would look like application hosting. Imagine what happens, then, when you have a “server fail” event. Do you tell the router management system the user has connected that you have a server failure? Hardly. Broadly, your realistic options are to try to relate a “real” resource failure to a virtual condition, or to just fix everything underneath the virtual covers and never report a failure unless it’s so dire that causal factors are moot.
To put the latter option more elegantly, one approach to virtualization management is to manage the virtual elements as intent models with SLAs. You simply make things happen inside the black box to meet the SLAs, and everyone is happy. However, managing this way has its own either/or choice—do you manage explicitly or probabilistically.
Explicit management means knowing what the “virtual to real” bindings are, and relating the specific resource conditions associated with a given service to the state of the service. You can do this for very high-value stuff, but it’s clearly difficult and expensive. The alternative is to play the numbers, and that (in case you were wondering if I’d gotten totally off-point) is where big data, analytics, and Deepfield come in.
Probabilistic network management is based on the idea that if you have a capacity plan that defines a grade of service, and if you populate your network with resources to meet the goals of that plan, then any operation that stays within your plan’s expected boundaries meets the SLAs with an acceptable probability. Somewhere, out there, you say, are the resources you need, and you know that because you planned for the need.
This only works, of course, if you didn’t mess up the plan, and if your resources don’t get messed up themselves. Since the question of whether a massive, adaptive, multi-tenant, multi-application network or service is running right, or just how it’s wrong if it is, is complex, you need to look at a bunch of metrics and do some heavy analytics. The more you can see and analyze the more likely you’ll obtain a correct current state of the network and services. If you have decent baseline of normal or acceptable states, that gets you a much higher probability that your wing-and-a-prayer SLA is actually being met when you think it is.
Many people in the industry, and particularly in the telco space, think explicit management is the right answer. That’s why we hear so much about “five-nines” today. The fact is that almost none of the broadband service we consume today can be assured at that level, and almost none of it is explicitly managed. Routers and Ethernet switches don’t deliver by-the-second SLAs, and in fact the move to SDN was largely justified by the desire to impose some traffic management (besides what MPLS offers) on the picture. In consumption terms, consumer broadband is swamping everything else already, it’s only going to get worse, and consumers will trade service quality for price to a great degree. Thus, it’s my view that probabilistic management is going to win.
That doesn’t mean that all you need to manage networks is big data, though. While probabilistic management based on resource state is the basis for a reasonable management strategy for both SDN and NFV, there’s still a potential gap that could kill you, which I’ll call the “anti-butterfly-wings” gap.
You know the old saw that if a butterfly’s wings flap in Japan, it can create a cascade impact that alters weather in New York. That might be true, but we also could say that a typhoon in Japan might cause no weather change at all in nearby Korea. The point is that a network resource pool is vast, and if something is buggered in a given area of the pool there’s a good chance that nothing much is impacted. You can’t cry “Wolf!” in a service management sense just because something somewhere broke.
That’s where Deepfield might help. Their approach adds endpoint, application, or service awareness to the mass of resource data that you’d have with any big-data network statistics app. That means that faults can be contextualized by user, service/application, etc. The result isn’t as precise as explicit management, but it’s certainly enough to drive an SLA as good or better than what’s currently available in IP or Ethernet.
The interesting this about this approach, which Nokia’s move might popularize, is the notion of a kind of “resource-push” management model. Instead of having the service layer keep track of SLAs and draw on resource state to get the necessary data, the resource layer could push management state to the services based on context. At the least, it could let service-layer processes know that something might be amiss.
At the most, it opens a model of management where you can prioritize the remedial processes you’re invoking based on the likely impact on customers, services, or applications. That would be enormously helpful in preventing unnecessary cascades of management events arising from a common condition; you could notify services in priority order to sync their own responses. More important, you could initiate remedies immediately at the resource level, and perhaps not report a service fault at all.
That’s the key point. Successful service management in SDN and NFV is successful not because it correctly, or at least logically, reflects the fault to the user. It’s successful because no faults are reported because no SLA violations occur. It will be interesting to see how Nokia plays this, though. Most M&A done these days fails because the buyer under-plays its new asset.