How SDN and NFV Impact Netops

The impact of SDN and NFV on OSS/BSS is a topic that obsesses many operators and also a topic I’ve blogged about extensively.  There’s no question it’s important, but there’s another kind of operations too—network operations.  It’s not always obvious, but both SDN and NFV would have profound impacts on network operations and the operations center—the NOC.  Some of the impacts could even threaten the classic model we call “FCAPS” for “Fault, Configuration, Accounting, Performance, and Security”.

In today’s world, network operations is responsible for sustaining the services of the network and planning network changes to respond to future (expected) conditions.  The ISO definition of the management model is the source of the FCAPS acronym and it reflects the five principle management tasks that make up network operations.  For enterprises, this model is pretty much all of operations, since most enterprises don’t have OSS/BSS elements.

To put netops, as many call it, into a broader perspective, it’s a function that’s typically below OSS/BSS and is made up of three layers—element management, network management, and service management.  Most people would put OSS/BSS layers on top, which means that service management on the netops stack is interfacing or interconnecting with the bottom of OSS/BSS.  Operations support systems and business operations “consume” netops services.  Netops practices can be divided by the FCAPS categories, but both enterprises and service providers tend to employ a kind of mission-based framework based more on the three layers.

Virtualization in any form distorts the classic management picture because it breaks the convenient feature-device connection.  A “router” isn’t a discrete device in SDN or NFV, it’s a behavior set imposed on a forwarding device by central intelligence (SDN) or it’s a software function hosted on some VM or in a container (NFV).  So, in effect, we could say that both SDN and NFV create yet another kind of layered structure.  At the bottom is the resource pool, in the middle is the realized virtualizations of functions/features, and at the top are the cooperative feature relationships.  In a general way, the top layer of this virtualization stack maps to the bottom (element) layer of the old netops stack.

It’s easy to apply the five FCAPS disciplines to the old netops stack, or at least it’s something everyone is comfortable with and something that’s well supported by tools.  If SDN and NFV could just as easily map to FCAPS we’d be home free, but it’s pretty obvious that they don’t.

Take the classic “fault”.  In traditional netops, a fault is something that happens to a device or a trunk, and it represents aberrant behavior, something outside its design range of conditions.  At one level this is true for SDN and NFV as well, but the problem is that there is no hard correlation between fault and feature, so you can’t track the issue up the stack.  A VM fails, which means that the functionality based on it disappears.  It may be difficult to tell, looking (in management terms) at the VM, just what functionality that was and where it was being used.

We can still use classic FCAPS, meaning classic netops, with SDN and NFV as long as we preserve the assumption that the top of our SDN/NFV stack is the “element” at the bottom of the ISO model.  That’s what I’ve called the “virtual device” model of the past.  The problem is that when we get to the ISO “element” virtualization has transformed it from a box we can send a tech to work on, into a series of software relationships.  Not only that, most of the relationships involve multi-tenant resource pools and are designed to be self-healing.

One logical response to this problem at the enterprise level is to re-target netops management practices at the virtualization stack, particularly at the pooled resources, and treat the ISO netops stuff almost like an enterprise OSS/BSS.  This could be called asynchronous management because the presumption is that pooled resources would be managed to conform to capacity planning metrics and self-healing processes (scaling, failover) would be presumed to do everything possible for service restoration within those constraints.  A failure of the virtualized version of a service/device would then be a hard fault.

This seems to me to be a reasonable way of approaching netops, but it does open the question of how those initial capacity planning constraints are developed.  Analytics would be the obvious answer, but to get reasonable capacity planning boundaries for a resource pool would require both “service” and “resource” information and a clear correlation between the two.  You’d want to have data on service use and quality of experience, and correlate that with resource commitments and loading states.

Not only that, we’d probably need resource/service correlations to trace the inevitable problems that get past the statistical management of pooled resources.  Everyone knows that absent solid resource commitments per service, SLAs are just probability games.  What happens when you roll snake-eyes?  It’s important in pool planning to be able to analyze when the plans failed, and understand what has to be done (one option being roll the dice again, meaning accept low-probability events) when they do.

There’s also the question of what happens when somebody contacts the NOC to complain about a problem with a network service.  In the past, NOC personnel would have a reasonable chance of correlating a report of a service problem with a network condition.  In a virtualized world, that correlation would have to be based on these same service/resource bindings.  Otherwise an irate VP calling the NOC about loss of service for a department might get a response like, “Gee, it appears that one of your virtual routers is down; we’re dispatching a virtual tech to fix it!”

To take this to the network operator domain now, we can see that if we presume that there exists a netops process and toolkit, and if we assume that it has the ability to track resource-to-service connections, we could then ask the question of whether OSS/BSS needed to know much about things like SDN and NFV.  If we said that the “boundary” element between the old ISO stack and the new virtualization stack was our service/resource border, we could assume that operations processes could work only with these abstract boundary elements, which are technology-opaque.

This then backs into operators’ long-standing view that you could orchestrate inside OSS/BSS, inside network management, or at the boundary of the two.  The process of orchestration would change depending on where you put the function, and the demands on orchestration would also change.  For example, if OSS/BSS “knows” anything about service technology and can call for SDN or legacy resources as needed, lower-level network processes don’t have to pick one versus the other for a given service.  If operations doesn’t know, meaning orchestration is below that level, then lower-level orchestration has to distinguish among implementation options.  And of course the opposite is true; if you can’t orchestrate resources universally at a low level, then the task of doing that has to be pushed upward toward the OSS/BSS in network operators, or perhaps totally out of the network management process of enterprises, into limbo.  There is no higher layer in enterprise management to cede orchestration to, so it would end up being a vendor-specific issue.

This point puts the question of NFV scope into perspective.  If you can’t orchestrate legacy, SDN, and NFV behaviors within NFV orchestration you have effectively called for another layer of orchestration but not defined it or assigned responsibility to anybody in particular.  That not only hurts NFV for network operators, it might have a very negative impact on SDN/NFV applications in the enterprise.