Service Assurance in the Network of the Future

One of the persistent questions with both SDN and NFV is how the service management or lifecycle management processes would work.  Any time that a network service requires cooperative behavior among functional elements, the presumption is that all the elements have to be functioning.  Even with standard services, meaning services over legacy networks, that can be a challenge.  It’s even more complicated with SDN and NFV.

Today’s networks are multi-tenant in nature, meaning that they share transmission/connection facilities to at least some degree.  Further, today’s networks are based on protocols that discover state and topology through adaptive exchanges, so routing is dynamic and it’s often not possible to know just where a particular user’s flows are going.  In most cases these days, the functional state of the network is determined by the adaptive processes—users “see” in some way the results of the status/topology exchanges and can determine if a connection has been lost.  Or they simply don’t see connectivity.

QoS is particularly fuzzy.  Unless you have a mechanism for measuring it end-to-end, there’s little chance that you can determine exactly what’s happening with respect to delay or packet loss.  Most operator guarantees of QoS are based on performance management through traffic engineering, and on capacity planning.  You design a network to offer users a given QoS, and you assume that if nothing is reporting a fault the users are getting it.

It’s tempting to look at this process as being incredibly disorderly, particularly when you contrast it with TDM services that because they were dedicating resources to the user could define the state and QoS with great precision at any point.  However, it’s not fair to SDN or NFV to expect that they will do better than the current state of management, particularly if users expect lower prices down the line, and operators lower opex.

The basic challenge posed by SDN in at least replicating current management knowledge is the fact that by design you’re saying that adaptive exchanges don’t determine routes, and in fact don’t happen.  If that’s the case, then there is no way of knowing what the state of the devices is unless the central controller or some other central management element knows the state.  Which, of course, means that the devices have to provide that state.  An SDN controller has to know network topology and has to know the state of the nodes and trunks under its control.  If this is true, then the controller can construct the same knowledge of overall network conditions that the network acquired through adaptive exchanges, and you could replicate management data and SLAs.

NFV creates a different set of problems.  With NFV the service depends in part on functions hosted on resource pools, and these are expected to offer at least some level of “automatic recovery” from faults, whether that happens by instantiating a replacement copy, moving something, reconnecting something, or scaling something under load.  This self-repair means that a fault might exist at the atomic function level but you don’t want to recover from it at the service level till whatever’s happening internally has been completed.

The self-remediation model of NFV has, in the NFV ISG and IMHO, led to a presumption that lifecycle management is the responsibility of the individual virtual network functions.  The functions contain a local instance of a VNF management process and this would presumably act as a bridge between the state of resources and their management and the state of the VNFs.  The problem of course is that the service consists of stuff other than that single VNF, and the state of the service still has to be composited.

The operators’ architectures for NFV and SDN deployment, now emerging in some detail, illustrate that operators are presuming that there is in the network (or at least in every domain) a centralized service assurance function.  This function collects management information from the real stuff, and also provides a means of correlating the data with service state and generating (in some way) the notifications of faults to the service processes.  It seems that this approach is going to dominate real SDN and NFV deployment, but the exact structure and features of service assurance aren’t fully described yet.

What seems to have emerged is that service assurance is a combination of three functional elements, aggregation of resource status, service correlation, and event generation.  In the first of these, management data is collected from the things that directly generate it, and in some cases at least the data is stored/cached.  An analytics process operates on this data to drive what are essentially two parallel processes—resource management and service management.  The resource management process is aimed at remedying the problems with physical elements like devices, servers, and trunks.  The service management process is designed to address SLA faults, and so it could just as easily replace a resource in a service as require it be fixed—in fact, that would be the normal course.

Service management in both SDN and NFV is analogous to end-to-end adaptive recovery as found in legacy networks.  You are going to “fix” a problem by reconfiguration of the service topology and not by actually repairing something.  If something is broken, that becomes a job for the resource management processes.

Resource management doesn’t appear to present any unusual challenges.  You have device state for everything, and so if something breaks you can fix it.  It’s service management that poses a problem because you have to know what to reconfigure and how to reconfigure it.

The easiest way to determine whether a service has faulted is to presume that something internal to the service is doing it, or that the service users are reporting it.  Again this may seem primitive but it’s not really a major shift from what happens now.  If this approach is taken, then the only requirement is that there be a problem analysis process to establish not what specifically has happened but what can be done to remedy the fault by reconfiguration.  The alternative is to assume that the service assurance function can identify the thing that’s broken and the services that are impacted.

Both these options seem to end up in the same place.  We need to have some way of knowing when a virtual function or SDN route has failed.  We need to have a recovery process that’s aimed at the replacement of that which has broken (and perhaps a dispatch task to send a tech to fix a real problem).  We need a notification process that gives the user a signal of conditions comparable to what they’d get in a legacy network service.  That frames the reality of service assurance.

I think that the failing of both SDN and NFV management to date lies in this requirement set.  How, if internal network behavior is not determined by adaptive exchange, does the service user find out about reachability and state?  If SDN replaces a switch/router network, who generates the management data that each device would normally exchange?  In NFV how do we reflect a virtual function failure when the user may not be directly connected to the function, but somewhere earlier/later in the service chain?

The big question, though, is one of service configuration and reconfiguration.  We cannot assume that every failed white box or server hosting a VNF can be recovered locally.  What happens when we have to change the configuration of the service enough that the elements outside the failing domain have to be changed to reflect the reconfiguration?  If we move a VNF to another data center, don’t we have to reconnect the WAN paths?  Same with SDN domains.  This is why the issue of recovery is more than one of event generation or standardization.  You have to be able to interpret faults, yes, but you also have to be able to direct the event to a point where knowledge of the service topology exists, so that automated processes can reconnect everything.  Where is that point?

In the service model, or it’s not anywhere.  Lifecycle management is really a form of DevOps, and in particular of the declarative model where the end-state of a service is maintained and compared with the current state.  This is why we need to focus quickly on how a service is modeled end-to-end and integrate that model with service assurance, for both legacy and “new” technologies.