Service and Resource Management in an SDN/NFV Age

I mentioned in my blog yesterday that there was a distinct difference between “service management” and “resource management” in networks, and it’s worth taking some time to explore this because it impacts both SDN and NFV.  In fact, this difference may be at the heart of the whole notion of management transformation, the argument on whether we need “new” OSS/BSS approaches or simply need changes to current ones.

In the good old days of TDM networks, users had dedicated capacity and fixed paths.  That meant that it was possible to provide real-time information at a highly granular level, and some (like me) remember the days when you could get “severely errored seconds” and “error-free seconds” data.  When you got a service-level agreement (SLA) it could be written down to the conditions within an hour or even minute, because you had the data.

Packet networking changed all of this with the notion of a shared network and oversubscription.  One of the issues with TDM was that you paid 24×7 for capacity and might use it only for 20% or so of that time.  With packet networks, users’ traffic intermingled and this allowed more efficient use of resources.  It also meant that the notion of precise management information was forever compromised.  In packet networks, it would be very difficult and expensive to recover the exact state of routes and traffic loads at any given time.  Operators responded by extending their SLA guarantee periods—a day, a week, a month.  Packet networking is all about averages, including management of packet networks.

This is where the service/resource management differences arose.  The common principle of packet networks is to design for acceptable (within the collective SLAs) behavior and then assume it as long as all the network’s resources are operating within their design limits.  So you managed resources, but you also sought to have some mechanism of spotting issues with customer services so that you could be proactive in handling them.  Hence, the service/resource management split; you need both to offer SLAs and reasonable/acceptable levels of customer care and response.

The ability to deliver an SLA from a shared-resource packet network depends in large part on your ability to design the network to operate within a given behavioral zone, and to detect and remedy situations when it doesn’t.  That means a combination of telemetry and analytics, and the two have to be more sophisticated as the nature of the resource-sharing gets more complicated.  To the extent that SDN or NFV introduce new dimensions in resource sharing (and both clearly do) you need better telemetry and analytics to insure that you can recognize “network resource” problems and remedy them.  That gives you an acceptable response to service problems—you meet SLAs on the average, based on policies on violations that your own performance management design has set for you.

However, SDN and NFV both change the picture of resource-sharing just a bit.  First, I’ll use an SDN example.  If you assign specific forwarding paths to specific traffic from specific user endpoints, from a central control point, you presumably know where the traffic is going at any point in time.  You don’t know that in an IP network today because of adaptive routing.  So could you write a better, meaning tighter, SLA?  Perhaps.  Now for NFV, if you have a shared resource (hosted facilities) emulating a dedicated device, have you created a situation where your SLA will be less precise because your user thinks they’re managing something that’s dedicated to them, and in fact is not?

In our SDN example, we could in theory derive pretty detailed SLA data for a user’s service by looking at the status of the specific devices and trunks we’d assigned traffic to.  However, it raises the question of mechanism.  Every forwarding path, route, through an SDN network has a specific resource inventory, and we know what that is at the central control point.  But is the status of the network the sum of all the route states?  Surely, but how do we summarize and present them?  Management at the service level should now be viewed as a kind of composite, a gross state derived from the average conditions based on some algorithm, but a drill-down to a path-level state as needed.  That’s not what we have today.  And if SDN is offered using something other than central control, or if parts of the network are centralized and parts are not, how do we derive management then?

In NFV, the big question or issue is the collision of management practices and interfaces of today with virtual infrastructure.  A user can manage a dedicated device, but their management of a virtual device has to be exercised within the constraints imposed by the fact that the resources are shared.  I can never let a user or a service component exercise unfettered management on a resource that happens to host a part of their service because I have to assume that could compromise other users and services.

All of this adds up to a need for a different management view.  Logically what I want to do is to gather all the data on resource state that I can get, at all levels.  What I then have to do is to correlate that data to reflect the role of a given resource set in a given service, and present my results in an either/or/both sense.  On the one hand, I have to replicate as best I can the management interfaces that might already be consumed for pre-SDN-and-NFV services.  They still may be in use, both at the carrier and user levels.  On the other hand, I have to present the full range of data that I may now have, in a useful form, for those management applications that can benefit.  This is what “virtualizing management” means.

What we need to have, for both SDN and NFV, is a complete picture of how this resource/service management division and composition process will work.  We need to make it as flexible as we can, and to reflect the fact that things are going to get even more complicated as we evolve to realize SDN and NFV fully.