Policy Management for SDN and NFV

One of the things about hot trends that my surveys tell me frustrates users is the tendency to talk about something but never really define it properly or explain how it would work.  We’ve all seen that with things like SDN and NFV.  It doesn’t have to be one of the super-revolution trends either; I’ve noticed that we’re hearing more about “policies” in networking, and yet it’s hard to nail down what people are really talking about.  That’s too bad because policies could be critical in both SDN and NFV.

To paraphrase a famous saying, “ex pluribus, chaos” (“from many, chaos”, for those who don’t understand Latin and don’t want to bother running a translate program).  Many of the things we do when we talk about network deployment and management work fine when you’re not doing to many of them.  Many of the strategies for monitoring things like sensors are the same.  But if you’re confronted with millions and millions of things to deal with, they don’t work so well.

Perhaps the deciding problem with technologies like frame relay and ATM was the fact that “connections” that require network devices to know something about specific user sessions are simply not scalable to the level of the Internet.  With connectionless networks you have gadgets shuffling packets one at a time with no knowledge of the overall state of the session.  If you want to provide some particular QoS to the mix, you define class or grade of service and you tag traffic appropriately.  This creates a small number of “sub-networks” in a functional sense, and you can manage them.

The notion of “policy management” emerges from this approach because instead of saying that “connection x gets this QoS” you say that “packets meeting this criteria are handled thusly” which means that when conditions impact the delivery of a specific grade of service you have a set of policies to define what you do.  In effect, policy management disconnects service QoS from services and moves it toward infrastructure.  You don’t manage individual sessions or relationships, you manage the collective conditions that set QoS for the appropriately tagged items.

Policy management could obviously reduce the complexity of SDN management because it would focus management processes on sustaining grade-of-service behavior, presuming that if a packet is admitted to a grade of service (we could in theory block packets to enforce design constraints like total load) it will be handled correctly if all the gadgets associated with that grade of service are performing their tasks.  You could think of it as presumptive QoS.  Unless the network reports that something is wrong, I can presume it’s all going right.  This lets SDN design focus on establishing the traffic management rules for the various grades of service, and SDN management focus on making sure that everything that enforced those rules was working.  If it’s not there are policies to say what to do, and the limited number of grades of service make this scalable.

The nice thing about policy-managed SDN is that you can gather information from anyplace where traffic can be monitored, feed that back to some correlating logic, and pick a policy to do what’s needed.  The devices themselves need not be “managed” at all; you just infer device condition from traffic.

For NFV things are a little more complicated, and the reason is simple.  NFV in its ETSI ISG form presumes that virtual functions can be managed equivalently to real devices, which means that the virtual-element behaviors have to map somehow to a MIB that then interprets and presents them as appropriate status variables.  The question is whether it’s possible, or effective, to do this if the paradigm of network management in place doesn’t allow you to look at the specific state of a specific session.  If two virtual network functions are linked in a service chain, do I need to know what is wrong with the linking path if it’s degraded?  Can I infer, based on the fact that I’ve assigned that path to grade-of-service “C” and that “C” is degraded in a general sense, that this path is degraded?

The challenge in NFV is that while it’s not explicitly necessary to know the state of connecting resources, it’s explicit that NFV management knows the state of hosting resources.  If I construct a management view of a virtual function by combining the state of its hosting resources in detail, then add in a kind of fudge number representing what I think the state of connection resources are based on aggregate grade-of-service state, what kind of result to I end up with?  If I undertake remediation for a problem and pick hosting locations carefully to maximize QoS, can grade-of-service state alone insure I’ve not made a bad choice because of connecting resources?

One thing that policy management is really good for is sustaining services and QoS across administrative or technical boundaries.  If a network is made up of three “zones” each of which is managed autonomously, it’s very helpful to be able to communicate service behaviors and handling policies among the zone-keepers so that a consistent experience can be provided.  Each zone remains a “black box” but the properties of the boxes can be shared, and reports on properties and deviations can be made among the zones.

I think it’s very likely that policy management could sustain SDN service management, and because of the “zone” benefit it would be a good way to organize manageable SDN domains into an overall network.  Given that SDN should be able to abstract any service if it’s exploited correctly, that means it should be possible to manage NFV connection services, interior to the VNFs and outside, using policy management too.  The challenge is knowing how to do it.

The relationship between SDN and NFV is important for a lot of reasons, but if that relationship is important in determining how grade-of-service handling and policy management would be exploited by NFV, it might be critical.  Policy management wouldn’t necessarily change the way that NFV recognizes service events for each of the virtual components because remediation of server failures or other hosting issues would still require handling under at least some conditions, but if we were to visualize NFV as consuming “network-as-a-service” and “hosting-as-a-service” it’s possible we could then create a simple management framework for NFV.  That could answer a lot of the questions that stand in the way of making a strong NFV business case, so exploring the question might be a good topic for PoCs.