It’s pretty clear to me from comments I’ve gotten in my blog that there are a lot of questions out there on the topic of management or operations. When we talk about things like SDN and NFV we can talk about management practices in a general way, but we can really only talk “operations” if we can integrate management practices with business practices. So today let’s take a look at the topic and see if we can sort out a framework for evaluating the needs of new network technologies.
You can group services into two broad classes—assured and best-efforts. Assured services are those that guarantee some level of service quality and availability and best-efforts are services that…well…just expect the operator to do their best to get something from point “A” to point “B”.
Best-efforts may not have a hard standard of quality or availability, but it’s not “no-effort”. An ISP who had significant numbers of delivery failures could never survive in a competitive market, so independent of the service classes above there’s still a general need to manage network behavior so as to get most stuff to its destination.
Here we can also say there are two broad approaches, provisioned QoS and engineered QoS. The former means that a service is assigned resources according to the service guarantees made, and sustaining those guarantees means validating whether the committed resources are meeting them. The latter means that we play a statistical game. We engineer the infrastructure to a given probability of packet loss or mean delay or failure rate/duration, and we understand that most of the time we’ll be OK. Sometimes we won’t. In all cases, though, what we’ve done is to calculate the service quality based on resource conditions. We assure resources against the parameters that we define as “good-enough-effort” for a given service, not the services themselves.
Where we have provisioned QoS, we assure services specifically and we do resource substitution or adaptation based on service events. Where we have engineered QoS, we build a network to operate within statistical boundaries, and we respond to conditions with what? Generally, with policy management. Policies play two broad roles—they insure that the resources are operating within their designed limits through traffic admission controls, and they respond to internal conditions by applying predefined broad remedies.
So can we apply this to SDN and NFV? Let’s see.
In the case of SDN, the immediate problem we have is in the definition of SDN. Does SDN mean “software defined” in the sense that network behavior is centrally software-managed without local device adaptation (the original notion) or do we mean it’s “software controlled” meaning that software can control network behavior more precisely?
If SDN is the former, then we have a bicameral model of service behavior. We have real devices (whether they’re hosted/virtual or physical) that have to be running and working in order for traffic to be passed from place to place, and we have central route/forwarding control that has to be right or packets fly off to the wrong places or into the bit bucket. The advent of central forwarding control means that we know the rules in one place, the routes in one place, but it doesn’t mean we’re less dependent than before on retrieving and delivering status information. In fact, one of the biggest issues in centralized SDN is how you secure management connectivity at all times without adaptive behavior to provide default paths. Without management connectivity you don’t know what’s going on, and can’t do anything about it.
In the software-controlled model, we presumably still have adaptive behavior and thus have default routing. Arguably this model means, in management terms, that we are providing more service-centricity while sustaining essentially an adaptive, policy-managed resource set. It’s my view that any software-controlled SDN models are really models that will ultimately rely on better (more granular in terms of services and service grades) policy control. This model relies more on predictive analytics for one simple reason; if you can’t figure out exactly how a given service will impact resources and thus other services, you can’t reliably provide for software control at the service and application level (which is the only meaningful place to provide it). So we do traffic engineering on service/application flows based on understanding network conditions and how they’ll change with the introduction of the new service/flow.
In the central-control model, we can still use predictive analytics but we also have to provide the baseline assembly of route status and traffic patterns that come from our two previously mentioned sources. However, we’ll use this information not to manipulate routes or queues but rather in creating/connecting paths for specific services or flows. We may also use policy management, but more in the form of automated service responses to resource events. There are plenty of early SDN examples of predefining failure modes to create fast transitions from a normal state to a failure-response state when something goes wrong.
I think it’s clear from SDN that we do have a role for analytics and also a role for policy management in the models. How about NFV?
NFV replaces boxes with software/hosting pairs, much as centralized SDN would replace switching/routing with OpenFlow switches and central software. You would have to manage NFV elements just as much (though not necessarily in the same way) as we’d managed real devices in the same mission before NFV came along. We can’t manage the resources on which we host virtual functions and assume that the software is perfect, any more than we could presume in SDN that just because all our OpenFlow switches were running fine, we had coherent forwarding rules to deliver traffic or operating software centrally or in the devices. If the services are provisioned per-user, though, then service management in a virtual world doesn’t necessarily change much because a service event can trigger a resource-assignment-remediation response.
But NFV adds a problem dimension in that a virtual device may have properties that real ones didn’t. Scalability and repositioning are examples of this. So I have an additional dimension of things that might get done, and there’s arguably a policy-management and even analytics link to doing them. Scale-in/out and redeployment on command are management functions that can be automated responses to events. Is that policy management?
To me, this is the core of the whole story of SDN/NFV management. We are really talking about service automation here, even today before we have either SDN or NFV much in the picture. Today, most service automation is either applied at the OSS/BSS provisioning level or exercised through policy management (admission control, grade-of-service assignment). In the mobile world we have those tools. The place where we might be going wrong is in assuming that the mobile world gives us the only example of how we implement policy management. I think the best thing to do here is to say that we’re heading for a future of service automation and that this future automation strategy will draw on traffic data, past and present, and device status (also past and present) in some measure to make machine decisions about how to handle conditions in real time. If we apply service automation to the same kinds of services we provide today’s PCC-based policy management or path computation functions to, we could well have the same implementation. But all future NFV and SDN management missions don’t map to this basic model, and that’s why we have to try to expand our conception of management when we expand into SDN and NFV.