Can We Scale SDN and NFV?

Over the holiday weekend I got an email from a network operator friend who offered a comment on the state of SDN and NFV.  The point was simple; it’s not completely accurate to say that the PoCs and trials so far have validated either SDN or NFV functionality.  The reason isn’t that these efforts have left functions out, but that they’ve not addressed operation at scale.

SDN provably works in a data center.  There are a few NFV models that could in theory deploy in a kind of minimalist way (service chaining using edge hosting, for example) but nobody thinks SDN or NFV can survive on those models alone.  We have to make both SDN and NFV work in massive deployments.  How?

The first step, on which my operator friend and I agree, is to formalize the process of “domains” and “federation”.  If we think of SDN as being a domain of switches under a controller, we’ve defined an SDN domain.  If we think of NFV as being a collection of NFVI under a single MANO instance, we’ve defined an NFV domain.  The point is that we know that there will be multi-domain networks built using both technologies, so we need to understand how services can cross domain boundaries.

The current mechanisms for SDN federation rely mostly on legacy protocol features—BGP is an example.  This isn’t a long term approach to the issue because it doesn’t let services be built across multiple domains using the full set of SDN features.  In SDN it seems obvious that we need to have a control hierarchy, and/or define an “interdomain” protocol set that would let controllers cooperate to establish services across a connected set of domains.

NFV has no real federation capability as yet, though the NFV ISG has deliberated the question of interconnection of domains.  The sense I get from operators is that they are seeing cooperation as being primarily an exchange of resources rather than cooperative functionality.  An operator might deploy on NFVI provided by a partner, for example, but they’d use their own MANO.  That doesn’t seem to be a good long-term approach either, in no small part because of the next point.

Which is that the limits of domain size and performance have to be established.  How many switches can an SDN controller manage, and how many service creations and changes can be managed/orchestrated by NFV MANO?  It’s possible to extrapolate based on controllable events like service setups just how many events a controller or MANO instance could support.  The problem is that it’s difficult in the context of limited trials to understand how many events might be created under a full set of real-world network conditions.

Suppose we have ten thousand customers with service-chained cloud-hosted components.  We’re using NFV to orchestrate the hosting and SDN to make connections.  Even ten thousand customers probably generate only a few moves, adds, and changes in a day so this isn’t much of a problem for either technology.  Now suppose that we have a major trunk or data center failure.  We have a good number of those ten thousand service customers looking for automatic remediation.  Few would believe that a single instance of an SDN or NFV controller could absorb that load, particularly when the users would be scattered about a metro area or larger.

You can’t have multiple co-equal controller instances trying to allocate the same pool of resources.   How does Guy A know that Guy B took capacity on a given trunk?  What happens if we had conflicting assignments of capacity?  There has to be some higher-level process that “knows” that when you have a massive failure you have to start by trying to replace the massive facilities that failed, and then move upward for remediation that will ultimately reach the service level.  What process is that in either SDN or NFV?

Even if we have controllable, federated, domains we still have to be able to engineer and test the combination without creating a national-scale communications disaster to prove we can handle one.  That means that both SDN and NFV domains have to be engineered to appear as true black-box functional elements so that domain management doesn’t have to dig into the details of what’s out there to understand how to hand off to it or replace it.  It also means that element behavior has to be designed to meet domain behavior standards.  We’re talking a lot about SDN or NFV performance, but we don’t really have a standard against which we can measure it.  We don’t know what a domain has to do, and thus don’t know what elements of that domain have to do to make the domain work.

Part of this engineering is dealing with the impact of various technology options and proposed specification elements on “domainability” of the whole.  For example, we know that SDN has two modes of route control.  One says the controller simply lays out routes based on a central topology understanding and analytics on device behavior.  The other says that when a packet is presented at a switch with new forwarding needs, the switch asks the controller for handling.  I think everyone understands that there are strengths and limitations to both these modes, but do we know what they are to the point where we could size controllers and domains?  I doubt it.

Part of the problem is goes back to federation.  We know based on cloud computing and OpenStack experience that there’s a limit to the size of a domain that a given instance of OpenStack can support.  Some of the elements are single-thread, and it’s hard to see how you avoid that when (as I noted above) resource grants have to be coordinated to make sure several control points don’t grab the same thing.  How does that serialization happen, and what’s the performance implications for the mechanism we’ve selected?

My operator friend is right; we’ve not really dug into at-scale SDN or NFV as strongly as we should have, and as a result most prospective users (and some actual users) don’t understand what might happen if an event creates a flood of changes.  I’ve had plenty of experiences in the networking industry where a small situation caused a cascade of problems that swamped the whole of a network.  The worst example I ever saw of an enterprise network failure, one that impacted almost sixty locations and over ten thousand workers in a mission-critical field, started with a link error.  The escalation to disaster wasn’t caused because the error spread, but by the fact that the remediation overloaded critical management elements way past the point of surviving or failing gracefully.  Neither SDN nor NFV can afford that, and while I think both technologies can be made to scale, federate, and survive, I don’t think we’re as close to being able to do that as we should be.