SDN Management: As Forgotten as NFV Management

I’ve talked a quite bit in 2014 about the operations issues of NFV, but much less about the issues associated with SDN.  We’re at the end of the year now, and so I’d like to rectify that in part at least by addressing SDN operationalization now.  There are links to NFV, of course, but also SDN-specific issues.

To open this up, we have to acknowledge the three primary “models” of SDN.  We have the traditionalist OpenFlow ONF model, the overlay “Nicira” model, and the “API-driven” Cisco model.  Each of these has its own issues, and we’ll address them separately, but some common management points will emerge.

OpenFlow SDN presumes a series of virtual or white-box elements whose forwarding is controlled by an SDN Controller component.  Network paths have to be set up by that controller, which means that there’s a little bit of a chicken-and-egg issue with respect to a “boot from bare metal” startup.  You have to establish the paths either adaptively (by having a switch forward an unknown header to the controller for instructions) or explicitly based on a controller-preferred routing plan.  In either case, you have to deal with the fact that in a “cold start”, there is no controller path except where a controller happens to be directly connected to a switch.  So you have to build the paths to control the path-building, starting where you have access and moving outward.

In an operations sense, this means that telemetry from the management information bases of the real or virtual devices involved has to work its way along like anything else.  There will be a settling period after startup, but presumably that will end when all paths are established and this should include management paths.  However, when there’s a problem it will be necessary for the controller to prioritize getting the paths from “devices” to controller established, followed by paths to and from the management ports.  How different the latter would be from the establishing of “service paths” depends on whether we’re seeing SDN being used simply to replicate IP network connectivity or being used to partition things in some way.

However this is done, there are issues related to the controller-to-device-to-management-model coordination.  Remember that the controller is charged with the responsibility for restoration of service, which means the controller should be a consumer of management data.  If a node fails or a trunk fails, it would be best if the controller knew and could respond by initiating a failure-mode forwarding topology.  You don’t want the management system and the controller stepping on each other, so I think it’s logical to assume that the management systems would view SDN networks through the controller.  In my own SDN model, the bottom layer or “Topologizer” is responsible for sustaining an operational/management view of the SDN infrastructure.  This is consumed by “SDN Central” to create services but could also be consumed on the side by an OSS/BSS/NMS interface.

The boundary between the Topologizer and SDN Central in my model, the alignment of services with resources, is also a useful place for SDN management connections to be supported.  Service management requires that a customer-facing vision align with a resource-facing vision (to use old TMF terms) to get meaningful service status.  So if we took the model I reference in the YouTube video link already provided, you could fulfill operations needs by taking management links from the bottom two layers and pulling them into a contextual processing element that would look a lot like how I’ve portrayed NFV management—“derived operations”.

If we look at overlay SDN we see that the challenge here is the one I just mentioned—aligning the customer- and resource-facing visions.  Overlay SDN simply rides on underlying switching/routing as traffic.  There is a logical overlay-network topology that can be aligned with the physical network by knowing where the logical elements of the overlay are hosted.  However, that doesn’t tell us anything about the network paths.

Logically, overlay SDN should be managed differently (because it probably has to be).  It’s easier if you presume the “real” network is a black box that asserts service connections with associated SLAs.  You manage that black box to meet the SLAs but you don’t try to associate a specific service failure to a specific network failure; you assume your capacity management or SLA management processes will address everything that can be fixed.

If we assumed that we had an SDN connection overlay on top of an OpenFlow, central-control, SDN transport network we could presume we had the tools needed to do service-to-network coordination, if a service model created the overlay network and was associated by my SDN “Cloudifier” layer with a “physical SDN” service.  This suggests that even though an overlay SDN network is semi-independent of the underlying network in a topology sense, you may have to make it more dependent by associating the overlay connection topology with the underlying routes or you’ll lose your ability to do management fault correlation or even effective traffic engineering by moving overlay routes.  Keep this point in mind when we move to our last model of SDN.

The “API model” SDN picture is easier in some sense, and harder in others.  Here the presumption is that “services” are policy-induced behavior sets applied through a chain of controllers/enforcers down to the device level.  This is in effect a black-box model because the service is essentially a set of policy invocations that are used to then drive lower-and-lower elements as appropriate.  It’s like “distributed central control” in that the policy control is central but dissection and enforcement are distributed.  When you want to know the state of something you’d have to plumb the depth of policy, so to speak.

Presumably, management variables would be introduced into this distributed-policy system at an appropriate, meaning local, level.  Presumably failures at a given level would create something that rose up to the higher level so alternatives could be picked, since the “failure” should have been handled using alternative resources at the original level had that been possible.  The problem obviously is all this presumption.  We really need to have a better understanding of how policy distribution, status distribution, and device and service state are related.  Until we do, we can’t say much about management here, but we can assume it would follow the general model of a “topology map” a “service map” and an intersection of the two to define management targets from which we have to obtain status.

The common thread here is that all the “SDN” mechanisms (not surprisingly) abstract services away from resources.  So, of course, does traditional switching/routing.  But remember that one of the goals of SDN was to create greater determinism, and that goal could end up being met in name only if we lose the management connections between the services and the resources that “determine” service behavior.  We’ve underplayed SDN management, perhaps even more than we’ve done for NFV management.

NFV management principles could save things, though.  I believe that the principles of “derived operations” that synthesize a service-to-resource management connection by recording the binding made when the abstraction/virtualization of a service is realized in NFV could be applied just as easily to SDN.  The problem, for now at least, is that nobody on either the SDN or NFV side is doing this, and I think that getting this bridge built will be the most important single issue for SDN to address in 2015.

I wish you all a very happy and prosperous New Year!