There are, as you’re all aware at this point, a lot of open questions in the SDN and NFV world. Recently I covered one of them, the operations impact of both technologies. While that’s undoubtedly the most significant open issue, it’s not the only one. Another is the broad question of how networks and network operators are bound together. Protocols have been layered for longer than most network engineers have been alive, and operators have interconnected their networks for even longer. We have to be able to understand how protocol layers and internetworking work in any successor network architecture because we’re not going to eliminate either one.
Layered protocols presume that a given layer (Level 2 or 3 in a practical sense) consumes the services of the layer below, and those services abstract the protocol stack at all lower layers. If we look at how this process works today, we find an important point. A protocol layer either does what its service says it will do, or it reports a failure. Implicitly that means that a protocol layer will attempt to recover from a problem in its own domain and will report a failure when it cannot.
In the good old days of data link protocols like HDLC, this meant that if a data link problem occurred the link protocol would ask for a retransmission and keep that up until it either got a successful packet transfer or hit a limit on the number of attempts. For SDN at any given layer we’d have to assume that this practice was followed, meaning that it’s the responsibility of the lower layers to do their own thing and report a failure upward only when it can’t be fixed. That’s logical because we typically aggregate traffic as we go downward, and we’d not want to have a bunch of high-level recovery processes trying to fix a problem that really exists below.
This could conflict with some highly integrated models of SDN control, though. If an SDN controller is managing a whole stack of layers, then the question is whether it recognizes the natural layer relationships that we’ve always built protocols on. Let’s look at an example. Suppose we have a three-layer SDN stack, optical, L2, and L3. We have an optical failure, which obviously faults all of the L2 paths over it, and all the L3 paths over those L2s. What should happen is that the optical layer recovers if possible, so if there’s a spare lambda or whatever that can be pressed into service, we’d like to see the traffic routed over it at the optical level. Since the low-level path that the higher layers expect has been fixed there are no faults above (assuming reasonable responsiveness).
But will the controller, which has built all the forwarding rules at all the levels, sustain all the presumptions of the old OSI stack? If not, then it’s possible that the optical fault would be actioned by other layers, even multiple layers. That’s not necessarily a crisis, but it’s harder to come up with failure modes for networks if you presume there’s no formal layered structure. Where SDN controllers reach across layers, we should require that the layer-to-layer relationships be modeled as before. Otherwise we have to rethink a lot of stuff in fault handling that we’ve taken for granted since the mid-70s.
If layers are the big issue for SDN, then the big issue for NFV is those relationships between providers. Telecom evolved within national boundaries, and even today there are no operators who could say that they could deliver their own connections over their own infrastructure anywhere in the world. We routinely interconnect among operators to deliver networking on a large geographic scale, and when we add in cloud computing and NFV feature hosting, we add the complication of perhaps sharing resources beyond transport/connection.
So suppose we have a simple service, a VPN that includes five sites in the US and five in Europe. Even presuming that every site on each continent can be connected by a single operator, we have a minimum of two operators that have to be interconnected. We also have to ask the question whether the “VPN” part of the service is provided by one operator with sites from the other backhauled to the VPN, or whether we have two interconnected VPNs. All of this would have to be accommodated in any automated service provisioning.
Now we add in a firewall and NAT element to all the 10 sites. Do we host these proximate to the access points, in which case we have two different NFV hosting frameworks? Do we host them inside the “VPN operator” if we’ve decided there’s only one VPN and two access/backhaul providers? Does a provider who offers the hosting also offer the VNFs for the two NFV-supported elements, or does one provider “sell” the VNFs into hosting resources provided by the other? All of this complicates the question of deployment, and if this sort of cross-provider relationship is routine, then it’s hard to claim we understand NFV requirements if we don’t address it (which, at present, we do not).
But this isn’t the only issue. What happens now if there’s a problem with the “VPN service?” The operator who sold the service doesn’t own all of the assets used to fulfill it. How does that operator respond to a problem when they can’t see the assets? But would another operator provide visibility into their network or cloud to see them?
There is one common element here, which is that the service of a network has to be viewed as a black-box product exported to its user, and that product must include a SLA and a means of reporting against it. The first point, in my view, argues against a lot of vertical integration of an SDN protocol stack and the latter says that the user of a service manages the SLA. The owner of the resources manage the resources, what’s inside the black box.
Making this work gets back to the notion of formal abstractions. A “service” has to be composed by manipulating models that represent service/feature abstractions. Each model has to define how it’s created (deployment) and also how it’s managed. This approach is explicit in TOSCA, for example, which is why I like it as a starting point for management/orchestration, but you can do the same thing in virtually any descriptive model, including ordinary XML. If we take this approach, then layers of protocol can be organized as successive (vertically aligned) black boxes and inter-provider connections represented simply as horizontal structures. The “function” of access or the “function” of transport is independent of its realization at the consumer level, so we don’t have to care about who produces it.
I think we’ve missed this notion in both SDN and NFV because we’ve started at the bottom, and that sort of thing encourages too much vertical integration because we’re trying to build upward from details to principles. While the specific problems SDN and NFV face because of this issue differ, the solution would be the same—it’s all in the modeling.