New SLAs and New Management Paradigms for the Software-Defined Era

There is no shortage of things we inherit from legacy telecom.  An increasing number of them are millstones around the neck of transformation, and many of those that are drags are related to management and SLA practices.  Those who hanker for the stringent days of TDM SLAs should consider going back in time, but remember that 50 Mbps or so of access bandwidth used to cost almost ten thousand dollars a month.  For those with more price sensitivity than that, it’s better to consider more modern approaches to management and SLAs, particularly if you’re looking ahead to transformation.

All management boils down to running infrastructure to meet commitments.  When infrastructure and services were based on time-division multiplexing or fixed and dedicated capacity, the management processes were focused on finding capacity to allocate and then insuring that your allocation met the specified error-and-availability goals.  Packet networking, which came along in the ‘60s, started the shift to a different model, because packet networks were based on statistical multiplexing of traffic.  Data traffic doesn’t have a 100% duty cycle, so you can fit peaks of one flow into valleys in another.  That can multiply capacity considerably, but it breaks the notion of “allocation” of resources because nobody really gets a guarantee.  There’s no resource exclusivity.

A packet SLA has to reflect this by abandoning what you could call “instantaneous state”, the notion that at every moment there’s a guarantee, in favor of “average conditions”.  Over a period of time (a day, a week, a month), you can expect that effective capacity planning will deliver to a given packet user a dependable level of performance and availability.  At any given moment, it may (and probably will) not.

TDM-style SLAs have to be based on measurement of current conditions because it’s those conditions being guaranteed.  Packet SLAs have to be based on analysis of network conditions and traffic trends versus the capacity plan.  It’s more about analytics than about measurement, strictly speaking, because measurements are simply an input into the analysis aimed at answering the Great Question of Packet SLAs: “Will the service perform, on the average and over the committed period, to the guarantees?”  Remediation isn’t about “fixing the problem” at the SLA level, as much as bringing the trend back into line.

Another management issue that has evolved with packet network maturity is that of “stateful versus stateless core” behavior.  Packet protocols have been offered in both “connectionless” and “connection-oriented” modes.  Connection-oriented packet networks, including frame relay and ATM, offer behavior in SLA terms that’s a bit of an intermediary between TDM and IP, which is connectionless.  When a “connection” is made in a packet network, the connection reserves resources along the path.

The problem is that if a core network element breaks, that breaks all the connections associated with it, resulting in a tear-down backward toward the endpoints and restoration of the connections using a different route.  In the core, there could be tens of thousands of such connections.  Connectionless protocols don’t reserve resources that way, and there’s no stateful behavior in the core.  Arguably, one reason why IP has dominated is that a core fault creates a fairly awesome flood of related management issues and breaks a lot of SLAs, some because nodes are overloaded with connection tear-down and setup.

We’ve had packet SLAs for ages; nobody writes TDM SLAs for IP or Ethernet networks.  Yet we seem stuck on TDM notions like “five nines”, a concept that at the packet service level is hardly relevant because it’s hardly likely to be achieved unless you define what those nines apply to rather generously.  We’ve learned from the Internet to write applications that tolerate variable rates of packet loss and a range of latencies.  It was that tolerance that let operators pass on connection-oriented packet protocols so as to avoid the issues of stateful core networks, issues that could have included a flood-of-problems-induced collapse of parts of the network and a worse and more widespread SLA violation.

We now have to think about managing networks evolving to central management (SDN) of traffic, and hosted device instances that replace physical devices.  There, it’s particularly dangerous to apply the older TDM concepts, because service management almost has to be presented to the service buyer in familiar device-centric terms, and many faults and conditions in evolved networks won’t relate to those devices at all.  We need, at this point, a decision to break the remaining bonds between service management and service SLAs, and the explicit state of specific resources underneath the services.

In the best of all possible worlds, if you want management to be the easiest and service operations costs the lowest, you’d build infrastructure according to a capacity plan, exercise basic admission control to keep things running within the design parameters, and as long as you were meeting your capacity plan goals, nobody would see an SLA violation at all.  That happy situation would be highly desirable in transformed infrastructure, because it’s far easier than trying to link specific services, specific resources, and specific conditions to form an SLA.  As I pointed out yesterday, though, there are issues.

Adaptive IP networks do what their name suggests, which is adapt.  When you have a network that’s centrally traffic managed like SDN, or you have resources in NFV that have to be deployed and scaled on demand, you have a resource commitment to make.  Where real resources are committed—whether they’re SDN routes or NFV hosting points—you have a consumable that’s getting explicitly consumed.  You can’t have multiple activities grabbing the same thing at the same time or things are going to get ugly.  That forces serialization of the requests for this sort of change in infrastructure state, which creates single points of processing that can become bottlenecks in responding to network conditions or customer requests.  These same points can be single points of failure.

In the long run, we need to work out a good strategy for making more of SDN and NFV control processes scalable and resilient.  For now, we can try to narrow the scope of control for a given OpenStack instance or SDN controller, and “federate” them though a modeling structure that can divide up the work to insure things operate at the pace needed.  As SDN and NFV mature, we’re likely to need to rethink how we build controllers and OpenStack instances that are themselves built from components that adhere to cloud principles.

If you’d have tried to sell a two-nines service to business thirty years ago, you’d have tanked.  Today, almost all large companies rely heavily on services with about that level of quality.  We had a packet revolution.  Now we’re proposing a software-centric revolution, and it’s time we recognized that constraining services to the standards of even the recent past (much less falling back to “five nines”) is no more likely to be a good strategy now than it was at the time of the TDM/packet transition.  This time, the incentive to change may well be to improve operations efficiency, and given process opex is approaching costs of 30 cents per revenue dollar, that should be incentive enough.