Operators tell me that they are still struggling with the details of service lifecycle automation, even though they’re broadly convinced that it’s a critical step in improving profit per bit. There are a lot of questions, but two seem to be rising to the top of everyone’s list, and so exploring both the questions and possible answers could be valuable for operators, and so for the industry.
Service lifecycle automation means creating software processes to respond to service and network conditions, rather than relying on a mixture of network operations center reactions and both contemporaneous and delayed responses in operations systems. “Something happens, and the software reacts…completely, across all the impacted systems and processes, dealing both with service restoration and SLA escalation and remediation,” is how one operator put it.
That statement raises the first of the operator questions, which is how you can get service lifecycle management to cover the whole waterfront, so to speak. There’s a lot of concern that we’ve been addressing “automation” in a series of very small and disconnected silos. We look at “new” issues in orchestration and management, like those created by NFV hosting, but we don’t address the broader service/network relationships that still depend on legacy elements. They point out that when you have silos, no matter where they are or why/how they were created, you have issues of efficiency and accuracy.
The silo problem has deep roots, unfortunately. As networking evolved from circuit (TDM) to packet (IP), there was an immediate bifurcation of “operations” and “network management”. Most of this can be traced to the fact that traditional OSS/BSS systems weren’t really designed to react to dynamic topology changes, congestion, and other real-time network events. The old saw that OSS/BSS systems need to be “more event-driven” harkens from this period.
The separation of operations and management has tended to make OSS/BSS more “BSS-focused”, meaning more focused on the billing, order management, and other customer-facing activities. This polarization is accentuated by the focus of operators on “portals”, front-end elements that provide customers access to service and billing data. Since you can’t have customers diddling with network resources in a packet world, the portalization of planning forces delineation of business operations and network management boundaries.
One way to address this problem is to follow the old TMF model of the NGOSS Contract, which has at the latest morphed into TMF053 NGOSS Technology Neutral Architecture (TNA). With this model, operations systems are implicitly divided into processes that are linked with events via the contract data model. Thus, in theory, you could orchestrate operations processes through event-to-process modeling. That same approach would work for service lifecycle automation, which would provide a unified solution, and you could in theory combine both in a single service model. Operators like a single modeling language but are leery about having one model define both management and OSS/BSS lifecycles.
That raises the operators’ second question, which is about events. Lifecycle management is all about event-handling, and just having a nice data-model-driven approach to steering events to processes doesn’t help if you can’t generate the events or agree on what they mean. This is particularly important when you are looking at multi-tenant infrastructure where the relationship between a service and infrastructure conditions may be difficult to obtain, and where correlation costs are such that the measures would be impossible to financially justify.
An “event” is a condition that requires handling, and it’s obvious that there are questions on what conditions should create events and how the events should be coded. Ideally, you’d like to see service lifecycle events standardized at all levels, including how a pair of models—one managing network resources for services and the other operations processes for customers—would use events to synchronize their behavior. Operators have no idea how that’s going to happen.
Events are critical in service automation, because they’re the triggers. You can’t automate the handling of something you don’t know about, and “know about” here means having an unambiguous, actionable, signal. If something that’s dedicated to and an explicit part of a given service breaks, it’s easy to envision that such a signal could be produced, though differences in vendor implementations might require standardization of error conditions. Where shared or virtualized resources fail it’s another matter.
One problem is that there are a lot of different sources of “events”, and many sources are from different pieces of infrastructure with different status conditions. A server might report one set of events and a fiber trunk another. How do you correlate the two, or even understand what they mean? For example, a server might report an overheat warning. Does that mean anything in terms of capability? If a server reports overheating and a fiber trunk overloading, do the two have a common cause?
Another problem is that a condition in infrastructure doesn’t always impact all the services, so you have to understand what the scope of impact in. A fiber failure surely impacts services that happen to use the fiber, but what services are those? In order to find out for an IP service, you’d have to understand the addresses involved in the service and the way those addresses were routed. And it’s more than just destination address. Two different users accessing the same server might find that one is impacted by a fiber failure and the other is not.
“Analytics” is the most problematic source of events. Analytics have a presumptive multiplicity of dimensions that separate it from simple status reporting. Those added dimensions make it harder to say what a given analytics “prediction” or “event” would mean to services. Last week, analytics might say, these conditions led to this result. We already know that the result might be difficult to associate with the state of a given service, but we now add the question of whether the conditions last week had any specific relevance to that service last week. If not, is there a reason to think we should care now? Do we then do service-specific correlations for analytics?
Event correlation is critical to the notion of service automation, because it’s critical to establishing service conditions. You can do infrastructure or resource management a lot easier because the event relationship with infrastructure is clear; events are generated there. This means that it would probably make sense to separate services and infrastructure so that the services of infrastructure (what I’ve always called the “resource layer”) are presented as services to the service layer. Then you need only determine if a resource service (“behavior” is the term I use) is meeting the SLA, and you can readily generate standard events relating to SLA conformance.
This leaves us with what are effectively three layers of modeling and orchestration—the OSS/BSS layer, the network services layer, and the resource/behavior layer. This multiplication seems to make operators uneasy at one level, and comforts them at another. Lots of layers seems to indicate unneeded complexity (likely true), but the layers better reflect current structures and dodge some important near-term issues.
We don’t have nearly enough dialog on these topics, according to operators. Even those involved in SDN and NFV trials say that their activity is focused at a lower level, rather than addressing the broad operations relationships that we really need to see addressed. I’m hoping that the expanding work of operators like AT&T and Telefonica will at least expose all the issues, and perhaps also offer some preliminary solutions. Eventually, I think that operator initiatives will be the drivers to getting real discussions going here; vendors don’t seem willing to take the lead.