Applying the Lessons of OSS/BSS Transformation

Talking with operators about OSS/BSS is interesting, and I’ve blogged about operator views on the topic before.  The discussions, combined with my own long exposure to the topic, have also given me my own perspective, enriched by recent deep-dive discussions.  This is my view of where we are in OSS/BSS modernization, and what might make things better.

You have to start any OSS/BSS discussion with a bit of history, because that’s what’s guided the evolution.  In the very early days, the ‘70s and early ‘80s, operations systems were primarily mechanisms for managing what operators called “craft” activity, the manual steps associated with service setup and changes.  These were the days when operators sold constant-bit-rate or “time-division multiplexed” (TDM) trunks, stuff like 56 kbps Dataphone Digital Service (DDS), 1.544 Mbps “T1”, or 45 Mbps T3.

When packet services emerged, starting with frame relay and (a little bit of) ATM, then moving to IP, it had the effect of shifting services from per-customer bandwidth provisioning to multiplexed shared-network provisioning.  You still had to provision access connections (as you do today), but the real service was created from the management of shared infrastructure.  In nearly all cases, this infrastructure was capacity-planned to support service goals overall, and perhaps some admission control was added to keep customers within the design parameters of the network.

Politics entered the picture here.  The original operations systems were essentially “clerical” applications, and services were now direct offshoots of network behavior.  The clerical side of things was managed by the “CIO”, who ran business-supporting applications in operator verticals just as the CIO did in other verticals.  The network side of things was managed by network operations, with technology enhancements managed by the Chief Technology Officer (CTO) and staff.  The result of this was a kind of separation between operations and network, and the OSS/BSS tended to be directed more at “services” than at “networks”.

It seems to me that this separation is fundamental to the issues of service lifecycle automation today.  A “service” has a service-level agreement (SLA), and it draws on the network to meet the SLA, but it almost never directly controls the network.  Furthermore, the network almost never relies on service faults as signals for remediation; it has its own network operations processes that are triggered by faults, either failures or congestion/performance issues.

What this has done, without much fanfare, is separate “lifecycle automation” into two areas, one focusing on the service relationship with the customer and one on the network’s operations.  That has some profound implications.

The most obvious problem this separation creates is that a single strategy for lifecycle automation is going to run up against the barrier between service and network.  An OSS/BSS lifecycle automation approach is going to be limited in its ability to manage network problems, because network problems are in the network operations domain, a separate organization and a separate set of tools.  Network operations automation is likewise going to be limited in its ability to integrate service-related issues, including fundamental things like SLA violations.

There are two possible paths that might be followed.  First, you could declare one of the systems the “master” and mandate that it manage all service and network lifecycle issues.  Second, you could create a boundary function between the service and network areas, and have both sides build to meet in the middle.  Both of these approaches would require some high-level mandate to the CIO and network operations people to play nice, and while there were in fact a number of CEO-staff-level teams formed about ten years ago to do this, as a part of transformation, the initiatives have almost universally failed to drive cross-organizational consensus and cooperation.

The most promising approach seems to be the second, the “meet in the middle” model, but even that approach has variations.  One is to presume that the two sides will remain totally independent, creating what are in technical terms two finite-state machines linked by an agreed set of state/event assumptions.  The other is to presume a single “machine” that defines all service and network events and states, and invokes network and service processes as needed.

The “two-different-worlds-colliding” approach is the most politically attractive, and also the one that best fits if you presume that vendors’ offerings will have to be the driver behind any implementation of lifecycle automation.  A service automation model and a network automation model can run asynchronously, communicating an event from one model to the other if something happens that crosses the wall between service and network.  A service change is an example of a service-to-network event, and a fault that could not be remediated by the network’s own operations processes is an example of a network-to-service event.

The notion of an “intent model” is a useful one in this situation, because an intent model exposes properties and not implementations, and thus is abstract enough to represent the service/network boundary.  The only issues are having a two-directional event flow, and agreeing on the properties getting exposed and the events getting generated.  Most intent-model interfaces don’t exchange events, but rather expose a simple API or API set.

I always liked the common model approach, but recognizing the need for a boundary point between services and networks, I divided things (in my ExperiaSphere model) into a service and resource model domain pair.  Network “behaviors” were composed into services, and the network was responsible for behavior SLAs, while the service layer handled customer SLAs.  If the network couldn’t deliver on the behaviors’ SLAs, a reported fault would give the service layer a chance to either accept the fault (and perhaps issue a credit to the customer) or attempt to recompose the service using behaviors that still offered a suitable (or better-than-the-failure-mode) SLA.

This could be done with two separate models, of course.  However it’s done, I think it’s the logical way to bridge the service/network gap.  That means it would be a great place for operators to put their efforts into the game.  It would also be a good place for the IT and cloud vendors who want to take control of next-gen operator infrastructure to spend some effort, and money.

Any lifecycle automation approach that’s comprehensive, and that unites service and network domains, would be better than what we have now.  I think we could rely on vendors to address the scope of lifecycle automation within the service and network domains, as long as they were separated.  What’s more difficult is the uniting piece.

The task of integrating operations elements, within the service or network zones or across the boundary, has been largely focused on APIs.  That’s a problem because the truth is that most API work has ended up encouraging a tight coupling between the two elements—what programmers call a synchronous relationship.  In cloud-native behavior, synchronicity is a no-no because it ties up processing resources waiting for things to happen.  Event-triggered approaches are asynchronous by nature.  What we need to see now is an event-centric vision of operations processes.  I’d like to see that overall, but for many vendors it may be too much of a transformation to complete quickly.  I’d settle for seeing the service/resource linkage done via events, and seeing operators put pressure on vendors to define the boundary relationship that way.