The Management Side of Resource Abstraction

One interesting question a client of mine raised recently is the impact of infrastructure abstraction on management tools and practices.  What goes on inside a black box, meaning an abstract intent model of a feature or application component, is opaque.  The more that you put inside, the more opacity you generate.  How do you manage something like this?

Abstraction, as a piece of virtualization, allows a “client-side” user of something to see a consistent interface and set of properties, while that view is mapped to a variety of implementations on the “resource side”.  The client-side view can (and, in many or even most cases, should) include an abstract management interface as well as the normal “data-plane” interface(s).  Management, then, can be an explicit property of abstraction of infrastructure.

That doesn’t mean that management is unaffected.  To start with, when you have an abstraction based on an intent model, you are managing the abstraction.  That means that your management practices are now disconnected from the real devices/resources and connected instead to the abstract behaviors that you’ve generated.  That changes the way that real resource management works too.

You can visualize an intent-modeled service or application as an organization chart.  You have a top-level block that represents the service/application overall.  That’s divided into major component blocks, and each of those into minor blocks.  You continue this reversed tree downward until each branch reaches an actual implementation, meaning that the intent model at that level encloses actual resource control and not another lay of modeling.

From the perspective of a service user, this can be a very nice thing.  Services/applications are made up of things that are generally related to functionality, which is how a user would tend to see them.  If we assume that each level of our structure consists of objects that are designed to meet their own SLA through self-healing or to report a fault up the line, this hierarchy serves as the basis for automatic remediation at one level and of fault reporting (when remediation isn’t possible) that identifies the stuff that’s broken.  Users know about a functional breach when the included functions can’t self-heal, and they know what that breach means because they recognize the nature of the model that failed.

The challenge is that, as I’ve noted before, you can’t send a real technician to fix a virtual problem.  Users of applications or services with no remediation role may love the abstraction-driven management model, but it doesn’t always help those who have to manage real infrastructure resources.  That means that you have to be able to relate the abstraction-driven model to the real world, or you have to assume that the two are sort-of-parallel universes.

The parallel-universe theory is the simplest.  Resources themselves can be assigned SLAs to meet, SLAs that are then input into the capacity plans that commit SLAs to the bottom level of the abstraction model.  As long as the resources are meeting their SLA, and if your capacity plans are correct, you can assume that the higher-level abstractions are meeting theirs.  Thus, the resource managers worry about sustaining resources and not applications or services.

This isn’t much different from the way that IP services work today.  While MPLS adds traffic management capability and enhances the kind of SLAs you can write, the model of management is to plan the services and manage the infrastructure.  Thus, there’s no reason this can’t work.  In fact, it’s well-suited to the analytics-driven vision that many have of network management.  You capacity-plan your infrastructure, keep service adds within your capacity plan, and the plan ensures that there’s enough network capacity available to handle things.  If something breaks, you remedy at the level the break occurs, meaning you fix the real infrastructure without regard for the services side.

What about service management, then?  The answer is that service management becomes more a process of SLA violations redress and refunding.  If something breaks, you can tell what functional element in the service “organization chart” broke, which at least tells the customer where the fault is.  You presume your infrastructure-level remediation process will fix it, and you do whatever the SLA requires to address customer ire.

The more complicated approach is to provide a management linkage across the “organization chart” model hierarchy.  The TMF offers this sort of thing with its MTOSI (Multi-Technology Operations System Interface), which lets a management system parse through the structure of manageable elements in another system.  What it means is that if an element in a model reports an SLA violation, it’s because what’s inside it has failed and can’t be automatically remediated.  The logical response is to find out what the failure was, which means tracing down from where you are (where the fault was reported as being beyond remedy) to where the fault occurred, then digging into the “why”.

One technical problem with this second approach is the potential for overloading the management APIs of lower-level elements when something breaks that has multiple service-level impacts—the classic “cascaded fault” problem.  I’ve proposed (in my ExperiaSphere work and elsewhere) that management information be extracted from individual APIs using an agent process that stores the data in a database, to be extracted by queries that present the data in whatever format is best.  This would ensure that the rate of access to the real APIs was controlled, and the database processes could be made as multi-threaded as needed to fulfill the design level of management queries.

A more serious problem with this second approach exists at the process level.  Service-driven remediation of infrastructure problems can result in spinning many wheels, particularly if a fault creates widespread service impact.  You don’t want customer service reps or (worse) individual customer portals to be launching any infrastructure-level initiatives, or you’ll risk having thousands of cooks spoiling the soup, as they say.

I think it’s logical to assume, again as I’ve done in ExperiaSphere, that service-level management visibility would be blocked at a certain point while parsing the “organization chart”, letting CSRs and customers know enough to feel confident that the problem had been identified, but not so far as to make individual resources visible at the service management level.

Overall, I think that virtualization and resource abstraction tend to pull traditional management apart, and the more abstract things are the further apart the two pieces get.  Both applications and services need to manage quality of experience, and that means understanding what that is (at the high level) and dissecting issues to isolate the functional pieces responsible.  However, remediation has to occur where real resources are visible.

You need service or application management because networks are valuable because their services are valuable.  The “organization chart” model can provide it.  You also need resource management to address the actual state of infrastructure.  It’s possible to do resource management based on the same kind of modeling hierarchy, providing that you have a barrier to visibility between services and resources at the appropriate point (in ExperiaSphere, that was handled by separating the service and resource layers).  It’s also possible to manage resources against an independent SLA using tools that suit the resource process.  That conserves current tools and practices, and it may be the path of least resistance toward the future.