What, When, and How to Use T&M With SDN and NFV

There have been a number of online articles recently on the relationship between testing and monitoring and NFV.  In the CloudNFV project I headed in 2013 to early 2014, I did some work defining these issues in more detail, and though the results were never incorporated in the PoC they do offer some input on the issue.  The most interesting thing is that we have to review our entire testing, monitoring, and management practice set in light of virtual-network technologies.

One clear truth about T&M in a virtual age is that just as virtualization separates the abstract from the real in a clearly defined way, so it separates T&M tools and practices—or should.  There is “Service T&M” and “Resource T&M”, each focusing on a specific area and each with a specific mission.  The focus and mission differences dictate the technology implications of both.

Service T&M, since it lives at the service layer, lives above the abstract service/resource boundary, or in a sense on the “outside” of the intent models, where only guaranteed behaviors are visible and not how they’re fulfilled.  Obviously Resource T&M has to focus on the real resources, and so should always live “inside” the intent models.  Put another way, Service T&M tests and monitors virtual/functional things and Resource T&M actual things.

A complication to both kinds of T&M is the boundary point.  It’s not unreasonable to think that Service T&M practices would eventually lead to an intent model boundary, and to penetrate it you’d have to know service-to-resource correlations.  Similarly, it’s not unreasonable to think that tracing a resource issue might lead you to wonder what service impacts it was having.  We’ll deal with boundary-crossings in both directions.

Let’s start with the easy stuff.  Resource T&M, being aimed at testing real paths and devices, is at one level at least similar to or identical with traditional T&M.  In my consideration of T&M for NFV, my conclusion was that the only area where NFV (and, in my view, SDN) differed in Resource T&M was the introduction of a new class of resources—the hosting sites or servers.  These are the foundation points for NFV, and they also represent a complexity in the boundary-condition problem.

If a service like VPN is viewed from the top (outside the intent model) it can easily be represented as a single virtual router.  That’s true with most network-connection services.  Similarly, a chain of virtual functions that make up vCPE could be visualized (you guessed it!) as “virtual CPE”.  However, inside-the-model view would be (in both cases) a bunch of VMs or containers likely linked by SDN paths.  The transition from service to resource is not obvious.

The solution I proposed in CloudNFV (and ExperiaSphere) was to have a “soft” boundary between the service and resource layer, where instead of having a bottom-level service intent model decompose into resources directly, it decomposes into virtual devices.  A resource architect can then formulate how they want to expose virtual devices for composition into services, and however that happens the relationship between the virtual devices and the real resources is set.  Orchestration and decomposition then take place on both sides of the boundary, but driven by different but related missions.

Service T&M is more complicated, and the top reason for that is that it’s far from clear whether you actually want to do it.  Everyone used to crank the engine by hand to start a car, but we don’t put cranks on cars any longer.  The point is that many of the old-line ways of managing networks and dealing with problems don’t really relate to the way we have to do it in an age of orchestration and service automation.

Operators themselves were (and still are) very conflicted on the role of T&M, even at the Resource level.  The trend seems clear; they would rather manage resources against a capacity plan based on policies and perhaps on admission control—the mobile model, in short.  If that’s what you’re doing, then you probably don’t want to do specific resource testing much, if at all.  On the service side, the issue is a little more complicated because it’s largely driven by the fear of a major enterprise client calling and shouting “What do you mean all your lights are green?  My &#$* service is down!” when they have no way of “testing” or “measuring” just what’s going on.

After some deliberation, my own conclusion was that service-layer T&M should really consist of the following, in order of value and lack of complications:

  1. The ability to trace the binding to resources by tracing the state of the intent models from service down to the resources. Any inconsistencies in state could then be determined.
  2. The ability to employ packet inspection on a path/connection and gather statistics.
  3. The ability to “see” or tap a path/connection, subject to stringent governance.

In the model I believe to be workable, the central thesis is that it’s useless to test a virtual resource; you can’t send a real tech to fix one.  The goal, then, is to validate the bindings by looking at how each intent model in the service structure has been decomposed, and establishing whether the state of each is logical given the conditions below.  For example, if a model representing a virtual router shows a fault, then the higher-level models that include it should also be examined for their fault state.  This lets you uncover problems that relate to the transfer of status in a virtual-network model, before you start worrying about resource state.

At the resource level, the tracing of state can link the status of the intent model that represents the resource (in my example here, the virtual router model) to the status of the resources that were committed.  This could involve a dip into a status database, or a direct query of the management interface of the resource(s) involved.

The packet-inspection mission has interesting SDN/NFV connotations, for “inspection-as-a-service” in particular.  Inspection is a combination of creating a tap point and then linking it to a packet inspection engine, and both these could be done using virtual elements.  Any virtual switch/router could be made to expose a tap, and once there is one it’s not difficult to pipe the data to a place where you have static instances of inspectors, or to spawn a local inspector where you need it.  You could extend this to data injection without too much of a problem, but data injection in today’s network protocols has always been more problematic; it’s easy to do something that creates an instability.

In my view, based on some real experience, I think that any discussions on SDN/NFV T&M that don’t focus first on binding are a waste of time.  In SDN, you need to know how a route is constructed from successive forwarding instructions in devices.  In NFV you need to know where something is hosted and what connection resources are used to connect in the pieces.  I believe that if service models are constructed logically, the models themselves will provide access to the information you need to trace functionality, and little more will be required.  Where it is required, then the packet-inspection-as-a-service approach can supplement binding tracing as needed.

If bindings are important, then service models and the nested-intent-model approach are critical.  The state of a service today is directly related to the state of the devices that make it up.  Whether the service of the future is built from virtual functions, virtual devices, or virtual routes, the same dependency will exist then.  The most logical way to determine the status of a given intent model is to look at the state of the things underneath, what it decomposes to, and continue that downward progression until you find the problem.  If you can’t do that, they you might as well throw darts at a network resource map and look where each one lands.

But let’s get to the top of the issue.  All of this, IMHO, demonstrates how convincingly virtualization technology changes network operations and management, or should, or perhaps must change it.  Nobody should doubt that virtualization is more complex than old-fashioned fixed assets.  That additional complexity will swamp any capex benefits in opex unless we’re very careful with service automation.  T&M as we know it is irreconcilable with service automation; you can’t remedy problems with low-touch opex practices by touching them.  However, those who want to practice them on real resources can continue to do so, perhaps as a last resort.  What we should be worrying about is reconnecting the automated analysis of service/resource behavior at that “slash point”, the boundary that virtualization will always create.

Could it be that the biggest opportunity in SDN and NFV is one that’s related to doing the kind of deep-thinking-and-heavy-lifting stuff that’s essential in framing virtual technologies in profitable service practices?  If so, then I think that the modeling and binding approaches are the most critical things in any of these new technologies, and ironically the least developed.  I looked at the major vendors who can make a business case for broad deployment of NFV, for example, and found that of the seven who now exist, all have either incomplete modeling/binding approaches or have recently made changes in theirs.  Yet these should be the top of the heap in terms of software design and evolution.  We still have a lot of catching up to do.