In Search of a Paradigm for Virtual Testing and Monitoring

Virtualization changes a lot of stuff in IT and in networking, for two principle reasons.  One is that it breaks the traditional ties between functionality (in the form of software) and resources (both servers and associated connection-network elements).  The other is that it creates resource relationships that don’t map to physical links or paths.  The end result of virtualization is something highly flexible and agile, but also significantly more complicated.

When SDN and NFV came along, one of the things I marveled at was the way that test and monitoring players approached it.  The big question they asked me was “What new protocols are going to be used?”  as if you could understand NFV by intercepting the MANO-to-VIM interface.  The real question was how you could gain some understanding of network behavior when all the network elements and pathways were agile, virtual.

Back in the summer of 2013 when I was Chief Architect for the CloudNFV initiative, I prepared a document on a model for testing/monitoring as a service.  The approach was aimed at leveraging the concept of “derived operations” that was the primary outgrowth of the original ExperiaSphere project and the associated TMF presentations, to provide the answer to the real question, which was “How do you test/monitor a virtual network”.  There was never a partner for that phase and so the document was never released, but I think the basic principles are valid and they serve as a primer in at least one way of approaching the problem.

Like ExperiaSphere, CloudNFV was based on “repository-based management” where all management data was collected in a repository to be delivered through management proxies and queries against that data base, in whatever form was helpful.  A server or switch, for example, would have its MIB polled by an agent who would then store the data (including time-stamp) in the repository.  When somebody wanted to look at switch state, they’d query the repository and get the relevant information.

What makes this “derived” operations was the idea that a service model described a set of objects that represented functionality—atomic like a device or VNFC or collective like a subnetwork.  Each object in the model could describe a set of management variables whose value derived from subordinate object variables using any expression that was useful.  In this way, the critical pieces of a service model—the “nodes”—could be managed as though they were real, which is good because in a virtual world, the abstraction (the service model) is the “realest” thing there is.

The real solution to monitoring virtual networks is to take advantage of this concept.  With derived operations, a “probe” that can report on traffic conditions or other state information is simply a contributor to the repository like anything else that has real status.  You “read” a probe by doing a query.  The trick lies in knowing what probe to read, and I think the solution to that problem exposes some interesting points about NFV management in general.

When an abstract “function” is assigned to a real resource, we call that “deployment” and we call the collective decision set that deploys stuff in NFV “orchestration”.  It follows that orchestration builds resource bindings, and that at the time of deployment we “know” where the abstraction’s resources are—because we just put them there.  The core concept of derived operations is to record the bindings when you create them.  We know, then, that a given object has “real” management relationships with certain resources.

Monitoring is a little different, or it could be.  One approach to monitoring would be to build probes into service descriptions.  If we have places where we can read traffic using RMON or DPI or something, we can exercise those capabilities like they were any other “function”.  A probe can be what (or one of the things that) a service object actually deploys.  A subnet can include a probe, or a tunnel, or a machine image.  Modeled with the service, the probe contributes management data like anything else.  What we’d be doing if we used this model is similar to routing traffic through a conventional probe point.

The thing is, you could do even more.  In a virtual world, why not virtual probes?  We could scatter probes through real infrastructure or designate points where a probe could be loaded.  When somebody wanted to look at traffic, they’d do the virtual equivalent of attaching a data line monitor to a real connection.

To make virtual probes work, we need to understand probe-to-service relationships, because in a virtual world we can’t allow service users to see foundation resources or they see others’ traffic.  So what we’d have to do is to follow the resource bindings to find real probe points we could see, and then use a “probe viewer” that was limited to querying the repository for traffic data associated with the service involved.

One of the things that’s helpful in making this work is the notion of modeling resources in a way similar to that used for modeling services.  An operator’s resource pool is an object that “advertises” bindings to the service objects, each representing some functional element of a service for which it has a recipe for deployment and management.  When a service is created, the service object “asks” for a binding from the resource model, and gets the binding that matches functionality and other policy constraints, like location.  That’s how, in the best of all possible worlds, we can deploy a 20-site VPN with firewall and DHCP support when some sites can use hosted VNF service chains and others have or need real CPE.  The service architect can’t be asked to know that stuff, but the deployment process has to reflect it.  The service/resource model binding is where the physical constraints of infrastructure match the functional constraints of services.

And monitoring, so it happens.  Infrastructure can “advertise” monitoring and even test data injection points, and a service object or monitoring-and-testing-as-a-service could then bind to the correct probe point.  IMHO, this is how you have to make testing and monitoring work in a virtual world.  I think the fact that the vendors aren’t supporting this kind of model is in no small part due to the fact that we’ve not codified “derived operations” and repository-based management data delivery, so the mechanisms (those resource bindings and the management derivation expressions) aren’t available to exploit.

I think that this whole virtual-monitoring and monitoring-as-a-service thing proves an important point, which is that if you start something off with a high-level vision and work down to implementation in a logical way, then everything that has to be done can be done logically.  That’s going to be important to NFV and SDN networks in the future, because network operators and users are not going to forego the tools they depend on today just because they’ve moved to a virtual world.