Why We Need to Pay More Attention to “Events”

One of the big issues in zero-touch automation is event generation.  Since management of services and infrastructure is all about responding to events, it’s pretty logical that getting events to respond to is critical, fundamental.  It’s not so much that we don’t know how to generate events, as that we don’t necessarily know how to generate “good” ones.  Then there’s the fact that the nature of events is related to the flexible, fuzzy, relationship between virtual service elements and infrastructure.

Virtualization overall is a matter of creating and exploiting multi-tenant infrastructure, which isn’t far from how networking has worked since packet switching replaced circuit-switching.  With virtualization, you have services based on virtual resources that map to real ones in some possibly complex way, and this process is what creates the event confusion.

Suppose a “real” trunk connection fails.  That connection might be a part of tens, hundreds, thousands of virtual resource relationships, either as a direct transport conduit or as a pathway over which “interior” virtual connections pass.  In either case, the failure of the trunk is clearly an “event” that has to be handled, and this raises the first of many event questions.

You can’t fix virtual resources, at least not conventionally, and so at one level the right approach to this failure is to fix the real trunk.  In fact, that probably has to be done in any case.  The problem is that the SLAs of some or all of the services impacted might require remediation faster than trunk repair could be expected to complete.  In traditional adaptive networks, in fact, we’d reroute traffic around the failure.  That means that the trunk failure event has to be somehow reflected into the virtual world.

If we assume that services were created from components that were modeled via intent model principles, and if we further assumed that these services supported an SLA, and further assumed that the intent model elements of a service could each generate an event if the SLA it offered was not met, then we could assume that all service events could be created through intent modeling.  Obviously, few of these assumptions could actually be made in the practical world, because intent modeling is far from a universally-accepted principle, and even where it is accepted there are a wide variety of implementation practices.

A standard intent/event strategy would be extremely valuable in zero-touch automation. In fact, if intent modeling contributed nothing to the process other than a uniform event generation approach, it would be a highly useful step to take. There are a number of bodies including ESTI and the TMF that could perhaps be expected to generate a set of event model relationships, but so far none of them appear committed to doing that. Given that, we have to look deeper and find more universally applicable but intrinsically complicated solutions.

For services where an explicit commitment between a resource and a virtual element is made, the correlation of a resource event to events associated with virtual elements is fairly straightforward.  When you bind a virtual resource to a real one, you need to create a record of that fact.  Think of the record as being simply a pairing:  Virtual>Real.  When our trunk fails, a database table of these pairings will give you all the associations between the trunk resource and virtual subordinates.

This point illustrates some very important truths about events.  Most important, it shows that if you want to have service-level responses to specific resource conditions, then you need to define a virtual-to-resource binding.  That means that building services with specific SLA remediation expectations means either accepting SLAs from underlying facilities and foregoing explicit remediation, or building the services on virtual elements that can be explicitly bound.

This latter requirement obviously demands an example.  Say you have a VPN service that maps to a resource-layer multipoint service made up of a thousand devices and thousands of trunks.  You can’t expect to remediate a trunk failure in that situation because your mapping isn’t to the right level of resource.  If instead you build your VPN from virtualized elements that are specifically mapped to hosting points, tunnels, etc., then you can design your zero-touch system to be able to report a failure of the bound resources to the service level for remediation.

The second point the binding pairing shows is that for multi-tenant services it quickly becomes impractical to bind tenant service SLAs to resource state in order to generate events.  Multi-tenancy is inherent resource sharing based on some rules, and you can’t bypass the rules and the multi-tenant-shared resources by binding across them.  Multi-tenant services abstract the services above the multi-tenant service-level virtual resources, and in turn map those virtual resources to the real ones.

Obviously, things would be easier if we could presume that “virtual” failures were detectable as events.  Take the example of a virtual wire that transits a half-dozen trunks and nodes.  The higher-level response to a fault in this wire would be the same whatever real resource failed—reroute.  Monitoring some virtual resources is straightforward; if the resource is a virtual device or software-hosted feature instance, it can expose a management interface that could itself be (directly or indirectly) an event source.  If the resource is a virtual wire, the options are more complicated.

There have been two general approaches to making virtual wires event sources.  One is what could be called the “admission point” approach.  If you can monitor entry/egress to a virtual wire, you can probably inject test packets or tag normal ones, and by doing that get a measure of the health of the path.  The other is to actually peep at the traffic along the way.

Until recently, the latter approach has been generally rejected for its impact on performance of the link.  However, there have been dramatic improvements in merchant silicon to facilitate packet processing, even where “deep header” inspection is required.  The P4 language I mentioned in a previous blog is designed to facilitate the description of “flow programs” that could perform monitoring/tagging functions as well as determine forwarding paths.

P4 and the associated silicon, likely used in embedded appliance or network adapter missions, could provide a better way of doing admission-point tagging and analysis, and a fairly practical way of doing deep-header inspection too.  It’s well within the realm of possibility that P4-driven silicon could provide monitoring of the virtual pathways that services of all kinds are likely to use.  If the silicon were powerful enough, it would be possible to make an interface or P4 switch flow-aware down to the specific virtual pipe level, and it’s certainly possible to be able to gather statistics and conditions on “class-of-service” trunks that were aggregates through which more detailed flows passed along well-traveled routes.

Having virtual-flow analytics would simplify the problem of generating relevant service events considerably.  The same technology might also be able to offer better responses to problems, if the virtual flows could be quickly diverted to “standby” routes or failure mode states, for example.  However, doing this, even with P4, requires coordination.  Switch programming still programs only one switch at a time, and so you need to coordinate behavior across switches to create a flow.  You also need to be able to relate per-switch programs to flow behavior when you’re analyzing virtual pipes.  Thus, there’s still work to do.

There’s also some confusion to address.  Many believe that “intent modeling” would eliminate the concern about events overall, or would eliminate at least the need to couple hardware conditions to service events.  That’s not really true.  Intent models can enforce an SLA within themselves, but that enforcement is almost certainly going to require the same kinds of management as we’ve talked about here.  The model makes the event exchange opaque from the outside, but it doesn’t eliminate it.

The net here is that events are way too important to be left to almost accidental or peripheral discussions, which is where they are now.  The same is true with “state” meaning the current condition of a cooperative system.  If you define states and events, you define the process context you’re expecting, and we can’t do zero-touch automation without that kind of definition.  We really need to press the bodies claiming to be discussing zero-touch automation to frame their event expectations early, and we then need to examine them closely to be sure they match our conception of the relationship between resources and services.