Monitoring and Testing in the Age of Open Networks

Every change you make to a network changes how you look at is.  That changes what you look at it with, meaning the technologies, tools, and practices involved in monitoring and testing.  Networks are a cooperative association of devices, coordinated in some way to fit a specific mission.  Getting them to do that, or determining whether they’re doing it or not, depends on the mission and the means of coordination.  That’s changing rapidly, and testing and monitoring have to change with it.

There are three broad models of networking in use today.  The first is the adaptive model where devices exchange peer information to discover routes and destinations.  This is how IP networks, including the Internet, work.  The second is the static model where destinations and pathways (routes) are explicitly defined in a tabular way, and the final is the central model where destinations and routes are centrally controlled but dynamically set based on policies and conditions.  Each of these models have different monitoring and testing challenges.

IP networks are the baseline technology approach today, and in IP networks the devices themselves will typically exchange information on who they can reach and how efficient the pathway is.  Since the networks adapt to changes in conditions, including traffic loading, it’s traditionally been difficult to know just exactly where a given flow of traffic is going, and so it’s difficult to say whether it’s impacted by some trunk or node condition.

Advances like MPLS (multi-protocol label switching) have allowed IP networks to define virtual trunks that create a more meshed topology, and by routing these trunks explicitly you can gain greater control over how routes are drawn through a network.  This, of course, creates what’s effectively another layer of protocols, and a new list of things that might be subject to monitoring and testing.

The static model is in a sense an extreme end-game of these virtual-trunk practices.  The theory is that if a “good” connectivity model can be defined as a series of forwarding rules/policies/tables in all the devices, then you can keep it in place unless something breaks, and if something breaks you fix it.  The fixing delay might be prohibitive in a “real” world of nodes and circuits, but if we’re dealing with virtual elements then fixing something is a matter of virtual reconfiguration.  No big thing.

Static models have their disadvantages, the biggest of which being that they don’t work well if there are a lot of new destinations added, old destinations deleted, or destinations moving around in connectivity terms.  It takes time to update all the distributed forwarding rules, and while the update is in progress you risk having illogical and non-functional intermediate states created.  This also happens in IP, where adaptive routing changes can create pathways to nowhere as the new topology “converges”.  It can also mean your monitoring and testing end up focusing on somewhere the traffic isn’t going.

Central control models try to thread the needle between the approaches by providing a set of policies for forwarding, but also a controller where the policies are determined, stored, and distributed.  OpenFlow SDN is a modern example of central control, and for those who like me have a leg in networking history, IBM’s System Network Architecture (SNA) was a centrally controlled network approach.

Central control has issues too, not the least of which is the loss of connection between a network node and the controller.  Without a control path, forwarding policies can’t be updated, and it’s possible that without their being updated, you can’t connect a control path.  Classic Catch-22.  There’s also the fact that the load on the central controller can be acute where the central controller is presented with packets for which no forwarding rules have been defined, and then expected to define them.  During periods of high dynamism, where new things are being added or a lot of things are breaking, this loading issue could take a controller down.

The new paradigm of programmatic control of forwarding, via languages like P4, open even more variability.  What is needed to control a P4 network?  Answer: It depends on how it’s programmed, which means it would be very challenging to evolve testing and monitoring based on traditional techniques to keep pace with what P4 might produce in network forwarding approaches.  How do you address that?

At a broad scale, the answer is analytics.  You analyze overall behavior, because with hosted VPN or Internet services, you have no other option.  Enterprise management and monitoring inevitably evolves to analytics, because there are no nodes or trunks visible to you.  Networking is a numbers game, and that’s also true as things transform to intent models.  Even specific, provisioned, services can be reduced to analytics when you divide them into an intent hierarchy.

Not so for those who build networks, and who face the open-network revolution.  There, the only possible solution to the new agile-forwarding problem is also a solution to the problems of the current three network models.  What you need to do is forget the notion of packet interception and inspection, which depends on knowing what’s supposed to be happening on the data path.  Instead you focus on what the nodes themselves are doing.  The network, whether it’s the network of the present or some abstract network of the future, is the sum of its policies.  Those policies are applied in the nodes, so know the nodes and you know the state of the network.

Monitoring, then, is really going to turn into examining the node’s behavior.  That means two things, really.  One is the rules that determine how packets are handled, and the other is the conditions of the connections and the data plane behavior of the nodes that do the connecting.  The first is a kind of sophisticated policy interpretation and analysis process, and the second is dependent on in-band telemetry.  P4 assumes both can be made available, and you can make the two available with some customization even in the current three network models.

The ideal foundation for P4 testing and monitoring would be a “P4-network simulator” that let you model the nodes and policies and test the behavior of the network under the variable conditions that the P4 policies were sensitive to.  The same simulator would generate pseudo-in-band telemetry and how much was generated, when, and what it recorded could be analyzed and correlated with load conditions.  Every P4 device has to have a “P4 compiler” anyway, so doing a simulator wouldn’t be rocket science.

There’s a P4 suggestion on monitoring that uses a fairly broad set of in-band features that not only provide information on the flow, but also on the forwarding path rules.  I’m sure that this would be helpful in “debugging” a P4 program, but I think the simulator would be of greater value, and the in-band telemetry could then focus on end-to-end performance information, obtained by timestamping the messages and also recording things like queue depth along the way.

I think future testing and monitoring for all network models should converge on this approach.  Do a simulation and test it with telemetry comparisons.  For network models that don’t have forwarding rules distributed to devices, the goal would require you obtain current forwarding rules (time-stamped) to feed the simulator.  That means reading routing tables.  You could also, for legacy network models, read adaptive discovery information from the data paths, but this has limited value except at area boundaries where such information can be made available.  Snooping on every path/trunk is complicated, and of diminishing value.

How about probes and protocol analysis?  I think both concepts have been slowly diminishing in importance too; first because fewer people can read them and second because of the difficulty in matching probing to virtual networks.  It is possible to introduce probe processes more easily now, but more difficult to interpret the results and avoid impacting performance.  I think that “virtual probes” have a place in creating a simulator or analytics framework for monitoring and testing, but that their value outside that environment will continue to decline over time.

There are network simulators out there, of course.  As far as I know, they’re not specific P4 simulators, which wouldn’t necessarily rule them out as the framework for modern network monitoring and testing.  It would make the more general, but less able to model the specifics of our new forwarding-programmable model of devices.  Most are used in research and university environments, though, meaning that they’re not intended to model things at scale for operational analysis.  Discrete event simulation, the core of most of the models, is difficult to apply at scale for the obvious reason that processing simulated events at scale is as big a problem as actually moving the equivalent traffic.

Some experts in the operator space tell me that there’s interest in the notion of moving beyond discrete event simulation to a broader model.  You first build elements whose behavior you do model traditionally, but you then link them in a different way to create something that can actually represent network behavior in real time without requiring a sea of supercomputers.  We’re not quite there, but we’re at least looking at the right stuff.

I think that’s true with testing and monitoring.  Virtualization inevitably changes how we build networks, so it’s going to change how we monitor and test them too.  Intent modeling could have a profound impact by framing SLAs for components, which enhances the role of analytics.  Smaller components can also be simulated more easily, and the simulations can then be “federated” to model behavior overall.  As in other spaces in our open-network future, vendors have been dragging their feet a bit.  That’s diminishing the opportunity to get testing and monitoring right, shifting the industry toward predictive analytics instead.  It could still be turned around, but the longer that takes the harder it will be.