Simulation, AI, and Testing in the NaaS of the Future

Virtual networking or Network-as-a-service (NaaS) makes connectivity easier, but it complicates automation of lifecycle processes.  The problem is that when “services” are created on top of connectivity rather than through the same devices that provide connectivity, you lose some insight into service behavior.  You also lose pretty much all of the traditional ways of doing management.  There’s talk about “monitoring”, “analytics”, “simulation”, and “artificial intelligence” to fill the gap, but little detail about how that might work.

The basic principle of automated service lifecycle management is that there are a series of generated events that represent changes in conditions.  These events are then analyzed so that a cause and correction can be determined, and then a remediation process is launched to restore normal (or at least optimal under current conditions) behavior.  This is the process into which we have to fit any new tools we hope to adopt.

The tools?  These also need some definition.  Monitoring is the real-time examination of network behavior to identify deviations from the desired operating state.  Such deviations are then coded as events, things that represent alerts that have to be handled in some way.  Analytics is the broader process of historical trend analysis or correlation analysis, aimed at deepening the insight you gain from traditional monitoring, usually to detect things earlier than monitoring would.  Simulation is the process of modeling the behavior of the network under a specific set of conditions, and it can be used either to help predict how current monitoring/analysis conditions will play out (giving you yet more advanced notice) or predict how effective a given remedial process will be.  Artificial intelligence is similarly a tool to interpret our “events” to either predict how a problem will unfold, or what reaction to the problem is most likely to be successful.  Got it?

We can apply our tools to our lifecycle model now, starting with the “event” generation.  Let’s say that the network has a goal state, which is implicit in nearly all operations processes.  There’s a network plan, perhaps a capacity plan, that is based on assumptions about traffic and connectivity.  That plan implicitly defines a network behavior set that’s based on the plan, and that’s the goal state.  If the current network behavior deviates, then it would be expected that the plan would be invalid, and some remediation would be required.  Monitoring and analytics are the traditional and leading-edge (respectively) ways of detecting a deviation.

Simulation could be a powerful tool at this point, because when the network’s presumptions on connectivity and traffic are known, those presumptions could be used to set up a simulation of normal network behavior.  What simulation brings to the table is the opportunity to model detailed conditions in the network as traffic and connectivity demands change.  That provides specific ways of generating early warnings of conditions that are truly significant, versus just things that won’t make much of a difference or will be handled through adaptive network behavior.

AI can also be injected at this point.  If we have a good simulation of “normal” behavior and we also have a simulation of how the current set of conditions is likely to impact the network, we can then use AI to establish a preemptive (or, of we’re late in recognizing things, a reactive) response.  That response could then be fed into the simulator (if there’s time) to see if it produces a satisfactory outcome.  We could also use AI to define a set of possible responses to events, then simulate to see which works best.  That could then be fed back into AI (machine learning) to help rule out actions that shouldn’t be considered.

In the response area, it’s AI that comes to the fore.  Machine learning is often represented as a graph, where we have steps, conditions, and processes linked in a specific way.  Think of this as an augmentation of normal state/event behavior.  Because all service lifecycle processes should (nay, I would say “must!”) be based on state/event handling, augmenting that to provide for more advanced analysis of the state/event relationship would be expected to improve lifecycle automation.

That doesn’t mean simulation isn’t helpful here, either to supplement AI by providing insight into the way that proposed responses will impact services, or by offering a path to predict network behavior if conditions fall outside what AI has been prepared to handle.  Most useful “process automation AI” is likely a form of machine learning, and simulation can help with that by constraining possible solutions and by offering something that’s perhaps almost judgment-like to solve problems for which specific rules haven’t been defined.

It should be clear here that the combination of AI and simulation serve in lifecycle automation almost as a substitute for human judgment or “intuition”.  The more complex the network is, the more complex the services are, the more difficult it is to set up specific plans to counter unexpected conditions—there are too many of them and the results are too complex.  It should also be clear that neither simulation nor AI are the primary tools of lifecycle automation.  We still need good overall service modeling (intent modeling) and we still need state/event or possibly graph-based interpretation of conditions.

All of this, we should note, is likely to be managed best at the facility level rather than at the virtual network level.  NaaS is a wonderful thing for operations, but less so as the target of remediation.  Real faults can’t be corrected in the virtual world; the best you could do is to rebuild the virtual structure to accommodate real conditions.  That could create the classic fault avalanche, and it would be easier and less process-intensive in any event to remediate what can really be fixed—there’s less of it and the actions are more likely to be decisive.

There’s an impact on testing here, too.  The industry has been struggling with the question of whether you should do “connection network” testing in even IP networks, where finding a traffic flow means examining routing tables to see where it’s supposed to go.  In NaaS-modeled networks, we have a clean separation of connectivity and the associated underlayment, but even that underlayment may be based on IP and adaptive behavior.

One possibility for testing is the “deep dive” model; you go to the bottom layer that’s essential for everything else and test the facilities there.  This is difficult to separate from monitoring, though.  Another possibility is to maintain connection-level testing, but that’s going to be much more difficult unless you have a standard mechanism for setting up an overlay network.  Today in SD-WAN, for example, we have dozens of different overlay encapsulation approaches; can we expect test vendors to adapt to them all?

The best approach is probably to provide what could be called a “testing API”, which probably doesn’t need to be anything more complicated than a way to establish a specific pathway then examine how traffic flows through it.  Since this pathway would look like a standard connection to the NaaS layer, you’d be able to get it in a standardized way even if the underlying encapsulation of the overlay network differed from vendor to vendor.

There’s still our reality of fixing virtual problems in the real world to contend with.  Do we really want to do “testing” anywhere except at the connection edge of a NaaS or at the physical facilities level?  What do we expect to learn there?  That’s something testing vendors will need to think about, particularly as we expand our goals for automation of service lifecycle management.  There’s little point to that if we then expect networking personnel to dive into bitstreams.

Again, simulation and AI can help.  We need to think more in terms of what abnormal conditions would look like and how they should be remediated, and less in terms of geeks with data line analyzers digging into protocols.  The rational goals for network automation, like the implementations of network modeling, should focus on intent-modeled elements that self-remedy within their scope.  If we set that goal, then AI and simulation are going to prove to be very powerful tools for us in the coming years.