Are Skill Issues a Hidden Problem in Zero-Touch?

Could we be missing a big requirement in zero touch?  In a recent conversation I had with a Tier Two operations executive, this question came up, and I think it’s a fair one.  It also illustrates that the world of telcos is more varied than we’d think.

Nearly all the emphasis on zero-touch service lifecycle automation is on its potential to radically reduce operations costs.  This is important first because opex is larger than capex for most operators, and second because new technologies designed to reduce capex (like NFV) risk raising complexity and opex in turn.  There’s nothing wrong with this perspective, but my recent conversation suggests it’s incomplete.

“Our problem isn’t how much touch operations needs, it’s who’s available to do the touching.”  That’s the insight my executive contact presented.  His point is that smaller telcos have major challenges in sustaining an adequate base of skilled operations personnel.  That means that zero-touch might have an added benefit of reducing errors and supporting more complex network technology choices, where limitations in skilled labor impact operations practices.  It also means that some zero-touch approaches might themselves be too complicated to administer.

My all-time favorite tech cartoon comes from the ‘60s.  At the time, it was popular to measure computing power in terms of how many years it might take some number of mathematicians to solve a given problem, versus the computer.  In the cartoon, a programmer comes home from work, tosses a briefcase on the sofa, and says to his spouse “I made a mistake today that would have taken a hundred mathematicians a thousand years to make!”  This reflects the downside of automation.  If it goes wrong, it can do truly awful things so fast that humans might never be able to respond.  If the humans don’t have the requisite skills…well, you get the picture.

Having machines run amok is a popular science fiction theme, but rampant machine errors aren’t confined to robots or artificial intelligence.  Any automated process is, by design, trying to respond to events in the shortest possible time.  Zero-touch systems necessarily do that, and so they’re subject to running amok too.  My contact is concerned that if somehow zero-touch automation isn’t set up right, it makes one of those thousand-year mistakes.

Most automated operations tools are rule-based, meaning they look for events and associate them with a set of rules, which in turn describe actions to be taken.  It’s pretty obvious that messing up the rules would mess up the results.  What could be done to reduce this risk?

The pat answer is “AI”, but artificial intelligence still has to somehow learn about the real world.  A proper AI strategy is probably a subset of a broader approaches that we could say are “state-based” or “goal-seeking”.  These systems define a network as existing in a number of possible states, one of which is the goal state.  The system responds to events by translating into a new state, and when in that new state, a set of rules describes how to seek the goal state again.

This approach is similar to “intent modeling” in that it can be implemented via an intent model.  It could also be implemented using AI, but however it’s implemented there’s still some skill required in setting up the state transition rules.  The advantage is that the state-based approach is both holistic (it deals with network state, not discrete events) and easier for people to visualize.  One disadvantage is that the state transition rules tend to be “hidden” and easily forgotten.

The big problem with state-based modeling for zero-touch automation is that complex networks have a lot of moving parts, and thus a lot of failure modes to represent.  Similarly, the task of restoring the goal state is highly complex, to the point where you could expect someone to author rules to do it, which takes you back to the rule-based approach.  As always, we then have the complexity of the rules and the difficulty creating them without skilled personnel.

The approach that my contact liked was simulation.  If we presumed that we had a simulation of the network, and that the simulation allowed for the input of conditions, we could use the simulation both to create rules and to validate actions before they were taken.  In fact, the marriage of a good simulation strategy and any zero-touch approach seems natural, to the point that it’s hard to see how we could have come this far without trying it out more seriously.

Simulation of this sort depends on a library of elements that can then be combined graphically to “draw” a network.  The elements would include devices and trunk lines, of course, and ideally there should be elements to represent any real-world network component.  There would also have to be “load” and “condition” simulators to allow the simulated network to be “operated” and conditions tried out.

We have simulators, of course, and even complete simulation languages and libraries.  There are several dozen “network simulators”, including a number of free ones.  Some have decent library support, though it’s often necessary to define your own models for devices if the library doesn’t include what you’re running in your network.  The Tier Two and Three operators aren’t uniformly aware of these tools, and in my surveys the number who have any experience with them is down in the statistical noise level.

Those who have tried them don’t report much success.  The problem is that there’s nothing much out there to describe how to integrate simulations with zero-touch automation or even operations in general.  For Tier Ones, this is a problem I’ve often heard about, and for Tier Two and Three operators, it’s probably insurmountable without outside help.

AI machine learning or neural networks might also offer a solution here, but it’s unclear just how effective either would be absent a skilled team to provide information.  Most AI relies on subject matter experts who “know” the right approaches, and “knowledge engineers” who then create an AI application that can then deal with conditions in the field.  If a telco has limited access to a skilled labor pool, those experts could be hard to come by.

I think simulation is likely the best approach here, providing we can get some broader industry support for the idea, and in particular can get some standards and APIs that allow simulation to be integrated with operations tools, including zero-touch automation.

I also think that the problem of skill levels needed to establish and sustain zero-touch automation goes beyond Tier Twos and Threes.  I’ve worked on rule-based systems for quite a while, and the challenge that they pose is related to the complexity of the ecosystem they’re applied to.  A single trunk or device can be sustained in service with simple rules.  As you add elements, the rule complexity rises with the square of the number of elements, simply because networks are an interdependent system.  I wonder how many big Tier Ones will face this truth as they try for their own zero-touch applications.  Maybe the skills needed are beyond even them.