Why Networking Needs to Catch Software Operations Fever

Remember when we used to talk things over? I’m not talking about relationships, but about faults and problems. Have we, as an industry, gotten too focused on real-time analysis and forgotten “retrospection?” There’s an interesting article on this, as it relates to software operations, that we could think about applying to networking. Note that the article distinguishes between “retrospectives” and what it calls “debriefings” because software engineers use the former term for a type of formal design-team review.

I remember working on a big real-time project with a lead architect from computer vendor. We had a failure and he immediately ran around trying to set the conditions up so that the problem could be observed. First, he couldn’t get the problem to happen because the conditions were wrong. Then the attempt to observe changed the timing. I got disgusted pretty quickly, so I started rooting around in the aftermath, and in about ten minutes of discussions and exploration, we found and fixed the problem.

Networking is really a perfect place to apply this sort of retrospection for fault and root-cause determination. In network problems, setting things up is often impossible; you rarely know what was happening so setting it up again is a dream. Network problems are often timing problems, making them very hard to diagnose. I wondered about whether enterprises were thinking this way, so I went back through my recent discussions and found some interesting truths.

I’d chatted with 44 enterprises on matters relating to their procedures for fault management and root cause analysis. Of that group, only 4 had volunteered that their procedures included an interview with the user. Only 9 had interviewed the network operations people who worked on the problem after the fact, to establish whether they believed that the root cause had been identified. Two enterprises, by the way, did both—five percent.

Any real-time system, networks included, need to be assessed in context, which means finding out what the context was. When we have PC application problem to assess, it’s common to ask the user “What were you trying to do?” Wouldn’t it make sense to capture that information on the networking side?

The argument could be made (and in fact has been, many times) that interviews don’t do any good because there’s too much going on in the network that causes problems elsewhere. The old “butterfly’s wings” argument. It sounds good, but of those 44 enterprises, 30 said that in most cases, a network problem was caused by a “predictable source”, which meant an action taken by someone or a condition someone knew about. Given that a total of 11 of the group had interviewed anyone, it’s clear that these predictable-source outages were determined to be predicable after considerable analysis.

I went back to a half-dozen enterprises after I’d read the article I referenced and asked them a simple question; what usually gets reported to you when there’s a network problem, what gets reported from that contact into a trouble ticket, and how carefully do you establish the exact time of the problem. I guess you know how that went.

Everyone agreed that trouble tickets were digestions of the report itself. They also agreed that what was retained was a description of the problem itself, not what was going on before it. Finally, they agreed that their help desks recorded the time of the report, not the time of the incident, which meant that it would be very difficult to correlate the problem with activities taking place outside the awareness of the user who reported the problem.

Another interesting point was that only 5 of 44 enterprises said that they routinely analyzed problem reports in the context of general application and user activity. This five said they would review application logs to determine what was being run and by how many, for example. Three said they would contact other users of an application if the person who reported the problem indicated they were running, or trying to run, it. Almost all the 44 admitted they could probably do better in communicating with the IT operations team when analyzing a network problem; most said that happened only if the IT team referred the problem to them.

A failure to solicit user feedback, human impression, is particularly scary when you consider that the percentage of problems this group of 44 users said were directly caused by human error was an astounding 78%. It’s also scary when you consider that all of the 44 users said that retrospective interviewing “might help” accelerate lasting corrections.

Some organizations (about a third of my group of 44) treat problem analysis as a formal collaborative task, but among network professionals. The remainder may make contact ad hoc with network operations team personnel during fault analysis, but they don’t have any specific tools or procedures to support the conversations.

Formalism in the sense of specific tools to encourage collaborative problem analysis and resolution seems critical in another way. As the article notes, having a group hug isn’t going to make long-standing changes in your network. Discussions have to end in recommendations, and everything needs to be documented. One network operations lead I was chatting with spontaneously admitted that “I think sometimes we have to solve the same problem three or four times before it gets recognized and documented.”

That comment suggests that it’s not simply a matter of communications and collaboration to facilitate network problem resolution. You also need to record the result, and in fact you should view all the steps involved, from gathering information about the problem through taking steps to isolate it, and onward to solutions, as important enough to record and index. This might be a place where artificial intelligence and machine learning could come in, as long as we had proper records for them to operate on.

Almost every network user has a trouble-ticket system that’s designed to track problem resolution, but users report these systems are rarely useful as a reference in analyzing new problems. Do you index tickets by symptoms? A lot of unrelated stuff then gets connected. How about by the area of the network involved? At the start of a fault analysis, you don’t know what area is involved.

There are enterprises who keep detailed network fault analysis records, but none of the enterprises I talked with said they were in that group, though as I noted above, many realized that they needed a system to record what they saw, tried, and what worked. One company said they had tried using a simple text document to track a problem, for the logical reason that the tool was readily available, but said that it quickly generated either long, meandering, texts that spanned multiple problems, or tiny isolated ones that couldn’t be linked easily to a new instance of a problem, and thus were never referenced.

A couple of my contacts said they wondered whether software development tools or project management tools could serve here, but they hadn’t tried to use them. They really seem at a bit of a loss with regard to how to move forward, how to modernize fault management to make networks more available and responsive to business needs. It makes you wonder why things seem to have changed so much.

Two reasons, I think. First, companies are a lot more reliant on networking these days. Even before COVID there was a gradual increase in network-dependency, created by the increased use of online customer sales and support. COVID accelerated that, and introduced work-from-home (WFH), essentially remaking project team dynamics. Second, networks have changed. In the early days (which were really only 20 or 30 years ago), companies built networks from nodes and trunks. Today, they rely more and more on virtual networks, and these are becoming more powerful and more abstract with the advent of SD-WAN and the increased use of the cloud. Technology may have a five-year lifespan, but human practices tend to live a lot longer.

Fault management is a piece of network operations, and to link back to the opening here, the article on interviewing and software operations. The software world, perhaps because they’re the force driving their own bus, has done a better job managing the growing complexity of virtualized resources than the network people have. It’s obviously time to catch up.