Filling the Holes in Opex Reduction Strategies

Vendors are finally discovering the virtue of opex reduction.  Cisco has included the network sector in its overall AI/ML strategy, complementing “intent-based networks”.  Juniper’s Mist AI deal, combined with their recent acquisition of Netrounds, shows that they’re looking at more ways to apply automation, testing, monitoring, and other stuff that qualifies as operations automation.  The reason, of course, is that if operators are looking to cut costs to reduce profit-per-bit squeeze, cutting opex might prevent them from cutting capex, which hits vendors’ own bottom lines.

Vendor desire to support operator cost reduction in places where the reduction doesn’t impact operators’ spending on the vendors is understandable.  That doesn’t make it automatically achievable, or mean that any measure that has a claim to enhancing operations would necessarily reduce opex significantly, or at all.  I’ve done a lot of modeling on opex, and this modeling suggests that there are some basic truths about opex reduction that need to be addressed any time opex enhancements are claimed as a benefit.

The biggest truth about opex is, for operators, a sad one.  Recall from prior blogs that I’ve noted that “process opex”, the cost related to customer and infrastructure operations, is already a greater percentage of revenue than capex is.  Starting in 2016 when operators moved to address the problem, and despite measures taken largely in 2018, opex costs have continued to grow even when capex has been reduced.

The biggest and saddest truth for vendors is that securing opex improvements is getting harder.  Besides the year 2018 when major changes to installation and field support were initiated, to create a major impact on opex overall, further technical improvements have only made the growth curve of opex a bit less steep.  It has never, even in 2018, turned back, and it’s continuing to grow in 2020.  The biggest challenge is that one area where cost containment seems to have worked this year is in the area of network operations, where most of the announcements (like Juniper’s) are targeted.

It would be easy to say that service lifecycle automation and other forms of operations cost management are already out of gas, but that oversimplifies, even though it’s a view that could be gleaned from the data.  A “deeper glean” results in a deeper understanding, which is that while service automation is important, it has to be viewed from the top and not from the bottom.

If you take a kind of horizontal slice across all the categories of “process opex”, the largest contributor by far is customer care and retention, which accounts for an astonishing 40% of all opex.  You could argue that making network operations automated is important primarily if it reduces customer care and retention costs, and for that to be true, you have to be able to translate the netops benefits directly into the customer domain, which even in 2020 is a disaster.  The reason we had a brief improvement in process opex in 2018 is that operators wrung human cost out of the process.  The reason why it didn’t last is because the system changes made were simply badly done.

Let me offer a personal example.  I recently had a Verizon FiOS outage, caused (I believe) by a construction crew cutting the fiber cable.  Having lost wireline Internet, I went to Verizon’s website to report and resolve the problem.  Thanks to that 2018 change, there was no human interaction involved; I was told to run through an automated troubleshooter, and candidly it was a truly frustrating and terrible experience.

First, the troubleshooter asked me whether I was having a power failure.  It then told me there were no known outages in my area, and had me go to the location of the optical terminal.  Now, there was a red light on the OTN, but they never asked for status.  Instead I had to make sure the OTN had power (duh, there’s a red light on) and second to reboot the OTN to see if that fixed it.  It didn’t, so they told me it would now take a service call, and gave me a bunch of disclaimers on how I’d have to pay for the call if this turned out to be my fault.  Then it asked if I still wanted to schedule a service call.  I said I did, and it told me there were no slots available for the next two weeks, so I’d have to be connected to a human.

It never connected me.  I tried to call the support number and got another (this time voice-driven) troubleshooter, which proceeded to tell me that there was an outage in my area and it would be repaired by 6 AM the following morning.  Good service?  Hardly, but the problem wasn’t service automation.

There’s an automated element at the end of the broadband connection, the ONT.  There is no reason why Verizon could not have known that some number of customers’ ONTs had gone away, and from the distribution have been able to determine the likely location of the problem.  They had my cell number, so they could have texted me to say there was an outage, that they’d have it repaired overnight, and that they were sorry for the problem.  That would have left a good taste in my mouth, and reduced the chances that I’d look for another broadband provider.  It would have unloaded their troubleshooting system too, and it would have required nothing that would qualify as a network automation enhancement.

Start with the customer, with how to proactively notify them of problems and how to give them the correct answers and procedures.  When you’ve done that, think about what information could be provided to drive and improve that customer relationship.  When that’s done, think about what might have been done to prevent the problem.  The proverbial “cable-seeking backhoe” isn’t going to be service-automated out of existence, nor is the repair of the cable, which shows that some of the most common service problems aren’t even related to the network’s operation.  We absolutely have to fix customer care.

This doesn’t say that you don’t need service lifecycle automation, though, and there’s both a “shallow” and “deep” reason that you do.  Let’s start with the shallow one first.

Some customer problems are caused by network problems, and in particular they’re caused by operator error.  A half-dozen operators out of a group of 44 told me that operator error was their largest cause of network problems, but almost 30 of them wouldn’t say what was, so it could well be that operator error is the largest source of errors overall.  Misconfiguration lies at or near the top of the list.  Lifecycle automation, by reducing human intervention, reduces misadventures in their intervention (to add a bit of conversational interest to the picture).

The second, deep, reason is that in an effort to reduce capex, we’re developing a more complex infrastructure framework.  A real router is a box, and everything about managing it is inside the box.  A virtual router is a virtual box, and it still has to be managed as a box, but its hosting environment, its orchestration, and even the management processes associated with hosting, also have to be managed.  If we break our box into a chain of virtual-features, we have even more things to manage.  Management costs money, both in wages and benefits, and in the errors that raise customer care and retention costs.

You can see what might be a sign of that in the operations numbers for this year.  While network operations costs are up about 16% over the last five years, IT operations costs are up over 20%.  Given that these costs are all “internal”, meaning there’s no contribution to direct customer interactions, installations, or repairs, that’s a significant shift.  Could it be the indication that greater adoption of virtualization is creating more complex infrastructure at the IT level, and that the biggest contribution that service lifecycle automation could make is in controlling the increase in opex related to this increased complexity?

Then there’s the final point, which is the impact of demand density on opex and on opex reduction strategies.  We already know that wireline has a higher opex contribution than wireless, but it’s also clear that in areas where demand density (roughly, opportunity per square mile) is high, opex is lower because craft efficiency is higher and the cost of infrastructure redundancy is lower.  As demand density falls, there’s a tendency to conserve infrastructure to manage capex, which means opex tends to rise because of loss of reserve resources.  It’s possible that this factor could impact capex-centric approaches to improving profit per bit; if new-model networks are cheaper to buy and more expensive to operate, what’s the net benefit?

The fact is that you can’t let opex rise, in large part because a rise in opex is often a sign that customer care is suffering.  It’s possible that a customer-care-centric approach to operations, even without massive changes in lifecycle automation, could improve opex as much as new automated technology could.  It’s also possible that wrapping new service lifecycle automation in an outmoded customer care portal and practice set could bury any benefits the new system could generate.

My experience with Verizon couldn’t have been resolved by an automated system if it was indeed caused by nearby construction.  No new systems or AI was required to do a better job, only more insight into designing the customer care portal.  I’m not saying that we should forget service lifecycle automation and focus on customer portals, but we can’t forget the latter while chasing the former.