Truth sometimes hits you in the face. So it is with NFV savings. A Light Reading story quotes Goran Car of Croatia’s Hrvatski Telekom as saying that every function they virtualize has to reduce TCO by 40%. This is particularly interesting, given that most of the ten operators who created the “Call for Action” paper on NFV back in 2012 said as early as October 2013 that even 25% reduction in capex wasn’t worth it. “We can get that beating Huawei up on price,” was the key comment. According to my recent conversations with operators, they still fee the same way, and in fact their TCO reduction goal is much the same as that cited in the story. Enterprises, by the way, also want to see cost improvements between 32% and 45% before they’ll approve new projects, so the savings goal range is pretty universal.
Operators think that NFV is falling way short of the savings they’d hoped for, and they cite three reasons for that. We’ll look at each of them here. Reason one is that licensing the VNFs is costing much more than expected. Operators told me they had been offered VNF licenses that would have cost them at least three-quarters of what an appliance would cost. Reason number two is that operations complexity, including VNF integration, is boosting NFV opex instead of generating savings. In fact, many operators said that net TCO, including capex and opex, was actually higher in some trials. Finally, reason three is that expected capital economies of scale were not being obtained. Part of this is because so many early applications rely on uCPE boxes, dedicated to a customer/service and looking for all the world like appliances.
Every one of these issues was raised with operators in 2013, the first year of work on NFV. Obviously they didn’t get addressed, and so we need to look at each to see what could be done.
VNF licensing was a certain-to-be problem. Operators, like everyone in the industry, it seems, want to improve their own profit margins but are oblivious to what their measures would do to their suppliers. It was never likely that vendors who had successful network appliances such as firewalls would license their software (which is after all their real proprietary value) for a fraction of the appliance price. Surprise, surprise, those vendors want to charge more.
There has never been any solution to this problem other than to exploit open-source software. In the summer of 2013, I suggested it was critical that NFV “can draw immediately on open-source tools without requiring forking of the projects and modification of code.” Had this been done, there would be plenty of open-source options with zero licensing cost. Note that Metaswitch, who contributed an open-source IMS implementation to the first ETSI NGV proof-of-concept, has gone on to be a successful provider of open-source elements to operators.
Another cause of VNF licensing cost problems is the high cost of integration of VNFs. Again citing 2013 documents, the goal should have been to first define NFVi properly, then define the software framework into which VNFs would integrate. “Develop the requirements for an NFV-compliant data center, including the hardware issues that would impact performance and stability, the platform software issues, and the mechanism for virtualization” was the first of the recommendations, and “to be able to create a VNF by “wrapping” any software component that exposed an interface, any hardware device, or any deployed service” was the second. These points are being addressed only now, six years after having been first brought up, and even now aren’t being done optimally.
Our second reason for insufficient NFV savings is an operations cost overrun. There are two sources of operations complexity and cost in an NFV deployment. One is the specific costs associated with deployment and maintenance of the VNFs as independent functional elements. This would include the deployment, scaling, redeployment, diagnosis of problems, etc. The other is the cost associated with the management of the service lifecycle overall. To be successful the first of these opex cost sources would have to trend to zero, since any VNF operations is incremental to the operation of networks that had no VNFs. If, in addressing VNF operations costs, overall service lifecycle operations costs could also be addressed, then opex reduction would contribute directly to lower TCO.
From that summer of 2013 document, the relevant goal: “Establish the optimum way of defining management practices as “co-functions” of the service, composed in the same way and at the same time, and representing the service and its elements in the optimum way based on one or more existing sets of operations practices, or in a new way optimized for the virtual world. The functionality created by combining VNFs defines not only how a service works but how it should/must be managed, and the only way to harmonize management with functionality and at the same time create a link between virtualized elements and real services is to link them at the time of deployment by composing both side-by-side.”
Build management as you build services. Seems logical, but instead what was done was to presume virtualization only built devices, meaning that NFV created virtual equivalents of physical boxes, which were then networked and managed in the way they always were. This meant that all the VNF-specific management tasks had to take place with little or no knowledge of the state of the rest of the service, and that NFV’s management tools could not reduce opex overall. The best they could do is ensure it didn’t increase, if we could presume that VNF management as a task consumed no resources and cost no money.
The opex failure is the biggest problem with NFV’s business case. It started with a lack of an effective abstraction of NFVi and the software structure into which VNFs would plug. It was exacerbated by a decision to target NFV at virtualizing only devices, not services, and it reached its peak in insulating virtualization from the operations lifecycle, and vice versa. If this issue isn’t addressed, there is no chance NFV could be broadly useful.
The last of the issues was the failure of NFV to establish expected economies of scale. The problem here is due to a focus on a very limited but easily understood use case, the “virtual CPE” use case. This, in turn, evolved in part from that “virtual-box” focus I mentioned earlier. What kind of “box” is present in the greatest numbers in any network? Answer, the premises device that acts as the termination of the service, the “customer premises equipment” piece. That’s vCPE in NFV terms.
The problem is that the most numerous kind of CPE is the broadband hub that terminates residential and small business services. These devices are available to operators for less than $50, and some report prices as low as $35. They include a firewall, a DHCP server, WiFi base station, and more. Be kind and say there are two or three virtual functions inside. Two or three VNFs, meaning two or three VMs. You need “service chaining” (how much time in NFV has been wasted on that!). All this to replace a device that will probably be installed for five to six years, for an annualized cost of perhaps ten bucks.
Opex alone would eat the ten bucks, and then you have the cost of the NFVi. “I lose money on every deal but I make it up in volume,” is the old sales joke. It’s no joke here. You can’t have economy of scale when the base cost of what your replacing is so low you can’t even support the NFV equivalent at the price. But if you make things into vCPE, there’s no cloud infrastructure on which to develop capital or operations economies of scale.
The first NFV proof-of-concept in the summer of 2013 used Metaswitch’s Project Clearwater open-source IMS as the target, because its VNFs were cloud-ready and could thus be hosted on carrier cloud infrastructure, where both capital and operations economies of scale could be measured. The focus of most PoC work, though, was vCPE, and it’s no surprise that’s where most early NFV deployments have been focused too.
Why, if all these problems were raised, and proposals to solve them were presented, did we do NFV with no solutions for them, and thus left them to devil operators today? Part of the problem was the inevitable bias introduced into any standards group by money. Vendors can spend massive budgets on these bodies because they can use them to promote their views and perhaps seize some advantages by pushing things in directions that favor their own solutions. This kind of pressure tends to first encourage the body take the easy path (because to vendors, it’s the profits in the next quarter that drive their stock price), and second to create a diffusion of strategy because a standard approach doesn’t support vendor differentiation. If all firewalls are interchangeable, then what’s to keep somebody from changing mine out?
So we have three totally avoidable problems with NFV impacting the TCO. We didn’t avoid them, so can they be fixed? Yes, but.
The easiest thing to fix would be the open-source VNF problem. What operators would need to do is to first establish the software framework I mentioned in relation to operations cost overruns, then contribute developer resources to the open-source projects for the main VNF classes (like firewalls) to make them compatible with the established framework. This sort of thing wouldn’t disrupt current NFV specifications, only augment them.
The next-easiest thing to fix would be the economy-of-scale problem. I believe that the preponderance of vCPE projects is partly or even largely driven by opportunistic factors like vendor support or easy validation of the effort put into NFV so far. Operators need to accept that without cloud hosting, efficient cloud hosting, NFV fails. They need to prioritize support for projects that are actually cloud-hosted, not those based on uCPE. This could be done fairly easily, though making true cloud-native VNFs happen goes back to that notion of a software framework.
The hardest thing to fix, sadly, is the operations problem that includes the need for that software framework. We do have service models (in TOSCA, usually) that can define full end-to-end, multi-technology, services. These would have to be augmented in two ways to make them effective as solutions to the operations problem.
The first way is to build that software framework for VNFs, and also for NFVi. Good network function software has to be an interplay of abstractions—feature abstractions hosted on infrastructure abstractions. We have neither of the two now, and we need both of them. The feature abstractions would create the “plugins” that would match VNFs to baseline NFV MANO elements, and the NFVi abstractions would make any hosting environment look like a “resource pool”.
The second way is to integrate lifecycle management into service models using state/event tables. A service is a set of cooperative functional elements. Each element eventually decomposes into a set of resource commitments, and the manner of this decomposition can be (and in fact must be) represented by a hierarchical model, something that looks like an organization chart. Each element in the model is an independent state machine, each accepting events from its neighbors, generating events to its neighbors, and interpreting those events based on its own process state. You can model lifecycle automation this way, and I submit it’s hard to model it any other way.
None of this is technically difficult, but it’s apparently politically difficult. NFV proponents don’t like to toss out a lot of what they’ve done, or let changes expose the truth that it wasn’t the optimum approach. Vendors who have pushed things in their favor don’t want things to push them back. But every day we let this go is a day we move NFV further from the truth, further from the point where remedy is practical, or even possible. Those who want NFV to succeed need to get along with the task of fixing the problems operators are now clearly recognizing. Nevertheless, it’s those operators who have to be convinced to deploy it.