Why (and How) Infrastructure Managers are Critical in NFV Management

In a number of recent blogs I’ve talked about the critical value of intent modeling to NFV.  I’d like to extend that notion to the management plane, and show how intent modeling could bridge NFV, network management, and service operations automation into a single (hopefully glorious) whole.

In the network age, management has always had a curious dualism.  Network management has always been drawn as a bottom-up evolution from “element” management of single devices, through “network” management of collective device communities, to “service” management as the management of cooperative-device-based experiences.  But at the same time, “service” management in the sense of service provider operations, has always started with the service and then dissected it into “customer-” and “resource-facing” components.

The unifying piece to the management puzzle is the device, which has always been the root of whatever you were looking at in terms of a management structure.  Wherever you start, management comes down to controlling the functional elements of the network.  Except that virtualization removes “devices” and replaces them with distributed collections of functions.

The thing is, a virtual device (no matter how complicated it is functionally) is a black box that replicates the real device it was modeled on.  If you look at a network of virtual devices “from the top” of the management stack (including from the service operations side) you’d really want to see the management properties of the functionality, not the implementation.  From the bottom where your responsibility is to dispatch a tech to fix something, you’d need to see the physical stuff.

This picture illustrates the dichotomy of virtualization management.  You still have top-and-bottom management stack orientation, but they don’t meet cleanly because the resource view and the service view converge not on something real but rather on something abstract.

If we visualize our virtual device as an intent model, it has all the familiar properties.  It has functionality, it has ports, and it has an SLA manifest in a MIB collection of variables.  You could assemble a set of intent-modeled virtual devices into a network and you could then expect to manage it from the top as you’d managed before.  From the bottom, you’d have the problem of real resources that used to be invisible pieces of a device now being open, connected, service elements.  The virtual device is then really a black box with a service inside.

Might there be a better way of visualizing networks made up of virtual devices?  Well, one property of a black box is that you can’t see inside it.  Why then couldn’t you define “black boxes” representing arbitrary useful functional pieces that had nothing to do with real network devices at all?  After all, if you’re virtualizing everything inside a black box, why not virtualize a community of functions rather than the functions individually?

This gives rise to a new structure for networks.  We have services which divide into “features”.  These features divide into “behaviors” that represent cooperative activities of real resources, and they divide into the real resources.  In effect, what this would do is to create a very service-centric view of “services”, meaning a functional view rather than one based on how resources are assembled.  The task of assembling resources goes to the bottom of the stack.  All composition is functional, and when you’ve decomposed your composition to deploy it, you enter the structural domain at the last minute.

This approach leads to a different view of management, because if you assemble intent models to do something you have to have intent-model SLAs to manage against, but they have to be related somehow to those black-box elements inside them.  To see how this could work, let’s start by drawing a block that’s an intent model for Service X.  Alongside it, we have one of those cute printer-listing graphics that represents the MIB variables that define the intent model’s behavior—what it presents to an outside management interface.

But where do they come from?  From subordinate intent models.  We can visualize Services A and B “below” Service X, meaning that they are decomposed from it.  The variables for Services A and B must then be available to Service X for use in deriving its own variables.  We might have a notion that Service X Availability equals Service A Availability plus Service B Availability (simplified example, I realize!)  That means that if Services A and B are also black boxes that contain either lower-level intent models or real resources, the SLA variables for these services are then derived from their subordinate elements.

This notion of service composition would be something like virtual-device composition except that you don’t really try to map to devices but to aspects of services.  It’s more operations-friendly because that’s what you sell to customers.  I would argue that in the world of virtualization, it’s also more resource-management friendly because the relationship between resource state (as reflected in its variables) and service state (where it’s reflected) is explicit.

How would you compose a service?  From intent models, meaning from objects representing functions.  These objects would have “ports”, variables/SLAs, and functionality.  The variables could include parameter values and policies that could be set for the model, and those would then propagate downward to eventually reach the realizing resources.  Any set of intent models that had the same outside properties would be equivalent, so operators could define “VPN” in terms of multiple implementation approaches and substitute any such model where VPN is used.  You could also decompose variably based on policies passed down, so a VPN realized in a city without virtualization could realize on legacy infrastructure.

In this approach, interoperability and interworking is at the intent model level.  Any vendor who could provide the implementation of an intent model could provide the resources to realize it, so the intent-model market could be highly competitive.  Management of any intent model is always the same because the variables of that model are the same no matter how it’s realized.

The key to making this work is “specificational” in nature.  First, you have to define a set of intent models that represent functionally useful service components.  We have many such today, but operators could define more on their own or through standards bodies.  Second, you have to enforce a variable-name convention for each model, and create an active expression that relates the variables of a model to the variables generated by its subordinates (or internal structure).  This cannot be allowed to go further than an adjacent model because it’s too difficult to prevent brittle structures or consistency problems when you dive many layers down to grab a variable.   Each black box sees only the black boxes directly inside it; the deeper ones are opaque as they should be.

Now you can see how management works.  Any object/intent model can be examined like a virtual device.  The active expressions linking superior and subordinate models can be traversed upward to find impact or downward to find faults.  It it’s considered useful, it would be possible to standardize the SLAs/MIBs of certain intent models and even standardize active flows that represent management relationships.  All of that could facilitate plug-and-play distribution of capabilities, and even federation among operators.

We may actually be heading in this direction.  Both the SDN and NFV communities are increasingly accepting of intent models, and an organized description of such a model would IMHO have to include both the notion of an SLA/MIB structure and the active data flows I’ve described.  It’s a question of how long it might take.  If we could get to this point quickly we could solve both service and network management issues with SDN and NFV and secure the service agility and operations efficiency benefits that operators want.  Some vendors are close to implementing something like this, too.  It will be interesting to see if they jump out to claim a leading position even before standards groups get around to formalizing things.  There’s a lot at stake for everyone.

Why NFV’s VIMs May Matter More than Infrastructure Alone

Everyone knows what MANO means to NFV and many know what NFVI is, but even those who know what “VIM” stands for (Virtual Infrastructure Manager) may not have thought through the role that component plays and how variations on implementation could impact NFV deployment.  There are a lot of dimensions to the notion, and all of them are important.  Perhaps the most important point about “VIMs” is that how they end up being defined will likely set the dimensions of orchestration.

In the ISG documents, a VIM is responsible for the link between orchestration and management (NFVO and VNFM, respectively) and the infrastructure (NFVI).  One of the points I’ve often made is that a VIM should be a special class of Infrastructure Manager, in effect a vIM.  Other classes of IM would represent non-virtualized assets, including legacy technology.

The biggest open question about an IM is its scope, meaning how granular NFVI appears.  You could envision a single giant VIM representing everything (which is kind of what the ETSI material suggests) or you could envision IMs that represented classes of gear, different data centers, or even just different groups of servers.  There are two reasons why IM scope is important; for competitive reasons and for orchestration reasons.

Competitively, the “ideal” picture for IMs would be that there could be any number of them, each representing an arbitrary collection of resources.  This would allow an operator to use any kind of gear for NFV as long as the vendor provided a suitable VIM.  If we envisioned this giant singular IM, then any vendor who could dominate either infrastructure or the VIM-to-orchestration-and-management relationship would be able to dictate the terms through which equipment could be introduced.

The flip-side issue is that if you divide up the IM role, then the higher-layer functions have to be able to model service relationships well enough to apportion specific infrastructure tasks to the correct IM.  Having only one IM (or vIM) means that you can declare yourself as having management and orchestration without actually having much ability to model or orchestrate at all.  You fob off the tasks to the Great VIM in the Sky and the rest of MANO is simply a conduit to pass requests downward to the superIM.

I think this point is one of the reasons why we have different “classes” of NFV vendor.  The majority do little to model and orchestrate, and thus presume a single IM or a very small number of them.  Most of the “orchestration” functionality ends up in the IM by default, where it’s handled by something like OpenStack.  OpenStack is the right answer for implementing vIMs, but it’s not necessarily helpful for legacy infrastructure management and it’s certainly not sufficient to manage a community of IMs and vIMs.  The few who do NFV “right” IMHO are the ones who can orchestrate above multiple VIMs.

You can probably see that the ability to support, meaning orchestrate among, multiple IMs and vIMs would be critical to achieving full service operations automation.  Absent the ability to use multiple IMs you can’t accommodate the mix of vendors and devices found in networks today, which means you can’t apply service operations automation except in a green field.  That flies in the face of the notion that service operations automation should lead us to a cohesive NFV future.

Modeling is the key to making multiple IMs work, but not just modeling at the service level above the IM/vIM.  The only logical way to connect IMs to management and orchestration is to use intent models to describe the service goal being set for the IM.  You give an IM an intent model and it translates the model based on the infrastructure it supports.  Since I believe that service operations automation itself demands intent modeling above the IM, it’s fair to wonder what exactly the relationship between IMs/vIMs and management and orchestration models would be.

My own work on this issue, going back to about 2008, has long suggested that there are two explicit “domains”, service and resource.  This is also reflected in the TMF SID, with customer-facing and resource-facing service components.  The boundary between the two isn’t strictly “resources”, though—at least not as I’d see it.  Any composition of service elements into a service would likely, at the boundaries, create a need to actually set up an interface or something.  To me, the resource/service boundary is administrative—it’s functional versus structural within an operator.  Customer processes, being service-related, live on the functional/service side, and operator equipment processes live on the resource side.

Resource-side modeling is a great place to reflect many of the constraints (and anti-constraints) that the ISG has been working on.  Most network cost and efficiency modeling would logically be at the site level not the server level, so you might gain a lot of efficiency by first deciding what data centers to site VNFs in, then dispatching orders to the optimum ones.  This would also let you deploy multiple instances of things like OpenStack or OpenDaylight, which could improve performance.

Customer-based or customer-facing services are easy to visualize; they would be components that are priced.  Resource-facing services would likely be based on exposed management interfaces and administrative/management boundaries.  The boundary point between the two, clear in this sense, might be fuzzy from a modeling perspective.  For example, you might separate VLAN access services by city as part of the customer-facing model, or do so in the resource-facing model.  You could even envision decomposition of a customer-facing VLAN access services into a multiple set of resource-facing ones, for each city involved for example, based on what infrastructure happened to be deployed there.

From this point, it seems clear that object-based composition/decomposition could take place on both sides of the service/resource boundary, just for different reasons.  As noted earlier, most operators would probably build up resource-facing models from management APIs—if you have a management system domain then that’s probably a logical IM domain too.  But decomposing a service to resources could involve infrastructure decisions different from decomposing a service to lower-level service structures.  Both could be seen as policy-driven but different policies and policy goals would likely apply.

I think that if you start with the presumption that there have to be many Infrastructure Managers, you end up creating a case for intent modeling and the extension of these models broadly in both the service and resource domain.  At the very bottom you have things like EMSs or OpenDaylight or OpenStack, but I think that policy decisions to enforce NFV principles should be exercised above the IM level, and IMs should be focused on commissioning their own specific resources.  That creates the mix of service/resource models that some savvy operators have already been asking for.

A final point to consider in IM/vIM design is serialization of deployment processes.  You can’t have a bunch of independent orchestration tasks assigning the same pool of resources in parallel.  Somewhere you have to create a single queue in which all the resource requests for a domain have to stand till it’s their turn.  That avoids conflicting assignments.  It’s easy to do this if you have IM/vIM separation by domain, but if you have a giant IM/vIM, somewhere inside it will have to serialize every request made to it, which makes it a potential single point of processing (and failure).

Many of you are probably considering the fact that the structure I’m describing might contain half-a-dozen to several dozen models, and will wonder about the complexity.  Well, yes, it is a complex model, but the complexity arises from the multiplicity of management systems, technologies, vendors, rules and policies, and so forth.  And of course you can do this without my proposed “complex” model, using software that I can promise you as a former software architect would be much more complex.  You can write good general code to decompose models.  To work from parameters and data files and to try to anticipate all the new issues and directions without models?  Good luck.

To me, it’s clear that diverse infrastructure—servers of different kinds, different cloud software, different network equipment making connections among VNFs and to users—would demand multiple VIMs even under the limited ETSI vision of supporting legacy elements.  That vision is evolving and expanding, and with it the need to have many IMs and vIMs.  Once you get to that conclusion, then orchestration at the higher layer is more complicated and more essential, and models are the only path that would work.

Why is Network-Building Still “Business as Usual?”

If we tried to come up with a phrase that expressed the carrier directions as expressed so far in their financials and those of the prime network vendors, a good suggestion would be “business as usual.”  There’s been no suggestion of major suppression of current capital plans, no indications of shifts in technology that might signal a provocative shift in infrastructure planning.  We are entering the technology planning period for the budget cycle of 2016, a critical one in validating steps to reverse the revenue/cost-per-bit crunch operators predict.  Why isn’t there something more going on?

It’s perfectly possible that one reason is that operators were being alarmists with their 2017 crossover prediction.  Financial analysts and hedge funds live quarter to quarter, but it seems pretty likely to me that they’d be worried at least a little if there were a well-known impending crisis in the offing.  Somebody might take flight and cause a big dip in stock prices.  But I think it’s a bit more likely than not that the 2017 consensus date for a revenue/cost crunch is as good an estimate as the operators could offer.

Something that ranges over into the certainty area is that operators are responding, by putting vendors under price pressure, buying more from Huawei the price leader.  Except in deals involving US operators where Huawei isn’t a player, we’ve seen most vendors complain of pricing pressures and at least a modest slowing of deals.  Ciena said that yesterday on their call, though they say it’s not a systemic trend but rather a timing issue for a couple of players.

Another almost-sure-thing reason is that the operations groups that do the current network procurements haven’t been told to do much different.  VPs of ops told me, when I contacted them through the summer, that they were not much engaged in new SDN or NFV stuff at this point.  As they see it, new technology options are still proving out in the lab (hopefully).  Their focus is more on actual funded changes like enhancements to mobile infrastructure.

The question, the big one, is whether one reason operators are staying the course is that it’s the only course they have.  We’ve talked, as an industry, about massive changes in network infrastructure but for all the talk it’s hard to define just what a next-gen infrastructure would look like.  Harder, perhaps, to explain how we’d fund the change-over.

That’s the real point, I believe, because in our rush to endorse new network technologies we’ve forgotten a message from the past.  The notion of transformation of telecom infrastructure isn’t new.  We had analog telephony, then digital and TDM, and then the “IP convergence”.  What would we see if we looked back to the past and asked how the changes came about, particularly that last one to IP?

Packet networking was proposed in a Rand Corporation study in 1966, and we had international standard packet protocols and the OSI model just a decade later.  We also had the foundations of the Internet.  None of the stuff that evolved in that period was intended as a replacement for TDM.  That role was envisioned for Asynchronous Transfer Mode, or ATM.

The theory behind ATM at the technical level isn’t relevant here, so I’ll just summarize it.  You break down information into “cells” that are small enough so that the delay you’d experience waiting for a cell to be sent or received is small.  That lets you jump priority stuff in front of that which isn’t a priority, which lets you mingle time-sensitive stuff like voice (or video) with data.  This, in turn, lets you build a common network for all traffic types.  ATM was actually intended to replace the public network, designed for it in fact, and there was an enormous wave of interest in ATM.  I know because I was part of it.

I learned something from ATM, not from its success but from its failure.  There was nothing technically wrong with ATM.  There was nothing wrong with the notion that a single converged network would be more economical as the foundation of a shift of consumer interest from voice to data. The problem was that the transition to ATM was impractical.  Wherever you start with ATM, you deliver technology without community.  You can’t talk ATM with somebody because early deployment would be unlikely to involve both you and your desired partners in communication.  You needed to toss out the old and put in the new, and that’s a very hard pill to swallow for operators.

Why did IP then win?  It wasn’t technical superiority.  It won because it was pulled through by a service—the Internet.  Operators wanted consumer data, and the Internet gave it to them.  The revenue potential of the Internet could fund the deployment of what was then an overlay network based on IP.  More Internet, more IP, until we reached the point where we had so much IP that it became a viable service framework, a competitor to what had previously been its carrier technology—TDM.  We got to IP through the revenue subsidies of the Internet.

What revenue funds the currently anticipated infrastructure transformation?  We don’t have a candidate that has that same potential.  The benefits of SDN or NFV are subtle, and we have no history as an industry in exploiting subtle benefits, or even harnessing them.  That means, in my view, that we either have to find some camels’-nose service to pull through the change as the Internet did for IP, or we have to learn to systematize the change.  I’ve offered an example of both in recent blogs.

IoT and agile cloud computing could both be candidates for the camel role.  We could gain almost a trillion dollars in revenues worldwide from these services.  We’re slowly exploiting the cloud already, and while it would help if we had a realistic understanding of where we’re going with it, we’ll eventually muddle into a good place.  IoT is more complicated because we have absolutely no backing for a truly practical model, but I think eventually it will happen too.

That “eventually” qualifier is the critical one here.  We probably can’t expect any new service to take off as fast as the Internet did, but the Internet took a decade or more to socialize IP to infrastructure-level deployment.  My point with the notion of service operations automation is that we could do better.  If we build, through a combination of cloud features for infrastructure and enlightened software data modeling, a petri dish with an ideal growth medium in it, we could build many new services and attract many new revenue sources.  This could then drive the evolution of infrastructure forward as surely as one giant camel could have, and a lot faster.

Consumerism has blinded us to a reality, which is the reality of justification.  I buy a new camera not because I need one but because I want it.  That works for discretionary personal expenses up to a point, but it’s not likely the financial industry would reward, or even tolerate, a decision by operators to deploy SDN or NFV for no reason other than it was nice and shiny and new.  It’s my belief that we can accelerate change by understanding how it has to be paid for.  A useless technology will meet no financial test, and both SDN and NFV can be justified.  If we want earnings calls that cite explosive deployment growth in these new things, we’ll have to accept the need for that justification and get working on making it happen.

The Technical Steps to Achieve Service Operations Automation

If the concept of service operations automation is critical to NFV success and the NFV ISG doesn’t describe how to do it, how do you get it done?  I commented in an earlier blog that service operations could be orchestrated either within the OSS/BSS or within MANO.  The “best” place might establish where to look for NFV winners, so let’s explore the question further.

Service operations gets mired in legacy issues in the media.  We hear about OSS/BSS and we glaze over because we know the next comments are going to be about billing systems and order portals and so forth.  In truth, the billing and front-ending of OSS/BSS is fairly pedestrian, and there are plenty of implementations (proprietary and open-source).  These can be refined and tuned, but they can’t automate the service process by themselves.

Operations automation of any sort boils down to automating the response of operations systems to events.  These events can be generated either within the service framework itself, in the form of a move, add, or change for example.  They could also be generated in the resource framework, meaning they’d pop up as a result of conditions that emerged during operation.

Automating event-handling has been considered for some time.  As far as I can determine, the seminal work was done around 2008 by the TMF in their NGOSS Contract discussions, which grew into the GB942 specification.  The picture painted by the TMF was simple but revolutionary; events are steered to operations processes through the mediation of a contract data model.  In my view, this is the baseline requirement for service automation.  Things like billing systems respond to events and so are passed them by the model, and things like order entry systems generate events to be processed.

The reason why the TMF established the notion of a contract data model (the SID, in TMF terms) as the mediating point is that the automation of service events is impossible without event context.  Suppose somebody runs into a network operations center that’s supporting a million customers using a thousand routers and ten thousand ports and circuits, and says “a line is down!”  It’s not very helpful in establishing the proper response even in a human-driven process.  The fundamental notion of the TMF’s GB942 was that the contract data model would provide the context.

For that to work, we have to be able to steer events in both the service-framework and resource-framework sense.  It’s pretty easy to see how service events are properly steered because these events originate in software that has contract awareness.  You don’t randomly add sites to VPNs, somebody orders them, and that associates the event with a service data model.  The problem lies in the resource framework events.

In the old days, there was a 1:1 association between resources and services.  You ordered a data service and got TDM connections, which were yours alone.  The emergence of packet technology introduced the problem of shared resources.  Because a packet network isn’t supporting any given service or user with any given resource (it adapts to its own conditions with routing, for example), it’s difficult to say when something breaks that the break is causing this or that service fault.  This was one of the primary drivers for a separation of management functions in networking—we had “operations” meaning service operations and OSS/BSS, and we had “network management” meaning the NOC (network operations center) sustaining the pool of shared resources as a pool.

“Event-driven” OSS/BSS is a concept that’s emerged in part from GB942 and in part because of the issues of resource-framework events.  The goal is to tie resource events somehow into the event-steering capabilities of the service data model.  It’s not a particularly easy task, and most operators didn’t follow it through, which is a reason why GB942 isn’t often implemented (one of almost 50 operators I talked with said they did anything with it).

This is where things stood when NFV came along, and NFV’s virtualization made things worse.  The challenge here is that the resources that are generating events don’t even map to logical components of the services.  A “firewall” isn’t a device but a chain of software-hosted functions on VMs running on servers and connected with SDN tunnels into a chain.  Or maybe it’s a real device.  You see the problem.

Virtualization created the explicit need for two things that were probably useful all along, but not critical.  One was a hierarchically structured service data model made up of cataloged standard components.  You needed to be able to define how events were handled according to how pieces of the service were implemented, and this is easy to do if you can catalog the pieces along with their handling rules, then assemble them into services.  The other was explicit binding of service components to the resources that fulfill them, at the management level.

NFV’s data model wasn’t defined in the ISG’s Phase One work, but the body seems to be leaning toward the modern concept of an “intent model”, with abstract features, connection points, and an SLA.  This structure can be created with the TMF SID.  NFV also doesn’t define the binding process, and while the TMF SID could almost surely record bindings the process for doing that wasn’t described.  Binding, then, is the missing link.

There are two requirements for binding resources to services.  First, you must bind through the chain of service component structures you’ve created to define the service in the first place.  A service should “bind” to resources not directly but through its highest-level components, and so forth down to the real resources.  Second, you must bind indirectly to the resources themselves to preserve security and stability in multi-tenant operations.  A service as a representative of a specific user cannot “see” or “control” aspects of resource behavior when the resource is shared, unless the actions are mediated by policy.

So this is what you need to be looking for in an implementation of NFV that can address service operations automation—effective modeling but most of all effective binding.  Who has it?

Right now, I know enough about the implementations of NFV presented by Alcatel-Lucent, HP, Oracle, and Overture Networks to say that these companies could do service operations automation with at-most-minimal effort.  Of the three, I have the most detail on the modeling and binding for HP and Overture and therefore the most confidence in my views with regard to those two—they can do the job.  I have no information to suggest that the OSS/BSS players out there have achieved the same capabilities, so right now I’d say that NFV is in the lead.

What challenges their lead is that all the good stuff is out of scope to the standards work, while it’s definitely in-scope to the TMF.  I’m not impressed by the pace of the TMF’s ZOOM project, which has spent over a year doing what should have been largely done when it started.  But…I think a couple of good months of work for a couple people could totally define the TMF approach, and that would be possible for the TMF.  I don’t think the ISG can move that fast, which means that vendors like those I’ve named are operating in a never-never land between standards, so to speak.  They might end up defining a mechanism de facto, or demonstrating that no specific standard is even needed in the binding-and-model area.

The deciding factor in whether service operations automation is slaved to MANO or to the OSS/BSS may be history.  MANO is a new concept with a lot of industry momentum.  OSS/BSS is legacy, and while it doesn’t have far to go it’s had the potential to do all of this all along.  The same effort by the same couple of people would have generated all the goodies on this in 2008, and it hasn’t done it yet.  We have four plausible implementations on the NFV side now.  If those four vendors can hook to service operations automation they could make the business case for NFV, and perhaps change OSS/BSS forever in the process.

Five NFV Missions that Build From Service Operations Success

In my last blog I outlined an approach to making an NFV business case that was based on prioritizing service operations with legacy infrastructure.  This, I noted, would provide a unifying umbrella of lifecycle services that NFV-based applications could then draw on.  Since the operations modernization would have been paid for by service operations cost reductions, NFV would not have to bear the burden.  That would facilitate making the business case for individual applications of NFV, and at the same time unify these disconnected NFV pieces.

The obvious question stemming from this vision is just what NFV applications might then be validated.  According to operators, there are 5 that seem broadly important, so let’s look at how these would benefit from the service operations umbrella.

The most obvious NFV opportunity that might be stimulated by service ops is virtual CPE or vCPE.  This app is sometimes presented as a generalized-device-on-premises-hosted set of functions, sometimes as a cloud-hosted set, and sometimes an evolution from the first to the second.  vCPE is fairly easy to justify as a managed service provider (one who uses third-party transport and links it to managed services and extra features) because agile service feature selection is a clear value proposition.  Broader-based telcos might find the service interesting but lacking in the scope of benefits that would be needed to impact EBITDA.

A service operations umbrella would necessarily include a service portal, automatic provisioning and modifications of the service features, and self-care features.  These features, coming to vCPE at no incremental cost, would make the application valuable to any operator for business services.  It’s likely that efficient service operations could even make it useful to SMBs using consumer-like broadband connections.  For some applications, including content (more on this later), home control and medical monitoring, vCPE could be justified all the way to the consumer.

A second NFV opportunity that’s already generating interest and even sales is virtualization of mobile infrastructure.  Even for 4G/LTE services, operators are visualizing software-defined radio, self-optimizing elements, flexible topologies in the evolved packet core (EPC), and add-on services tightly coupled to IMS.  5G could increase the credibility of all of these applications, but particularly the latter.

Operations automation is critical for the success of any mobile infrastructure business case for NFV because mobile infrastructure is multi-tenant and so the dynamism of individual functions is limited.  Most function deployment would come in response to changes in traffic load or as a response to an outage.  Mobile infrastructure is going to be highly integrated with legacy equipment, but also with other services like content delivery and IoT.

What makes add-on services interesting is that IMS-mediated applications would likely be much more dynamic than core IMS or EPC components.  You can imagine mobile collaborative applications involving content sharing or IoT integration with mobility as generating a lot of per-user or per-group activities.  Content-viewer collaboration, sometimes called “social content”, is also a good application because it involves both the indexing and sharing of content clips and the socialization of the finds among a group, almost like collaboration.

The third opportunity area that service optimization would enable is elastic bandwidth and connectivity.  The business key to services that tactically build on baseline VPNs or VLANs is that they don’t undermine the revenue stream of the base service.  My research suggests that this can be done if such services are always tied to an over-speed access connection (which you need for delivery anyway), that they have a pricing strategy that encourages longer-term commitments rather than every-day up-and-down movement, and that they be invoked through a self-service portal that can not only offer capacity or new site connectivity on demand, but on a schedule or even if a management system detects certain conditions.

The differentiating point for this kind of service is its ability to augment private-network services and to do so at the same high level of availability and QoS.  While those factors can’t really be guaranteed using NFV, the NFV-to-SDN tie could help assure it.

IoT is a fourth opportunity, though it’s more complicated to realize.  As I’ve noted in other blogs, IoT is an application that demands harnessing of a large variety of already-deployed sensor/controller elements and their conversion into a data repository that can then be queried.  It’s true that there may be new sensor/controller elements put online via 4/5G, but there are literally billions of devices already out there and a failure to enlist them in early applications would generate a major question of how new deployments would be capitalized and by whom.

The most significant value of IoT lies in its capability to provide context to mobile users, which is the basis for many valuable applications.  You can visualize the world as an interlocking and overlaid series of information fields that users move through or to (or away from) and of which they are selectively aware.  The information available can be about the physical area, traffic conditions, nearby events that might be of interest or create risk, commercial opportunities, and so forth.  The information from a given user’s mobile device could guide the user’s path, and if a community of users are interacting the collective information of them all could help arrange meetings.  Emergency responders and any worker-dispatch function could draw on the data too.

The final opportunity is tactical cloud computing, another activity that emerges from either physical or social mobility.  One of the profound truths of IT spending by enterprises is that it has always peaked in growth when new productivity paradigms become visible.  Mobility is likely to be the next opportunity for such a paradigm, because engaging the worker with information at the point of worker activity is the ultimate marriage of information processing and information delivery.  But mobile workers are mobile and their information needs are tactical and transient, which isn’t ideal for traditional application computing models.  It’s better for the cloud, particularly if elements of the application and data can be staged to follow the worker and anticipate the next mission.

It’s clear that this concept would grow out of business use of IoT, and it might also grow out of maturing collaboration models focused on mobile workers as well.  It would probably be difficult to socialize it by itself, which is why I place it last on the list.

The big long-term opportunities for NFV are in the last two areas, and that’s an important point to make because it’s a validation of the premise of this whole exercise.  To make shareholders happy is the primary mission of public-company business policy, and that for network operators means raising EBITDA, which means lowering opex significantly.  You can’t do that except through broad changes in service operations, and you can’t justify broad changes with narrow service targets.  Since broad service targets are too radical to start an NFV story with, you are stuck unless you decouple service operations from the rest of NFV and give it priority.  We don’t have to forget NFV or pass on its real benefits, but we do have to apply the lesson it’s taught about orchestration to service operations first.  Once that happens, then all the NFV priorities of various types of operators, all the customer directions they might follow, unite in a single technical framework that can propel NFV and operator business success.  That’s the only way we’re going to get this done.

How an NFV Sales Story Can Get Wall Street “Tingly Inside”

Remember from yesterday’s blog that the goal of an NFV business case should be to “make the Street all tingly inside.”  That means that NFV’s business case has to be made in two interdependent but still separate tracks—one to justify new capex with high ROI and the other to create an opex-improving umbrella to improve EBITDA for the operators.

Classic NFV evolution proposes to combine this, creating a significant EBITDA gain and high ROI in any incremental capex.  That’s a problem at two levels.  First, we have no broad model for NFV deployment—we have a series of specific options like vCPE or mobile infrastructure, but no unifying model.  Thus, it’s hard to obtain enough mass to get any significant savings to improve EBITDA.  Second, if we achieve gains through infrastructure modernization to an NFV form, we necessarily have a large capex if we have a large network scope to apply savings to.  That’s hard to reconcile with high overall ROI and harder to manage in risk terms.

My proposal is simple.  We first attack EBITDA by attacking what financial analysts mean when they say “opex”, and we do that by focusing NFV principles on legacy infrastructure.  We then virtualize functions where we can secure a higher ROI, either because of extra cost efficiencies or because of incremental revenue gains.

We have to start with that opex point.  Any discussion about improving “opex” should start with the fact that the word really has two meanings.  We in the networking industry think of opex as “operations expense” but the operators’ CFOs and the financial markets think of it as “operating expense”.  Network operations is only one component of this broader opex definition.  The other two are service payments (roaming, etc.) and service operations costs.  It’s this last group that we need to consider.

Service operations or “service management” is the set of tasks/systems that support the retail or wholesale customer relationships of a network operator.  It starts with an order and ends when the service contract expires and the associated resources are released.  Service operations links with network operations for the realization of the service relationship being purchased.  Through the service lifecycle there are events from both the service and network side that are triggers for action on the other side, so the two operations areas are interdependent.

It is highly unlikely that you could do anything in the way of a new service, or change the operating expenses of an operator, without impacting service operations.  In fact, in terms of total costs, service operations makes up about two-thirds of total management costs (service and network), which in turn make up about half of opex.  Service operations is implemented through OSS/BSS systems and is under the operators’ CIO organizations.

Given that a third of all opex is service management, you can conclude that if we were to modernize OSS/BSS systems to provide full service automation capabilities and simply cut the cost of service operations in half, we’d make an impact equal to eliminating network operations costs completely.

How does this relate to NFV?  Well, as I’ve noted in earlier blogs, NFV declared management/operations to be out of scope and it hasn’t generated recommendations either on how to broadly improve service operations or how to integrate NFV with service operations.  As it happens, service-to-network operations integration has been a growing problem.  NFV would make it worse.

The problem with NFV, or even with packet networks, is that the connection between “service” and “network” isn’t fixed, because resources are multi-tenant.  If you can’t connect resource and service management, you can’t automate customer care—and that’s the largest single component of operations expenses, bigger than network operations.  The more multi-tenancy we have, the looser we couple services to resources through virtualization, the more we risk operations disconnect that fouls our service automation nest.  We can’t let network and service management diverge or we lose automation, so we have to figure out what can converge them.

This makes a critical point in the evolution to NFV.  Any NFV business case will have to address service operations efficiency, and thus should apply NFV orchestration principles to network infrastructure (devices or servers) end-to-end to organize and activate service and network operations tasks.  If you can’t drive service operations modernization, you can’t make any NFV business case at all because you can’t change operating expenses and EBITDA.  At best, you can ride on somebody else’s business case—if they’ll let you.

If you want a tingle from the Street, start by telling them you can cut service operations costs by half by applying NFV orchestration principles to legacy infrastructure and service management.  That would cut total operations costs by 17%, raising EBITDA accordingly.  And you’ve not deployed any NFV, any servers, at all.

That’s right!  Orchestration of service operations, if you can represent legacy devices in the infrastructure model, can operate without any virtual functions.  That’s why you can credibly claim a 50% reduction in service operations costs.  You might think you’ve just killed NFV, but what you’ve really done is get the second half of that tingle.

A service operations gain like this does two things.  First, it offers a powerful collection of risk-reducing measures.  How many NFV salespeople spend time telling a prospect that the hosting of functions won’t impact SLAs or increase the cost of customer care?  If we have a modern, automated, system for service management we have a solution to these objections.  Second, all new services will create profit to the extent that they overcome cost issues.  As we’ve seen, most costs like in the service operations side of the organization.  Since new services are covered under our new service operations umbrella, they inherit its efficiencies.

What new services?  It doesn’t matter nearly as much.  In NFV today we have a bunch of NFV islands trying to coalesce into a benefit continent.  Built the continent with service operations tools and you have the natural collection already in place.  You can now fit individual NFV applications in underneath without having to invent a whole service lifecycle process for them.  If you like vCPE or virtual mobile infrastructure, and if you can make a business case for that application alone, you can deploy it.  A service operations umbrella could save the whole lab trial and PoC process and transition it to a consistent, operationalized, field trial.

Who wins with this sort of thing?  You can come at service operations in two ways.  First, you could expand the current OSS/BSS systems to address the necessary service automation processes.  Second, you could expand the current implementations of NFV to include process integration.

Most OSS/BSS systems are at least loosely compatible with the TMF’s model—the extended Telecom Operations Map or eTOM.  Most also support the TMF SID data model.  These are a necessary step for both approaches—you need to be able to map operations processes, which means defining them, and you need centralized service MIBs.  For the OSS/BSS systems to go the rest of the way, they need two things.  One is to be event-driven, which the TMF has defined at a high level in its NGOSS Contract/GB942 specification.  This is rarely implemented, unfortunately.  The other thing OSS/BSS needs is dynamic binding between resource and service domains.  I proposed to the TMF that they adopt the notion of an explicit binding domain to do this, but they didn’t like the idea.

For NFV, the key requirement is to have some sort of service model, something that like the TMF SID could define a service as an association of functional components with synchronized operation.  This model could then record resource bindings and steer events, which combine to make service operations automation possible.

The key for both these approaches is that they have to apply to legacy infrastructure or we miss most of the service operations processes and savings.  That means that the best NFV trials to conduct are those with no NFV at all, trials of service orchestration with legacy components that can then be expanded to demonstrate they work with NFV components too.

As far as NFV vendors go, I believe that Alcatel-Lucent, HP, Huawei, Oracle, and Overture Networks could model and orchestrate the required elements, with perhaps some tuning of the interfaces with legacy elements.  Of the group I’m confident that HP has the right modeling strategy and that both Alcatel-Lucent and Oracle have the required operations bend.  Oracle is involved in some high-level transformation projects that could actually lead it to the service operations conclusion the industry needs.

Why haven’t we done this already if it’s such a good idea?  A big part of the reason is the NFV ISG’s scope issues; the trials and PoCs have tended to match the scope set by the early work.  However that was never a requirement; you were always able to propose a broader scope.  I did with the original CloudNFV project, though the documents were revised to limit scope once I left my role as chief architect, and that was the first PoC approved.

Why not now, then?  Because vendors want simple sales value propositions with big payoffs, and revamping service operations is not only a complex process it’s a process that doesn’t sell a lot of hosting.  Everyone wants a big, easy, win.  There can be a very big win here with NFV, but it’s not going to be an easy one.  All that means is that it’s time to face reality and get started.