Top Network Issues for 2016: My Personal View

We’re closing in on 2016 and what will surely be a pivotal year in terms of network operator strategy.  I’ve already blogged about the results of the operators’ fall planning cycle, and I think that offers a pretty clear view of their technology plans.  Because even the best of plans can be contaminated by internal biases (personal and company) I want to offer my own analysis of the year to come.

First and foremost, 2016 is a year when a combination of operations efficiency and new revenue opportunity will come to dominate network operator planning.  The industry has tended to focus on capital budgets, in no small part because Wall Street uses network operator capex as a means of assessing the opportunities in the network equipment space.  However, the Street’s own measure of network operator financials (EBITDA) omits capex, and in any event operators have known for at least two years that there are really no technology measures they could take that would effectively address capex.  That leaves new revenues or lower operations costs to address the revenue/cost-per-bit convergence.

In one sense, this shift in sentiment seems to offer some significant opportunities.  The fact is that the overwhelming majority of the stuff needed to either enhance revenues or reduce operations costs is concentrated in OSS/BSS systems under the CIO.  With one super-sponsor and one technology target, it’s far easier to build a consensus and implement it.  On this basis, you could assume progress would come quickly in 2016.

The problem is that CIOs are the most glacier-trained of all operator executives.  Operations systems have greater inertia than even network equipment.  Thus, despite the fact that top-down planners of network transformation have long accepted the need to drive those top-down changes through OSS/BSS, very little has been done.  Everyone seems to get bogged down in details, meaning that even in OSS/BSS where top-down focus has been clear, people have started technology changes at the bottom and hoped for convergence in some sense.

What seems to be driving changes to this practice in 2016 is the fact that operators are now committed to the “wing and a prayer” strategy of NFV deployment.  “I don’t know whether we understand what an NFV transformation would look like, as a company,” one CIO told me.  “We are committed to evolving from trials toward something, but I don’t really know what that ‘something’ is or exactly how we’ll approach it.”  Put in technology terms, operators are committed to building up from their trials and PoCs, which are overwhelmingly focused on a single service concept, and in the majority of cases on vCPE.  So, here’s the key point for 2016; service-specific silos are not only here to stay, they’re on the rise.

While no operator wants to admit that they’ve invented in NFV a new way to build silos, that’s pretty much what’s happened.  What a few operators are now recognizing is that silo convergence in the new age is probably a technical problem that OSS/BSS could solve.  If that is true, then the silo-convergence mission at the technical level might lead (accidentally?) to a pathway to implement the top-down strategy operators have been grasping for.  Where do you converge silos other than at their top?

The problem with this approach, for vendors at least, is that the big bucks in NFV benefits terms are only available at the silo-top.  If we’re relying on OSS/BSS to converge the silos and deliver the benefits, then the benefit delivery will occur largely after the silos are deployed.  That means well beyond 2016, even discounting the time it might take for CIOs to frame a silo-convergence strategy.

If nobody converges silos explicitly then there’s only one other option to escape them, and I’ll call it the cult of the ginormous silo.  If NFV deployment is the success of a thousand Band-Aids, then one really big one could be a game-changer.  There are, as I’ve noted, only two opportunities for a ginormous silo—mobile/content and IoT.  Mobility (as we’ll see below) is a key factor in another technology shift, and IoT is a darling of the media, but neither of them has shaped a solid vision of how it might change operator costs or revenues decisively.  The best anyone has done is to say “sell 4/5G to every human, and when you run out sell to the gadgets directly.”  Presumably 3G printing would then combine with robots to create autoreproducing gadgets to continue the trend.  Not very insightful.

If nobody pushes the ginormous silo approach effectively, then vCPE will dominate early services.  Here we are likely to generate an interesting dynamic, because the great majority of credible vCPE opportunities are associated with business sites (the remainder are linked to content delivery, which then links back to mobile and social services).  Enterprises have their own technology plans, focused as always on enhancing worker productivity.  The technology vehicle for this is mobility, a transitioning of workers from a desktop-centric “bring work to your workplace” approach to a mobile-centric “take your workplace to where you’re working” model.

Mobility, particularly when viewed as something you’re transitioning to, almost demands a connection network or VPN that is interdependent of, rather than created with, physical network connectivity.  Read, then, a virtual network.  I would contend that social-based services that focus on the person and not on the network have this same need.  In both cases, the logical thing to do is to build an overlay VPN to accomplish what you need.  Unlike most of these “SDN” networks, this overlay would have to be supported on every device, and in parallel with the “normal” Internet connectivity.  You could tunnel over the Internet or create something alongside it—either would work—but you don’t want to be pushing Internet and VPN traffic on VPN for egress somewhere else.

This raises the last of my 2016 technology issues, which is the cloudification of the network.  You can already see that mobile services are impacted by the notion of personal digital assistants that would offer users answers to questions and not access to information.  IoT will develop and extend this trend by building an analytics framework to collate and interpret sensor contextual inputs and combine them with social and retail data.  The result of this will be an increasing shift toward answer delivery from website delivery, to agent handling from search.  That will subduct more and more of the Internet into the cloud, or rather put it behind a cloud edge.

This would create a profound shift in the dynamics of the Internet.  If you could sell the user an agent that answers all their questions, what value does anyone else’s search engines or ad placements have?  Everyone knows that nobody is going to replace Google or Yahoo, but suppose that’s not even the goal any longer?  Suppose that “over-the-top” is now under the covers?

There are network implications too.  A shift of traffic from website-to-user to website-to-agent wouldn’t eliminate access networking because bandwidth is still key for content delivery.  What it would do is create a massive intra-cloud traffic flow, one between data centers.  That could radically increase the high-capacity trunking needs of operators while making the access network more of a simple on-ramp to the cloud.  Big pipe, but a dumb subordinate.

So these are the things I think we should be watching next year.  I know I will be, and I promise to keep you posted on how I’m seeing things develop.

How the “Trial-Silo” NFV Strategy of Operators for 2016 Impacts Vendors

NFV trials are not exercises involving the same set of players, or reflecting the same issues or service goals.  I noted in a blog earlier this week that operators had decided (often by default) that they would pursue evolving PoCs toward services even though it would likely result in silos based on the large difference in players and focuses.  That dynamic could change the fortunes of some of the notable NFV players, and so I want to take a look at them.

First, as I’d noted, an uncoupled vision of NFV evolution would favor OSS/BSS vendors like Amdocs and Ericsson, who can couple silos together above the NFV level.  Neither of these two were “full service” NFV players, but the evolution of operator planning would validate their above-the-fray stance.  The problem for this group of players is that it’s a weak position to sell.  Do operators know, or want to admit, that their NFV trials aren’t likely to promote a common strategy?  Do you, as a vendor, want to cram that truth down their throat?

Pure-play OSS companies like Amdocs seem to understand that their best approach is to focus not specifically on silo integration but rather on operations efficiency and service agility.  Most NFV stories are way beyond being “weak” in reaping these benefits, yet they’re the ones operators say have to drive deployment of NFV in the long run.

Ericsson, on the other hand, is already the most-accepted integrator of the carrier space.  They see the disorder developing and know that eventually all NFV trials will have to converge on common operations and management practices or neither agility nor efficiency will be attained.  They know how to do it, not necessarily in an optimum way but certainly in a profitable way, so they can simply let things play out.

Some of the six vendors who can really do NFV (Alcatel-Lucent, Ciena, HPE, Huawei, Oracle, and Overture) have also taken positions (deliberately or by default) that the new silo model favors.  Huawei wins if NFV fails completely, because price concessions are the only other answer to the narrowing carrier profit margins that become critical in 2017.  Overture has always promoted a service-specific (Carrier Ethernet, MSP, and vCPE) vision of NFV but its technology could be used to spread a unifying ops layer down the line.  Oracle has taken an ops-centric position from the first, so if operations integration wins they get an automatic boost.

For the other vendors on the list, it’s more complicated, so let’s look at them in alphabetical order.

Alcatel-Lucent probably knows that its best approach is to accept silos because it could have the biggest one.  Mobile services and content delivery will be where most operator profit and cost focus will fall, because they’re the largest opportunity and largest capex target already.  Alcatel-Lucent can afford to lay back in other areas because they realize that on scale of deployment alone, they’d win.  It would be easier and cheaper to fit something like vCPE into an NFV infrastructure justified and deployed based on mobile/content than the other way around.

Ciena has an extremely interesting opportunity arising from the trial-diversity-and-silo-creating strategies of operators.  They are the only vendor of the six who have actively embraced silo consolidation and federation as capabilities.  You could put in Blue Planet above other NFV silo implementations of services and use it to harmonize management and operations, and Ciena says so.  That latter point is important because operators tell me they want to exercise vendor capabilities that the vendors own up to in public, not those that are pronounced in whispers in smoke-filled rooms.  Since Ciena also has an optics-up SDN/NFV story, that gives Ciena two shots at achieving a compelling NFV position.

HP (now HPE) has the strongest technical solution to NFV, but they’ve focused on a PoC-and-trial process that doesn’t let them demonstrate their scope effectively.  They may see the trial-and-PoC silo approach as a validation of their own strategy of standing tall in PoCs even if the PoCs don’t make a business case.  I know enough about their modeling and orchestration capability to know they could almost certainly orchestrate across diverse silo implementations, but they have not made that capability a fixture of their positioning.  That means they are at risk to being confined by their sales efforts into a mission far less capable and important than their software technology could obtain.  But marketing assertions have little inertia; all HP has to do to fix their problems is assemble a nice orchestra and sing their hearts out.

The big question, one raised yet again by the Light Reading interoperability test, is whether silo NFV can make enough of a business case even when confined to sweet-spot services.  How long will a service trial for a single service take before operators can say it will pay back?  Answer: We don’t know, nor do the operators.  Some, particularly if you talk to the CFOs, aren’t sure that any silo service will pay back and aren’t sure that anything can unite the silos.

Operators didn’t make a proactive decision to pick at NFV opportunity scraps, they were offered no alternative.  None of the vendors who could tell a full story were willing to do that, probably because such a story would have been complicated to sell both in objective business case terms and in terms of operator political engagement.  At this point, based on what operators have told me about their fall planning results, I think it’s too late to make a broad case for NFV.  A trial process comprehensive enough to do that could not now be initiated and completed in 2016.

It’s probably also too late for a new camel to stick its nose in.  I noted in a prior blog that IoT was potentially the killer carrier for NFV, and interestingly companies like GE Digital have already launched cloud-IoT services that to me clearly demonstrate NFV’s value to the “right” IoT story.  But again, IoT is complicated and operators have gotten off on the wrong track (as Verizon did), misled by opportunism.  I don’t think that IoT/NFV trials would make even as much progress in 2016 as a full-solution operations-driven NFV trial would.

In the midst of this, OPNFV says it’s ready to take on MANO and Light Reading says the same, for its next phase of testing.  Too late.  You can’t really test the underlayment of something without knowing its higher-level requirements.  The ISG, to define a proper operations/management framework, would have to obsolete much of the work it’s already done.  OPNFV has to have something to implement from, and anything they do right at the MANO level will expose limitations in their NFVI and VIM models.  So we are not going to get a unifying standard in time to drive market consensus.

Which takes us back to the silos.  Operators at this point will continue with their PoCs and trials because they have no other options that will mature in time.  Even though these trials will create silos, they are committed to hope these can be somehow unified.  That hope is now the opportunity for every NFV vendor to position for.

What Can We Learn from the Light Reading NFV Tests?

Light Reading has published the first of what they promise will be a series of NFV tests, run in combination with EANTC, and the results are summarized HERE.  I think there are some useful insights in the results, but I also think that there are fundamental limitations in the approach.  I’m not going to berate Light Reading for the latter, nor simply parrot the former.  Instead I’ll pick some main points for analysis.  First, I have to summarize what I perceive was done.

This first set of tests were targeted at interoperability at the deployment level, meaning VNFs deploying on NFVI.  Most of the functional focus, meaning software, would thus fall into the area of the Virtual Infrastructure Manager (VIM), which in all the tests was based on OpenStack.  I can’t find any indication of testing of MANO, nor of any program to test operations/management integration at a higher level.

This is the basis for my “fundamental limitation” point.  NFV is a technology, which means that you have to be able to make it functional, and the Light Reading test does make some good points on the functional side.  It’s also a business solution, though; something that has to address a problem or opportunity by delivering benefits and meeting ROI goals.  We cannot know from the test whether NFV could do that, and I contend that implementations that can’t make a business case are irrelevant no matter how well they perform against purely technical standards.

Now let’s get to the points made by the tests themselves.  As I said, I think some of the results were not only useful but highly evocative, though I don’t think what I saw as important matched Light Reading’s priorities.

The first point is that OpenStack is not a plug-and-play approach to deployment.  This is no surprise to me because we had issues of this sort in the CloudNFV project.  The problem is that a server platform plus OpenStack is a sea of middleware and options, any of which can derail deployment and operation.  The report quotes an EANTC source:  “There were tons of interop issues despite the fact that all NFVis were based on OpenStack.”

The lesson here isn’t that OpenStack isn’t “ready to play an interop role” (to quote the report) but that it’s not sufficient to guarantee interoperability.  That’s true of a lot of (most of, in my view) network-related middleware.  There are dependencies that have to be resolved, options that have to be picked, and none of this is something that operators or VNF vendors really want to worry about.

What we have here isn’t an OpenStack failure but an abstraction failure.  The VIM should represent the NFV Infrastructure (NFVI) no matter what is below, presenting a common set of features to support deployment and management.  Clearly that didn’t happen, and it’s a flaw not in OpenStack but in the ISG specifications.  All infrastructure should look the same “above” the VIM, converted by the VIM into an intent model that can represent anything on which VNFs can deploy.   The specifications are not sufficient for that to happen, and the notion of a fully abstract intent model is absent in any event.

You can see another indication of this abstraction failure in the discussion of interoperability issues.  There are discussions of OpenStack Nova, Heat scripts, and so forth.  None of this should matter; a VNF should never “see” or be impacted by specifics of the implementation.  That’s what VIMs are supposed to cover, creating a universal platform for hosting.  It is unacceptable that this did not happen, period.

The next point is that NFV management is broken.  I’ve noted all along that the decision to create a hazy management framework that includes external or VNF-integrated VNF Managers (VNFMs) has negative ramifications.  The tests show that the decision has one I didn’t predict, which was dependence on a tie between VIM management and OpenStack that was never fully described and isn’t really present.  The VIM abstraction must represent NFVI management in a consistent way regardless of how the NFVI is implemented and how the VIM uses it.  The tests show that too much of OpenStack is exposed in the deployment and management processes, which makes all of the service lifecycle stages “brittle” or subject to failure if changes occur underneath.

The model that’s been adopted for VNFM almost guarantees that lifecycle management would have to be integrated tightly with the implementation of the VIM, and perhaps (reading between the lines of the report) even down to the actual OpenStack deployment details.  That means that it will be very difficult to port VNFs across implementations of NFVI.  The VNFMs would likely not port because they can’t exercise an intent-model-level set of management facilities.

I also have concerns that this level of coupling between VNFM and infrastructure will create major security and compliance problems.  If VNFMs have to run HEAT scripts, then how do we insulate an OpenStack instance from incorrect or insecure practices?  Can we prevent one VNFM (which represents one vendor’s notion of a service for a single user) from diddling with stuff that impacts a broader range of services and users?

The third issue raised in the tests was that NFV spends too much time describing logical inter-VNF relationships and not enough time on describing how the VNFs themselves are expecting to be deployed on the network.  This is a problem that came up very early in ISG deliberation.  Every software-implemented network feature expects certain network connections; they’re written into the software itself.  What the Light Reading test showed is that nobody really thought about the network framework in which VNFs run, and that made it very difficult to properly connect the elements of VNFs or link them to the rest of the service.

The most common example of a VNF deployment framework would be an IP Subnet, which would be a Level 2 network (Ethernet) served by a DHCP server for address assignment, a DNS server for URL resolution, and a default gateway to reach the outside world.  The virtual function components could be connected within this by tunneling between ports or simply by parameterizing them to know where to send their own traffic.  To know that traffic is supposed to follow a chain A-B-C without knowing how any of these are actually connected does no good, and the testing showed that.

But this is only the tip of the iceberg.  As I’ve said in prior blogs, you need a specific virtual networking address and membership model for NFV, just as you need one for the cloud.  Amazon and Google have their own, and Google has described its approach in detail (Andromeda).  Without such a model you can’t define how management elements address each other, for example, and how NFV components are separated from service components.

All of this results from the bottom-up approach taken for NFV specifications.  Nobody would design software like that these days, and while nobody disputes that NFV implementation is going to come through software, we’ve somehow suspended software design principles.  What we are seeing is the inevitable “where does that happen?” problem that always arises when you build things from primitive low-level elements without a guiding set of principles that converge you on your own benefit cases.

So where does this leave us?  First, I think the Light Reading test found more useful things than I’d thought it might.  This was dissipated a bit by the fact that the most useful findings weren’t recognized or interpreted properly, but tests aren’t supposed to solve problems but rather to identify them.  Second, I think the test shows not that NFV is fairly interoperable (they say 64% of test combinations passed) but that we have not defined a real, deployable, model for NFV at all.  Truth be told, the tests show that NFV silos are inevitable at this point, because operators could never wait for the issues above to be resolved through a consensus process of any sort.

But this isn’t bad (surprisingly).  The fact is that the operators are largely reconciled to service-specific, trial-specific, NFV silos to be integrated likely by operations processes down the road.  The points of the test are helpful in identifying that those unifying operations processes will have to contend with.  However, I think that PoCs or trials are the real forums for validating functionality of anything, particularly NFV, and that these vehicles will have to show results for operators no matter what third-party processes like Light Reading’s tests might show.

The Relationship Between Content Delivery and SDN/NFV: It’s About Mobility

Operators have recognized from the first that video probably represents their largest incremental opportunity.  There’s also been a lot of hype around the video market, particularly focusing on the notion that Internet OTT delivery of video would displace all other video forms.  Like most popular notions, this is almost completely unsupported and even illogical, but it’s covering the real issues and opportunities.  Let’s look at the truth instead.

Channelized material represents the great majority of video consumed today, and this material is delivered in what is often called “linear” or “linear RF” form, to a set-top box and to a TV through an HDMI cable or coax.  For material that is delivered on a fixed schedule this is a highly efficient model, and it’s my view that assertions that any company with either fiber to the home or CATV in place would abandon it are inaccurate.  Imagine the traffic generated by a million viewers of a popular TV show if all those views were IP-based.

Where IP is increasingly a factor is in delivery of video that is either off-schedule (on demand) or delivered to a mobile device or computer.  The popularity of mobile video has grown as people have become dependent on their smartphones and tuned to having entertainment access through them.  I think that the trend toward online delivery of traditional TV-on-demand reflects the fact that mobile use of video creates a platform with favorable exploitation costs—better than you could have in trying to build linear-RF solutions to TV on demand.

If mobile drives on-demand viewing trends, then it’s fair to ask how mobile changes video content delivery overall.  There are clearly “back-end” issues impacting content delivery networks (CDNs), but there are also viewing-habit changes that could have a profound impact on video overall.

Wireline content delivery is well understood.  The access network is an aggregation hierarchy, with consumer connections combined into central office trunks, and then further aggregated at the metro/regional level.  The goal for most CDNs was not to optimize this hierarchical structure, but to avoid the variable but generally significant performance variations that would arise were content to be streamed over Internet peering connections.  Instead, the CDN caches content close to the head-end point of an aggregation network.

With traditional CDNs, a user who clicks on a URL for video is redirected (via DNS) to a cache whose location is based on whatever quality of experience or infrastructure optimization strategies the network operator applies.  The content is then delivered from that point throughout the experience.  Cache points can be aggregated into a hierarchy too, with edge cache points refreshed by flow-through video and deeper ones filled in anticipation of need.

Mobile video changes this in two important ways.  First, the mobile user is mobile and so the optimum cache location is likely to change over time.  This is particularly important because mobile infrastructure has many aggregation points as towers aggregate into Evolved Packet Core elements and onward to gateways.  It’s also typically more expensive and more easily overconsumed than wireline.  Second, the mobile user is often highly socially driven in terms of consumption.  People don’t usually share their living rooms and view collectively, but almost every mobile user will be a part of an active social group (virtual or real) and the habits and behaviors of others in the group will impact what any member decides to view.

For both SDN and NFV, the dynamism inherent in mobile CDN operation presents an opportunity.  SDN could be a cheaper and faster way of defining transient paths between the cell sites supporting mobile users and their content cache points.  NFV could be used to spin up new cache points or even to build “video distributor” points that would repeat video streams to multiple users.  Some network operators have expressed interest in what could be called “scheduled on-demand” viewing, where a mobile user joins a community waiting for delivery of something at a specific time.  This would enable video delivery “forking” to be established.  Fast-forward, pause, or rewind would require special handling but operators say some market tests show users could accept a loss of these features for special pricing or special material (live streaming, for example).

Dynamic CDN is fine, but it’s the social-viewing changes that could have a profound impact.  Twitter is often used today to comment on TV shows, either among friends or in real time to the program source.  This model doesn’t require additional technical support, but with some new features the social integration of content could be enhanced.

Tweeting on scheduled TV relies on the schedule establishing an ad hoc community of users whose experience is synchronized by transmission.  Most useful enhanced social features/services would be associated with establishing simultaneous viewing for a community of interacting users (like the “forking” mentioned above) that would insure that they could all view the same thing at the same time.  Users could then interact with each other on the content-shared experience, and could invite others to join in the experience.

A variation on this social sharing would use metadata coding of the content.  Coding of scenes by an ID number or timestamp would allow a social user to refer to a scene and to synchronize others to that specific point.  It could also be used to jump from one video to another based on the coding—to find other scenes of a similar type, with the same actors/actresses, etc.  It would also be possible to rate scenes and find scenes based on the ratings of all users, of a demographic subset, or among a mobile user’s social contacts.

You can see that social video viewing/sharing would change the dynamic of content delivery.  Most obviously, you’d probably want to spawn a process to fork such sharing experiences from a common cache if the people were together, and perhaps even to make each viewer a parallel controller of the experience—one person pauses for all, for example.  You might also want to create multiple independent video pipes to a viewer if they’re browsing by metadata, and you’d need a database application to actually do the browsing.

As video content viewing becomes more social, it starts to look a bit like videoconferencing, and I think that over time these two applications would combine.  A videoconference might be a multi-stream cache source with each participant a “sink” for whichever streams they wanted to view.  They could then rewind, pause, etc.  And there are already many applications in healthcare (as a vertical) and compliance (as a horizontal) where metadata coding of conference content would be highly valuable.

Video of this sort could become a powerful driver for both SDN and NFV, but I don’t think it would be easy to make it a killer app, the app that pulls through a deployment.  Consumer video is incredibly price-sensitive, and operators will be pressed to make a business case for mass deployment in a market with low margins like that.  Still, if I were a mobile and video vendor (like Alcatel-Lucent) I might be taking a serious look at this opportunity.  At the least it would guarantee engagement with the media and consumer population.

I think the video opportunity for SDN and NFV shows something important at a high level, which is that NFV and SDN are not driven as much by “virtualization” as by dynamism.  A virtual router in a fixed place in a network for five years doesn’t need SDN or NFV, but if you have to spin up one and connect it for a brief mission, service automation efficiency is the difference between being profitable and not.  That’s an important point to remember, because most of our current network missions are static, and fears of thinking outside the “box” in the sense of physical-network constraints and missions could compromise both SDN and NFV’s long-term benefits.

APIs for NFV Operation: A High-Level Vision

There are a lot of technical questions raised by NFV and even the cloud, questions that we cannot yet answer and that are not even widely recognized.  One of the most obvious is how the component elements are stitched together.  In NFV, it’s called “service chaining”.  Traditionally you’d link devices using communications services, but how to link software virtual devices or features isn’t necessarily obvious from past tradition.  I think we need to look at this problem, as we should look at all problems, from the top down.

A good generalization for network services is that every device operates at three levels.  First, it has the data plane, which it passes according to the functional commitments intrinsic to the device.  Second, it has a control/signaling plane that mediates pair-wise connections, and finally it has a management plane that controls its behavior and reports its status.

In NFV, I would contend that we must always have a management portal for every function we deploy, and also that every “connection” between functions must support the control/signaling interface.  A data-plane connection is required for functions that pass traffic, but is not a universal requirement.  Interesting, then, is the fact that we tend to think of service chaining only in terms of connecting the data paths of functions into a linear progression.

Because we have to be able to pass information for all three of our planes, we have to be able to support a network connection for whatever of the three are present.  This connection carries the information, but doesn’t define its structure, and that’s why the issue of application programming interfaces (APIs) are important.  An API defines the structure of the exchanges in “transactional” or request/response or notification form, more than it does the communications channel over which they are completed.

I believe that all management plane connections would be made via an API.  I also believe that all signaling/control connections should be made via APIs.  Data plane connections would not require an API, only a communications channel, but that channel would be mediated by a linked control interface.  Thus, we could draw a view of a “virtual function” as being a black box with a single management portal, and with a set of “interfaces” that would each have a control API port and an optional data port.  If the device recognized different types of interfaces (port and trunk, user and network, etc.) then we would have a grouping of interfaces by type.

Going with this model in an example might help.  Let’s suppose we have a function called “firewall” designed to pass traffic between Port and Trunk.  This function would then have a management port (Firewall-Management) with an API defined.  It would also have a Firewall-Port and Firewall-Trunk interface, each consisting of a control API and a data plane connection.

Let’s suppose we had such a thing in a catalog.  What would have to be known to let us stitch “firewall” into a service chain?  We’d need to know the API-level and connection-level information.  The latter would be a matter of knowing what OSI layer was supported for the data path (Ethernet, IP) and how we would address the ports to be connected, and this is a place where I think some foresight in design would be very helpful.

First, I think that logically we’d want to recognize specific classes of functions.  For example, we might say we have functions designed for data path chaining (DPC, I’ll call it), others to provide control services (CTLS), and so forth.  I’d contend that each function class should have two standards for APIs—one standard representing how that class is managed (the management portal) and one that defines the broad use of the control-plane API.  So our firewall function, as a DPC, would have management exchanges defined by a broad DPC format, with specificity added through an extension for “firewall” functions.  Think of this as being similar to how SNMP would work.

The management plane should also have some of the TMF MTOSI (Multi-Technology Operations Systems Interface) flavor, in that it should be possible to make an “inventory request” of a function of any type and receive a list of its available interfaces and a description of its capabilities.  So our firewall would report, if queried, that it is a DPC device of functional class “FIREWALL”, and has a Port and Trunk interface both of which are a control/data pairing and supported via an IP address and port.

This to me argues for a hierarchy of definitions, where we first define function classes, then subclasses, and so forth.  All DPC functions and all CTLS functions would have a few common management API functions (to report status) but more functions would be specific to the type of function.  A given implementation of a function might also have an “extended API” that adds capabilities, each of which would have to be specified as optional or required so the function could be chained properly.

An important point here is that the management APIs should be aimed at making the function manageable, not at managing the service lifecycle or service-linked resources.  Experience has shown that pooled assets cannot be managed by independent processes without colliding on actions or allocations.  That’s long been recognized with things like OpenStack, for example.  We need to centralize, which means that we need to reflect the state of functions to a central manager and not reflect resource state to the functions.

To continue with the example of the firewall, let’s look at the initial deployment.  When we deploy a virtual function, we’d check the catalog to determine what specific VNFs were available to match the service requirements, then use the catalog data on the function (which would in my view match the MTOSI-like inventory) to pick one.  We’d then use the catalog information to deploy the function and make the necessary connections.  Each link in the chain would require connecting the control and data planes for the functions.

In our firewall, the control link on the PORT side would likely be something GUI-friendly (JSON, perhaps) while that on the TRUNK side would be designed to “talk” with the adjacent chained element, so that two functions could report their use of the interface or communicate their state to their partner.  We might envision this interface as JSON, as an XML payload exchange, etc. but there are potential issues that also could impact the management interface.

Most control and management interfaces are envisioned as RESTful in some sense, meaning that they are client-server in nature and stateless in operation.  The latter is fine, but the former begs the question of duplex operation.  A function needs to accept management commands, which in REST terms would make it a server/resource.  It also needs to push notifications to the management system, which would make it a client.  We’d need to define either a pair of logical ports, one in each direction, or use an interface capable of bidirectional operation.

What interface fits the bill?  In my view, it’s not necessary or perhaps even useful to worry about that.  The key things that we have to support in any management or control API is a defined function set, a defined information model, and a consistent set of interactions.  We could use a large number of data models and APIs to accomplish the same goals, and that’s fine as long as we’re really working to the same goals.  To me, that mandates that our basic function classes (DPC and CTLS in this example) define states and events for their implementations, and that we map the events to API message exchanges that drive the process.

How might this work?  Once a function deploys on resources and is connected by the NFV management processes, we could say the function is in the DEPLOYED state.  In that state its management interface is looking for a command to INITIALIZE, which would trigger the function’s setup and parameterization, and might also result in the function sending a control message to its adjacent elements as it comes up.

This sounds complicated, doesn’t it?  It is.  However, the complexity is necessary if we want to build services from a library of open, agile, selectable, functions.  The fact that we’ve done very little on this issue to date doesn’t mean that it’s not important, it just means we still have a lot of work to do in realizing the high-level goals set for NFV.

What’s Ahead for NFV in 2016?

The fall is an important time for operators, because they have traditionally embarked on a technical planning cycle starting mid-September and running forward to mid-November.  This is the stuff that will then be used to justify their budget proposals for the coming year.  We’ve now finished that cycle for the 2016 budget, and I’ve been getting feedback from operators on what they’ve decided with respect to NFV.

The good news for NFV is that over 90% of operators say that they will continue to fund NFV trials in 2016.  Very few are saying their current projects will be ending, even to be replaced by other activities.  That’s the bad news, though.  Only about 15% of the operators said they were confident they would deploy “NFV services” in 2016 and less than half thought they would get to field trials.

The operators who have made the most progress toward NFV have made it by bounding their missions with NFV.  Of the operators who said they would be deploying services, well over 90% said their target service was “vCPE”, with about two-thirds of these basing their deployment on premises hosting of agile features rather than in-the-cloud service chaining.  Managed service providers (MSPs) who add a management/integration layer to wholesaled transport from others make up about half the vCPE group.

The reason this is important is reflected in the next point.  Only about a fifth of the operators who say they plan deployments in 2016 say they have plans to extend their initial NFV commitment to other service areas—even beyond 2016.  Further, while nearly all this deploying-services group describe what they are doing as “NFV”, fewer than 1 in 10 base their deployment plans on a real, full, NFV implementation.

A Tier One who offers managed services offered the best description of what’s happening.  NFV, says the CFO of this operator, is attractive as a framework for delivering portal-driven agile feature delivery to customers.  They see managed services as an opportunity to climb the value chain, but their vision of implementation is really more about a service catalog and portal linked to a simple CPE management system that can load features than about full NFV.

This operator does have further NFV plans, though.  They are looking at services in IoT and mobile content for the future, and they see something more like “real” NFV as a pathway to deploy services in both areas almost on a trial basis.  “It’s the fast-fail model,” the CFO says.  “We deploy generalized infrastructure and we use that to test service concepts the only way that really works…by trying them in the market.”  This is what that particular operator needs service agility for.

There is a question of whether even this is “real NFV” though, and even in my reference operator.  The CFO admits that the operations organization and CIO have only been recently engaged in the activity, and that they are still unsure as to how they’d operationalize an NFV infrastructure and its services.  Their early target services are those where either there is no specific SLA experience in the market to conform to, or where SLAs are largely met by managing the infrastructure (NFV and legacy) as a pool of resources.  “NFV management for us is data center management plus network management,” my CIO friend admits.  “We don’t have a strategy yet that can unify them.”

All of this seems to point to an NFV evolution that’s far from the traditional way services and infrastructure have evolved in the past.  We have not been able to make a broad business case for NFV yet, though at least half-a-dozen vendors could in my view deliver the necessary elements.  Given that, what has happened is that operators have effectively extended the early trials and PoCs that aim toward a specific service-of-opportunity, hoping to prove something useful out along the way.  But doesn’t this present that old bugaboo of the silo risk?  It sort of does, and here we have some interesting commentary.

“NFV silos don’t worry us as much as silos of legacy gear,” the CFO admits.  “NFV’s primary cost is the new data centers—the servers.  The secondary cost is operations integration.  We don’t have the primary cost at risk at all with NFV because the infrastructure isn’t service-specific.  We don’t have ops integration yet so that’s not a potential sunk cost.  So we have a risk-less silo.”

That’s not a long-term strategy, as well over 80% of operators with specific NFV plans for 2016 will admit.  They believe they will be evolving toward that utopian true NFV goal.  The problem, they say, is that they can’t reach it yet.

What?  Do operators think nobody can deliver on NFV as a whole?  That’s pretty much what CFOs and CIOs think, for sure.  When I tell operators that six specific vendors would be able to make such a business case, they smile at me and say “Tom, it’s great that you can tell us that and even name the vendors, but you have to understand that we can’t deploy based on your assurances.  The vendors have to make the claim and back it up, and they are not doing that.”

That was the key result of my discussions, I think.  Operators are saying that it doesn’t matter if you have a complete NFV strategy because you probably can’t or won’t sell it to the operators.  Everyone has gotten so focused on islands of NFV that the whole NFV process has evolved to letting clumps of stuff driven by the wind and currents aggregate into little islands that will in time eventually make up a significant land mass.

In a vendor sense, who then does the aggregating, the building of that full NFV?  Not the early vendors, and in fact probably not even vendors who have NFV solutions.  Operators are seeing this increasingly as an operations step taken after early per-service trials.  Because CIOs haven’t been fully engaged so far, we have operation-less NFV, and the CIOs want to add operations by adding OSS/BSS tools above whatever NFV stuff lives below in the service layer.

If all this holds in 2016 it could have profound impact on vendors.  Smaller players might hope to drive service-specific NFV solutions that could in time be collected at the OSS/BSS level.  Larger vendors, though, will have to present something much more OSS/BSS-intensive as their central element because only that story would appeal to the CIOs who will likely drive the next step—the step beyond 2017 when NFV silos are collected.

I don’t think this is the optimum way to do things.  I think it will string NFV deployment out for five years when it could have been done in three.  I also think that we’ll sacrifice a lot of efficiency and agility by failing to fully integrate infrastructure orchestration and management with operations orchestration.  I’d have bet the outcome would have been different, but I’d also have bet that the vendors who can do full NFV right now would have pushed their capabilities more aggressively.  Now, according to operators, the opportunity to have a MANO-driven merger of infrastructure and operations has passed.

This isn’t going to stop NFV.  The risk that operators would drop it and look for solutions elsewhere doesn’t appear to be significant according to my discussions with operators themselves.  But it will change it, and how those changes will work through NFV functions and standards, and how it will impact vendor and market dynamics, will be one of my blog focuses for the coming year.