Some General Thoughts on SDN/NFV Security

SDN and NFV security are issues that rank high with network operators and users, but “security” is an issue that ranks high with everyone and ranking doesn’t always equate with rational action.  Of 48 operators I’ve talked with over the last six months, all place SDN and NFV security among their top three issues.  Enterprises, in my past surveys, have also ranked network security in the top three in over 80% of cases.  But only 38% of enterprises in my survey said they had specific and effective strategies for network security, and only four of 48 operators in SDN/NFV said they even had a handle on it.

A big part of the problem of coming to terms with SDN or NFV security is the lack of a really effective model of security issues.  My surveys have shown that enterprises and operators accept that security includes connection security that provides freedom from interception or injection of information on networks, information security that protects information in database form, and software security that protects programs and components from induced bad behavior.  There’s clearly an overlap in the categories, indicating to me that our notion of security is to secure things rather than to address classes of risks.

Another issue with SDN and NFV security is that it’s easy to get wrapped around what could be called the “legacy axel.”  SDN and NFV both produce network services, and all network services have some common security issues—the need to prevent either interception of traffic or unauthorized injection of traffic, for example.  One might be tempted, with both SDN and NFV, to catalog all of the network security issues currently recognized and then rush out to address them in an SDN or NFV context.  I’d submit that exercise could take quite a while, and it might or might not really hit all the SDN/NFV issues.

There might be another way, which is what I’ll call “differential security”.  SDN and NFV are different in certain ways, and those differences are what will generate incremental differences in SDN and NFV security.  If we insure that SDN and NFV implementations deal with securing their differential issues, then we should end up with at least what we’d have had with legacy services.  If SDN or NFV have to do more, then we’ll need a map of specific security issues to be addressed and mechanisms proposed to address them.

All software-controlled or software-defined network technologies have an obvious security dependence on the software.  Anything that’s going to be loaded and run can immediately start doing stuff, at least some of which might be malicious.  For SDN and NFV, the paramount security concern should be software authentication.  Every piece of software run has to be authenticated, meaning that the software copy must be a true copy of an explicit release by an accredited organization/entity.  Furthermore, every parameter submitted to control the operation of the software must be similarly authenticated.

There’s a corollary rule here that I think could be almost as important as the primary rule.  No piece of software and no software parameter file can ever be loaded except by the control software of the network owner.  There can be no situations where service-level software can load other service-level software.  If that happens then the authentication process is broken because there’s no guarantee that proper procedures will be followed.

Software authentication is essential but not sufficient to protect SDN or NFV.  The problem is that software-defining as a process has a natural span of control—a given piece of software is entitled to do specific stuff.  To let it do more is to invite errors at best and malice at worst.  Thus, all SDN and NFV software must be sandboxed to insure that it can’t spread its wings beyond its intended scope.  In most cases this will mean controlling the software’s interfaces, the connections to the outside world.

The most logical way to build a sandbox for SDN and NFV is to presume that you’re going to expose all your interfaces onto a virtual private network or within a private address space, and then explicitly “gate” selected interfaces into the wide world.  You absolutely cannot make your control interfaces accessible in the public address space, and that includes interfaces between virtual components.

This is particularly important for “management interfaces” because NFV in particular presents a risk of “leakage” of untrusted elements into a trust-heavy environment.  VNFs are caught between two worlds; they are not part of the carrier’s NFV software any more than applications bought from third parties are part of a company’s data center.  If management interfaces are exposed within a service then VNFs become a possible portal for malware.

Even in a world of sandboxed, authentic, software we still have some risks of an inside job.  When you can’t prevent a problem, at least identify the culprit reliably so you can punish them as a deterrent.  That means that every operational change made to a software system has to be attributable via a state audit process.  You sign a “change request” and submit it, and the change is then stamped with your signature.  In fact, you could make a strong argument for all management-plane messages to be digitally signed.

Attributability is one of the SDN/NFV topics that I think have gotten short shrift in discussions.  One reason it’s important to talk about it is that not all signature strategies are really capable of being applied to a broad and loosely coupled community of elements.  Yet that’s what SDN and NFV would create.  I’d like to see some strong recommendations in this area.

Boundary issues are my final point.  SDN and NFV have to interwork with legacy technology at various points, and these points represent places where management and control are different on each side of the mirror.  Information like reachability and topology data in IP and even Ethernet may have to be generated or converted at border points, and with this comes the risk of something spoofing itself into creating a failure or even a breach.  To the extent that SDN and NFV create borders, they have to be examined carefully to insure that they’re secured.

The sum of the SDN/NFV security situation is that SDN and NFV create multi-element “black box ecosystems” that represent device functionality of the past.  We have to insure that these black boxes don’t create increased levels of risk versus the devices they displace.  Otherwise black boxes become black hats.

How SDN and NFV Impact Netops

The impact of SDN and NFV on OSS/BSS is a topic that obsesses many operators and also a topic I’ve blogged about extensively.  There’s no question it’s important, but there’s another kind of operations too—network operations.  It’s not always obvious, but both SDN and NFV would have profound impacts on network operations and the operations center—the NOC.  Some of the impacts could even threaten the classic model we call “FCAPS” for “Fault, Configuration, Accounting, Performance, and Security”.

In today’s world, network operations is responsible for sustaining the services of the network and planning network changes to respond to future (expected) conditions.  The ISO definition of the management model is the source of the FCAPS acronym and it reflects the five principle management tasks that make up network operations.  For enterprises, this model is pretty much all of operations, since most enterprises don’t have OSS/BSS elements.

To put netops, as many call it, into a broader perspective, it’s a function that’s typically below OSS/BSS and is made up of three layers—element management, network management, and service management.  Most people would put OSS/BSS layers on top, which means that service management on the netops stack is interfacing or interconnecting with the bottom of OSS/BSS.  Operations support systems and business operations “consume” netops services.  Netops practices can be divided by the FCAPS categories, but both enterprises and service providers tend to employ a kind of mission-based framework based more on the three layers.

Virtualization in any form distorts the classic management picture because it breaks the convenient feature-device connection.  A “router” isn’t a discrete device in SDN or NFV, it’s a behavior set imposed on a forwarding device by central intelligence (SDN) or it’s a software function hosted on some VM or in a container (NFV).  So, in effect, we could say that both SDN and NFV create yet another kind of layered structure.  At the bottom is the resource pool, in the middle is the realized virtualizations of functions/features, and at the top are the cooperative feature relationships.  In a general way, the top layer of this virtualization stack maps to the bottom (element) layer of the old netops stack.

It’s easy to apply the five FCAPS disciplines to the old netops stack, or at least it’s something everyone is comfortable with and something that’s well supported by tools.  If SDN and NFV could just as easily map to FCAPS we’d be home free, but it’s pretty obvious that they don’t.

Take the classic “fault”.  In traditional netops, a fault is something that happens to a device or a trunk, and it represents aberrant behavior, something outside its design range of conditions.  At one level this is true for SDN and NFV as well, but the problem is that there is no hard correlation between fault and feature, so you can’t track the issue up the stack.  A VM fails, which means that the functionality based on it disappears.  It may be difficult to tell, looking (in management terms) at the VM, just what functionality that was and where it was being used.

We can still use classic FCAPS, meaning classic netops, with SDN and NFV as long as we preserve the assumption that the top of our SDN/NFV stack is the “element” at the bottom of the ISO model.  That’s what I’ve called the “virtual device” model of the past.  The problem is that when we get to the ISO “element” virtualization has transformed it from a box we can send a tech to work on, into a series of software relationships.  Not only that, most of the relationships involve multi-tenant resource pools and are designed to be self-healing.

One logical response to this problem at the enterprise level is to re-target netops management practices at the virtualization stack, particularly at the pooled resources, and treat the ISO netops stuff almost like an enterprise OSS/BSS.  This could be called asynchronous management because the presumption is that pooled resources would be managed to conform to capacity planning metrics and self-healing processes (scaling, failover) would be presumed to do everything possible for service restoration within those constraints.  A failure of the virtualized version of a service/device would then be a hard fault.

This seems to me to be a reasonable way of approaching netops, but it does open the question of how those initial capacity planning constraints are developed.  Analytics would be the obvious answer, but to get reasonable capacity planning boundaries for a resource pool would require both “service” and “resource” information and a clear correlation between the two.  You’d want to have data on service use and quality of experience, and correlate that with resource commitments and loading states.

Not only that, we’d probably need resource/service correlations to trace the inevitable problems that get past the statistical management of pooled resources.  Everyone knows that absent solid resource commitments per service, SLAs are just probability games.  What happens when you roll snake-eyes?  It’s important in pool planning to be able to analyze when the plans failed, and understand what has to be done (one option being roll the dice again, meaning accept low-probability events) when they do.

There’s also the question of what happens when somebody contacts the NOC to complain about a problem with a network service.  In the past, NOC personnel would have a reasonable chance of correlating a report of a service problem with a network condition.  In a virtualized world, that correlation would have to be based on these same service/resource bindings.  Otherwise an irate VP calling the NOC about loss of service for a department might get a response like, “Gee, it appears that one of your virtual routers is down; we’re dispatching a virtual tech to fix it!”

To take this to the network operator domain now, we can see that if we presume that there exists a netops process and toolkit, and if we assume that it has the ability to track resource-to-service connections, we could then ask the question of whether OSS/BSS needed to know much about things like SDN and NFV.  If we said that the “boundary” element between the old ISO stack and the new virtualization stack was our service/resource border, we could assume that operations processes could work only with these abstract boundary elements, which are technology-opaque.

This then backs into operators’ long-standing view that you could orchestrate inside OSS/BSS, inside network management, or at the boundary of the two.  The process of orchestration would change depending on where you put the function, and the demands on orchestration would also change.  For example, if OSS/BSS “knows” anything about service technology and can call for SDN or legacy resources as needed, lower-level network processes don’t have to pick one versus the other for a given service.  If operations doesn’t know, meaning orchestration is below that level, then lower-level orchestration has to distinguish among implementation options.  And of course the opposite is true; if you can’t orchestrate resources universally at a low level, then the task of doing that has to be pushed upward toward the OSS/BSS in network operators, or perhaps totally out of the network management process of enterprises, into limbo.  There is no higher layer in enterprise management to cede orchestration to, so it would end up being a vendor-specific issue.

This point puts the question of NFV scope into perspective.  If you can’t orchestrate legacy, SDN, and NFV behaviors within NFV orchestration you have effectively called for another layer of orchestration but not defined it or assigned responsibility to anybody in particular.  That not only hurts NFV for network operators, it might have a very negative impact on SDN/NFV applications in the enterprise.

An Operator’s View of Service/Resource Modeling

I had an interesting exchange with a big national carrier on the subject of management integration and unified modeling of services.  I’ve noted in past blogs that I was a fan of having a unified service model, something that described everything from the tip-top retail experience to the lowest-level deployment.  I pointed out that such a model should make management integration easier, but also that it would be possible to have integrated management without a unified model.  You’d need only firm abstractions understood by both sides at critical boundary points.

My operator friend had given some thought to the same point, and reached somewhat different conclusions.  According to the operator, there is a distinct boundary point between service management and resource management.  Above the line, so to speak, you are focused on conformance to an explicit or implied SLA.  It’s about what the user of the service is experiencing, not about the state of the resources underneath.  In fact, you could have significant faults at the resource level that might not generate any perceptible service impact, in which case nothing needs to be done in service terms.

Taking this even further, the operator says that as you move into virtualized resources, where a service feature is created by assembling hardware/software tools that probably don’t directly bear on what that original feature was, in management terms, the boundary between services and resources gets even more dramatic.  So dramatic, says the operator, that you may not want anyone to cross it at all.

Let’s presume that we have a “service layer” and a “resource layer” both of which are modeled for automated deployment and assurance.  The boundary between the two layers is a set of abstractions that are the products of resource behaviors.  An example might be a “VPN” or a “subnet” or a “VLAN”.  The services are built from these abstractions, and it’s the role of the resource layer to make them real.

The operator’s vision here can now be explained a bit more clearly.  A “VPN” abstraction has a set of management variables that are logical for the abstraction.  The service management side has to be able to relate the state of those variables up the line to the customer or the customer service rep.  However, there is no reason for the customer or CSR to dive down through the boundary to see how a given “abstraction property” was derived from the state of resources.  After all, what’s underneath the abstraction covers is likely a shared resource pool that you don’t want people diddling in to begin with.

There’s a lot to say for this view, and it may be particularly important in understanding the two camps of operations modernization, the “I want to preserve my OSS/BSS processes and tools” and the “I want to start OSS/BSS over from scratch” groups.  If you take my operator friend’s perspective, you can see that things like SDN and NFV can be viewed as “below the line”, down in the resource domain where operations systems and processes that are (after all) primarily about selling and sustaining the service don’t go.

I kind of alluded to this approach in a past blog as the “virtual device” mode, but I think it’s a bit more complicated than the term suggests.  The operator is saying that “networks” are responsible for delivering abstract features from which we construct services.  There may be many ways a given feature could be created, but however many there are and however different they might be, the goal should be to harmonize them to a single common abstraction with a single set of properties.  Management above the line relates those properties to the overall state of the service as defined by the service’s SLA, and management below the line tries to insure that the abstraction’s “abstraction-level agreement” (again, explicit or implied) is met.  Both parties build to the same boundary, almost the way an NNI would work.

The difference between this and what I was thinking of as a virtual device approach is that in general the abstractions would be network behaviors, cooperative relationships involving multiple devices and connections.  My thought was to make “virtual devices” that mapped primarily to real devices.  I have to agree with my operator’s view that the abstraction-boundary model makes more sense than the virtual device model because it fixes the role of OSS/BSS at the service level and lets fulfillment go forward as it has to, given the resources used.

The value of this independence is that “services” know only of network behaviors that are “advertised” at the boundary.  Most new network technologies, including SDN and NFV, are considered primarily as alternative ways of doing stuff that’s already being done.  We have IP and Ethernet networks; SDN gives us a different way to build them.  We have firewalls and NAT and DNS; NFV gives us a different way to deploy those features.  In both cases, though, the service is made up of features and not devices or technologies.

Oracle, interestingly, has been touting something it’s working on in conjunction with the MEF, called lifecycle service orchestration or LSO.  The Oracle model shares much with Alcatel-Lucent’s vision of how operations and NFV or SDN coexist, but interestingly my operator friend says that Oracle, Alcatel-Lucent, and the MEF don’t articulate that critical notion of a boundary abstraction to the satisfaction of at least this particular operator.

The abstraction-boundary approach would make a two-tier management and orchestration model easier to do and set boundaries on implementation that would largely eliminate the risks of not having a single modeling approach to build both sides of the border.  In fact you could argue that it would allow vendors or operators to build service-layer and resource-layer structures using the stuff that made the most sense.  Two models to rule them all instead of one.

Or maybe two dozen?  In theory, anything that could present an abstraction in the proper form to the service layer would be fine as an implementation of the resource layer.  The abstraction model could at least admit to multiple vendor implementations of lower-layer orchestration and resource cooperation.  It could even in theory encourage them because the abstraction and the existence of a higher-layer service model would harmonize them into a single service.  It’s a bit like telling a programmer to write code to a specific functional interface; it doesn’t matter what language is used because from the outside looking in, all you see is the interface.

I’m not quite ready to give up on my “single model” approach, though.  Obviously you can create an abstract-based boundary between services and resources using a single model too; you can create such a boundary anywhere you like whatever model you use.  The advantage of the single model first that you can traverse the model from top to bottom with the same tools, and second that you are using the same tools.  I confess that current practices might make traversing an entire model less useful than I’ve believed it would be, but we still have to see how practices adapt to new technologies before we can be sure.

This is the sort of input I think we should be trying to solicit from operators.  There are many out there who are giving serious thought to how all of our revolutionary technologies would work, and many are basing their views on early experimentation.  This sort of thing can only strengthen our vision of the future, and the strength of that vision will be a big part in getting it financed.

What Oracle Teaches Us About the Cloud

Oracle reported their numbers on Wednesday, and the results weren’t pretty by Street standards.  The company missed pretty much across the board, and in particular in Europe.  Oracle blamed foreign exchange for much of their problem, but the general financial-industry consensus is that it’s deeper than that, including dragging hardware, poor execution, and perhaps a lack of an effective cloud strategy overall.  One Street report said that few would doubt that Oracle will be a major cloud player.  That may well be true, but many (including me) doubt what the profit implications of being a major cloud player will turn out to be.

The overwhelming majority of business IT executives and general management teams say that the reason for cloud adoption is reduced cost.  OK, but if that’s true then we also have to ask how vendors and providers will end up making money on the cloud.  Remember that old joke, “I’m losing money on every deal but I’m making it up in volume?”  For any provider of cloud services or equipment, particularly the CFO in such a company, this is a bad joke indeed.

Business spending on IT (compute and networking) is made up of two components; what can be called “budget” spending and what can be called “project” spending.  Budget spending is allocated to sustain the IT operations in place, operations that were previously justified by business improvements they supported.  Project spending is allocated to reap new business benefits by adding IT features or capabilities.

Any buyer would look at this picture and come to two important conclusions.  First, the smartest thing for me to do with budget spending is to reduce my budget by getting a better deal on technology.  In fact, on the average, companies say they would like to see budget spending decline by about 5% year over year.  Second, I need to justify my project spending carefully because it first has to meet my company targets for ROI and second it will be contributing (through modernization and replacement) to budget spending for some time forward.

CIOs and IT planners understand budget spending pretty well; it boils down to getting more bang for your buck.  They generally don’t understand project spending all that well because in many cases they don’t have a good opportunity model in place.  Projects are driven by benefits, and if you can’t identify opportunities to reap benefits you’ll not justify projects.  That means specific benefits, with quantifiable outcomes, not catch-phrases like “agility”.  Oracle’s first problem is that they have not effectively presented project-oriented benefits, so their cloud offerings tend to fall into the budget buying that’s based only on reducing cost.  Oracle is the cost getting reduced.

Historically, project spending has been driven by what could be called “waves”.  When computers came along and batch processing of information contributed to better management reports and operations, we had a wave.  Distributed computing produced a second wave, the minicomputer and real-time applications.  Personal computing produced a third wave.  With each wave, the new compute paradigm was widely socialized in technical media and the essence of the value proposition for that new paradigm was quickly spread through management.  Project spending ramped up, until eventually all of the new benefits were reaped.  At that point, project spending declined and IT spending overall dipped relative to GDP growth.  You can see this sinusoidal set of peaks and valleys for the three waves we’ve had.

The challenge of project spending today is that there is no clear fourth wave, nor IMHO is there a clear way to socialize a candidate for it.  There’s no real trade media insight any more.  The climate of technology today is controlled by budget spending, which you’ll recall is an area where the target is always to lower IT cost.  If the cloud is important, it’s because it lowers IT costs.  Same with SDN, NFV, or whatever.  No new benefits, no new spending, and what you face is a period of commoditization and decline.  Oracle’s second problem is that the natural direction of IT coverage in the media today, the natural focus of CIOs, is cost reduction.

But does it have to be this way?  The interesting thing about past cycles is that each of them have showed about the same peak/valley relationship.  The problem is that while in the past one valley was followed within a few years by the ramping to a new peak, and since 2002 we’ve had no such ramping—we’ve stayed near the historical lowest ratio of IT spending growth to GDP growth.  There has been no driver to a new cycle, and because we’ve had over a decade of cost-driven IT most of the management on both the seller and buyer side have lost the charts to productivity benefits’ safe harbor.  That includes Oracle.

Part of the problem is that each past productivity cycle, and likely all future cycles, are driven by a new way of supporting workers.  In the past, the change was easily visualized by everyone.  No computing to batch computing—easy to understand.  Batch to real-time—also easy.  Same with remote real time to personal computing.  You can see that each of these trends seem to move computing forward in the worker production cycle—closer to the worker in a physical sense.  But with personal computing we’re there at the worker.  Do you go inside for the next cycle?  You see the visualization problem.

I’ve grappled myself with understanding what the next cycle driver might be.  Clearly it has to have something to do with work and workflow in an IT sense, because the best way to look at past cycles is really not where the computer is but how it plays in the worker’s activity.  My current thinking is that the next cycle would be driven by point-of-activity empowerment.  The difference would be that the worker would no longer even be going to IT, they would be taking IT with them.  Thus, it’s about mobility.

Even if I’m right about mobility, though, it’s not as easy a thing to exploit.  If PCs are a new productivity wave, you buy PCs.  What do you buy if mobility is the next wave?  Mobile phones or tablets aren’t the answer.  First, users have them already—from their company or personally.  Second, the application of PCs to productivity was clear.  Microsoft Office or Oracle’s Creative Suite were the embodiment of PC-based productivity enhancement.  What’s the brass-ring mobile app?  For the first time, perhaps, our next wave is enabled by something (mobility) but is realized in a more complicated way, through a lot of changes in a lot of areas.

Sales stories based on complicated value propositions have always been difficult, and in an age where an insightful online article is about 500 words long, it’s almost impossible to guide planners through mobile-enhanced productivity changes.  Oracle’s sales failure is probably more a marketing failure, because complex new stories have to be told first, broadly, and inspirationally, in the marketing channel.  A salesperson will never be able to drag a buyer out of total ignorance of a benefit case into the state of good customer-hood.

It’s not that Oracle got to the cloud too late, but that they’re apparently (at the marketing and sales levels) getting to it wrong.  At the root of Oracle’s problems is the fact that they’re seeing the future as a cloud transition.  It’s not; it’s a mobile transition that will convert the cloud from a cost-saving-and-revenue-killing strategy to a strategy to gain new benefits and increase total IT spending.  They’re not the only ones with that problem.

The cloud, mobility, and virtualization can change the world and probably will eventually.  The question for companies is whether they’ll be able to navigate those changes, and the long-standing tendency to take an easy sales argument like “Saves money!” and run with it in the near term in the hope of gaining market share is hurting them.  You’ve got to be committed to revolution and not contraction.  True for Oracle, true for us all.

NFV’s Virtual Node Opportunity Could be Significant

I’ve blogged now about “edge-based” and “interior” NFV service opportunities, and in the latter I noted that I was going to treat the case of “interior nodes” separately.  Many of you will probably understand why that is the case, but I hope to show everyone why nodal services are different, and perhaps generate some discussion even among those who’ve known that all along.

A network is a structure designed to generate connectivity through a combination of trunks and nodes.  Nodes provide data passage among trunks, what in the general sense is called “forwarding” based on addresses.  Trunks connect the nodes.  In the old days before virtualization, it was pretty obvious what each of these two things was and most important where they were.

The thing we call “virtualization” (again in its most general sense) has changed networking gradually by allowing for “virtual trunks” that were really paths composited from segments of media.  The segments could be parallel, like lambdas, or sequential as in strung along in a line.  In the OSI model, a given layer offers abstract services to the higher layer and creates them within its own domain as needed, so “virtual circuits” and paths or tunnels are equivalent to physical layer if that’s how the services are pictured.  We’ve had networks based on that for ages.

We’ve also had virtual nodes, in a sense, through VLAN and VPN services.  These services look to the user like a kind of device and replace the classic node-and-trunk per-user structures.  SDN can augment these virtual-node services because it can provide customized forwarding control that could either structure traditional services in a different way or build new ones with new forwarding rules.  You can also simply host a bridge or router instance in a server or VM and create a software node.

The thing that’s common to both legacy and virtual-node models today is that the topology of the network, including the placement of nodes, is fairly static.  You may have a real router or an SDN white box or a software router, and you may let it work adaptively or control forwarding explicitly, but it kind of is where it is.  In theory, NFV could change that and the question is under what circumstances a change would be useful.

There are a lot of reasons why a network topology could change, the most obvious of which is that the optimum location for nodes was impacted either by changes in traffic flows or by changes in the underlying transport properties on which the trunks were based.  The former is obviously the most “dynamic” but you can see the problem immediately; “traffic flows” in aggregate may not be that dynamic.  On the other hand, suppose we returned to the notion of networks of old, trunks and nodes, and simply substituted “virtual” into each term?  On a per-user basis, topology optimization would be a lot more useful.

If we view VPNs and VLANs as being created by special features at Level 3 and 2 (respectively) by routers and switches then we lose dynamism value by aggregating stuff.  If we build private networks, which is what VPNs and VLANs build, in the “old” way with virtual switches and routers then we personalize topology to the service and users.

Virtual trunks of any sort could be married to NFV-deployed virtual routers/switches to create an instant and different model of a virtual private network.  Not only that, the nodes could be moved around, replicated for performance, etc. providing that you could tolerate the disruption in the packet flow.  Obviously a performance problem or a device failure would create a disruption anyway, so it’s a matter of degree.

This model of a virtual private network/LAN could be connected to user sites through a virtual pipe, which would mean their on-ramp router was in the cloud, or through a virtual trunk from a (likely virtual) router hosted on premises.  That could be customer- or carrier-owned.  Since the interior router on-ramp would exist in either case, this looks like what I’ve called the “agent model” of service access; a cloud agent represents each network site and you connect with your agent to get the service.

One of the interesting consequences of this model is that connection services live totally in NFV data centers; you can only host virtual routers where you have hosting resources available.  That facilitates the introduction of other NFV services since you have direct access to user traffic flows from the places where new service features could be placed.  You’d never have to “hairpin” traffic to get to a service chain, for example.

If the model is carried to its logical conclusion and all business virtual network services are hosted this way, you also have a natural way of connecting to applications the operator hosts in their own cloud computing service.  The carrier would have a competitive advantage because they’d have direct connection with customer flows; no additional delay or risk of failure would be introduced if cloud applications were co-located with virtual routers carrying the private network traffic.

There are also, obviously, operational questions that would have to be answered.  One big one is whether multiplying the router instances needed to build private networks (versus providing for them out of shared devices) generates a risk of greater opex.  I think that if we assumed that our trunk underlayment was engineered for reliability and also had path recovery capabilities, you might actually end up with lower operations costs.  You’d also eliminate some security issues by partitioning traffic by customer/service below the router layer.  But we need work on this point.

We also need work on understanding how the model would apply to multi-user service subnets.  If a service is supported on its own subnet like an application, and if users are gated onto that subnet as needed, how would the fact that the users weren’t of the same company impact the topology and costs?  That would also help answer the question of how mobile users would impact the VPN/VLAN picture.  Does the operator provide a mobile virtual on-ramp in multiple metro areas, and if so how would that be priced and how would it impact traffic and operations?

I believe that virtual routers and switches could have a profound impact on network services, starting with VPNs and VLANs, but I also believe that both operators and vendors have tended to think in too linear a way regarding how they could help.  You can never take full advantage of virtualization if you create virtual things that exactly mirror the capabilities of physical devices already in use, then place the virtual things in the same place you put the physical devices.

In some ways, virtual switching and routing deployed by NFV and organized and accessed via SDN could be a killer app for both SDN and NFV.  That would be particularly true where an operator had a lot of new customers or moves, adds, and changes and so would be refreshing infrastructure more regularly.  Of all the NFV opportunities, deployment of virtual nodes has received the least attention.  I think that should change.

Exploring “Natural-Interior” Applications of NFV

I blogged yesterday about the vCPE model of services, talking both about its role in NFV and how it might have a life outside/beyond NFV.  Some of you were interested in what might be said about other new service models, and in particular how “interior” rather than edge models could work.  My information on this topic is more speculative because operators have focused on vCPE and have really looked at only a few “interior” models, but I’m happy to offer what I have and draw some tentative conclusions.

Edge-related services, in my vernacular, are services that are normally associated with devices connected at or near the point of service connection.  Interior services are thus the opposite; ones that don’t normally look like an extension of the access device.  I’m not classifying service chains that support natural-edge functions as “interior services”; they’re virtualized services that happen to be hosted deeper.  For my discussions I want to focus on the natural habitat of the service, not its hosted manifestation.  I also want to exclude, for a later blog, the topic of interior services of routing, switching, and other connection functions.

Operators who have looked at NFV-driven service evolution have identified a number of possibilities for interior services, which I’ll list in order of operator mentions:

  1. Mobility, meaning IMS and EPC. This is the runaway winner in terms of interest.
  2. Content delivery networks (CDN) and the associated elements.
  3. “Distributed” network features such as load balancing and application delivery control.
  4. Hosting services, including web, email, collaboration, and cloud computing.

While there’s widespread interest in the mobility category of interior NFV targets, operators confess that there’s still a lot of confusion in terms of the “what” and the “why”.  Operators have tended to divide into three groups by their primary goal.  The largest is the “agility, elasticity, and cost” group that’s targeting a virtualized replacement of fixed IMS/EPC elements.  This group isn’t seeing a major change in the conceptual structure of IMS/EPC, only elastic hosting of RAN, mobility management elements, HSS elements, and EPC components.  The second “re-architect” group would like to see all the mobility elements redone architecturally based on current insights and current technology.  The final group, “applications”, wants to focus on extending mobile services with applications and look to NFV to host the application components and integrate them with IMS/EPC.

Perhaps the most interesting of the “re-architect” concepts is framing the whole notion of EPC in SDN terms.  Tunneling in the classic EPC sense may not be needed if you can control forwarding explicitly.  In addition, if you presume that EPC components like PGWs and SGWs are virtual and cloud-hosted, might they not end up being adjacent for all practical purposes?  If so what’s the value of having them as separate entities?

The content delivery area is less conflicted but still somewhat diverse in interests.  It divides into two segments; operators primarily worried about mobile content delivery and operators looking for IP video offerings primarily as home viewing offerings.  The first group is blending CDN and IMS/EPC virtualization into a single goal, and the second is looking at CDN and perhaps set-top box virtualization.

TV Everywhere is the primary focus of the first group, obviously.  Operators who have multichannel video services are very interested in leveraging their assets, and those who don’t are looking perhaps to partnership relationships with others who do.  A CDN, IMS-like user certification, and a strong virtual EPC could make TV Everywhere, well, go “everywhere.”

The distributed network features interests are the least focused, obviously in no small part because the category itself is so diffuse.  For all the specific feature targets there are two sub-classes—a group who want to architect the concept for the virtual age and a group that simply wants to virtual-host elements.

Some distributed features, like DHCP and DNS, already exist and would likely be simply virtualized and made scalable in some form.  But in both cases it would be helpful to explore how scaling for performance would be handled given that both DHCP and DNS servers are updated.  Other features like load balancing and application performance management (“higher OSI layer” services) really have to be re-architected to be optimally useful when they’re employed in the cloud, particularly when they’re in front of a scalable community of components.

Many operators believe that the logical structure of these higher-layer, traffic-and-application-aware services is a distributed DPI capability.  Most don’t see DPI being added to every trunk, but rather see it as something that would be hosted in multiple data centers and then used in a cooperative system of forwarding management.  Some believe that DPI would also be coordinated with DNS/DHCP services, particularly for load balancing.

The last category of interior features are the hosted elements of today—things like web servers, mail, SMS/IM, collaboration, etc.  Even cloud computing services could fall into this category.  You might find it surprising that this item is last on the list, but the fact is that these services are typically not provided by the network (regulated) entity and most NFV planners come from that side of the house.

The questions that these services raise in the network sense are only starting to emerge.  If you assume that the hosted element/service is either “on the Internet” or on a VPN/VLAN, then you could assume that the service might simply be made visible in the proper address space.  However, some operators are looking at things like “bicameral email” where an email server has a VPN presence and a public SMTP presence and a security element between.  Here we have something live in both worlds—Internet and VPN.

That model is of growing interest for hosted elements and applications.  I’ve mentioned the idea of having “application-specific VPNs” created through SDN where every application or class of application has its own subnet, and where users are connected onto subnets based on what they’re allowed to access.  This same approach, meaning the creation of what are essentially service-specific subnetworks, would also be valid for NaaS based on SDN technology, and some technology planners in the operator world think that some sort of SDN/NFV hybrid of connectivity and features would likely represent what all future services might look like.

This model could unite the edge and interior visions.  It might also be the key to creating a useful mobile services model, one that can exploit things like IoT, location, social context, and so forth.  That may seem counter-intuitive, so let’s look a bit at why that’s true.

If you think about NFV services, and in particular the notion of “interior” models, what you see is the notion of a service that’s almost everything-independent.  It lives out there on its own hosted by some set of network-linked systems and represents connectivity or other experiences that a buyer finds valuable.  The service is delivered by making an edge connection.  We’d think of that connection as being one of a number of fixed-branch-site access pipes because that’s the typical VPN or VLAN model, but it would be perfectly feasible to envision a service as a kind of “meet-me” where users access a service on-ramp and obtain the service when needed.

That model suits mobility in two ways—first it is easily adapted to a roaming user, and second it adapts to the notion of a “personal assistant” or mobile-agency service models.  It also suits the notion of a universal dmarc with NaaS on-ramps ready to be connected when needed, and with NFV-based services also there to be accessed as an application/service-specific subnetwork.  In fact, if the model were in place, users would have a virtual persona in the network that they could contact from any device, and obtain the services they count on.

One interesting question which (like many, I confess) I don’t have a good answer to is whether edge and interior models of NFV services can grow naturally together to support this symbiotic model.  So far, trials and use cases for both SDN and NFV are fairly contained and so something like this would be a major change of approach.  But without a vehicle that moves both edge and interior forward, we might end up missing the change to help both missions, operators and vendors, and the market overall.

Some Important Trends Wrapping Around vCPE

Virtual CPE (vCPE) is one of the hotter topics these days, and even though it has its roots in NFV the concept seems to be taking some tottering steps on its own.  As it does, it may be exposing some evolutionary trends that could supplement or replace aspects of the NFV value proposition, and even help us understand where services are heading.

vCPE is part of a very well-established trend toward “service-sourcing” in networks.  In the old days, businesses built networks from circuits and nodal devices—routers most recently.  As IP convergence and Carrier Ethernet matured, this model was gradually replaced by an edge-networking model, where an operator provided architected multi-site connectivity (VLANs or VPNs) and users simply attached to it at each site.  In most cases the edge devices were essentially members in the architected connectivity operators provided, and in some cases they were even “customer-located equipment” or CLE, provided by and owned by the operator but located on the premises.

The benefit of architected connectivity is that you don’t have to build a network from trunks and nodes on your own.  The value proposition for this type of service is much the same as for the cloud, so if it’s good to buy architected L2 or L3 services, then it follows it might be helpful to buy higher-layer services as well.  In many cases, users of VLANs and VPNs purchase equipment or features to add to basic L2/L3 connectivity, and it’s these devices that were among the primary targets of the early NFV work.  Why buy and install devices that add specific functionality and have to be maintained, updated, changed in and out, when you can do that virtually, as-a-service?

Generalizing service deliveries, meaning true networking-as-a-service, requires a general access model or you end up with access connections waiting for something to connect with.  Elasticity of services demands, in a realistic world, an access pipe with enough capacity and capability to terminate all the variable stuff you plan to use/sell.  Adva’s latest agile access notion, which is only vaguely NFV-ish in detail, is a good example of an agile demark.  You could in theory set up multiple L2 terminations on a trunk, a combination of L2 and L3.  This also seems compatible with Brocade’s elastic edge concept, which uses a hosted router/firewall inside the carrier cloud to serve as the L3 ramp to a simple L2 dmarc.  It’s not NFV but it might be important in exploiting what NFV can do.

Another approach that’s being touted is the “agile CPE” model, where a device that has general hosting capability is used to replace multiple special-function devices like firewalls and routers.  This could look a lot like NFV in that you could add “VNFs” to the box to add functionality, but unlike classic NFV there’s no need to host the VNFs in a cloud.  RAD, Overture, and other vendors are now offering boxes that do this.

Silver Peak just announced another slant on next-gen vCPE; their Unity EdgeConnect uses the Internet to create secure VPNs that can either augment or replace standard carrier VPNs (MPLS).  Their approach lets users employ a single access connection to support the cloud, the Internet, and VPNs.  It wouldn’t be difficult to add hosted NFV-ish features to such a model, either.

If the virtual CPE model is based on hosting at the dmarc, then it’s logical to associate it with more versatile edge devices.  Logical but maybe not necessary.  Any server with the right network card could terminate a carrier connection and be used to host connection-point or network-based functions, of course.  Given the fact that operators are looking to replace specialized network devices with servers, it seems inevitable that there’d be interest in vCPE that’s hosted on such a box at the termination point, whoever owns it.

And whoever supplies it.  If vCPE is a good idea, if higher-layer services are a good idea, then the dmarc might be very valuable real estate indeed.  There’s a lot of stuff that is best done there, either on a transient basis (like “service chaining”, where local function hosting could alleviate massive first-cost problems associated with building cloud data centers for VNFs) or permanently.  Firewalls and encryption obviously belong close to the service termination, and voice services benefit from having a local virtual IP PBX locally for intra-facility calls.  Application acceleration belongs at the dmarc too, or very close to it.  Get a good camel’s nose under the dmarc tent and there’s a lot of follow-on business you could do.

You might even have a lot of carrier partnership opportunities.  A lot of cloud services and even advanced network services could benefit from local representation, and the owner of the dmarc device might have a chance to promote deals with network/cloud providers based on their incumbency.  If the device vendor could induce the user to take a “vCPEaaS” deal, retaining ownership and management of the edge, then the opportunity could get a lot larger.

Something like this could be a boon not only to Carrier Ethernet companies who have been finding it harder to get into the NFV-and-SDN revolution, but even for consumer equipment vendors.  There’s no reason why a Level 2 service interface with multiple L3 terminations couldn’t work with any consumer broadband service model—telco or cable.  You could even envision models where a user’s mobile broadband service has multiple virtual terminations, perhaps for direct access to wearables or simply to add service features.

This is where vCPE cuts both ways, in NFV or even SDN terms.  Because vCPE could be an agent for services it could promote services by facilitating their use.  vCPE, for example, could make services “appear” at a dmarc, a trivial-sounding but essential element in getting user access to those services.  I’ve already noted that vCPE could be used as an on-ramp to NFV-type service chaining in the cloud, offering operators a way of providing VNF hosting before they have enough customers to justify cloud data centers.  But sometimes the access road is as far as you need to go.  vCPE could tap off a significant segment of early NFV opportunity if those VNFs you start hosting in vCPE stay there and never move to a carrier cloud at all.   Similarly, SDN services could be terminated in vCPE to make them easy to access, or you could use CPE to create what looked like SDN from legacy service technologies.

“Real” NFV offers broader services, greater agility, and lower mass-cost points, providing that you actually design your implementation to do all of that.  As vCPE becomes hotter, proponents of real NFV will have to pay more attention to how to differentiate their expressway from the on-ramp, rather than just promoting the notion of chaining VNFs.  And vCPE vendors may make that hard to do, if they learn their own positioning lessons in the crucible of competition.  Or maybe they’ll form a common cause.  It could be an interesting marriage, or struggle, to come.

Can a New Kind of Industry Group Solve NFV’s Problems?

For those who, like me, find the current NFV activity diffused and missing the point more often than not, the prospect of yet another body with their fingers in the pie stirs mixed emotions.  The New IP Agency (NIA) was launched at the Light Reading Big Telecom Event last week with (according to the group) four missions; information, communications, testing and education.  Since NIA is born out of a realistic picture of where we are with NFV, there’s hope.  Since we already have a couple dozen groups working on NFV, there’s risk NIA will simply be another of them.

On the “hope” side, there is no question that there are major problems developing with NFV.  The original mission of NFV was to reduce costs by substituting lower-cost hosting for higher-cost appliances—a capex-driven argument.  This was discounted by most operators I’d met within the first year of NFV’s existence; they said there weren’t enough savings attainable with that model versus just forcing vendors to discount their gear.  NFV’s benefit goals have shifted to opex reduction and service agility, both of which could be substantial.

The operative word is “could”, because while the goals shifted the scope of NFV’s efforts did not.  The specifications still focus on the narrow range of services where large-scale substitution of virtual functions for appliances has been made.  Since legacy equipment is outside NFV’s scope, you can’t do end-to-end orchestration or management within the specs, which means that opex efficiency and service agility are largely out of reach—in standards terms.  There is also a very limited vision of how services are modeled and managed even inside the scope of the spec, and that’s largely what contributed to the situation Light Reading called the “Orchestration zoo”.

It would be possible, in my view, to assemble a complete model for NFV that embraced end-to-end services and even organized management and operations processes.  My ExperiaSphere example shows that you could build such a thing largely from open-source elements and that it could support the ISG’s work as well as embrace TMF and OASIS cloud specifications.  The TMF has run some Catalyst demos on operations integration, and there are at least four NFV vendors whose orchestration strategies could provide enough integration breadth to capture the full benefit case.  Arguably all of these conform to standards.

Stevens once said that the “wonderful thing about standards is that there are so many of them to choose from.”  NIA doesn’t propose to write more, but I wonder whether it could even hope to do the choosing.  What it could do is to shine some light on the real problems and the real issues.

Example:  An “orchestration zoo” isn’t necessarily a disaster.  The cost of orchestration tools is hardly significant in the scope of NFV infrastructure costs.  If multiple orchestration tools delivered better benefits then they could be better than having a single standardized one.  What we can’t afford, and what NIA would have to help the industry to avoid, is either a loss of benefits or a loss of infrastructure economies.  Multiple orchestration approaches could lead to that more easily than a unified approach, but if we managed how orchestration processes related to both infrastructure and operations processes we could avoid real harm.

Example:  Service modeling is really important both in preserving benefits and preserving economies of scale in infrastructure.  A good service model both describes deployment and records results of deployment.  You need it to organize the components of a service, and to reorganize them if you have to scale horizontally under load or replace a piece.  You need a service model to start with goals—an intent model.  You need it to decompose downward in an orderly way, with each transition between modeling approaches represented by a hard service/intent-modeled boundary.  You need to end up with an abstract vision of hosting and connection services that doesn’t expose technology specifics upward to services, but models it as features across all possible implementations.  There’s been next to no discussion of service modeling, and NIA could help with that.

Example:  Current PoCs focus too much on individual services and not enough on broad opportunities or systemic infrastructure choices.  That’s because we started with the notion of use cases, which were service-based applications of NFV.  Since there was no notion of unity, only of use cases, it’s not surprising that we exploded the number of “orchestrators.”  We’re beyond that now, or should be.  A PoC can validate technology, but deployment requires we validate the totality of carrier business practices, the whole service lifecycle, and that we do so inside a universal vision of NFV that we can reuse for the next service.  NIA could help PoCs “grow up”, elevate themselves into business cases by embracing both the other horizontal pieces of a service and the vertical components of operations processes.

Where NIA could help might be easier to decide than how it could help.  We don’t know how it will be constituted, other than that LR will have at least a temporary board membership.  From the early sponsorship and support it appears likely it will have vendors involved, and also likely that it will charge for membership as bodies like the ONF do.  One thing that’s unusual and could be exploited is the notion of an explicit partnership with a publisher, and perhaps through that an analyst firm (Heavy Reading).  It’s my view that simply having good coverage, meaning objective and thorough coverage, of NFV could be helpful by putting the right issues out front in the discussions.

In my view, the critical factor in making NIA successful is whether vendors who offer SDN and NFV products see it as a path to making the business case, not just proving the technology.  Operators tell me that current PoCs don’t have the budget for deployment in the great majority of cases, and operator CFOs say that they don’t make the business case to get the budget either, at least as they’re currently constituted.  Vendors need to sell something here, not just play in the science sandbox, so they could mutually gain from efforts to elevate the mission of trials so that it includes the operations efficiency and meaningful new-service and service-agility elements.

Obviously that could be tricky.  Vendor cooperation tends to fall down at the first sign of somebody being able to drive things to a competitive advantage.  The “zoo” Light Reading describes wouldn’t be an issue if vendors were in accord with respect to how to approach NFV or SDN.  In fact, as we know, there are vendors who’d like to see both ideas fall flat on their faces.  I witnessed some sharp operator-to-vendor exchanges even in the early days of NFV as operators tried to assure that vendors wouldn’t poison the NFV well.  If NIA falls prey to that kind of behavior then there would be no consensus to drive the industry to, and likely no useful outcome.  How can it avoid it, though?  Ultimately all the industry forums are driven by whoever funds them most.

Light Reading’s participation, or that of any publication or analyst group, will also raise concerns.  Everyone wants to be a sponsor of a popular forum, after all, and the NIA charter says that the body will buy services from Light Reading for at least the first year.  How will that impact the way NIA issues are covered by other publications, or treated by other analyst firms?  Might other firms want to launch their own initiatives, or at least cast a few stones in the direction of an initiative launched by a competitor?  Light Reading has always been fair in their coverage of my own NFV-related activities and I don’t think they’d be unfair in covering NIA, but will everyone believe that?

I’ll be honest here.  If I believed that the path to SDN and NFV success was clear and that we had a good chance of getting or being on it with the initiatives underway, my skepticism would overcome my enthusiasm on NIA and I’d probably be against the notion.  The problem is that the founders, including Light Reading, are absolutely right; something is needed.  I guess that unless and until NIA proves it can’t do something positive, I’ll be supporting whatever it can do.

How NFVI Deployment Might Progress

NFV is going to require deploying VNFs on something; the spec calls the resource pool to be used “NFV Infrastructure” or NFVI.  Obviously most NFVI is going to be data centers and servers and switches, but not all of it, and even where the expected data centers are deployed there’s the question of how many, how big, and what’s in them.  I’ve been culling through operator information on their NFV plans and come up with some insights on how they’re seeing NFVI.

First, there seem to be three primary camps in terms of NFVI specifics.  One group I’ll call the “edge-and-inward group”, another that I’ll call the “everywhere” group, and finally the “central pool” group.  At the moment most operators put themselves in the middle group, but not by a huge margin, and they’re even more evenly divided when you ask where they think they’ll be with respect to NFVI in the next couple of years.

The “edge-and-inward” group is focused primarily on virtual CPE.  About half of this group thinks that their early NFV applications will involve hosting virtual functions on customer-located equipment (CLE), carrier-owned, generalized boxes, kind of an NFVI-in-a-box approach.  This model is useful for two reasons.  First, it puts most of the chained features associated with vCPE close to the customer edge, which is where they’re expected to be.  Second, it offers costs that scale with customers—there’s no resource pool to build and then hope you can justify.

Where the “and-inward” part comes in is that at some point the customer density and service diversity associated with virtual CPE would justify hosting some features further inward.  Since this is customer and revenue driven, it doesn’t worry the operators in cost-scaling terms, and they could site small data centers in COs where there were a lot of customers and services.  Over time, these small-edge centers might then be backstopped by deeper metro centers.  In short, this model builds NFVI resources as needed.

Some of the operators in this group expect that services could eventually be offered from the deeper hosting points only, eliminating the NFVI-in-a-box CLE in favor of dumb service terminations.  The same group also notes that some functionality like DNS, DHCP and even application acceleration fit better when hosted deeper because they’re inherently multi-site services.

This is the group that has the slight edge in early deployments, meaning for the next several years.  Obviously one reason is that while NFV is waiting to prove itself out as a broadly beneficial model, you don’t want to start tossing data centers all over your real estate.  In the long run, though, operators think that NFVI-in-a-box would be a specialized piece of functionality for high-value sites and customers.  For everyone else, it’s the larger resource pool with better economies of scale that make sense.

The second group is the “everywhere” group, so named because when I asked one member of the group where they’d put NFVI, the answer was “everywhere we have real estate”.  This group expects to distribute NFV functions efficiently and widely and to have function hosting move around to suit demand and traffic trends.

Most of the operators who put themselves in this group are looking at a diverse early service target set.  Most have some virtual CPE, nearly all have mobile infrastructure as a target, and many also have content delivery and even cloud computing aspirations.  Their plan is to harmonize all of their possible hosting around a common set of elements that create a large and efficient (in capital cost and opex) resource pool.

Obviously most edge-and-inward players end up in this category unless their NFV strategies fail, and that’s why this group is the largest overall when you look at the longer-term state of infrastructure.  The group has the smallest (by a slight margin) early adherents because most operators are concerned they lack the breadth of applications/services to justify the deployment.

The central group is the smallest of the three, both in the near term and long term—but again not be a huge margin.  This group is made up of operators who have very specialized metro-centric service targets, either aspects of mobile infrastructure or large-business vCPE.  Some also have cloud computing services in place or planned.  All of them serve geographies where users are concentrated—no vast sprawling metropolis here.

The service targets for this group seem vague because they’re not particularly focused.  The sense I have is that the group believes that NFV success depends on “everywhere” and think that you should start “everywhere” somewhere other than in a bunch of NFVIs in a box out on customer sites.  Some have cloud infrastructure already, and plan to exploit that, and a few even plan to start hosting in the same data centers that currently support their operations systems.

What goes into the data centers varies as much as the data center strategies vary.  The “everywhere” group has the greatest range of possible server configurations, and the central group (not surprisingly) wants to standardize on as small a range of configurations as possible.  However, all the groups seem to agree that somewhere between 10% and 20% of servers will be specialized to a mission.  For software, Linux and VMs are the most popular choice.  VMware gets a nod mostly from centralists, and containers are seen as the emerging strategy by about a third of operators, with not much difference in perspective among the three groups.

For switching, there’s little indication in my data that operators are running out to find white-box alternatives to standard data center switching.  They do see themselves using vSwitches for connecting among VNFs but they seem to favor established vendors with more traditional products for the big iron in the data center.  “If Cisco made a white-box switch I’d be interested,” one operator joked.

Of the three groups, the edge-and-inward guys are the ones least concerned about validating the total NFV benefit case in current trials, because they have relatively little sunk cost risk.  However, the CFOs in the operators of this group are actually more concerned about the long term NFV business case than those of other groups.  Their reasoning is that their current trials won’t tell them enough to know whether NFV is really going to pay back, and if it doesn’t they could end up with a lot of little NFVI-in-a-box deployments that will gradually become too expensive to sustain.

You can probably guess who ends up with the largest number of data centers and servers—the “everywhere” group.  Given that a lot of operators start in that camp and that some at least in the other two camps will migrate in the “everywhere” direction, it follows that for NFVI vendors the smart move is to try to get customers at least comfortable with the everywhere deployment model as quickly as possible.

Right now, executing on that strategy seems to involve the ability to demonstrate a strong and broad NFV benefit case and building as big a VNF ecosystem as you can.  Financial management wants to see an indicator that benefits can be harvested in large quantities before they deploy/spend in large quantities.  CTO and marketing types want to see a lot of use cases that can be demonstrated (says the CFO, demonstrated within that strong benefit context).

All of the approaches to NFVI could lead to broad deployment of NFV, but they get there by different means, and it may be that the credibility of the path is more important than the destination for that reason.  If we presume proper orchestration for operations efficiency and agility (a whole different story I’ve blogged about before), then matching VNFs to early needs is the biggest factor in determining how quickly and how far NFVI can extend, and the extensibility of NFVI is critical in developing a broad commitment to NFV.

Can We Really Support Service Agility in NFV?

I blogged yesterday about the need to create mission-specific “upperware” to facilitate the development of new services and experiences.  The point that was that NFV is not enough for that; you have to be able to develop and organize functional atoms for assembly according to some application model or all you’re doing is pushing VMs around.

If upperware is what generates new services, then NFV’s service agility has to lie in its support of upperware.  That assumption, I think, offers us an opportunity to look in a more specific and useful way at the “services” NFV actually offers and how it offers them.  That can help us prepare for service agility by properly considering NFV features, and also to assess where additional work is needed.

An “upperware application” would be a collection of virtual functions deployed in response to a need or request.  If we presume that NFV deployments are designed to be activated by a service order, then clearly any need or request could be made to activate NFV because it could trigger such an order.  The order-based path to “deployment” of upperware would be essential in any event, since it could be used to validate the availability of advertising revenue or direct payment for the service being committed.

From the user’s perspective, this “service” isn’t just VNFs, it’s a complete packaged experience that will almost certainly contain components that aren’t SDN or NFV at all, but rather “legacy” services created by traditional network or hosting facilities.  We’d need to commit these too, or we have no experience to sell, which is why it’s essential that we think of NFV deployment in the context of complete service/experience deployment.  That means that you either have to extend the orchestration and management model of NFV software to embrace these non-VNF elements, or you have to consider NFV to be subordinate to a higher-level operations orchestration process.

OK, let’s say we have our upperware-based service deployed.  In the orderly operation of the service, we might find that additional resources are needed for a virtual function, and so we’d want to scale that function horizontally.  The presumption of the ISG is that we’d either detect the need to scale within the functional VNFs and communicate it to VNF Management, or that a VNF Manager would detect the need.  The Manager would then initiate a scale-out.

This poses some questions.  First and foremost, committing resources requires trusting the requesting agent.  Would a network operator let some piece of service logic, presumably specific not only to a service but to a customer, draw without limits on infrastructure?  Hardly, which means that the request for resources would have to be validated against a template that said what exactly was allowed.  This is a centralized function, obviously, and that begs the question of whether centralized functions should make the determination to scale in the first place.

The second question is how the scaling is actually done.  It’s easy to talk about spinning up a VM somewhere, but you have to connect that VM into the service as a whole.  That means not only connecting it to existing elements in an appropriate way, but also insuring that the new instance of the function can share the load with the old.  That requires having load-balancing somewhere, and possibly also requires some form of “load convergence” where multiple instances of a front-end component must feed a common back-end element.

The third point is optimization.  When you start spinning up (or tearing down) components to respond to load changes, you’re changing the topology of the service.  Some of the stuff you do is very likely to impact the portion of the service outside NFV’s scope.  Front-end service elements, for example, are likely close to the points of traffic origination, so might you be impacting the actual user connection?  Think of a load-balancer that’s sharing work across a horizontally scaled front-end VNF.  You used to be connected to a specific single instance of that VNF, but now the connection has to be to the load-balancer, which is likely not in the same place.  That means that your scaling and NFV stuff have to cause a change in service routing outside the NFV domain.  And that means that you may have to consider the issue of that out-of-domain connection when optimizing the location of the load-balancing virtual function you need.

I think this makes it clear that the problems of supporting upperware with service agility are going to be hard to solve without some holistic service model.  A service, in our SDN/NFV future, has to be completely modeled, period.  If it’s not, then you can’t automate any deployment or change that impacts stuff outside the NFV domain.  You may not need a common model (I think one would help a lot) but you darn sure need at least a hierarchical model that provides for specific linkage of the lower-level NFV domain modeling to the higher-level service domain modeling.

I also think that this picture makes it clear just how risky the whole notion of having separate, multiple, VNF Managers could get.  The service provider will have to trust-manage anything that can impact network security and stability.  Giving a VNF Manager the ability to commit resources or even to test their status gives that manager the power to destroy.  We now generate a major function of certification, a kind of “VNF lifecycle management” corresponding to software Application Lifecycle Management, that has to prove that a given new or revised VNF Manager does only what it’s supposed to do.

What it’s supposed to do could get complicated, too.  Does every VNF that might scale get a preemptive load-balancer element deployed so the scaling can be done quickly?  That’s a lot of resources, but the response of a service to load changes might depend on having such a manager in place, and if it’s not there then load changes would probably result in disruption of the data path for at least the time needed to reconnect the service with the load balancer in place.  And of course failures and faults create even more complication because while we can predict what components we might want to horizontally scale, we can’t easily predict which ones will break.

The biggest effect of this sort of assessment of future requirements is demonstrating we’re not fully addressing current requirements.  Which of the functions that upperware might expect are we willing to say will not be available early on?  No scaling?  No fault management?  No changes in service by adding a feature?  Aren’t all those things missions that we’ve used to justify NFV?

If NFV doesn’t deliver service agility then it makes no real contribution to operator revenues, it can only cut costs.  But without the ability to model a service overall, can it really even do that?  I think it’s clear that we need to think services with NFV and not just functions.