How a New Alliance Could Drive NFV into the Real World

One of the things that came out of Light Reading’s NFV Everywhere event was a sense that the critical MANO element of NFV has somehow lost its luster.  There have been some articles on that topic, and one of the most active members of the NFV ISG has posted a blog as well.  IMHO, the truth lies somewhere between, and interestingly we have an announcement to report on that may weigh in and decide the issue.

MANO is the true innovation of NFV.  The basic notion is that a service is constructed by somehow coercing cooperative behavior from a set of resources, a process called “orchestration” (and the “O” in MANO).  The orchestration also coordinates the management of both services and resources (hence the “MAN”).  At this high level, MANO deserves all the pundits you can heap on it.

At another level, MANO fell short, or rather the ISG did.  From the very first, it should have been obvious that you cannot orchestrate services and not orchestrate management, that virtualization is either universal or it’s been nailed to the ground somewhere.  Operations and management processes from both the NMS/NOC and OSS/BSS world needed to be involved, and these were declared out of scope.  That was a critically bad move.

Even the operators who launched NFV in the first place had abandoned the notion that NFV could be driven by capex benefits alone within a year of their initial white paper.  That left “operations efficiency” and “service agility” as the benefits, and operators knew in the fall of 2013 that these benefits could arise only from automation of the entire service lifecycle.  That means not only integrating NFV management and operations with orchestration from the OSS/BSS down, it means integrating legacy equipment as well.  And remember that both legacy equipment and operations were out of scope to NFV.

But is MANO irrelevant, and if so why?  There are things about MANO as it has evolved that I don’t agree with, but none of it matters because the scope issue is absolutely fatal.  Everything happening now with MANO is really a response to the fact that NFV can’t make its own business case, which means something outside it has to take over.  As for MANO itself, well, since it can’t go “up” to the OSS/BSS level, it’s effectively contracting to become nothing more than a kind of hazy partner to OpenStack.

OpenStack reflects its own good-and-bad.  Rightfully, OpenStack is just an implementation of the NFV Virtual Infrastructure Manager (VIM).  It’s available, popular, and it seems to “orchestrate”, and if you believe the ISG that operations is out of scope to NFV, then it’s sufficient.  That’s resulted in a lot of vendors jumping on OpenStack as being MANO, and that’s helped obscure the fact that something even more than “classic ISG MANO” is needed.  It is needed, because without MANO on a very large scale, larger than the ISG paints it, there will be no NFV business case except for very specialized services and operators.

This is where the announcement comes in.  Overture Networks announced the “integration of Ensemble Service Orchestrator (ESO) with Wind River® Titanium Server, providing carrier-class NFV management and orchestration on the Titanium Server virtualized network infrastructure solution.”  To make this happen, Overture has stepped up from what was initially a very limited vision of NFV, hosted on their own custom carrier-Ethernet CPE, to become a real NFV player.  We thus have the best platform available for NFV hosting in Wind River, combined with one of the very few NFV implementations that can actually make a business case.  The combination will presumably be available through both the Overture Harmony and Wind River Titanium Cloud ecosystems.

What makes this significant is that it’s a kind of populist-NFV-in-a-brown-paper-bag.  Other vendors who don’t have a real NFV implementation (vendors other than Alcatel-Lucent, HP, Oracle, and Overture, and probably Ciena and Huawei) can now get the components to supplement whatever they do have.  With the right documentation and approach, they can turn NFV PoCs and trials into field trials and deployments.  The other NFV solutions are tied to major vendors with their own agendas, which makes these less attractive to the NFV masses.  Other computer vendors (who don’t want to pull in HP), other network equipment vendors (who don’t want Alcatel-Lucent or Huawei), and OSS/BSS players who won’t like any of the other NFV solutions, now have somebody to partner with.

I don’t think this is an indication that Wind River and Intel are ready to field their own NFV approach (though I’d love for them to).  Wind River has major partners like HP with strong NFV stories of their own, and it makes sense for Wind River and Intel to hedge their bets…sort of.  This might be a classic arms-merchant strategy—start a war and arm all sides.  Intel’s chips are likely winners no matter who does NFV, but they could lose if nobody does.  The Overture NFV approach is pretty portable; any server or software vendor could adopt it, which is good for Intel.

There’s still work to do here, though.  Even with Overture’s software there is still a significant integration task associated with making NFV work in a large-scale carrier application.  Operations integration and legacy devices will have to be specialized to the needs of the operator.  I don’t think Wind River is signing on for that, so the next question will be whether Overture can attract some big integrator interest, or maybe a buyer for the company itself.

At the least, this should elevate the NFV dialog.  The Light Reading event reinforces my own discussions with operators; they’re looking for a shift to a business-case-driven approach.  Even those vendors who could make an NFV business case have been silent on what would be involved, and thus unable to prove they could do it.  Silence now is going to be the opposite of golden, because a lot of new players could grab the Overture/Wind River combination, sing pretty, and become a player.  Everyone who, technically speaking, already is a player will have to sing too, and all the singing should finally expose the reality of what is needed for NFV.

Any way this shakes out will be interesting.  NFV needs a business case, and Overture cold make the NFV business case, or at least they could articulate all the required pieces.  They have the Masergy deal in the bag as a proof point for their capabilities.  That’s a lot of credentials for a small player, and you can see there could be a lot of interesting outcomes.

What NFV Standards Have to Address to Make the Business Case Work

According to a recent piece on SDxCentral, “Operators don’t want the TM Forum to get lost in NFV technicalities.  They want the focus on making money.”  That’s an interesting and insightful couple of sentences, for several reasons, and it may be a signal of a sea change in NFV.  The obvious question is who’s going to chart the change.

I’ve noted in prior blogs that many of the operators believe that the NFV ISG process has gone off-track, getting bogged down in little details of implementation that are actually outside the original spec-finding mandate for the group.  At the same time, the body hasn’t found the time to address critical issues in operations integration and federation, issues that could be the foundation of a real business case.  That’s left these critical issues out in the cold, to be resolved (if they can be) elsewhere.

One logical place to resolve them is the TM Forum, the body charged with operations standards.  The article in SDxCentral is about the TMF, and specifically about the “Zero-touch Operations and Orchestration Management” or ZOOM project that is intended to get the TMF on board with virtual network resources and services.  The acronym’s decoding is highly relevant to what the goal should be—we need operations and automation to be zero-touch and inclusive of the virtualization/orchestration processes of SDN and NFV.

So the TMF is riding to the rescue of NFV?  To be fair to the NFV ISG, the TMF hasn’t set the world on fire in this space either.  ZOOM goes back over a year, and in my personal view it’s not made tremendous progress in that time.  In the TMF case, as with the ISG, the problem is likely one of the bias of the people actually involved.

NFV is a software strategy, a software architecture, a software problem.  It takes software to solve it, and neither the ISG nor the TMF have really approached this as a software issue.  The ISG got mired in standards, and because old-line standards had to be highly specific in some areas and were generally directed at interfaces and little else, the focus of the ISG was directed at interfaces and details thereof.  The TMF got mired in its own architecture, in making NFV and virtualization fit within it.

But there is progress being made.  Some of the work on Phase Two of the NFV ISG program is very helpful and insightful, particularly as it relates to intent modeling.  There is also work underway in the TMF to modernize some aspects of their event-handling and the application map that frames operations processes overall.  The question is what this will add up to being.

In my view, the TMF has the inside track on creating something comprehensive.  While there’s useful work in the ISG, such as a harmonization of operator views in favor of intent modeling and a recognition that they need operations integration and federation, it’s hard to see how they could get to the right place on that range of topics in a short time.  The TMF, on the other hand, has nobody to fear but themselves.

The core of the right approach to NFV is a comprehensive service data model that extends from the top service layer to the bottom resource layer in a unified structure.  The TMF SID might not be the way I’d personally have built that model, but it’s quite suitable.  There is nothing being talked about in the ISG here except YANG, which I think is a waste.  TOSCA is probably the best approach, and one that might win out in the market regardless of what others do.

The next point is the notion of a full intent model representation for every customer- and resource-facing element of a service.  That means defining ports and SLA as well as functionality.  I don’t see any reason why the TMF SID could not accommodate this, though they’d have to make some specific recommendations to insure compatibility.  The more difficult issue would be the notion of derived operations views created by active expressions linking the model SLAs.  You could do this in SID, but here there’s certainly a need for an explanatory project.

Following on the expression-linking of SLAs, NFV demands active-chain binding of pool resources to intent-model abstractions that represent them in the service domain.  This can be done using SID, but how elastic bindings and active expressions of any sort are handled is in my view a project not unlike the original NGOSS Contract work that gave rise to GB942.  It should have been started a year ago.

On both the prior points, the notion of “active” expressions or bindings is outside the range of normal data models.  A traditional data element has a value.  An active element has a value that is determined by a process executed at the time of reference.  Active elements considerably enhance the utility of virtualization-operationalization technology because they allow a reference to a “high-level” variable to drive the derivation of that variable rather than having the variable constantly updated (or out of date!).

The next point is state/event synchronization of service lifecycle processes across the whole data model.  Every element of a service must have its own lifecycle states, respond to its own events, and synchronize the stuff that depends on its operation.  To me, this is a core feature of any valid service data model.  Every object in the model is an intent-modeled sub-service “atom” that has its own properties, lifecycle, and management.  This means that every lifecycle activity designed to operate on an object operates on everything, top to bottom, high-level to detail.  It simplifies implementation considerably.

A top-down approach to the challenge of NFV would have produced something like these points back in 2013.  OK, that’s off the table now, so the question is what process might be close enough to the right answer to do something by next year, which is probably the drop-dead date to preempt operators’ changing from NFV to something else.

None of this stuff is even mentioned in the Phase One ISG material, so it’s my view that it’s doubtful that the body could address all these points in a year or less, even if we assumed that members wanted to (which would be a dangerous assumption).  I think the ISG contributed the critical MANO notion, but somebody else will have to carry the torch from here.

The TMF might be the group, but if there’s a body on earth that outdoes the NFV ISG for glacial movement, the TMF is that body.  In fact, given their historicity, they arguably set the standard.  The only hope the TMF has IMHO is that vendors and operators will present Catalysts that will move the ball even if the formal processes of the TMF are mired in the usual political and personal wrangling.  That could happen, but even here we have the barrier of publication.  The TMF is a membership-fee body like the ONF, but unlike the ONF its stuff doesn’t roll out to full public view quickly, if at all.  A lot of good work the body has done has been submerged, where it never influences the broad media and market.

Remember my blog on the various CxOs in the operator world and NFV’s progress in engaging them?  Well, engagement is made primarily through public marketing and media channels these days.  Standards groups either sing or they risk irrelevance through simple lack of ability to build decisive momentum in buyer organizations.  If the TMF does all the right things with ZOOM and other modernizing initiatives, and fails to publicize the results, they may as well have done nothing.

The media and “public processes” bring their own baggage to this.  The fact is that 1) no technology revolution is going to generate legitimate stories at the rate the media needs them and 2) no media type is going to write a story at the depth needed to actually influence a major decision, nor would the editor run such a piece—it would be too big.  You get press to sell website visits, and you get press by riding the hype wave.  NFV has already passed the crest in that regard, which means that engagement by the TMF or by vendors to get their story out will now be harder.  The good news goes back to my opening quote, though.  Buyers now know what they want—monetization and a business case—and if they search websites they can still find enough to make progress possible.

Who Might Stand in the SDN/NFV WInners’ Circle at the End of the Race

In my last blog I pointed out that we really needed to understand what the end-game network would look like to effectively drive SDN and NFV.  Today I’d like to look at how the possible end-games relate to the vendors in the game.  Who might win, or even drive, a specific SDN/NFV future?

There seem to be three possible SDN scenarios that lead to my utopian SDN model.  They are the data-center application and service path, the up from the depths path, and the magnetic service path.

In the data center application and service path, SDN develops as an application and tenant segregation concept within a cloud data center.  This requires a fairly sophisticated virtual networking approach, much like Amazon and Google (whose Andromeda approach is fairly public) have developed.  Network virtualization in a data center can be leveraged to extend beyond the data center, and that would be the way this specific path could be realized.

The critical step in this path to SDN is that extension.  Even now, we have vendors like Nuage who have an overlay model of SDN that can easily be extended to the edge, and so Alcatel-Lucent is obviously a competitor who could make hay with this opportunity.  Brocade, who has a stronger data center positioning for SDN, could leverage its Vyatta offering with edge-hosted virtual routing and tunnels to build a similar uniform end-to-end virtual network model.  Juniper, with Contrail, could pair SDN and fabric switching to at least create the data center part, but it’s not clear how easily they could take the next step.

My view is that this approach to SDN success is the most “open” meaning that virtually any data center SDN strategy could be extended to complete virtual networking through either product extensions or partnerships with any number of players.  Open-source tools like OVS could also play a role, as could an open-source router.  The very breadth of possibilities may render this path, ironically, less attractive.  That’s because anyone who takes the trouble to educate the market in a broad SDN transformation via the data center will carry a lot of competitors along too.  Who wants to do that?

The second possible path for SDN is the “up-from-the-depths” path, which means starting with agile optics and optical grooming, extending that in some way to provide better granularity of tunnel management at the electrical layer (virtual L1), and then marrying these L1 partition-based networks with virtual switching and routing.  This model is easily applicable to business data services and with some care to mobile and CDN applications.  It’s a natural play for a network operator, in other words.

You need to be an optical player to win here, and in theory Alcatel-Lucent, Ciena, and Infinera would be obvious candidates.  Vendors like Juniper and Cisco could also jump in because the critical extension in this model is that electrical L1 grooming, which even someone who isn’t really an optical giant could do.  What you do need is to be able to add electrical L1 grooming to fat lambdas.

If I had to bet on who’d take a shot at this path it would be one of the pure-play optical types like Ciena and Infinera.  This path would undermine carrier router and switch deployment in favor of per-customer or per-service virtual switching and routing.  That would make it less attractive to Alcatel-Lucent, Cisco, or Juniper.  The wild card on this path would be NFV, of which more will be said below.

The final SDN path is the notion of a “magnet service”, something new and profitable.  One could argue that the only qualifying candidate here is the Internet of Things, but IoT is so mired in useless hype that I’m not sure how easily a logical approach could be promoted.  In any event, it’s very likely that the only people capable of using IoT to pull through massive SDN would be a service provider.  Most of them seem caught in the trap of thinking that every controller and sensor in the future will have a dedicated 5G connection that they can sell a user for twenty bucks a month or so.  Count the sensors and controllers in an average home and you can see how far that would get.  I think that while almost any SDN player could claim to support IoT as an SDN driver, I’m doubtful that it can drive SDN by itself.  Again, more below.

On the NFV side, the possible paths to the future are the operations umbrella path, the getting-cloudier path, and the old favorite magnetic service path.

The best way to make NFV deploy quickly and widely would be to transform service and network operations with model-based orchestration and automation.  That’s been clear from the very first, but even vendors with good credentials were slow to push on this approach.  Oracle came along early this year to take advantage of the void in operations positioning, and other vendors are now saying (and doing) more.  The functional leader in this space is HP, then Overture Networks, then Alcatel-Lucent and Oracle.  There are other players who appear to be working in this area, notably Ciena and Huawei, but I can’t yet say just where either really is on the topic of operations integration.

The reason strong operations integration would be so compelling is that both operations efficiency and service agility benefits—the primary drivers of NFV in operators’ view—depend on service/network management automation and coordination to reduce human interaction and speed the handling of service and network events.  If this capability is provided at a high level (the OSS/BSS and NFV MANO levels) then making the business case for NFV on a per-service, per-PoC, basis would be much easier.  The classic example is “virtual CPE” hosted on an agile premises device.  With an operations umbrella on top, this model is easily validated based on first-cost controls and agility alone.  Without that umbrella you can’t even confidently assert that it saves anything at all.

I think we’re going to get some action in this space, but I’m concerned about the timing.  The only way to socialize the operations umbrella strategy is through the CIO, one of the executives that hasn’t been much engaged in NFV trials.  Most vendors aren’t calling on the CIO regarding NFV, and they don’t know whether to position operations orchestration as an OSS/BSS transformation or ride on NFV’s coat-tails.  If there’s not some quick action this fall, we may go into 2016 budgeting with no clear path to an operations-driven field trial.

The cloud-driven approach to NFV deployment says that nothing much in the network (meaning connection) services area is going to move the ball fast and far enough.  Operators still see cloud computing as a major revenue opportunity, and if they pursue it they will be building cloud data centers that look structurally like NFV infrastructure and that present deployment and operations challenges similar to that posed by VNFs.  Operating the cloud efficiently at scale is arguably an NFV-like problem.

HP and Brocade could both drive this evolution in theory.  HP has both SDN and NFV assets, and both would be valuable in a cloud-path NFV success story.  However, vendors have been generally soft on making the connection between NFV and the cloud.  This may be a situation where a player like Intel, who offers foundation tools for the cloud and for NFVI, might want to step up and make the point to insure this path gets a fair hearing.

The final topic for NFV is the old familiar “magnet service” and as is the case with SDN, the most credible (perhaps the only credible) magnet is IoT.  With NFV, there may be more magnetism available, because NFV could deploy the query and collection functions needed to gather information from sensors with arbitrary connection options (often through intermediary gateways) and send information to controllers similarly connected.  There is a network mission here, but in many ways it’s a mission that’s very cloud-like or service-like, which is what makes it an ideal NFV story.

Alcatel-Lucent, HP, and Oracle are all developing (and in all cases, have developed at least in a preliminary sense) an IoT architecture that would at least involve NFV.  None of them are complete at this point, or at least not completely disclosed.  I think this is largely because we’ve still not accepted a truly comprehensive and sensible model for IoT overall.  It’s going to take a bold vendor to step up and blow against the prevailing media winds, but that’s an essential step if we ever hope to realize IoT potential.  And if a vendor does prevail in promoting a sane model, they’ll have a chance to establish themselves both as an SDN and NFV kingpin.

It seems clear that IoT is the wild-card stimulus for both SDN and NFV, in part because it is a stimulus for both.  No matter how many current hopefuls we have in the space, the fact is that we have no convincing positioning or fully articulated strategy (HP is the closest) relating IoT to either SDN or NFV, so anyone can claim the prize.

We have at least competing claims in all the driver areas, in fact, and that means that we’re entering the critical year for both SDN and NFV without a clear winner, and perhaps given the multiple possible paths to success, without a clear leader.  It should make for an interesting fall and Q1, but I think things will shake out sharply in the second half of next year.



Approaching the SDN/NFV End-Game

OK, I admit to liking old songs and poetry, so you’ll probably not be surprised if I quote a song title; “How deep is the ocean, how high is the sky?”  I don’t propose to blog on oceans or skies here, but depth and breadth is an interesting question posed to SDN and NFV.  We might need to ask ourselves another seemingly trivial question for both technologies, which is “What does the SDN/NFV end-game look like and how do we get there?”

We have about a trillion dollars in network assets out there, about a fifth of which is depreciated in a given year.  The capital budget of operators is running a bit lower, so at the moment we’re gradually drawing down on the “installed base” that offers a lot of inertia.  At the same time, that capital budget is reinvesting in the same technology, so inertia isn’t decreasing significantly.  SDN and NFV have to overcome that inertia.

If we just use the rough numbers I presented in the last ‘graph, then you can easily see the issue and perhaps also see the path to change.  It’s unlikely that we’d achieve major SDN or NFV success if operators keep buying the legacy gear.  That means that SDN and NFV have to be presented in an evolutionary posture, as something you can migrate to gracefully.  At the same time, though, operators aren’t really interested in quite evolution—at least not in the long term.  Technology changes present risks that can be justified only by significant upside.

SDN has at one level accepted and addressed the need to balance evolution and revolution.  You can control many legacy switches using OpenFlow.  That lets users invest in SDN at the controller and management level and apply that investment to current network devices.  As those devices age, they can be replaced with white-box gear that uses SDN and only SDN.  At least that’s the theory.  The practice has so far been somewhat stalled in the data center, where the impact on cost and revenue is limited.

For NFV, it’s been a harder row to hoe.  While you can argue capex savings for NFV on an incremental deployment, the fact is that NFV is more complicated than point-of-service devices would be for the same features.  That means that operations efficiency has to be better at least to the point that the incremental complexity is covered.  And remember that most operators don’t believe in capex as a driving NFV benefit.  Service agility and operations efficiency, the benefits de jure, both appear difficult to attain unless you address a service from end to end and top to bottom.  How do you square that with the need contain early cost and risk?

All migrations are justified by the destination.  A couple million African wildebeest don’t swing south toward the Mara River and face the crocs to starve in a different place.  We’d probably have an easier time postulating the migration strategy for SDN/NFV if we knew what a full deployment would be like.

What does an “SDN network” look like?  Obviously it can’t be an Ethernet or router network that adds in some OpenFlow control; that doesn’t move the cost or benefit ball much.  You could in fact build what looked like IP or Ethernet services using SDN.  You could also build services that looked the same at a service demarcation but were created very differently.  Application- and service-specific subnetworks, added to an enhanced virtual router at the customer edge, could frame services in a totally different way and revolutionize networking.  One option presents limited migration risk and limited benefit.  The other seems to go the other way.  Which model is best?

NFV poses a similar question.  From the first, the focus of NFV has been in deploying virtual functions that replace physical appliances operating above switching/routing.  So let’s assume we do that.  How much of the network capital budget is associated with that kind of technology?  A bit more than ten percent, according to operators.  Even if we can harness opex and agility benefits, how many will be available if we touch only a tenth of the gear?  You can contain NFV to that simple mission, or you can try to address the opex and agility goals even if it means extending what we mean by “NFV” significantly.  Again, one way offers risk management and the other a much better benefit case.  What’s the best approach?

So what are the answers?  I think we can best start with what can’t be the answers.  SDN and NFV for network operators cannot be a pure overlay strategy that rides on current switching/routing infrastructure without much change in that base.  How do you add something on top of the original model and by supplementation make it cheaper?  We have to displace switching and routing on a large scale for there to be large-scale SDN and NFV success, and whether we like that or not (and router vendor employees who read this probably won’t) it’s still the truth.

A second truth is that we are not going to replace current network routers and switches with servers and virtual routers.  Many of those current products are simply too big.  Terabit routers have been a reality for a long time, but we don’t have much experience with terabit servers.  Virtual switches and routers clearly have to play a big role in the future, but not as 1:1 replacements.

We need to look from the blue-sky future into the deep here.  The most obvious of the network technology trends has been that of agile optics and the displacement of traditional core-router aggregation with agile optical cores.  We should expect that this trend will continue, and as it does it joins up with SDN and NFV to create that future vision.

Agile physical-layer technology lets you dumb down switching/routing because it subducts error recovery responsibility from the higher layer.  Furthermore, if you can partition users and services economically at the agile optical layer, you could build business services using pipes and virtual routers/switches.  That opens the SDN opportunity to deploy a simple forwarding tunnel over optics, with no real L2/L3 involved, and use that tunnel with virtual switching/routing on a per-user, per-application, and per-service basis.

NFV can play a role here by deploying those L2/L3 virtual elements.  Absent a connection mission like this, NFV is stuck in higher-layer functionality where it can’t easily change the cost or benefit structure of basic services.  But if we build service and application networks one-off using partitioned L1 technology, we need the higher layers to deploy.  These missions, as I’ve pointed out, are less demanding of the devices hosting the virtual routers because they’re limited in scope to a single user or service.  We’ll still need to aggregate stuff, but not nearly as much device-based switching/routing is needed.

A lot of virtual routing could be needed.  Every service edge for every business and every consumer would have in this model a virtual router that provided the user with the specific tunnel/service-network access they needed.  For VPNs you’d have edge virtual routers and floating internal ones that were placed to optimize traffic flows and resource usage.  It’s a different model of optimization.  Forget finding the best path among a nest of routers, you find the best path and nest routers to fit it.

The biggest problem we have with all of this isn’t carrier culture.  Vendor resistance to this approach would be even more problematic because it prevents vendors from accepting a radical change.  And underneath both the carrier and vendor resistance is human resistance.  We have generations of network mavens who have known nothing but IP or Ethernet.  They simply cannot grasp a different model.

Well, we have to make a choice.  The future of networking will be the same as the present if we insist on building future networks using current principles.  We can’t bring the sky and ocean together without making rain.

Brocade’s Step Toward SDN’s Future: A Good Start

Yesterday, Brocade announced enhancements to its SDN Controller that advanced SDN in an operations sense.  I think these were important; they move SDN toward the architecture it should have been from the first.  They also show us just how far SDN still has to go for it to achieve all it can.

What Brocade has announced is a Flow Manager and a Topology Manager that essentially ride the SDN controller and provide a way of visualizing the structure of SDN switches and the way that consecutive forwarding processes (imposed per-switch as the OpenFlow approach mandates) add up to supporting a flow or route.  The products are highly visual, meaning that you can easily see what’s happening and manipulate connectivity to suit your specific needs.

Way back in the early days of SDN, I hypothesized an architecture of three layers.  On the bottom was the topologizer, an element responsible for determining what the physical device and trunk structure of the network was.  In the middle was what I called SDN central, the critical piece that aligned flows and topology.  The top layer was what I called the “cloudifier”, a layer that would frame SDN’s almost infinitely variable connectivity options into services that could really be consumed.  The “northbound interface” of the SDN controller would logically fit below these layers.

Brocade has taken an important step by providing us with specific implementations of the bottom two of my layers.  Instead of blowing kisses at a vague notion of a northbound interface, users can see what they are doing.  It’s an important step in operationalizing SDN for sure.  But it’s not a home run, at least not yet.

One specific statement from Brocade’s website sets the stage for the “miles-to-go-before-we-sleep” part of this discussion.  “Software-Defined Networking provides dynamic, end-to-end views and control of data center traffic.”  Obviously SDN isn’t limited to the data center, but it does seem as though Brocade is telling us something important.  SDN has been successful largely within the data center, and so SDN’s longer-term success will likely depend on extending from that base into the rest of the network.  I described such an evolution for network operators in my blog yesterday.  SDN has to make the transition to the network from the data center, for both enterprises and operators, if it’s to be really important.

The notion of flows and topologies that Brocade has introduced could play a role in that.  The first thing you have to do to get SDN out of the data center is to automate operations and management.  If we can use a GUI to drop flow paths where we want them, it is clearly possible to provide a tool that would define the paths based on policy.  I don’t think this would be a big technical issue for Brocade.  The larger problem to address is the management side.

Inside a data center you have unusually high reliability and you can probably stand on a chair and survey your network domain, looking presumably for smoke or some sign of failure.  The number of flows you have to support is also likely to be limited.  Get into the wider network and you need to have automated service management, and that means you have to be able to associate physical device and trunk conditions to connectivity.  Brocade takes a step in that direction too, because they have in a flow a map of the resource commitments.  If you drew data from MIBs along a flow, you could figure out what the flow state was.  If you had a flow problem you could trace it to the likely fault point.

The notion of a flow problem leads to the question of a flow standard, meaning an SLA, and the need to connect flows into groups to make up a logical service.  Brocade still needs to mature a vision of my “cloudifier”, the service layer that converts application/user requests for cohesive network behaviors into a series of flows that can map to the real device topology.

There may be help on the way here.  The ONF is taking up the notion of an intent model for northbound interfaces.  A true intent model would include an SLA, and that implies that an implementation would be able to collect management variables not only for a flow but for a set of related flows, and present them as a composite service state to be compared with the performance guarantees set for the service.  Brocade could implement this, and if they did they’d climb the SDN value chain up to the top.

A lot of the things needed to achieve SDN’s optimum future is at least partially supported within the SDN controller, at least the OpenDaylight version that Brocade and many others use.  What’s missing?  Interestingly, it might be the vision of service/resource management and operations that many have talked about for both SDN and NFV in the operator space.

In an SDN age, enterprises are “network operators” at least to the extent that they build SDN overlays on more traditional services.  In the future, if operators themselves expose “cloudifier” types of service interfaces to permit flow composition into services, the enterprises will still have to harmonize those with internal site connectivity.  We already know that the “enterprise network” is the sum of its LANs and WANs.  We may not need all the tools for enterprise SDN that we’d need for operator SDN, but it’s easy to see that we’d want a subset of those tools.  And it’s easier to subset something if you’ve got it to begin with, in a complete and cohesive form.

I think Brocade’s move is demonstrating that “network management” for the enterprise is going to shift just as decisively as service and network operations are shifting for the operators.  I also think that a future SDN network where users compose their connectivity by application and job type is going to demand a complete rethinking of how we know what a network is.  The “I’ll-know-it-when-I-see-it” model isn’t going to work in the virtual world.  Brocade may be working on the replacement.

How SDN Could Jump Over NFV in Deployment

SDN came along well before NFV, and there are many SDN implementations compared with “real” (meaning actually complete and useful) NFV.  Despite this, SDN became a bit of a junior partner to NFV at least among network operators.  Even in my own deployment models, it’s clear that the easiest path to SDN deployment would be in support of an NFV-driven transformation.  But suppose NFV doesn’t keep pace?  Is there a path to SDN success that bypasses NFV success?  Yes, there is—there are several, in fact.

To start off, there’s the cloud.  There’s a relationship between NFV and the cloud, in two dimensions.  NFV requires cloud hosting in almost all its large-scale success scenarios.  The cloud could benefit from NFV’s ability to manage dynamic application configuration and resource changes and simplify operations overall.  But the cloud is also different.  It’s a framework for operators to sell application/IT services and a new revenue source.

Virtualization at the network level is absolutely critical for cloud infrastructure, whether you’re using it for yourself or selling services from it.  Google and Amazon both have developed sophisticated network virtualization capabilities as part of their cloud offering, and even NFV demands much more power in network virtualization than either vendors or operators admit.  SDN could be driven by cloud virtualization faster than NFV advances.

The cloud lesson of Google and Amazon is important because it demonstrates that in the cloud, virtual networking has different properties than standard networks do.  The further SDN features are advanced beyond the traditional, presuming those advances are valuable or critical, the more likely it is to justify itself.  And we know that cloud virtual networking is very different.  For example, it deals with inside-outside address space differences and endpoints that reside in multiple address spaces.  Those aren’t common features for IP or Ethernet.

The strongest proof point for the potential of the cloud to drive SDN is that it’s had funding.  Cloud computing was for several years the leading project for operators.  It led mobile enhancement and content distribution, largely because the path to monetization seemed clear.  The biggest argument against it is that the cloud doesn’t lead any longer; monetization has been more difficult than operators had expected.  Still, SDN that grows out of cloud data centers could drive a major SDN commitment by operators.

Cloud SDN is primarily aimed at user application tenancy control, and the interesting thing about that mission is that application-specific networks take a step toward what could be called application networking versus the site networking of today.  That would lead to a big potential growth driver for SDN as a carrier service.

If applications each live on their own partitioned virtual network, then you can give users access to the application by giving them access to that network.  The sum of a user’s application rights defines their network connectivity.  A user “tunnels” from wherever they are in geography or client address space terms to reach the on-ramp of the application virtual network structure.  This model is more secure, more stable, more profitable for the operator because it’s more valuable to the user.  In theory any form of SDN that allows for explicit forwarding control could do this, but some vendors (Alcatel-Lucent’s Nuage for example) have made more of a point of highlighting the features needed.

When you look at this kind of network model, you can see another point of leverage that leads to yet another expansion of SDN opportunity.  If the user’s “access tunnels” jump onto a virtual network that extends outward near to their location, we’ve created a series of parallel virtual networks that have a large geographic scope and yet don’t touch each other.  They’re effectively partitioned at a lower layer than IP or Ethernet, by the “virtual network structure” whatever it is—the thing SDN created.

Well, if we’re partitioning everybody, every service, every application and user, then we’ve got a lot of small address spaces and not one big one.  Traditional switching/routing services demand big boxes because they involve large amounts of users and traffic.  It’s unrealistic to think that even a big company network could be hosted using virtual routers and servers, and certainly not a service provider network.  But if we partition, we have little networks that fit very nicely into virtual switch/routing services.

Think about that.  We build VPNs and VLANs today largely by segmenting the connection space of big networks.  We could build the same thing by simply building a private IP or Ethernet network within an SDN physical partition.  And with that step, we could significantly reduce the investment in switches and routers and further extend the utility of SDN.  We push down a lot of the aggregation and basic network connectivity features, along with resilience and SLA management, to a lower layer based on SDN and cheaper to build and operate.

This can be very ad hoc, too.  Think of “Network-for-a-Day” as a service.  Define a community of interest and a task to be performed.  Link the people and resources into an ad-hoc NfaD configuration that’s independent, secure, and has its own SLA.  When you’re done, it goes away.  Nobody has to worry about persistence of connectivity when the project is over, about someone from the project still having access rights.  The network and rights have disappeared together, at day’s end.

IoT could be revolutionized by this.  Think of a sensor-community, a control-community.  Each of these is made up of sub-communities with their own policies, and each feeds a big-data repository that enforces policy-based access.  We now have process-communities that can be linked to these other communities and their repositories, persistently or for a short period.  We can charge for the linkage because there is no presumptive access, because networks are now partitioned and if somebody’s not in yours they are invisible, and vice versa.

Same with content.  A CDN is now a virtual structure.  Users can watch videos and interact with others, in their own narrow community as they define it or in some predefined community.  Different classes of users, or even different users, can have different cache sources and partnerships during viewing.  Same for wireless or wireline, your own wireless or roaming.

For a lot of vendors, this evolution would be a major problem (Cisco or even Alcatel-Lucent might find it pretty destructive of their router business).  For others like Brocade or Plexxi or Big Switch it could be a real windfall, though they might have to do some extra positioning or even product enhancement to get all the benefits on the table.

For operators, SDN has that same “maybe-or-maybe-not” flavor.  Virtual networks, like all things virtual, add a dimension of operations complexity that if left unchecked might compromise not only the SDN business case but the whole service framework.  This makes yet again a point I’ve often made; you need an operations framework for next-gen services or your gains will at best be illusory and at worst could be proven to be losses.

Like NFV, SDN is something that gets better the more of it you do, and that means that it can be a challenge to start small and get anywhere.  The good news for SDN is that there are plenty of places to start, and some of them aren’t all that small.  With a little help from the cloud, SDN could actually overtake NFV in deployments at least for a time, which could perhaps mean SDN principles would influence NFV planning and not the other way around.  Turnabout is fair play!

Looking a Bit Deeper at the NFV Business Case

I got over a hundred emails after my series on making the business case for NFV.  A few didn’t like it (all of these were from the vendor community) but most who contacted me either wanted to say it was helpful or ask for a bit more detail on the process.  OK, here goes.

You have to start an NFV business case at the high level.  Hosting functions is a major shift.  It demands operators shift from a familiar device model to a software model, and it comes along at a time when operators are trying to make systemic changes in cost and revenues to accommodate the crunch they’re experiencing in revenue-versus-cost-per-bit.  There’s risk here, and you have to be able to demonstrate that benefits scale with risk.

The best place to start is to define specific benefit targets.  You have to reduce costs, increase revenues, or both, so that means “cost” and “revenue” are the high-level targets.  For either (or both) these targets, you’ll need to assess an addressable opportunity and forecast a penetration.

Cost targets are relatively easy.  As I pointed out in a past blog, most operators are judged by the financial markets based on EBITDA, which measures earnings before capital spending and depreciation is considered.  This focus means that unless the operator is relatively small, it’s going to be harder to make a pure capex business case.  In any event, the problem with capex-driven changes in profit per bit is that you’d have to make a massive infrastructure change to address most of the cost, and that’s just not going to happen in an industry with a trillion dollars of installed infrastructure.  Operators also say their own estimates are that you can save 20% or less with hosted functions versus custom devices; they’d rather beat vendors up on price for that amount than risk new technology.

Operators spend about 12% of their revenue dollars on operations costs, about a third of which is network operations and the other two-thirds service operations.  The big question to ask for your cost targeting business case is the scope of the impact of your NFV deployment.  For example, if you propose to target a service that accounts for one one-hundredth of the devices deployed, you can’t expect a revolutionary benefit.  If your NFV impacts the service lifecycle for services that account for ten percent of service operations effort, that’s the limit of your gains.

Even if you have a target for cost control that you can quantify, you may not be able to address all of it.  The best example of this is the network operations savings targets.  Most NFV deployment will demand a change in network operations to be sure, but that change may not be a savings source.  For example, if you’re selling virtual CPE that will reduce truck rolls by letting people change services by loading new device features, you can only consider the truck rolls that are necessitated by service changes, not total truck rolls.  You still have to get a device on premises to terminate the service, and you can only save truck rolls in cases where feature additions would be driving them.

The service operations side is the hardest one to address.  If you think you’re going to save service operations effort you have to first measure the effort associated with the service lifecycle, from order through in-service support.  How many interactions are there with the customer?  How many are you going to eliminate?  If carrier Ethernet is a service target, and if it represents 30% of customer service interactions, cutting its cost by half will save 15% of service operations effort.  You’d be surprised how many NFV business cases say “Service operations is 12 cents of every revenue dollar and we’ll eliminate that cost with virtual CPE” when the defensible vCPE target is only a tenth of customers.

On the revenue side, it’s more complicated because you have to consider the addressable market.  Again, the “new” versus “established” customer issue can bite you.  If you reduce time to revenue, you can claim x weeks of new revenue per new customer or new feature for a reduction of x in deployment time.  That won’t apply to customers you’re already earning revenue on, only ones that are having a new service rollout.  And it doesn’t happen every year for the same customer, so don’t multiply that “x-week revenue” gain by the total number of customers!

Truly new services are also complicated because of the difficulty in assessing total addressable market.  Most operators don’t have good numbers on opportunities in their areas, but most could get it.  How many business services could you sell?  You can get demographic information by location and SIC/NAICS to estimate the population of buyers.  You can use market data from other areas, other providers, to estimate optimum penetration.

If you go through the motions of a business case, you’re going to end up realizing that the primary challenge in making one for NFV is that improvements in operations or agility are difficult to secure without making a systemic change in practices.  Unless your service target for NFV is very contained, you may be introducing a new set of operations requirements to all network and service management personnel but gaining efficiency for only a small percentage of their interactions.  Thus, the fact that you have to start small with NFV works against creating a big benefit up front, and that makes it hard to justify a big up-front cost.

The easiest way to make a business case for NFV work, as I’ve said, is to first target the orchestration and optimization of service and network management tasks through software automation.  This can be done without deploying any NFV at all, but it can also obviously be tied to an early NFV task.  The operations automation will easily generate enough savings to not only make the early NFV business case, it will probably generate enough to fund considerable future growth in NFV deployment.

If you can’t target wholesale changes in operations/management activity, then the easiest approach is to target opportunities that have very contained operations/management teams.  If business Ethernet services are sold, supported, and managed by an independent group, that group can be targeted because you can identify the costs and limit the impact/scope of your changes to the place where benefits are available.  If the same service is supported by a general team that does all manner of other services, it will be harder to isolate your benefits and make them plausible.

The watchword is think contained targets.  Managed service providers who lease transport from third parties and integrate services with CPE or across carrier boundaries are probably the easiest early NFV targets.  Virtual CPE may be their only real “equipment” and operating it their only operations cost.  MVNOs would be easier targets than real mobile operators for the same reason, and mobile operators easier targets than operators who mixed mobile and wireline services in the same metro area.

NFV as an evolution and NFV as a revolution are hard to reconcile in a business case.  In trying to do that, many vendors and operators have either manufactured savings by distorting numbers, or presented something so pedestrian in terms of value to profit-per-bit that operators yawn instead of cheer.  You can get this right, but you have to think about it more and take the process seriously.

What Has to Happen for Service Automation to Work

NFV is going to succeed, if we define success as having some level of deployment.  What’s less certain is whether it will succeed optimally, meaning reach its full potential.  For that to happen, NFV has to be able to deliver both operations efficiency and service agility on a scale large enough to impact operators’ revenue/cost compression.  Not only is operations efficiency itself a major source of cost reduction (far more that capex would ever likely be) but it’s an important part of any new service revenue strategy.  That’s because new services, and new ways of doing old services, introduce virtualization and connection of virtual elements, and that increases complexity.  Only better service automation could keep costs under control.  It makes no sense to use software to define services and then fail to exploit software-based service and network management.

I think pretty much everyone in the vendor and operator community agrees with these points (yes, there are still a few who want to present the simpler but less credible capex reduction argument).  Consensus doesn’t guarantee success, though.  We’re starting to see some real service automation solutions emerge, and from these we can set some principles that good approaches must obey.  They’ll help operators pick the best approaches, and help vendors define good ones.

The first principle is that virtualization and service automation combine effectively only if we presume service and resource management are kept loosely coupled.  It’s obvious that customers who buy services are buying features, not implementations.  When the features were linked in an obvious way to devices (IP and Ethernet services) we could link service and resource management.  Virtualization tears the resource side so far away from the functionality that buyers of the service could never understand their status in resource terms (“What?  My server is down!  I didn’t know I had one!” or “Tunnels inside my firewall?  Termites?”)

Software types would probably view service and resource management as an example of a pair of event-coupled finite-state machines.  Both service and resource management would be dominated by what could be viewed as “private” events, handled without much impact on the other system.  A few events on each side would necessarily generate a linking event to the other.  In a general sense, services that had very detailed SLAs (and probably relatively high costs) would be more tightly coupled between service and resource management, so faults in one would almost always tickle the other.  Services that were totally best-effort might have no linkage at all, or simply an “up/down” status change.

Where the coupling is loose, service events arise only when a major problem has occurred at the resource level, a problem that could impact customer status or billing.  The pool of resources is maintained independently of services, based on overall capacity planning and resource-level fault handling.  Where coupling is tight, fault management is service-specific and so is a response to resource state changes.  The pool is still managed for overall capacity, and faults, but remediation is now moved more to the service domain.

The second principle of efficient service automation is that you cannot allow either fault avalanches or colliding remedies.  Automated systems require event-handling, which means both “passing” events and “handling” them.  In the passage phase, an event associated with a major trunk or a data center might represent hundreds or thousands of faults, and if this number of faults is generated at any point, the result could swamp handling processes.  Even if there’s a manageable number of events to handle, you still have to be sure that the handling processes don’t collide with each other, which could result in collision in resource allocation, delays, or errors.  Traditionally, network management has faced both these problems, and with varying degrees of success.

Fault correlation tools are often used to respond to problems at a low level that generate many high-level events, but in virtual environments it may be smarter to work to control the generation of events in the first place.  I’m an advocate of the notion of a hierarchy of service objects, each representing an intent model with an associated SLA.  If faults are generated at a low level, remediation should take place there with the passage of an event up the chain dependent on the failure of this early stage effort.

Collisions in management processes seeking to remediate problems, or collision between those processes and new activations, are historically handled by serialization, meaning that you insure that in a given resource domain you have only one process actually diddling with the hardware/software functionality that’s being shared or pooled.  Obviously having a single serialized handling chain for an entire network would put you at serious risk in performance and availability, but if we have too many chains of handling available, we have to worry about whether actions in one would impact the actions of another.

An example of this is where two services are trying to allocate capacity on a data center interconnect trunk.  They might both “check” status and find capacity available, create a service topology, and then have one fail because it lost the race to actually commit the resources.  That loser would then have to rebuild the service based on the new state of resources.  Too many of these collisions could generate significant delay.

Fixing the handler-collision problem in a large complex system isn’t easy.  One thing that could help is to avoid service deployment techniques that rely on looking for capacity first and then allocating it when the full configuration is known.  That introduces an interval of uncertainty between the two that raises the risk of collision.  Another approach is to allocate the “scarce” resources first, which suggests that services elements that are more in contention should be identified for prioritizing during the deployment process.  A third is to commit the resource when its status is checked, even before actual setup can be completed.

The final principle of service automation is that the processes have to be able to handle the full scope of services being deployed.  This means not only that you have to be able to automate a service completely and from end to end across all legacy and evolving technologies, but also that you have to be able to do this at the event rate appropriate to the size of the customer base.

The biggest misconception about new technologies like SDN and NFV is that you can bring about wholesale changes in cost with limited deployments.  In areas like operations efficiency and service agility, you gain very little if you deploy a few little pieces of the new buried in the cotton of the old.  Not only that, these new areas with new service lifecycle and operations/management practices are jarring changes to those weaned on traditional models, which means that you can end up losing instead of gaining in the early days.

It seems to me inescapable that if you plan on a new operations or service lifecycle model, you’re going to have to roll that model out faster than you could afford to change infrastructure.  That means it has to support the old stuff first, and perhaps for a very long time.

The other issue is one of scale.  We have absolutely no experience building either SDN or NFV infrastructure at the scale of even a single large corporate customer.  Most users know that technologies like OpenStack are “single-thread” meaning that a domain has only one element that’s actually deploying anything at any point in time.  We can’t build large networks with single-threaded technology, so we’re going to have to parallel SDN and NFV on a very large scale.  How do we interconnect the islands, separate the functions and features, commit the resources, track the faults?  It can be done, I’m convinced, but I’m not the one that has to be convinced.  It’s the operations VPs and CIOs and CFOs and CEOs of network operators.

I’ve noted needs and solutions here, which means that there are solutions.  It’s harder for vendors to sell complex value propositions, but companies in the past didn’t succeed by taking the easy way out.  In any case, while a complete service automation story is a bit harder, it can be told even today, and easily advanced to drive major-league NFV success.  That’s a worthwhile goal.

Comparing the NFV Data-Model Strategies of Key Vendors

I think that most of my readers realize by now that I think the data modeling associated with NFV is absolutely critical for its success.  Sadly, few of the players involved in NFV say much about their approach to the models, and I’ve not been able to get the same level of detail from all of those I’ve asked.  I do think it’s possible to make some general comments on the modeling approaches of the few vendors who have credentials in making an NFV business case, and so I’ll do that here.  To do this, I’ll introduce the key points of an NFV data model and then talk about how vendors appear to address them.  If I cover your own approach in a way you think is incorrect, provide me documentation to support your position and I’ll review it.

The first aspect of a data model is genesis.  All of the NFV models stem from something, and while the source doesn’t dictate the results, it shapes how far the approach can get and how fast.  The primary model sources are TMF SID, cloud/TOSCA, and that inevitable category, other.

Alcatel-Lucent and Oracle appear to have adopted the TMF SID approach to NFV modeling.  This model is widely implemented and has plenty of capabilities, but the extent to which the detailed features of the model are actually incorporated in someone’s implementation is both variable and difficult to determine.  For example, the TMF project to introduce Service Oriented Architecture (SOA) principles, GB942, is fairly rare in actual deployments, yet it is critical in modeling the relationship between events and processes.

HP is, as far as I can determine, the only announced NFV player who uses the cloud/TOSCA approach.  That’s surprising given the fact that NFV deployment is most like cloud deployment, and the TOSCA model defined by OASIS is an international standard.  IBM uses it for its own cloud orchestrator architecture, for example.  I think TOSCA is the leading-edge approach to NFV modeling, but it does have to be extended a bit to be made workable for defining legacy-equipment-based network services.

In the “other” category we have Overture, who uses the open-source distributed graph modeling architecture, Titan.  This is an interesting approach because Titan is all about relationships and structural pictures thereof, rather than being tied to a specific network or cloud activity.  It’s probably the most extensible of all the approaches.

The second aspect of the data model is catalogability, to coin a term.  To make an NFV model work, you have to be able to define a service catalog of pieces, assemble these into service templates, and then instantiate the templates into contracts when somebody orders something.  All of the models described by the key NFV vendors have this capability, but those based on the TMF SID have the most historicity in supporting this approach, since SID is the model long used by OSS/BSS systems and vendors.

Both HP and Overture have the ability to define models of substructures of a service and assemble them.  Either of the approaches appear to be suitable, but they lack the long deployment history of SID and the integration of operations processes with these non-SID models has to be addressed explicitly since there’s no TMF underlayment to appeal to for features integration with operations.  HP also provides for inheritance from base objects, and they appear to be alone in modeling resource structures as well as service structures.  I think SID models might be able to do that, but I can’t find an example in the material for our TMF-based vendors.

The third aspect of the data model is process integration.  In order to synchronize and support efficient network and service management, you somehow have to link NFV to NMS and OSS/BSS processes.  There are two basic ways to do that—one being the virtual device approach and the other what I’ll call the explicit event steering approach.  Details on how vendors do this stuff is extremely sparse, so I can’t be as definitive as I’d like.

It appears that both Alcatel-Lucent and Oracle have adopted the virtual-device approach.  NFV sits at the bottom of a resource structure defined in the SID, and operations and management processes couple to SID as usual.  The goal of NFV deployment implicitly is to make the deployed elements look like a virtual device which can then be managed as usual.  In theory, the GB942 event coupling to SOA interfaces via the SID is available, but I’ve got no details on whether it is implemented by either vendor.  Since GB942 implementation is rare, the answer is “probably not”.  This combination means that service automation steps “inside” VNFs is probably managed totally by VNF Managers, and may be opaque to OSS/BSS/NMS.  I can’t be sure.

HP’s approach seems also to rely on VNFMs for lifecycle processes at the NFV level, but they have a general capability to link events to processes through what they call a monitor in their SiteScope management component.  You can establish monitor handlers, transgression handlers, trigger handlers, etc.  It would appear to me that this could be used to integrate lifecycle processes within NFV to larger management and operations domains, though I don’t have an example of that being done.

Overture has the clearest definition of process integration.  They employ a GB942-ish mechanism that uses a business process language and service bus to steer events generated by their management analytics layer, or another source, to the correct operations/management processes.  The approach seems clear and extensible.

The next area to consider is management data integration.  To me, the baseline requirement for this is the establishing of a management repository where status information from resources and services is collected.  Everyone seems to do that, with Alcatel-Lucent and HP integrating their management data with their own platforms, and Overture using a system that unites open-source NoSQL, Cassandra, and Titan.  In theory all these approaches could generate management data correctly, and I believe that both HP and Overture could present custom management of modeled service elements.  On the rest, I don’t have the data to form a clear view.

The final point is support for modeled legacy elements.  In HP this capability is explicit in the modeling, and HP is the only player that I can say that for.  HP’s data model incorporates legacy components explicitly so you can define service solutions in a mixture of legacy and SDN terms.  Since HP’s NFV Director is derived from their legacy control product line, it has full capabilities to control legacy devices.  HP can also operate through OpenStack and ODL.

With Alcatel-Lucent and Oracle, modeling legacy elements is implicit in their SID basis, meaning that their processes are really legacy processes into which NFV is introduced so there’s little question legacy networking could be accommodated.  Under the virtual devices that NFV represents, both support OpenStack and ODL.

Overture can support legacy devices in a number of ways.  First, the Titan model is technology-agnostic so there’s no data model issue.  The model can drive a control-layer technology set to actually deploy and connect resources using legacy or NFV (or SDN) technologies.  Overture has their own network controller that (obviously) supports their own equipment and can be augmented through plugins to support third-party elements.  They can also work through OpenStack and ODL to support legacy devices that have suitable plugins for either of those environments.

The data models used for services are, as I’ve said, the most important thing about how a vendor implements NFV, and yet we don’t talk about them at all.  Public material on these models is limited for all the vendors I’ve listed.  What would be great would be an example of a service modeled in their architecture, containing both VNFs and legacy elements, and linked to NMS and operations processes.  If any of these four vendors (or anyone else) wants to send me such a picture for open use and publication then I’d love to see it and I’d almost surely blog on it.

What can we learn about NFV from this?  Well, if the most important thing about NFV is something nobody is talking about and few have any support for at all, we’re not exactly in the happy place.  There is no way you could do a responsible analysis of NFV implementations without addressing the points I’ve outlined here, which means without defining an explicit data model to describe services and resources.  If you see an analysis without these points (whether from a vendor, the media, or an analyst) it’s incomplete, period.  I hope my comments here will help improve coverage on this important topic!

What’s the REAL Impact of Virtualization on Network Security?

When I was a newly-minted programmer I saw a cartoon in the Saturday Evening Post (yes, it was a long time ago!)  A programmer came home from the office, tossed his briefcase on the sofa, and said to his wife “I made a mistake today that would have taken a thousand mathematicians a hundred years to make!”  This goes to show that automation doesn’t always make things better when it makes them faster.  Sometimes it doesn’t make them better at all.

Automation is a necessary part of virtualization, and we can certainly envision how a little service automation done wrong could create chaos.  One area stands out in terms of risk, though, and that’s security/compliance.  We learned with the cloud that security is different in a virtual world.  Will it be different for SDN and NFV, and how?  What will we do about it?  There are no solid answers to these questions yet, but we can see some signposts worth reading.

Virtualization separates function from realization, from resources.  When you create services with virtualization, you are freed from some of the barriers of service-building that have hampered us in the past, most notably the financial and human cost of making changes to accommodate needs or problems.  Adding a firewall to a virtual network is easier than adding one to a device network.  However, you also create issues to go with your opportunities.  The potential for malware in the virtual world is very much higher because you’ve converted physical devices with embedded code into something not very different from cloud components—something easier to get at.

I would propose that security in the virtual world is a combination of what we could call “service security” and “resource security”.  The former would involve changes to the “black-box” or from-the-outside-in experience of the service and how those changes would either improve or reduce security.  The latter would involve the same plus-or-minus, but relating to the specific processes of realizing the resource commitments associated with each function.  Service security relates to the data plane and any control or management traffic that co-resides with it, and resource security relates to the plane where virtualization’s resources are connected.

Service security differences would have to arise from one of three problems—spoofing an endpoint, joining a limited-connectivity service, or intercepting traffic.   To the extent that these are risks exposed within the data plane of the service (or co-resident control/management planes) you would have the same exposure or less with SDN or NFV.  SDN, for example, could replace adaptive discovery with explicit connectivity.  Thus, at the basic service level, I think you could argue that virtualization doesn’t have any negative impact and could have a positive one.

Virtualization’s agility does permit added security features.  NFV explicitly aims at being able to chain in security services, and this approach has been advocated for enterprises in a pure SDN world by Fortinet.  You could augment network security by chaining in a firewall, encryption, virus-scanning on emails, DNS-based access control, or other stuff without sending a tech out or asking the customer to make a device connection on premises.  However, remember that you can have all these features today using traditional hardware, and that’s usually the case with businesses and even consumers.  It might be easier to add something with virtualization, but in the end we end up in much the same place.

If we carried virtualization to its logical end, which is application- and service-specific networks built using OpenFlow and augmented with NFV, you could see the end to open connectivity and the dawn of very controlled connectivity, almost closed-user-group-like in its capabilities.  Given this, I believe that at the service level at least, virtualization is likely to make security better over time, and perhaps so much better in the long term that security as we know it ceases to be an issue.  I want to stress the point that this isn’t a near-term outcome, but it does offer hope for the future.

If service security could be better with virtualization, the resource side of virtualization is another story.  At some point in the setup of a virtualization-based service, you have to commit real resources and connect them.  This all happens in multi-tenant pools, where the isolation of tenants from each other and management/control pathways from tenant data planes can’t be taken for granted.

The first fundamental question for either SDN or NFV is the extent to which the resource domain’s control traffic is isolated from the service data plane.  If you can address or attack resource-domain components then there’s a security risk of monumental proportions being added when you add virtual infrastructure, and nothing much you do at the service level is going to mitigate it.  I think you have to think of the resource domain as the “signaling network” of virtualization and presume it is absolutely isolated from the services.  My suggestion was to use a combination of physical-layer partitioning, virtual routing, and private IP addresses.  Other approaches would work too.

If you isolate the signaling associated with virtualization, then your incremental resource risks probably come from either malware or “maloperators.”  There is always a risk of both these things in any operations center, but with virtualization you have the problem of needing a secure signaling network and then needing to let things onto it.  The VNFs in NFV and perhaps even management elements (the VNF Managers) would be supplied by third parties.  They might, either through malice or through error, provide a pathway through which security problems could flow into the virtualization signaling network.  From there, there could be a lot of damage done depending on how isolated individual VNF or service subnets of virtual resources were from each other and from central control.

To me, there are three different “control/management domains” in an SDN/NFV network.  One is the service data plane, which is visible to the user.  The second is the service virtual resource domain, which is not visible to the user and is used to mediate service-specific resource connections.  The third is the global management plane, which is separate from both the above.  Some software elements might have visibility in more than one such plane, but with careful control.  See Google’s Andromeda virtual network control for a good example of how I think this part has to work.

Andromeda illustrates a basic truth about virtualization, which is that there’s a network inside the “black box” abstractions that virtualization depends on.  That network could boost flexibility, agility, resilience, and just about everything else good about networking, but it could also generate its own set of vulnerabilities.  Yet, despite the fact that both Amazon and Google have shown us true black-box virtual networking with their cloud offerings, we’re still ignoring the issue.

The critical point here is that most virtualization security is going to come down to proper management of the address spaces behind the services, in the resource domain.  Get this part right and you have the basis for specifications on how each management process has to address its resources and partner processes, and where and how it crosses over into other domains.  Get it wrong and I think you have no satisfactory framework for design and development of the core applications of SDN or NFV and you can’t ever hope to get security right.

We need an architecture for SDN and NFV, in the fullest sense of the word.  We should have started our process with one, and we aren’t going to get one now by simply cobbling products or ideas together.  The question is whether current work, done without a suitable framework to fit in, can be retrofitted into one.  If not, then will the current work create so much inertia and confusion that we can’t remedy the issues?  That would be a shame, but we have to accept it’s possible and work to prevent it from happening.