Putting the ETSI NFV Architecture Through a Hypothetical Scenario Set

Hopefully your interest in NFV management prompted you to read yesterday’s blog and you’re ready to follow up.  If not, you may want to review it before you read this one because I’m building on the last one with only a very brief level-set!

Let’s assume we have a VNF with four components, one of which is horizontally scalable and sits in front of the other three, which are in line.  You can draw this out as four boxes left to right, with the user on the extreme left and the “service” interior on the right.  This is supported using a subnet and it’s got a private IP address (like the usual 192.168.1.x).  The leftmost VNF has a port exposed via something Google-Andromeda-or-Amazon-Elastic-IP-like mechanism which for brevity here I’ll just call “SuperNAT”.  Similarly the rightmost has an exposed port for service connection.

Let’s assume that we have a lot of load on our VNF on-ramp element.  The first obvious question is how we know that.  In the ETSI model we have Element Managers (EM) that are associated with the VNFs and we also have a VNF Manager.  It would seem logical that if the VNFs themselves were capable of understanding their own load profiles, EMs could communicate a need for scaling.  If not, it would have to come from “outside” meaning that the state of actual network and hosting resources might be used.

Whatever the source, scaling would have to be initiated as a lifecycle process, and the VNFM would drive the VIM to allocate additional instances.  That much is clear.  What is less clear is what would happen in a case like my example here, where in order to support multiple instances of our head-end VNF we’d likely need to load-balance.  We now need an additional functional component not part of the original picture.  How does that get instantiated?  Normally, the NFV Orchestrator would be responsible for this sort of decision.  Remember that we have a coordinated need to deploy the load balancer and to then reconnect the front-end elements, including the connection to the user.  (Note that for service availability reasons we might presume that every service with scaling had a predefined load-balancing element in the configuration to prevent interruptions during this reconnect phase.)

Faults are more complicated.  If something in NFVI breaks, then we have two possible paths forward.  First, we could assume that the fault would be recognized by the VNFs themselves based on conditions that would be visible to them on their interconnected pathways.  So if a VNF fails in our hypothetical service, the VNFs connected to it would presumably recognize the problem.  The other possibility is that the fault would be recognized by the infrastructure management systems, whatever they are.

The ETSI spec suggests that a VIM could notify a VNF manager of “changes in state”, and one might presume that this would mean that the VNFM could either undertake to replace the item on its own, or could notify an EM in the VNF, which would then start remediation.  It seems to me that if you have VNFMs and EMs in the picture, you have to let both of them know what’s being managed.

In a fault situation, we’re assuming that everything in real remediation terms is getting done by the VNFM, just as we did in the scaling example above.  That’s reasonable given that the VNFM seems to have all the parametric data on the service, but it kind of makes the Orchestrator look like a rump function.  I’d like to see a model where all of this was collected into a single software structure; I think it’s going to be difficult to build something with the separation of functions and the exposure of interfaces that the ETSI model defines, given the ambiguity of roles.

You can see the security ambiguity I talked about in the last blog more clearly now.  VNFMs have the ability to command resources, which means that to control a VNFM is to control infrastructure, at least in some sense.  The challenge here is that if the VNFM is specialized to the service itself, meaning we have per-service or per-VNF VNFMs, or even just if we have proprietary VNFMs, we’ve relaxed security on the network.  A single service, or worse a single service instance and its associated customer, has the ability to call on infrastructure.

I understand that this could in theory be prevented, meaning that you could “authenticate” requests.  The problem is that it’s hard to know what’s authentic.  Remediation or scaling requests additional resources, which obviously impacts the shared pool.  Under what conditions does a VNFM have the right to do that?  Who enforces the conditions?  If we say that the VNFM enforces its own security, we’ve just justified having no security at all because that principle runs afoul of all the firewall and management integrity checks traditional built into networks.

Then there’s operations integration.  We are spinning up additional VMs in scaling, and we’re replacing components due to a fault that would very possibly create an SLA violation in remediation.  It’s hard to see how both these conditions wouldn’t have to be reflected broadly, but in three specific places—the service model for NFV, the network operations center, and OSS/BSS.

Even for “NFV operations”, we have to maintain an accurate model of the service or we can’t respond to future change requirements.  Imagine the challenges of fault management if scaled components didn’t show up on the service model?  How does that happen, though?

I also wonder how a NOC finds out about a problem with a VNF.  You could say that overloading of a VNF isn’t a NOC problem, but if NOCs are still expected to respond to customer complaints, how do they see the conditions that the NFV service itself is responding to?

Then there’s how this gets integrated with OSS/BSS.  If a customer calls and asks something about the service, can a customer service rep dig into the details of the current service model and state?  Right now there’s no interface expressed for that, or any specific detail on the model itself.

You might get the idea from this that I’m saying that NFV won’t work as described.  I’m not saying that, but I am saying that I doubt that the ETSI model could be taken literally.  Operators tell me that all of their PoC and lab work is building out from basic ETSI descriptions into what’s essentially ad hoc extensions of NFV to cover all the bases.  That’s not necessarily a bad thing; innovation and multiple approaches can be valuable.  It does tend to negate the standards, though, because these innovations could very well be PoC-specific, service-specific, vendor-specific, and thus generate a bunch of silos.

What’s needed here?  Well, the simple answer is that we need to define the abstractions themselves—the service models that MANO would use, the models that are used by MANO to drive the VIMs—and we need complete flow diagrams to describe explicitly what happens under the kind of conditions I’ve outlined here.  You can’t define an architecture without testing your structure with the kind of things it’s expected to handle.  That’s routine in software design.  It needs to be done with NFV, and quickly.

NFV Management Discussion Phase One: NFV as a World of Subnets

NFV management has never been my favorite part of NFV, and I’ve groused about it here fairly regularly.  It’s probably time to talk about the issues in more detail, and so I’m going to do an as-yet-undetermined number of blogs in a series about the issue.

To get this straight, we have to set the stage.  NFV presumes that virtual network functions (VNFs) are collections of components that are hosted and connected during the deployment process by the NFV Orchestrator.  The management, meaning lifecycle management, of this collection is the responsibility of the VNF Manager or VNFM.

VNFs would have to be collected in some sort of subnetwork, and this is shown in the ETSI End to End Architecture Document’s Figure 3.  The easiest way to think of this would be as an IP subnet, though no specific reference to a network structure is provided in the document.  I’m presuming one here because it’s difficult to talk about management issues when you don’t have any specific way to reference the things you’re managing.

In our hypothetical NFV subnet we’d have a bunch of hosted software components (VNFCs) that are linked somehow.  The ETSI material calls the relationship a forwarding graph, but I’m not sure that doesn’t presume simple linear service chains.  Even if you had chains, you’d need to have something to chain through, meaning a network service that offered connectivity to the component.  Using this the elements inside a subnet would all be able to talk to each other, presuming they had an address reference.  Our components will also have to be visible to the real world, at least in part.  The ETSI Figure 3 shows endpoints connected to VNF1 and VNF3, which presumes that these endpoint connections on the VNFs are “visible” in the service address space, outside the VNF subnet.

Security, compliance, and sanity seem to dictate the presumption that our subnet is based on something like an RFC 1918 address space (assuming IPv4), so the VNFs would all be invisible to the outside world.  To make some of the ports on VNFs visible, we’d have something like NAT to translate between one of the private addresses and a public address.  We’d also need a DHCP function to assign addresses and a DNS to allow the components to see each other.  If we do this, then every VNF lives in its own private universe, secure from visibility to other VNFs and even to its own service address space.  It shares only ports it is designed to share, and only to other subnets it specifically elects to link with.

So where we are is as follows:  Something like Figure 3 would probably be set up by defining a private subnet with its own DNS and DHCP, and with NAT to convert between the internal addresses it wants to publish in the service data space, and addresses in that space.

We’re not done with subnets, though.  We have to be able to deploy this stuff, right?  Thus, we have to presume that there is an infrastructure network where all of the resources live.  We also have to assume that either in this network or in yet another network we have all the elements of NFV software, which means MANO, EMSs, and OSS/BSS connections (the actual OSS/BSS could be elsewhere but we have to be able to reach it).

You’re probably wondering why I’m getting into this, and the answer is that the framework we presume has to be there to deploy NFV will also have to support management of NFV services once they’re deployed.  We have to be able to harmonize the role of VNFM within this structure, and if we have any issues we have to get them addressed.

Management starts with the presumption that there is, included with a VNF, an Element Manager that performs all the VNF’s typical management functions.  This EM links with the VNFM for resource information and to provide lifecycle management.  The VNFM would go to the Virtual Infrastructure Manager to deploy something, such as a scale-out.  However, it also appears that the NFV Orchestrator also goes to the VIM for deployment.  To start with wouldn’t it be logical to say that “deployment” was a lifecycle stage?  Yes, but if an EM has to request lifecycle management that can’t happen till the EM, which co-deploys with the VNF, presumably, is actually deployed to do the requesting.

Apart from this we have some challenges of addressing and security.  It’s reasonable to assume that the EM talks to the VNFM through one of those NATted interfaces, so we can at least make the connection.  As long as there’s some record of the EMS address so that the VNFM could contact it, presuming it needs to, we are fine.

An issue arises if we look deeper into the VNFM proposal.  There’s a goal of supporting multiple VNFMs, so that VNF-specific VNFMs could be offered.  The reason given is that the task of lifecycle managing a VNF could be pretty specialized, and that may be true.  However, we now have to look at the addressing and security issues.

If a VNFM is provided by a VNF vendor, where does it live?  You have three options.   You can put it inside the VNF, inside the private subnet where MANO and the rest of the software lives, or in some third disconnected subnet.  What are the implications?

If you put the VNFM inside the VNF then we’re letting VNFs manage their own lifecycles, allocate resources, etc.  We have to give the VNFM a link to the VIM, which means that the VNF can “see” infrastructure directly and control real resources.  I think this is a serious security and stability problem.

If we put the VNFM inside the MANO subnet, we’re letting vendors add service-specific software inside NFV’s control software, where there are no barriers to what it could do.  That is IMHO a far worse issue with security and stability.

If we put the VNFM in its own subnet we’re still giving that subnet access to a VIM, and while that could be made more secure than the first (and second) options, it’s still not ideal.  The VNFM still can directly control resources.

My conclusion here is that we need to be looking at NFV deployments like cloud applications in a multi-tenant world.  Amazon and Google both provide a mechanism much like I’ve described to create subnets where components are hosted using private IP and then use NAT or “elastic IP” addresses to map to addresses visible outside.  We have to be able to draw a picture of an NFV deployment as a set of IP subnets which interconnect in some way.  Google offers such a picture in its own Andromeda architecture.  If we can draw the subnet structure of NFV we can see at least what connections between private spaces and public spaces are, and whether these connections are sufficient to permit NFV software to function as needed and secure enough to be acceptable by operators.

Functionality is yet another matter.  The best way to look at how this would work is to look at a deployment, a horizontal scaling, and a fault response.  That’s what I’ll do in later blogs.

 

 

Could SDN or NFV Save Us From Massive Outages?

Since the dual United Airlines and NYSE outages I’ve gotten a lot of email about the stability of new network architectures.  While I don’t have any special insight into those incidents and so can’t (and won’t) speculate on how they were caused or how they could be prevented, I do have some experience with network outages.  Are SDN and NFV going to make things easier for us, harder, or what?

The core reality of networking today is that it’s adaptive in operation.  Devices exchange information via control protocols, and the propagation of this information establishes forwarding rules that are then used to move packets around.  At the same time, management information moves to and from specific devices using the same data paths and forwarding rules.  The adaptive nature of today’s networks make them vulnerable in two distinct ways.

First, a device could be cut off from the rest of the network by a failure.  In this case, the device wouldn’t be accessible to management commands and thus could not be controlled, at least not until a pathway to that device was restored.  If the device had been contaminated by bad parameterization or software, the problem might prevent paths from ever being established, in which case you’d have to send someone to manually fix things (or provide an out-of-band management connection to every device).

The second issue is the bad apple problem.  You know (maybe) the old saying that “One bad apple spoils the barrel”.  The fact that devices in a legacy network derive most of their knowledge of topology and status from an exchange of information with adjacent devices means that a single device that’s contaminated could contaminate the whole network with bad information.  In most cases this means either that the device advertises “false” routes that are suboptimal or perhaps can’t even work, but it might also mean that the device floods partners with nonsense, ignores requests, and so forth.

Both these problems tend to happen for two reasons.  First, the device is parameterized incorrectly, meaning that there’s a human error dimension.  The largest network outage I ever saw in my career, which tool over fifty sites down hard for over 24 hours and caused failures of at least a quarter of sites at any given time for a week, was caused by a parameter error.  The second issue is a software problem.  We’ve all heard about how software updates to a device cause it to behave badly with its neighbors.

Logically, the questions to be asked for NFV and SDN are, first, how susceptible they’d be to the current pair of issues and second whether there might be new issues arising.  Let’s look at those points.

We have to set the stage here.  In SDN, we have a number of models of service in play these days.  Classic OpenFlow SDN says that white-box switches have their forwarding managed entirely by a central SDN controller.  In some cases classic SDN is overlaid on legacy forwarding, meaning that there’s still adaptive topology management being done by the device but explicit forwarding control via OpenFlow is possible.  Some other models (Cisco’s preferred approach) would utilize legacy adaptive behavior completely and use policies to add software control over the process.

In any model that retains adaptive behavior, we have the same risks that we have today.  If the model adds central SDN forwarding control, then we add the risks that such control might add.  Primarily, the risk of central control is the failure of the central controller.  If a controller fails, then it can’t respond to network conditions.  That doesn’t mean the network fails, only that it can’t be changed to reflect failures or changes in traffic or connection topology.  The big question for an SDN controller, IMHO, is whether it’s stable under load.  My biggest-network-failure example was caused by a parameter error, but the reason it exploded was that the error caused a flood of problems that swamped a central control mechanism.  When that failed under load, everything broke, and since everything now had to be restored, the controller never came up.

Bad-apple device problems in SDN wouldn’t impact the topology and forwarding of the network, but if a device went maverick and didn’t do forwarding updates correctly or at all, the central controller might not “understand” that the route it commanded hadn’t really worked.  I’ve not yet seen a demo/test of a controller that involved checking the integrity of routes and perhaps flagging a device as being down if it’s not doing what it’s supposed to do.

The cutoff problem in SDN has the same kind of risk.  A device could be cut off because an adjacent device killed the route to the controller.  If the device is functional enough to do what it would likely be supposed to do (try other paths) and if there were other paths available, you still might be able to restore contact.

Overall, my feeling is that purist OpenFlow SDN is less at risk to traditional adaptive-behavior-related outages for the obvious reason that it relies on central control.  If the controller is designed properly, hosted reliably, and if the devices are set up to deal with path loss to the controller in a reasonable way, then I think you can say that classic SDN would be more reliable than legacy networks.

NFV is a bit more complicated.  NFV doesn’t aim at changing network control-plane behavior, so if you hosted VNFs that did switching and routing via NFV you’d simply substitute a software version of a device for an appliance version.  All the adaptive risks would be the same.  If you hosted SDN VNFs and centrally controlled them, you’d now have the SDN risks.  Where NFV is different is first in the issue of node reliability and second in the management plane.

Servers, data center switches, and intra-VNF paths in an aggregate configuration make NFV more complex and likely generate a lower MTBF than you’d have with a ruggedized appliance.  NFV could potentially have an improved MTTR because you could fail over components, but you’d see an outage in most cases.  We also don’t really have much data on how fast service could be restored and how an extensive failure like a data center drop would impact the ability to even find alternative resources.  Thus, it’s hard to say at this point just what NFV will really do to network availability.

On the management side it’s even more complicated.  In traditional networks, management and data paths are handled equally, meaning that you have connectivity for both or for neither.  In NFV, the presumption is that at least a big chunk if not all of the management data is carried on a subnetwork separated from the service data paths.  It’s not unlike the SS7 signaling network of the old phone network (which we still have in most of the world).  If we presume that VNFs are isolated to secure them from accidental or malicious hacking from the data plane, we now have a subnet for every VNF and management connections within (and likely to and from) those subnets.  Because NFV depends on better remediation for its availability more than reliable appliance strategies, loss of management integrity could hurt it more.

The net for NFV is that we don’t know.  We have not built an NFV network large enough to be confident we’ve exposed all the issues.  We haven’t identified possible issues fully, and tested them in credible configurations.  I think that it would be possible to build NFV networks that were less susceptible to both the bad-apple and cut-off network problems, but I’m not sure the practices to do that have been accepted and codified.

The net, IMHO, is this.  If we do both SDN and NFV right, we could reduce the kind of outages we’ve seen this week.  If we do them badly, deploying either or both would make things worse.  Since we have far less experience managing SDN and NFV than managing legacy networks, that tells me that we have to be graceful and gradual in our evolution, or we’ll make reporters a lot happier with dramatic stories than we’ll make customers happy with reliable networks.

Why IoT is Probably the Killer App for NFV

One of the ironies of NFV is that its greatest success may be coming from deployments that are actually not NFV at all.  A part of this is due to normal market dynamics; you always try to pick the low apples first.  Another part is due to the scope limitations I’ve blogged about before; holistic benefits demand a holistic solution, and standards-based NFV doesn’t cover enough ground yet.  One interesting question is whether the current dynamic could help or hurt long-term NFV deployment.

Most of the publicized NFV service strategies are based on a CPE-hosted functions model, rather than the cloud-or-virtualization-hosted model that operators first envisioned and that the ISG is working to define.  In the CPE-hosted model, a user is given a premises box that is capable of being loaded with feature software from a central management system.  This box then provides the “virtual functions” on demand, easily updated to new versions, deleted when no longer needed, or augmented when conditions demand.

Operators tell me that the primary reason for this shift is the problem with “first cost” in NFV deployments.  If you presume central hosting of VNFs in NFV you need something to host them on, and unless you want to hairpin traffic a considerable distance you’ll need those hosting points at least proximate to the points of user connection.  For a network operator rolling out an NFV-based service over three or four or thirty or forty metro areas, this means an early commitment to multiple cloud data centers with enough servers to create suitable economy of scale.  That cost will rack up immediately, and the operator will then have to wait while marketing and opportunity combine to create buyers, which is why it’s called “first cost”.

When you use CPE hosting, you deploy a hosting point on the customer premises, and the cost is incurred only when you have compensating revenue.  Costs continue to scale with revenues through the life of the service.  When you use CPE hosting, you also eliminate some of the problems associated with shared infrastructure.  Security is easier to address because you don’t have multiple users sharing servers.  There’s less chance of one user’s service impacting another by having them load the server hosting a component of both users’ services.  Management is easier because there is a real box that contains everything.

If you look a bit deeper into the CPE-first movement, you see it’s a reflection of those shifting value propositions for NFV.  We started saying that shared-server efficiencies would reduce capex, and clearly the CPE-driven approach wouldn’t have shared servers at all.  That means that the justification, the business case, has to be developed by reduction of opex or improvement in the revenue flow.  That’s good news for operators because it suggests that the two NFV benefits that seem now to be the most credible are in fact credible enough to overcome any need for capex reduction at all.

CPE-first deployment really doesn’t need “NFV” at all, it needs only a management system to push software images into the CPE.  We have other successes with “NFV” in IMS and EPC, but these are deployments of multi-tenant assets that are actually likely to look more like cloud computing mediated by NFV management than like the kind of per-user-and-service NFV everyone expects.  The bad news for NFV, then, is that while the benefits are being proven we’re not really validating the full architecture.

If you look at the NFV specifications, you see a significant amount of work put into the details of picking the right place to host stuff, into creating high levels of availability, and so forth.  Tuning, in short.  If NFV is either about CPE-hosted VNFs or largely persistent multi-tenant VNFs, then are these microtunings useful?  We’re not making a resource pool selection in most cases.  That suggests that we may be paying too much attention to optimizing something that’s not currently proving to be important at all.

The big question, of course, is whether we can get to NFV deployment in a more “as-we’ve imagined” sense from what’s actually happening.  It’s a hard question to answer because you can look at the evolution two different ways.

If vendors who support either of the current strategies are capable of delivering centrally hosted NFV and supporting the standards, and if they are not tempted to under-commit to traditional NFV implementation points by the fact that they are making money doing something different, then everything we have going on could grease the skids for NFV progress.  If those vendors focus on the limited needs of their early service successes, then the broader features of NFV may end up becoming at best “options” to be offered later or at an additional cost.  We’d create a pseudo-NFV, or even (as some of my blog readers have suggested recently) a whole bunch of walled garden strategies because we don’t need the NFV standards much at all.

It seems to me that the answer to what happens in NFV evolution is going to come from the way that the NFV business case is made.  The current successes, if you look at them in benefit-harnessing terms, are successful because they address a special case of something.  The CPE-hosted approach, for example, addresses service agility in the context of business users whose “agility” needs focus on connection-point features and whose service value is high.  The IMS/EPC examples address operations improvements in a multi-tenant service set that simplifies the operations changes needed, versus per-user-per-service deployments.

Open, multi-vendor, revolutionary NFV has to be more pervasive than our current applications.  I think we’ve validated the notion that operations efficiencies and service agility can justify something that’s at least NFV-like if not a full NFV deployment.  We have to go the rest of the way, which means that we have to be able to exploit the benefits of service agility and operations efficiency more broadly.  IMHO, this is what all the NFV efforts should be directed at doing.

It’s an “enoughness” problem.  It’s not that we can’t help operations efficiency today, but that we can’t help enough of it in scope terms, and help it enough in terms of cost impact. In theory we could get there by expanding NFV’s scope into operations orchestration and legacy equipment, but I’m not sure we have time for that gradualism.  We need to find a broader trigger for NFV opportunity, one that exposes benefits on a broad scale but can still be implemented at least somewhat gracefully.  That trigger is probably IoT.

Something like IoT, truly and fully modeled as an NFV application and supported with a credible set of products, is the key to NFV’s broad success because such a service set would be a model for broad NFV deployment as well as a source of near-term drivers.  So what we may be proving here is that a truly comprehensive IoT implementation is going to be the thing that gets NFV moving, that keeps it from becoming nothing more than a series of vague specifications to guide specialized deployments.

We have to do more with NFV to get more from it.  We also have to get out of gardens or we’ll inevitably get walled in.

Could ALU’s Recent “NFV Wins” Show a Shift in Focus for the Industry?

Alcatel-Lucent has collected a number of important NFV deals recently.  The company won a pair in China (China Mobile and China Unicom) and also won an expanded deal with European innovator Telefonica.  There seems to be a common element in these deals—voice services and mobility.  I think that says a lot about NFV and how it will deploy.

Up to now, we’ve seen what I’ll call an NFV-directed vision of NFV deployment, meaning that NFV benefits were expected to directly drive a roll-out.  Operators might differ in terms of the specific benefit targeted or their path to realizing it, but in all cases it was an NFV benefit doing the heavy lifting.

I’d suggest that Alcatel-Lucent’s recent success is due in large part to a shift in NFV drivers.  What we’re seeing is actually more a cloud-centric vision of the future, a future where what we’re trying to achieve is cloud hosting of multi-tenant elements or services rather than an NFV-centric deployment of service-specific functional instances.  NFV’s benefit is not in “enabling” this but in making it efficient and agile.

In my view, the original mandate for NFV—capital cost efficiencies—has been largely ineffective at moving the ball.  There’s not enough savings to be had even under ideal conditions, and those ideal conditions could be achieved only if we somehow made the more complex VNF-based services operationally effective enough to be as cheap or cheaper to run as legacy equivalents.  That leaves operations efficiency and service agility as the benefits essential to NFV success.  The point now seems to be that these benefits aren’t conclusively linked to NFV.

There is little credibility to the notion that customers would pay for NFV itself—they’d pay for services that NFV deploys.  Those services, in the form of VNFs, provide what’s sold to users and NFV provides how the sale is realized through deployment and lifecycle management.  A lot of this view has been expressed through past assertions that VNFs were critical in NFV—you need functions and features to build service value.

If there’s little chance customers would buy NFV there is zero credibility to the notion that you can create agility or operations efficiency for just a piece of a service.  We’re going to have legacy elements forever.  We’re going to have services made up of many VNFs, even many of what could be called “images” that are cooperative collections of VNFs assembled to create services quickly.  Because of scope limitations in the ISG from the first, management of legacy elements is out of scope and management of complex services has yet to be conclusively addressed.

The reason these points are relevant is that operators absolutely need relief from their revenue/cost-per-bit margin compression.  That induces them to look for solutions to the new-service and efficient-operations problems.  Alcatel-Lucent offers (in its virtual IMS and Rapport platform) a strong voice and collaboration services framework that builds easily on mobile investment but also introduces a true value proposition for IP voice services for non-mobile applications.  They have, in Motive, an operations framework that can orchestrate and manage services end to end regardless of the mix of legacy and VNF elements.  So what we’re seeing is Alcatel-Lucent exploiting its strengths, and pulling NFV along with it.

You might wonder why that’s true, and here I think we come to an important point.  Hosted service or application features need to have as much automatic management and deployment features as you can introduce or they quickly fall into the same margin-compression problem we have for bit services.  Furthermore, it’s clear that in cloud computing as in NFV overall, you can’t make a durable value proposition if you can’t build highly dynamic services that are increasingly directed at people not at sites.  That means extemporaneous service delivery, which by even the standards of “traditional” standardized NFV could be difficult to achieve.  Alcatel-Lucent knows, as most NFV leaders do, that some superset of ETSI NFV is going to be critical, so they want to control as much of the early opportunity as possible so they have a commanding lead when things get moving.

This isn’t a shift that’s limited to a single vendor and a couple operators.  I’ve blogged all this year about the fact that NFV is taking a more operations-centric tack in the real world, even though that shift is much harder to detect in the progress of current PoCs or in vendor announcements.  The vendors who have the greatest NFV traction (Alcatel-Lucent, HP, and Oracle in alphabetical order) are all vendors who have an exceptionally strong operations story.  An operations story that is in fact outside the scope of official NFV.  Oracle, who in my view is functionally in third place on the list, is nevertheless the vendor growing fastest in recognition by operators, because they’ve made an operations-first strategy their cornerstone.

The shift Alcatel-Lucent’s recent wins highlights just may be a leading indicator of a face-off on NFV in the near future.  Alcatel-Lucent is the only network vendor of the leaders, and thus Alcatel-Lucent has network/service assets (like Rapport and IMS) that can easily be pushed into near-term opportunities for new services and new revenues.  That’s harder for the other two vendors because they lack the incumbent service infrastructure position.  However, vendors like HP and Oracle could frame a broader story through partnerships.  I think we can see a little of that in HP’s IoT approach, which I think has tremendous potential both as a service element and as a blueprint for handling multi-tenant features deployed through NFV.

This year is critical for the operators because most say that in 2017 their revenue-and-cost lines will cross over for connection services.  That can’t be remedied without some significant shift in infrastructure policy in 2016, and it’s hard to see how an operator could drive that sort of change without having some direct and positive field trial experiences in 2015.  These guys have got to get going, they’re not going to wait for mature standards because they can’t, and they’re going to follow the path of greatest reward and least risk.  That path may lead away from traditional NFV issues into OSS/BSS, service creation, mobility, content delivery, and (yes) IMS.  If it does, then Alcatel-Lucent is going to have a lot of reasons to be happy.

Why SDN and NFV Shouldn’t Force Us to Abandon OSI Layers

In the idealistic vision of the future network (a vision I still hope can be realized), NFV forms an operational- and feature-enhancing umbrella over SDN to create agile services that improve efficiency and add greater value than the basic connection services of today.  This vision would require some significant expansions in scope for NFV; primarily, NFV would have to be given the ability to orchestrate and manage legacy elements of infrastructure.  It’s not a pipe dream that this could happen because we have vendors who already offer this broader scope.

There will always be real network elements, and likely there will always be legacy L2/L3 services as well.  Most of the “SDN” and all of the “NFV” to date are hybrids with legacy technology.  We’ve started to hear about how legacy technology extends NFV hosting of features by connecting the feature subnetworks with the users of the service.  It also creates the underlayment for many of the virtual connectivity features, and this raises the interesting question of whether NFV should be considered to be multi-layered and vertically integrated.  NFV, via SDN, makes vSwitch connections.  Could it also control the real switches underneath?

One reason this is important is that fault management in even today’s services has to contend with the problem of “common cause”.  If a fiber trunk is attacked by the classic “cable-seeking backhoe” then there’s a physical layer outage, a Level 2 outage, and a Level 3 outage even if we limit ourselves to classic OSI.  The break could generate a flood of failures at the service level.  If NFV is responsible for remediation of service faults, does NFV have to “know” that fiber trunk outages should be addressed at the fiber level through rerouting and not by re-framing every service over the trunk individually to use a different path?  In a network operations center today, we’d see fault correlation activity to try to prevent this fight-the-symptoms-not-the-disease syndrome.  How would that work in NFV?

Even at a more mundane level, we have to wonder whether a universally capable and operationally optimizing NFV implementation wouldn’t be used to provision underlying facilities so their operation would be optimized too.  In our hypothetical data center with vSwitches, why wouldn’t we use NFV to provision the physical devices?  Don’t say there aren’t any either; even SDN would demand white-box facilities and SDN depends on having control paths to the switches.

Then there’s multi-tenant.  Suppose we decide to set up a kind of super-IMS using NFV, as many are already proposing and as Alcatel-Lucent is already being contracted to do.  IMS isn’t single-service-per-instance.  You set up an IMS for an operator, not for every call, but how does NFV’s deployment of a multi-tenant resource provide for integrating that resource with other services and applications?  If super-IMS exposed APIs for other services to use, how would those APIs be made available to the other services?

One concept that I think is absolutely critical in addressing these kinds of issues is that all network services have to be framed as network-as-a-service abstractions.  A service at any level is a black box, known by its properties not by its contents (which are invisible).  What follows from that is that the NaaS abstraction has management properties and state which are derived from those invisible contents.  The user of the service “knows” whether the black box has failed, but not how the failure happened.  I think this vision is at least somewhat accepted in both SDN and NFV, but not completely, because we don’t address the notion of layered services even though all services today are layered.

In layered services, a given level (the “retail service” for example) is composited from lower layers.  The user layer is a NaaS, but so are the lower layers.  The retail service would not then have visibility down to the bottom of the stack of infrastructure actually used, but only to the level of the black box combination below.  It would be responsible to remedy faults reported by its own black box and present a fault to its own (retail) user if that isn’t possible.

In fault management terms, this could have profound implications.  If a “retail NaaS” sits on a couple of “component NaaS” services that in turn exercise “transport NaaS” services, then a lower-layer fiber fault causes not a retail fault but a transport NaaS fault.  The lower layer would be given the opportunity to correct the problem, in which case you’d have a report of an interruption but not a failure at the retail level.  If the lower layer (transport) can’t fix things, then the problem would escalate to the “component NaaS” level for remediation, reaching the retail level only if nothing can be done below.

Our visions of SDN and NFV are both, by this measure, too vertically integrated.  We are expecting to allocate resources at a primitive level and not through adopting the NaaS services created by lower layers.  One thing that I think the “right” vision of NaaS would have done is make the whole SDN/NFV integration question moot.  NFV does not, ever, exercise SDN.  It exercises NaaS abstractions that can be fulfilled by SDN.  We need not, should not, focus on the implementation because that would mean our NFV principles would violate the principles of layered networking that are the foundation of packet communication of all types today.

In this framework, the absolutely critical element is that black box.  A black box is defined by its properties, as seen from the outside, so what we need to be thinking about is how we describe this in technology terms.  The most logical answer, I’ve suggested in the past, is the notion of a “recipe”.  If you’re making Margarita Shrimp, you have a black-box abstraction in hand.  The recipe name is the name of the abstraction, and the recipe is a procedure that realizes the outcome (produces the dish) when invoked.

We don’t need to name every possible dish to cook, nor does NFV or SDN have to name all its possible abstractions.  We have to be able to assign a “dish name” and provide a recipe for it no matter what it is.  The biggest hole in the notions of SDN and NFV as they are today is that we’re not focused on the notion of either producing or consuming black-box abstractions.  Without that notion we can’t do layers, and we vertically integrate services so that a common low-level fault blows up into an avalanche of service failures before anything really tries to deal with it.

To me, the lessons of our layered past dictate we have a layered future.  That means that we have to think about the basic principles of isolation, and adhere to them on one hand while making sure that we don’t constrain future services by depending on fixed service models based on older technologies.  The way to accomplish that is simple.  Named abstractions with recipes.  If that concept can be brought successfully into both SDN and NFV, we’ll take a giant step toward saving ourselves from a lot of problems—including those scalability issues I blogged about yesterday.

Can We Scale SDN and NFV?

Over the holiday weekend I got an email from a network operator friend who offered a comment on the state of SDN and NFV.  The point was simple; it’s not completely accurate to say that the PoCs and trials so far have validated either SDN or NFV functionality.  The reason isn’t that these efforts have left functions out, but that they’ve not addressed operation at scale.

SDN provably works in a data center.  There are a few NFV models that could in theory deploy in a kind of minimalist way (service chaining using edge hosting, for example) but nobody thinks SDN or NFV can survive on those models alone.  We have to make both SDN and NFV work in massive deployments.  How?

The first step, on which my operator friend and I agree, is to formalize the process of “domains” and “federation”.  If we think of SDN as being a domain of switches under a controller, we’ve defined an SDN domain.  If we think of NFV as being a collection of NFVI under a single MANO instance, we’ve defined an NFV domain.  The point is that we know that there will be multi-domain networks built using both technologies, so we need to understand how services can cross domain boundaries.

The current mechanisms for SDN federation rely mostly on legacy protocol features—BGP is an example.  This isn’t a long term approach to the issue because it doesn’t let services be built across multiple domains using the full set of SDN features.  In SDN it seems obvious that we need to have a control hierarchy, and/or define an “interdomain” protocol set that would let controllers cooperate to establish services across a connected set of domains.

NFV has no real federation capability as yet, though the NFV ISG has deliberated the question of interconnection of domains.  The sense I get from operators is that they are seeing cooperation as being primarily an exchange of resources rather than cooperative functionality.  An operator might deploy on NFVI provided by a partner, for example, but they’d use their own MANO.  That doesn’t seem to be a good long-term approach either, in no small part because of the next point.

Which is that the limits of domain size and performance have to be established.  How many switches can an SDN controller manage, and how many service creations and changes can be managed/orchestrated by NFV MANO?  It’s possible to extrapolate based on controllable events like service setups just how many events a controller or MANO instance could support.  The problem is that it’s difficult in the context of limited trials to understand how many events might be created under a full set of real-world network conditions.

Suppose we have ten thousand customers with service-chained cloud-hosted components.  We’re using NFV to orchestrate the hosting and SDN to make connections.  Even ten thousand customers probably generate only a few moves, adds, and changes in a day so this isn’t much of a problem for either technology.  Now suppose that we have a major trunk or data center failure.  We have a good number of those ten thousand service customers looking for automatic remediation.  Few would believe that a single instance of an SDN or NFV controller could absorb that load, particularly when the users would be scattered about a metro area or larger.

You can’t have multiple co-equal controller instances trying to allocate the same pool of resources.   How does Guy A know that Guy B took capacity on a given trunk?  What happens if we had conflicting assignments of capacity?  There has to be some higher-level process that “knows” that when you have a massive failure you have to start by trying to replace the massive facilities that failed, and then move upward for remediation that will ultimately reach the service level.  What process is that in either SDN or NFV?

Even if we have controllable, federated, domains we still have to be able to engineer and test the combination without creating a national-scale communications disaster to prove we can handle one.  That means that both SDN and NFV domains have to be engineered to appear as true black-box functional elements so that domain management doesn’t have to dig into the details of what’s out there to understand how to hand off to it or replace it.  It also means that element behavior has to be designed to meet domain behavior standards.  We’re talking a lot about SDN or NFV performance, but we don’t really have a standard against which we can measure it.  We don’t know what a domain has to do, and thus don’t know what elements of that domain have to do to make the domain work.

Part of this engineering is dealing with the impact of various technology options and proposed specification elements on “domainability” of the whole.  For example, we know that SDN has two modes of route control.  One says the controller simply lays out routes based on a central topology understanding and analytics on device behavior.  The other says that when a packet is presented at a switch with new forwarding needs, the switch asks the controller for handling.  I think everyone understands that there are strengths and limitations to both these modes, but do we know what they are to the point where we could size controllers and domains?  I doubt it.

Part of the problem is goes back to federation.  We know based on cloud computing and OpenStack experience that there’s a limit to the size of a domain that a given instance of OpenStack can support.  Some of the elements are single-thread, and it’s hard to see how you avoid that when (as I noted above) resource grants have to be coordinated to make sure several control points don’t grab the same thing.  How does that serialization happen, and what’s the performance implications for the mechanism we’ve selected?

My operator friend is right; we’ve not really dug into at-scale SDN or NFV as strongly as we should have, and as a result most prospective users (and some actual users) don’t understand what might happen if an event creates a flood of changes.  I’ve had plenty of experiences in the networking industry where a small situation caused a cascade of problems that swamped the whole of a network.  The worst example I ever saw of an enterprise network failure, one that impacted almost sixty locations and over ten thousand workers in a mission-critical field, started with a link error.  The escalation to disaster wasn’t caused because the error spread, but by the fact that the remediation overloaded critical management elements way past the point of surviving or failing gracefully.  Neither SDN nor NFV can afford that, and while I think both technologies can be made to scale, federate, and survive, I don’t think we’re as close to being able to do that as we should be.

Are Cisco’s “Six Pillars of IoT” a Strategy or a Placeholder?

I guess that like many out there, I regard Cisco announcements with a mixture of interest and cynicism.  I remember well the times when Cisco would announce a “five-phase strategy” that was (whatever the technology focus) always something that they were already in Phase Two of and never something that was actually delivered in “Phase Five” at the end.  In effect, it was a placeholder for a real Cisco push, one that would develop when Cisco was sure the market was ready.  Cisco is a “fast follower”, remember?

Cisco seems to have outgrown that five-phase approach, but I admit that when I saw their latest IoT announcement had six pillars, I was taken back to the good old days in a different wrapper.  The obvious question is whether that cynicism is warranted, and the answer in my opinion is “Yes”, at least at some level.  That means we have to look a bit deeper.  For those who want a specific reference to the Cisco announcement to follow my points, their press release is HERE.

The six pillars Cisco identifies are network connectivity, fog computing, security, data analytics, management and automation, and application enablement platform.  At a very high level this is a fair statement of IoT needs, but hardly one that’s insightful.  I’ve had some discussions with network operators and they have some fairly specific views of how IoT has to be done.  You can fit the operator visions into Cisco’s model as long as you stay vague.

Operators start with the notion that IoT has to partition sensor, control, and application elements in private networks.  The stability, security, privacy, and governance risks of “unbridled IoT” are so profound that nobody in their right mind would accept them.  If you start IoT with the presumption that things that talk to you about conditions (sensors), things that command actions or processes (control elements), and things that can convert one into the other are partitioned and exchange information only in a very controlled way, you’re covering the basic risks.

Their second point is that to applications, IoT is not a network of devices at all, but a hosted repository of “big data”.  You cannot make IoT work if you assume that applications just cast about in the world for stuff to talk to (or listen to).  You have to structure information so that applications can find and make use of stuff without threatening the underlying elements in any way (deliberate or by error).

Their third point is that IoT is most likely to be useful when it’s seen as a tool in exploiting mobility-generated opportunities.  Sensors and controllers out there in the world make sense if you’re out there too, not sitting at a desk somewhere.  This means that while it’s not necessary for IoT and mobile services to be considered a single problem.  LTE connection of a sensor or controller is just a question of finding the most convenient and cost-effective approach, not a question of creating an enforced unity of implementation.

The final point is that even if IoT is pervasive and important, it can’t be a service silo.  Whatever architecture defines support for IoT has to be generalized enough to be applied elsewhere, and tools from elsewhere have to make up as much of the IoT framework as possible.

You can see even at a glance that it’s possible to map Cisco’s approach to these four points.  The challenge is that the mapping isn’t convincing because Cisco hasn’t supplied a lot of detail to go with the basic notion of pillars.  This is where I think Cisco is falling back to “followership” in their approach, even though they’ve made the notion of IoT (as Internet of Everything) into a marketing slogan.

At a high level, it looks like IoT to Cisco is a combination of some new-age networking concepts and “fog computing”, which is Cisco’s name for deployment of hosted service elements at or close to the edge.  That seems to me to be a given, so it’s not telling me much.  The big question is how the applications (in the “fog”) and the sensors and controllers (in some sort of network) are linked.  Increasingly, operators I’ve talked with make the point that indiscriminate deployment of IoT elements raises way too many regulatory issues, and they’re rapidly settling on an IoT-as-a-database model.  Cisco’s “data analytics” isn’t defined that way in their announcement, but it could be positioned like that.

Cisco’s 15 IoT product announcements don’t provide a lot of clarity either.  Most of them are related to what I’d call simple issues—connecting stuff and managing stuff.  The notion of “fog data services” again skirts the edge of IoT reality, but it’s tied to a “data-in-motion” model that to me seems to suggest that IoT is a set of data flows and not a database.

What I’m wondering is whether “six pillars” is a Cisco solution to something that they see as having firm demand but vague deployment models, where “five stages” were to address something vague in a demand sense.  I personally think that large-scale rollout of IoT is more likely to happen through network operators than through any other source, but Cisco may think there are other early models that might make up in timing what the lack in convincing scale.  It’s in a way similar to the situation with NFV, which exposes “can it work” issues before it exposes “can it pay off” issues.

I think the risk to Cisco here is that they’ve grabbed apples that are a bit too low, IoT-wise.  My experience with operators shows that there are real early opportunities for a realistic IoT model based on the points I described.  I’m looking at a couple now, in fact, that map to those points almost exactly.  And Cisco’s not the only player talking in this space.  HP did an IoT announcement at MWC and followed up with more detail in their Discover event in June, and the HP approach seems to map to the operator points pretty directly.  As a major cloud and NFV player, HP is certainly in a position to get their story out there, and their story included using their IoT architecture as a framework for other services, including virtually any form of contextual service to mobile users.

Another risk is created by Cisco’s dance around NFV support.  Most operators think that applications using IoT data would be deployed and sustained through NFV, but that’s not part of Cisco’s six pillars.  They talk about management, but in the general sense that’s appropriate to the “I-don’t-know-the-buyer” position I suggested they might have.  While an NFV tie here would make their IoT story palatable to operators they may feel the specificity would turn off other possible early adopters.

Which raises an Oracle risk.  Oracle was interviewed in a New IP story on NFV that opened with the comment that NFV hadn’t progressed to make the business case for its own deployment, and that more operations centricity was the right path to address that.  It’s been clear for a while that Oracle’s own NFV approach is OSS/BSS-driven and analytics driven.  It’s a small step from analytics to deliver management data and analytics to deliver IoT data.

It’s smart for Cisco to spread its marketing wings wide in a world that’s changing as rapidly as networking and IT are likely to be changing.  Their fast-follower approach has also worked pretty well for them over time.  The risk, though, is that any kind of follower can wait too long to start, and watch others cross the finish line well ahead.  Whether that will happen here depends on the pace of the market, which means the collective pace of Cisco’s competitors.  That’s a risk, no question about it, to Cisco.

Some Early M&A Signals on the Impact of Virtualization in Networking

SDN and NFV, meaning “network virtualization”, is obviously going to have a significant impact on networking overall, even parts of networking that might not seem to be obvious targets.  We’ve had some announcements and M&A that illustrate this, and that offer us a chance to think about just how profoundly network virtualization could change things.

One of the most interesting M&A moves was Cisco’s announcement that they’re acquiring OpenDNS.  Many of us are familiar with OpenDNS as an alternative provider of DNS services.  While it’s not widely known, many of the “Internet problems” users experience aren’t their ISP’s network but their ISP’s DNS.  The default behavior for most Internet clients is to obtain a DNS address from the provider, and that will almost always be the provider’s own DNS.  If it’s overloaded or down, you’re in trouble.  OpenDNS and Google DNS are alternatives that will nearly always work better for you.

That’s not why Cisco bought them, of course.  While most people know OpenDNS for…well, obviously, DNS services, they got into security services starting three or four years ago, and it’s security that Cisco is most interested in.  Given that Cisco has a pretty thriving security business you might wonder why, and I think that SDN and NFV are a part of the mix.

The big problem with Cisco’s security strategy, and almost everyone else’s, is that it depends on devices or functions that become a part of the network.  In an age of virtualization, it’s harder for this approach to work, not because you can’t put functions into a virtual network but because you can put anyone’s functions there.  The security function/appliance space is going to get very crowded, competitive, and commoditized.

OpenDNS is almost an analytic view of security, derived from understanding Internet addressing and activity. It’s holistic, it’s outside the traditional “network” of a user, and it’s an asset that would be much harder for a competitor to commoditize.  It also works under nearly all of the foreseeable virtualized network models, even models that use SDN to segment networks into application or service-specific pieces (it’s not as useful in that case, IMHO, but it could still add value).

Perhaps the most interesting thing about OpenDNS’ approach is that it would in theory be possible to link the data that OpenDNS provides (via convenient APIs) with remediation software that might involve controlling legacy Cisco gear or even an SDN controller.  If OpenDNS tools detected a DDoS attack it would be able to quench it, at least at a point close to the site being attacked.  If the capability to quench was offered by operators as a service, it’s possible you could quench close to the source.

It’s also possible to use DNS tools to back-check IP addresses that are contacted by malware or to check source IP addresses of intruders.  It’s not a normal DNS function, but if you assume that an access device has the ability to validate “new” incoming IP addresses or ones emerging from apps, it could reduce intrusions and keep Trojans from calling home.

You also have to wonder whether Cisco might have its eye on other DNS-based services that would be impacted by network virtualization.  Load balancing is essential in NFV if you’re going to have failover or scaling of VNFs, and we know from the Metaswitch Project Clearwater example that you can do the job with a modified DNS.

Of course, all of this might be idle speculation.  Cisco has bought a lot of companies that could have presented great strategic stories but nothing came of them.  We’ll have to track the developments, and in particular how Cisco positions the security APIs, to get an idea.

The other interesting announcement is, in comparative industry terms, “deeper” because it involves network monitoring.  NetScout is “combining” its monitoring business with Dahaner’s communications business, which includes Tektronix Communications, Arbor Networks, and Fluke Networks.  Tektronix Communications has a broad portfolio of carrier-oriented stuff including some monitoring.  Fluke Networks has monitoring products, and Arbor Networks is primarily a security company.  The combination of these companies would create the biggest monitoring player by far and bring in related network technologies too.

Network monitoring hasn’t been exactly a hot sector, in no small part because most traditional monitoring tasks are simply too difficult for users to undertake even without the complication of virtualization.  The advent of things like the cloud and SDN and NFV have generally caught the monitoring community unawares.  The question I used to get when asked about monitoring in the virtual age was “what do the protocols look like?” indicating that the people thought all they had to do was understand the format of a new protocol or two.

SDN and NFV have a profound impact on monitoring.  You almost have to think of the future in terms of “virtual probes” because everything in SDN and NFV moves around, and you don’t want to hairpin through physical probe points.  I proposed the notion of “Monitoring as a Service” in the CloudNFV work in 2013, but nothing came of the effort.

MaaS was based on the idea that if you have virtualization in place on a large scale, you can deploy monitoring virtually and avoid the hairpinning.  You could also establish specific probe points where you’d equipped your network with either taps or your infrastructure with high-performance hosting so that introducing DPI-based monitoring would be of limited impact.  You could also link in edge elements that had knowledge of packet associations with services or applications, and of course tie in the service-to-resource bindings.

IMHO, there is no way to make traditional monitoring into a viable business for the same reason that security can’t keep on in the same old way.  SDN and NFV change the game too much, and without a strategy to incorporate those changes into products, the NetScout/Danaher combination is simply consolidation.

We’ve not seen the end of this.  There are going to be massive changes down the line, starting as early as next year, if SDN and NFV build as much momentum as they could.  These two industry events prove that the big and small, “shallow” in technical terms and “deep”, are all going to have to face a virtual future unless both our revolutions stall for lack of support.  Defensive vendors may hope for that outcome, but opportunity is its own reward and some vendors will surely take the aggressive track, leading the industry with them.

Can NFV and SDN Standards Learn from the Market?

I’ve commented in a number of my blogs that the standards process for both SDN and NFV have failed to address their respective issues to full effect.  The result is that we’re not thinking about either technology in the optimum way, and are at risk for under-shooting the opportunities both could generate.  There are some lessons on what the right way might be out there in the market, too.

One of the most interesting aspects of this swing-and-miss is that the problem may well come simply from the fact that we have “standards processes for both SDN and NFV.”  There’s more symbiosis between the two than many think, and it may be true that neither can succeed without the other.  There’s some evidence of this basic truth in, of all things, the way that OTT giants Amazon and Google have approached the problem.

Amazon and Google are both cloud providers, and both have the challenge of building applications for customers in a protected multi-tenant way.  That sounds a lot like the control and feature hosting frameworks of SDN and NFV.  When the two cloud giants conceptualized their model, their first step was to visualize components of applications running in a virtual network, which they presumed would be an IP subnet based on an RFC 1918 address space.

RFC 1918 is a standard that sets aside some IP addresses for “private” or internal use.  These addresses (there’s one Class A address, 16 Class B addresses, and 256 Class C addresses) are not routed on public networks, and so can’t be accessed from the outside except through NAT.  The presumption of both Amazon and Google is that you build complexes of components into apps or services within a private address space and expose (via NAT) only the addresses that should be accessible from the outside.

Logically this should have been done for management/control in both SDN and NFV, and NFV in particular should have established this private virtual network model for the hosting of VNFs.  The notion of “forwarding graphs” that’s crept into NFV is IMHO an unnecessary distraction from a basic truth that major cloud vendors have accepted from the first.

OpenFlow, which most NFV implementations use, has also accepted this model in a general sense; cloud applications are normally deployed on a subnet (via Neutron) and exposed through a gateway.  Within such a subnet or private virtual network, application components communicate as they like.  You could still provision tunnels between components where the relationship between elements was determined by provisioning rather than by how the elements functioned, of course, but in most cases complex topologies would be created not by defining them but by how the components of an application/service naturally interrelated.

A service/application model like the private virtual network model of Amazon and Google could provide a framework in which security, compliance, and management issues could be considered more effectively.  When you create a “VNF” and “host” it, it has to be addressable, and how that happens will both set your risk profile and expose your connectivity requirements.  For example, if you expect a VNFM component of a VNF to access resource information about its own platforms, you’d have to cross over the NAT boundary with that data—twice perhaps if you assume that your resources were all protected in private address spaces too.  This is exactly the situation Google describes in detail in its Andromeda network virtualization approach.

Another lesson to be learned is the strategy for resource independence.  Amazon and Google abstract infrastructure through a control layer, so that hosting and connection resources appear homogeneous.  A collection of resources with a control agent to manage the abstraction-to-reality connection is the way that new resources get presented to the cloud.  NFV doesn’t quite do that.

In NFV, we have four issues with resource abstraction:

  1. A Virtual Infrastructure Manager (VIM) is only now evolving into a general “Infrastructure Manager” model that admits anything into the orchestration/management mix, not just virtualized stuff. Everyone in the operator space has long realized that you need to mix virtual and real resources in deployments, so that generalization is critical.
  2. In the ETSI model, a VIM/IM is a part of MANO, when logically the VIM/IM and the NFVI components it represents should be a combined plug-and-play unit. Anyone who offers NFVI should be able to pair their offering with an IM that supports some set of abstractions, and the result should be equivalent to anything else that offers those abstractions.
  3. You can’t have resource abstraction without a specific definition of abstractions you expect to support. If a given “offering” has hosting capability, then it has to define some specific virtual-host abstraction and map that to VMs or containers or bare metal as appropriate.  We should have a set of required abstractions at this point, and we don’t.
  4. You can’t manage below an abstraction unless you manage through that abstraction, meaning that abstraction management is decomposed into and composed from management of what’s underneath, what’s realized. Unless you want to assume that actual resources are forever opaque to NFV management and that the pool of resources is managed independently without service correlation, you need to exercise management of realized abstractions through the element that realizes them, the IM.  The current ETSI model doesn’t make that clear.

Google’s Andromeda, in particular, seems to derive NFV and SDN requirements from a common mission.  What would Andromeda say about SDN, for example?  It’s an abstraction for execution of NaaS.  It’s also a mechanism for building those private virtual networks.

There are some things NFV could learn from other sources, including the TMF.  I’ve long been a fan of the “NGOSS Contract” approach to management, where management of events is directed to processes through the intermediation of the service contract.  That should have been a fundamental principle for virtualization management from the first.  The TMF also has a solution to how to define the myriad of service characteristics without creating an explosion of variables that threaten to bloat all of the parameter files.  IMHO, the ETSI work is at risk to that right now.

For quite a while the TMF data model (SID) has supported the use of “Characteristics” which means a dynamic run-time assignment of variables, a dynamic attribute pattern.  It should be possible, using the TMF approach, to define resource- or service-specific variables and pass them along without making specific by-name accommodation.  What’s required is consistency in production and consumption of the variables, which is needed in any case.

I think there are plenty of operators who agree these capabilities are needed, at least in some form.  I don’t think they’re going to get them from the NFV ISG because standards groups in any form are dominated by vendors because there are more vendors and because vendors have more money to spend.  They’re not going to get them from the OPNFV effort either, because open-source projects are dominated by contributors (who are most likely vendors for all the reasons just cited).

The New IP Agency might help here by shining a light on the difference between where we are in NFV and SDN and where we need to be.  Most likely, as I’ve said before, vendors will end up driving beneficial changes as well as being barriers.  Some vendors, particularly computer vendors, have nothing to lose in a transition to virtual technologies in networking.  While the incumbent equipment vendors surely do, they can’t sit on their hands if there are other vendors who are happy (for opportunistic reasons) to push things forward.

In some way or another, all these points are going to have to be raised.  They should have been considered earlier, of course, and the price of that omission is that some of the current stuff is sub-optimal to the point where it may have to be modified to be useful.  I think standards groups globally should explore the lessons of SDN and NFV and recognize that we have to write standards in a different way if we’re to gain any benefit in a reasonable time.