Approaching the SDN/NFV End-Game

OK, I admit to liking old songs and poetry, so you’ll probably not be surprised if I quote a song title; “How deep is the ocean, how high is the sky?”  I don’t propose to blog on oceans or skies here, but depth and breadth is an interesting question posed to SDN and NFV.  We might need to ask ourselves another seemingly trivial question for both technologies, which is “What does the SDN/NFV end-game look like and how do we get there?”

We have about a trillion dollars in network assets out there, about a fifth of which is depreciated in a given year.  The capital budget of operators is running a bit lower, so at the moment we’re gradually drawing down on the “installed base” that offers a lot of inertia.  At the same time, that capital budget is reinvesting in the same technology, so inertia isn’t decreasing significantly.  SDN and NFV have to overcome that inertia.

If we just use the rough numbers I presented in the last ‘graph, then you can easily see the issue and perhaps also see the path to change.  It’s unlikely that we’d achieve major SDN or NFV success if operators keep buying the legacy gear.  That means that SDN and NFV have to be presented in an evolutionary posture, as something you can migrate to gracefully.  At the same time, though, operators aren’t really interested in quite evolution—at least not in the long term.  Technology changes present risks that can be justified only by significant upside.

SDN has at one level accepted and addressed the need to balance evolution and revolution.  You can control many legacy switches using OpenFlow.  That lets users invest in SDN at the controller and management level and apply that investment to current network devices.  As those devices age, they can be replaced with white-box gear that uses SDN and only SDN.  At least that’s the theory.  The practice has so far been somewhat stalled in the data center, where the impact on cost and revenue is limited.

For NFV, it’s been a harder row to hoe.  While you can argue capex savings for NFV on an incremental deployment, the fact is that NFV is more complicated than point-of-service devices would be for the same features.  That means that operations efficiency has to be better at least to the point that the incremental complexity is covered.  And remember that most operators don’t believe in capex as a driving NFV benefit.  Service agility and operations efficiency, the benefits de jure, both appear difficult to attain unless you address a service from end to end and top to bottom.  How do you square that with the need contain early cost and risk?

All migrations are justified by the destination.  A couple million African wildebeest don’t swing south toward the Mara River and face the crocs to starve in a different place.  We’d probably have an easier time postulating the migration strategy for SDN/NFV if we knew what a full deployment would be like.

What does an “SDN network” look like?  Obviously it can’t be an Ethernet or router network that adds in some OpenFlow control; that doesn’t move the cost or benefit ball much.  You could in fact build what looked like IP or Ethernet services using SDN.  You could also build services that looked the same at a service demarcation but were created very differently.  Application- and service-specific subnetworks, added to an enhanced virtual router at the customer edge, could frame services in a totally different way and revolutionize networking.  One option presents limited migration risk and limited benefit.  The other seems to go the other way.  Which model is best?

NFV poses a similar question.  From the first, the focus of NFV has been in deploying virtual functions that replace physical appliances operating above switching/routing.  So let’s assume we do that.  How much of the network capital budget is associated with that kind of technology?  A bit more than ten percent, according to operators.  Even if we can harness opex and agility benefits, how many will be available if we touch only a tenth of the gear?  You can contain NFV to that simple mission, or you can try to address the opex and agility goals even if it means extending what we mean by “NFV” significantly.  Again, one way offers risk management and the other a much better benefit case.  What’s the best approach?

So what are the answers?  I think we can best start with what can’t be the answers.  SDN and NFV for network operators cannot be a pure overlay strategy that rides on current switching/routing infrastructure without much change in that base.  How do you add something on top of the original model and by supplementation make it cheaper?  We have to displace switching and routing on a large scale for there to be large-scale SDN and NFV success, and whether we like that or not (and router vendor employees who read this probably won’t) it’s still the truth.

A second truth is that we are not going to replace current network routers and switches with servers and virtual routers.  Many of those current products are simply too big.  Terabit routers have been a reality for a long time, but we don’t have much experience with terabit servers.  Virtual switches and routers clearly have to play a big role in the future, but not as 1:1 replacements.

We need to look from the blue-sky future into the deep here.  The most obvious of the network technology trends has been that of agile optics and the displacement of traditional core-router aggregation with agile optical cores.  We should expect that this trend will continue, and as it does it joins up with SDN and NFV to create that future vision.

Agile physical-layer technology lets you dumb down switching/routing because it subducts error recovery responsibility from the higher layer.  Furthermore, if you can partition users and services economically at the agile optical layer, you could build business services using pipes and virtual routers/switches.  That opens the SDN opportunity to deploy a simple forwarding tunnel over optics, with no real L2/L3 involved, and use that tunnel with virtual switching/routing on a per-user, per-application, and per-service basis.

NFV can play a role here by deploying those L2/L3 virtual elements.  Absent a connection mission like this, NFV is stuck in higher-layer functionality where it can’t easily change the cost or benefit structure of basic services.  But if we build service and application networks one-off using partitioned L1 technology, we need the higher layers to deploy.  These missions, as I’ve pointed out, are less demanding of the devices hosting the virtual routers because they’re limited in scope to a single user or service.  We’ll still need to aggregate stuff, but not nearly as much device-based switching/routing is needed.

A lot of virtual routing could be needed.  Every service edge for every business and every consumer would have in this model a virtual router that provided the user with the specific tunnel/service-network access they needed.  For VPNs you’d have edge virtual routers and floating internal ones that were placed to optimize traffic flows and resource usage.  It’s a different model of optimization.  Forget finding the best path among a nest of routers, you find the best path and nest routers to fit it.

The biggest problem we have with all of this isn’t carrier culture.  Vendor resistance to this approach would be even more problematic because it prevents vendors from accepting a radical change.  And underneath both the carrier and vendor resistance is human resistance.  We have generations of network mavens who have known nothing but IP or Ethernet.  They simply cannot grasp a different model.

Well, we have to make a choice.  The future of networking will be the same as the present if we insist on building future networks using current principles.  We can’t bring the sky and ocean together without making rain.

Brocade’s Step Toward SDN’s Future: A Good Start

Yesterday, Brocade announced enhancements to its SDN Controller that advanced SDN in an operations sense.  I think these were important; they move SDN toward the architecture it should have been from the first.  They also show us just how far SDN still has to go for it to achieve all it can.

What Brocade has announced is a Flow Manager and a Topology Manager that essentially ride the SDN controller and provide a way of visualizing the structure of SDN switches and the way that consecutive forwarding processes (imposed per-switch as the OpenFlow approach mandates) add up to supporting a flow or route.  The products are highly visual, meaning that you can easily see what’s happening and manipulate connectivity to suit your specific needs.

Way back in the early days of SDN, I hypothesized an architecture of three layers.  On the bottom was the topologizer, an element responsible for determining what the physical device and trunk structure of the network was.  In the middle was what I called SDN central, the critical piece that aligned flows and topology.  The top layer was what I called the “cloudifier”, a layer that would frame SDN’s almost infinitely variable connectivity options into services that could really be consumed.  The “northbound interface” of the SDN controller would logically fit below these layers.

Brocade has taken an important step by providing us with specific implementations of the bottom two of my layers.  Instead of blowing kisses at a vague notion of a northbound interface, users can see what they are doing.  It’s an important step in operationalizing SDN for sure.  But it’s not a home run, at least not yet.

One specific statement from Brocade’s website sets the stage for the “miles-to-go-before-we-sleep” part of this discussion.  “Software-Defined Networking provides dynamic, end-to-end views and control of data center traffic.”  Obviously SDN isn’t limited to the data center, but it does seem as though Brocade is telling us something important.  SDN has been successful largely within the data center, and so SDN’s longer-term success will likely depend on extending from that base into the rest of the network.  I described such an evolution for network operators in my blog yesterday.  SDN has to make the transition to the network from the data center, for both enterprises and operators, if it’s to be really important.

The notion of flows and topologies that Brocade has introduced could play a role in that.  The first thing you have to do to get SDN out of the data center is to automate operations and management.  If we can use a GUI to drop flow paths where we want them, it is clearly possible to provide a tool that would define the paths based on policy.  I don’t think this would be a big technical issue for Brocade.  The larger problem to address is the management side.

Inside a data center you have unusually high reliability and you can probably stand on a chair and survey your network domain, looking presumably for smoke or some sign of failure.  The number of flows you have to support is also likely to be limited.  Get into the wider network and you need to have automated service management, and that means you have to be able to associate physical device and trunk conditions to connectivity.  Brocade takes a step in that direction too, because they have in a flow a map of the resource commitments.  If you drew data from MIBs along a flow, you could figure out what the flow state was.  If you had a flow problem you could trace it to the likely fault point.

The notion of a flow problem leads to the question of a flow standard, meaning an SLA, and the need to connect flows into groups to make up a logical service.  Brocade still needs to mature a vision of my “cloudifier”, the service layer that converts application/user requests for cohesive network behaviors into a series of flows that can map to the real device topology.

There may be help on the way here.  The ONF is taking up the notion of an intent model for northbound interfaces.  A true intent model would include an SLA, and that implies that an implementation would be able to collect management variables not only for a flow but for a set of related flows, and present them as a composite service state to be compared with the performance guarantees set for the service.  Brocade could implement this, and if they did they’d climb the SDN value chain up to the top.

A lot of the things needed to achieve SDN’s optimum future is at least partially supported within the SDN controller, at least the OpenDaylight version that Brocade and many others use.  What’s missing?  Interestingly, it might be the vision of service/resource management and operations that many have talked about for both SDN and NFV in the operator space.

In an SDN age, enterprises are “network operators” at least to the extent that they build SDN overlays on more traditional services.  In the future, if operators themselves expose “cloudifier” types of service interfaces to permit flow composition into services, the enterprises will still have to harmonize those with internal site connectivity.  We already know that the “enterprise network” is the sum of its LANs and WANs.  We may not need all the tools for enterprise SDN that we’d need for operator SDN, but it’s easy to see that we’d want a subset of those tools.  And it’s easier to subset something if you’ve got it to begin with, in a complete and cohesive form.

I think Brocade’s move is demonstrating that “network management” for the enterprise is going to shift just as decisively as service and network operations are shifting for the operators.  I also think that a future SDN network where users compose their connectivity by application and job type is going to demand a complete rethinking of how we know what a network is.  The “I’ll-know-it-when-I-see-it” model isn’t going to work in the virtual world.  Brocade may be working on the replacement.

How SDN Could Jump Over NFV in Deployment

SDN came along well before NFV, and there are many SDN implementations compared with “real” (meaning actually complete and useful) NFV.  Despite this, SDN became a bit of a junior partner to NFV at least among network operators.  Even in my own deployment models, it’s clear that the easiest path to SDN deployment would be in support of an NFV-driven transformation.  But suppose NFV doesn’t keep pace?  Is there a path to SDN success that bypasses NFV success?  Yes, there is—there are several, in fact.

To start off, there’s the cloud.  There’s a relationship between NFV and the cloud, in two dimensions.  NFV requires cloud hosting in almost all its large-scale success scenarios.  The cloud could benefit from NFV’s ability to manage dynamic application configuration and resource changes and simplify operations overall.  But the cloud is also different.  It’s a framework for operators to sell application/IT services and a new revenue source.

Virtualization at the network level is absolutely critical for cloud infrastructure, whether you’re using it for yourself or selling services from it.  Google and Amazon both have developed sophisticated network virtualization capabilities as part of their cloud offering, and even NFV demands much more power in network virtualization than either vendors or operators admit.  SDN could be driven by cloud virtualization faster than NFV advances.

The cloud lesson of Google and Amazon is important because it demonstrates that in the cloud, virtual networking has different properties than standard networks do.  The further SDN features are advanced beyond the traditional, presuming those advances are valuable or critical, the more likely it is to justify itself.  And we know that cloud virtual networking is very different.  For example, it deals with inside-outside address space differences and endpoints that reside in multiple address spaces.  Those aren’t common features for IP or Ethernet.

The strongest proof point for the potential of the cloud to drive SDN is that it’s had funding.  Cloud computing was for several years the leading project for operators.  It led mobile enhancement and content distribution, largely because the path to monetization seemed clear.  The biggest argument against it is that the cloud doesn’t lead any longer; monetization has been more difficult than operators had expected.  Still, SDN that grows out of cloud data centers could drive a major SDN commitment by operators.

Cloud SDN is primarily aimed at user application tenancy control, and the interesting thing about that mission is that application-specific networks take a step toward what could be called application networking versus the site networking of today.  That would lead to a big potential growth driver for SDN as a carrier service.

If applications each live on their own partitioned virtual network, then you can give users access to the application by giving them access to that network.  The sum of a user’s application rights defines their network connectivity.  A user “tunnels” from wherever they are in geography or client address space terms to reach the on-ramp of the application virtual network structure.  This model is more secure, more stable, more profitable for the operator because it’s more valuable to the user.  In theory any form of SDN that allows for explicit forwarding control could do this, but some vendors (Alcatel-Lucent’s Nuage for example) have made more of a point of highlighting the features needed.

When you look at this kind of network model, you can see another point of leverage that leads to yet another expansion of SDN opportunity.  If the user’s “access tunnels” jump onto a virtual network that extends outward near to their location, we’ve created a series of parallel virtual networks that have a large geographic scope and yet don’t touch each other.  They’re effectively partitioned at a lower layer than IP or Ethernet, by the “virtual network structure” whatever it is—the thing SDN created.

Well, if we’re partitioning everybody, every service, every application and user, then we’ve got a lot of small address spaces and not one big one.  Traditional switching/routing services demand big boxes because they involve large amounts of users and traffic.  It’s unrealistic to think that even a big company network could be hosted using virtual routers and servers, and certainly not a service provider network.  But if we partition, we have little networks that fit very nicely into virtual switch/routing services.

Think about that.  We build VPNs and VLANs today largely by segmenting the connection space of big networks.  We could build the same thing by simply building a private IP or Ethernet network within an SDN physical partition.  And with that step, we could significantly reduce the investment in switches and routers and further extend the utility of SDN.  We push down a lot of the aggregation and basic network connectivity features, along with resilience and SLA management, to a lower layer based on SDN and cheaper to build and operate.

This can be very ad hoc, too.  Think of “Network-for-a-Day” as a service.  Define a community of interest and a task to be performed.  Link the people and resources into an ad-hoc NfaD configuration that’s independent, secure, and has its own SLA.  When you’re done, it goes away.  Nobody has to worry about persistence of connectivity when the project is over, about someone from the project still having access rights.  The network and rights have disappeared together, at day’s end.

IoT could be revolutionized by this.  Think of a sensor-community, a control-community.  Each of these is made up of sub-communities with their own policies, and each feeds a big-data repository that enforces policy-based access.  We now have process-communities that can be linked to these other communities and their repositories, persistently or for a short period.  We can charge for the linkage because there is no presumptive access, because networks are now partitioned and if somebody’s not in yours they are invisible, and vice versa.

Same with content.  A CDN is now a virtual structure.  Users can watch videos and interact with others, in their own narrow community as they define it or in some predefined community.  Different classes of users, or even different users, can have different cache sources and partnerships during viewing.  Same for wireless or wireline, your own wireless or roaming.

For a lot of vendors, this evolution would be a major problem (Cisco or even Alcatel-Lucent might find it pretty destructive of their router business).  For others like Brocade or Plexxi or Big Switch it could be a real windfall, though they might have to do some extra positioning or even product enhancement to get all the benefits on the table.

For operators, SDN has that same “maybe-or-maybe-not” flavor.  Virtual networks, like all things virtual, add a dimension of operations complexity that if left unchecked might compromise not only the SDN business case but the whole service framework.  This makes yet again a point I’ve often made; you need an operations framework for next-gen services or your gains will at best be illusory and at worst could be proven to be losses.

Like NFV, SDN is something that gets better the more of it you do, and that means that it can be a challenge to start small and get anywhere.  The good news for SDN is that there are plenty of places to start, and some of them aren’t all that small.  With a little help from the cloud, SDN could actually overtake NFV in deployments at least for a time, which could perhaps mean SDN principles would influence NFV planning and not the other way around.  Turnabout is fair play!

Looking a Bit Deeper at the NFV Business Case

I got over a hundred emails after my series on making the business case for NFV.  A few didn’t like it (all of these were from the vendor community) but most who contacted me either wanted to say it was helpful or ask for a bit more detail on the process.  OK, here goes.

You have to start an NFV business case at the high level.  Hosting functions is a major shift.  It demands operators shift from a familiar device model to a software model, and it comes along at a time when operators are trying to make systemic changes in cost and revenues to accommodate the crunch they’re experiencing in revenue-versus-cost-per-bit.  There’s risk here, and you have to be able to demonstrate that benefits scale with risk.

The best place to start is to define specific benefit targets.  You have to reduce costs, increase revenues, or both, so that means “cost” and “revenue” are the high-level targets.  For either (or both) these targets, you’ll need to assess an addressable opportunity and forecast a penetration.

Cost targets are relatively easy.  As I pointed out in a past blog, most operators are judged by the financial markets based on EBITDA, which measures earnings before capital spending and depreciation is considered.  This focus means that unless the operator is relatively small, it’s going to be harder to make a pure capex business case.  In any event, the problem with capex-driven changes in profit per bit is that you’d have to make a massive infrastructure change to address most of the cost, and that’s just not going to happen in an industry with a trillion dollars of installed infrastructure.  Operators also say their own estimates are that you can save 20% or less with hosted functions versus custom devices; they’d rather beat vendors up on price for that amount than risk new technology.

Operators spend about 12% of their revenue dollars on operations costs, about a third of which is network operations and the other two-thirds service operations.  The big question to ask for your cost targeting business case is the scope of the impact of your NFV deployment.  For example, if you propose to target a service that accounts for one one-hundredth of the devices deployed, you can’t expect a revolutionary benefit.  If your NFV impacts the service lifecycle for services that account for ten percent of service operations effort, that’s the limit of your gains.

Even if you have a target for cost control that you can quantify, you may not be able to address all of it.  The best example of this is the network operations savings targets.  Most NFV deployment will demand a change in network operations to be sure, but that change may not be a savings source.  For example, if you’re selling virtual CPE that will reduce truck rolls by letting people change services by loading new device features, you can only consider the truck rolls that are necessitated by service changes, not total truck rolls.  You still have to get a device on premises to terminate the service, and you can only save truck rolls in cases where feature additions would be driving them.

The service operations side is the hardest one to address.  If you think you’re going to save service operations effort you have to first measure the effort associated with the service lifecycle, from order through in-service support.  How many interactions are there with the customer?  How many are you going to eliminate?  If carrier Ethernet is a service target, and if it represents 30% of customer service interactions, cutting its cost by half will save 15% of service operations effort.  You’d be surprised how many NFV business cases say “Service operations is 12 cents of every revenue dollar and we’ll eliminate that cost with virtual CPE” when the defensible vCPE target is only a tenth of customers.

On the revenue side, it’s more complicated because you have to consider the addressable market.  Again, the “new” versus “established” customer issue can bite you.  If you reduce time to revenue, you can claim x weeks of new revenue per new customer or new feature for a reduction of x in deployment time.  That won’t apply to customers you’re already earning revenue on, only ones that are having a new service rollout.  And it doesn’t happen every year for the same customer, so don’t multiply that “x-week revenue” gain by the total number of customers!

Truly new services are also complicated because of the difficulty in assessing total addressable market.  Most operators don’t have good numbers on opportunities in their areas, but most could get it.  How many business services could you sell?  You can get demographic information by location and SIC/NAICS to estimate the population of buyers.  You can use market data from other areas, other providers, to estimate optimum penetration.

If you go through the motions of a business case, you’re going to end up realizing that the primary challenge in making one for NFV is that improvements in operations or agility are difficult to secure without making a systemic change in practices.  Unless your service target for NFV is very contained, you may be introducing a new set of operations requirements to all network and service management personnel but gaining efficiency for only a small percentage of their interactions.  Thus, the fact that you have to start small with NFV works against creating a big benefit up front, and that makes it hard to justify a big up-front cost.

The easiest way to make a business case for NFV work, as I’ve said, is to first target the orchestration and optimization of service and network management tasks through software automation.  This can be done without deploying any NFV at all, but it can also obviously be tied to an early NFV task.  The operations automation will easily generate enough savings to not only make the early NFV business case, it will probably generate enough to fund considerable future growth in NFV deployment.

If you can’t target wholesale changes in operations/management activity, then the easiest approach is to target opportunities that have very contained operations/management teams.  If business Ethernet services are sold, supported, and managed by an independent group, that group can be targeted because you can identify the costs and limit the impact/scope of your changes to the place where benefits are available.  If the same service is supported by a general team that does all manner of other services, it will be harder to isolate your benefits and make them plausible.

The watchword is think contained targets.  Managed service providers who lease transport from third parties and integrate services with CPE or across carrier boundaries are probably the easiest early NFV targets.  Virtual CPE may be their only real “equipment” and operating it their only operations cost.  MVNOs would be easier targets than real mobile operators for the same reason, and mobile operators easier targets than operators who mixed mobile and wireline services in the same metro area.

NFV as an evolution and NFV as a revolution are hard to reconcile in a business case.  In trying to do that, many vendors and operators have either manufactured savings by distorting numbers, or presented something so pedestrian in terms of value to profit-per-bit that operators yawn instead of cheer.  You can get this right, but you have to think about it more and take the process seriously.

What Has to Happen for Service Automation to Work

NFV is going to succeed, if we define success as having some level of deployment.  What’s less certain is whether it will succeed optimally, meaning reach its full potential.  For that to happen, NFV has to be able to deliver both operations efficiency and service agility on a scale large enough to impact operators’ revenue/cost compression.  Not only is operations efficiency itself a major source of cost reduction (far more that capex would ever likely be) but it’s an important part of any new service revenue strategy.  That’s because new services, and new ways of doing old services, introduce virtualization and connection of virtual elements, and that increases complexity.  Only better service automation could keep costs under control.  It makes no sense to use software to define services and then fail to exploit software-based service and network management.

I think pretty much everyone in the vendor and operator community agrees with these points (yes, there are still a few who want to present the simpler but less credible capex reduction argument).  Consensus doesn’t guarantee success, though.  We’re starting to see some real service automation solutions emerge, and from these we can set some principles that good approaches must obey.  They’ll help operators pick the best approaches, and help vendors define good ones.

The first principle is that virtualization and service automation combine effectively only if we presume service and resource management are kept loosely coupled.  It’s obvious that customers who buy services are buying features, not implementations.  When the features were linked in an obvious way to devices (IP and Ethernet services) we could link service and resource management.  Virtualization tears the resource side so far away from the functionality that buyers of the service could never understand their status in resource terms (“What?  My server is down!  I didn’t know I had one!” or “Tunnels inside my firewall?  Termites?”)

Software types would probably view service and resource management as an example of a pair of event-coupled finite-state machines.  Both service and resource management would be dominated by what could be viewed as “private” events, handled without much impact on the other system.  A few events on each side would necessarily generate a linking event to the other.  In a general sense, services that had very detailed SLAs (and probably relatively high costs) would be more tightly coupled between service and resource management, so faults in one would almost always tickle the other.  Services that were totally best-effort might have no linkage at all, or simply an “up/down” status change.

Where the coupling is loose, service events arise only when a major problem has occurred at the resource level, a problem that could impact customer status or billing.  The pool of resources is maintained independently of services, based on overall capacity planning and resource-level fault handling.  Where coupling is tight, fault management is service-specific and so is a response to resource state changes.  The pool is still managed for overall capacity, and faults, but remediation is now moved more to the service domain.

The second principle of efficient service automation is that you cannot allow either fault avalanches or colliding remedies.  Automated systems require event-handling, which means both “passing” events and “handling” them.  In the passage phase, an event associated with a major trunk or a data center might represent hundreds or thousands of faults, and if this number of faults is generated at any point, the result could swamp handling processes.  Even if there’s a manageable number of events to handle, you still have to be sure that the handling processes don’t collide with each other, which could result in collision in resource allocation, delays, or errors.  Traditionally, network management has faced both these problems, and with varying degrees of success.

Fault correlation tools are often used to respond to problems at a low level that generate many high-level events, but in virtual environments it may be smarter to work to control the generation of events in the first place.  I’m an advocate of the notion of a hierarchy of service objects, each representing an intent model with an associated SLA.  If faults are generated at a low level, remediation should take place there with the passage of an event up the chain dependent on the failure of this early stage effort.

Collisions in management processes seeking to remediate problems, or collision between those processes and new activations, are historically handled by serialization, meaning that you insure that in a given resource domain you have only one process actually diddling with the hardware/software functionality that’s being shared or pooled.  Obviously having a single serialized handling chain for an entire network would put you at serious risk in performance and availability, but if we have too many chains of handling available, we have to worry about whether actions in one would impact the actions of another.

An example of this is where two services are trying to allocate capacity on a data center interconnect trunk.  They might both “check” status and find capacity available, create a service topology, and then have one fail because it lost the race to actually commit the resources.  That loser would then have to rebuild the service based on the new state of resources.  Too many of these collisions could generate significant delay.

Fixing the handler-collision problem in a large complex system isn’t easy.  One thing that could help is to avoid service deployment techniques that rely on looking for capacity first and then allocating it when the full configuration is known.  That introduces an interval of uncertainty between the two that raises the risk of collision.  Another approach is to allocate the “scarce” resources first, which suggests that services elements that are more in contention should be identified for prioritizing during the deployment process.  A third is to commit the resource when its status is checked, even before actual setup can be completed.

The final principle of service automation is that the processes have to be able to handle the full scope of services being deployed.  This means not only that you have to be able to automate a service completely and from end to end across all legacy and evolving technologies, but also that you have to be able to do this at the event rate appropriate to the size of the customer base.

The biggest misconception about new technologies like SDN and NFV is that you can bring about wholesale changes in cost with limited deployments.  In areas like operations efficiency and service agility, you gain very little if you deploy a few little pieces of the new buried in the cotton of the old.  Not only that, these new areas with new service lifecycle and operations/management practices are jarring changes to those weaned on traditional models, which means that you can end up losing instead of gaining in the early days.

It seems to me inescapable that if you plan on a new operations or service lifecycle model, you’re going to have to roll that model out faster than you could afford to change infrastructure.  That means it has to support the old stuff first, and perhaps for a very long time.

The other issue is one of scale.  We have absolutely no experience building either SDN or NFV infrastructure at the scale of even a single large corporate customer.  Most users know that technologies like OpenStack are “single-thread” meaning that a domain has only one element that’s actually deploying anything at any point in time.  We can’t build large networks with single-threaded technology, so we’re going to have to parallel SDN and NFV on a very large scale.  How do we interconnect the islands, separate the functions and features, commit the resources, track the faults?  It can be done, I’m convinced, but I’m not the one that has to be convinced.  It’s the operations VPs and CIOs and CFOs and CEOs of network operators.

I’ve noted needs and solutions here, which means that there are solutions.  It’s harder for vendors to sell complex value propositions, but companies in the past didn’t succeed by taking the easy way out.  In any case, while a complete service automation story is a bit harder, it can be told even today, and easily advanced to drive major-league NFV success.  That’s a worthwhile goal.

Comparing the NFV Data-Model Strategies of Key Vendors

I think that most of my readers realize by now that I think the data modeling associated with NFV is absolutely critical for its success.  Sadly, few of the players involved in NFV say much about their approach to the models, and I’ve not been able to get the same level of detail from all of those I’ve asked.  I do think it’s possible to make some general comments on the modeling approaches of the few vendors who have credentials in making an NFV business case, and so I’ll do that here.  To do this, I’ll introduce the key points of an NFV data model and then talk about how vendors appear to address them.  If I cover your own approach in a way you think is incorrect, provide me documentation to support your position and I’ll review it.

The first aspect of a data model is genesis.  All of the NFV models stem from something, and while the source doesn’t dictate the results, it shapes how far the approach can get and how fast.  The primary model sources are TMF SID, cloud/TOSCA, and that inevitable category, other.

Alcatel-Lucent and Oracle appear to have adopted the TMF SID approach to NFV modeling.  This model is widely implemented and has plenty of capabilities, but the extent to which the detailed features of the model are actually incorporated in someone’s implementation is both variable and difficult to determine.  For example, the TMF project to introduce Service Oriented Architecture (SOA) principles, GB942, is fairly rare in actual deployments, yet it is critical in modeling the relationship between events and processes.

HP is, as far as I can determine, the only announced NFV player who uses the cloud/TOSCA approach.  That’s surprising given the fact that NFV deployment is most like cloud deployment, and the TOSCA model defined by OASIS is an international standard.  IBM uses it for its own cloud orchestrator architecture, for example.  I think TOSCA is the leading-edge approach to NFV modeling, but it does have to be extended a bit to be made workable for defining legacy-equipment-based network services.

In the “other” category we have Overture, who uses the open-source distributed graph modeling architecture, Titan.  This is an interesting approach because Titan is all about relationships and structural pictures thereof, rather than being tied to a specific network or cloud activity.  It’s probably the most extensible of all the approaches.

The second aspect of the data model is catalogability, to coin a term.  To make an NFV model work, you have to be able to define a service catalog of pieces, assemble these into service templates, and then instantiate the templates into contracts when somebody orders something.  All of the models described by the key NFV vendors have this capability, but those based on the TMF SID have the most historicity in supporting this approach, since SID is the model long used by OSS/BSS systems and vendors.

Both HP and Overture have the ability to define models of substructures of a service and assemble them.  Either of the approaches appear to be suitable, but they lack the long deployment history of SID and the integration of operations processes with these non-SID models has to be addressed explicitly since there’s no TMF underlayment to appeal to for features integration with operations.  HP also provides for inheritance from base objects, and they appear to be alone in modeling resource structures as well as service structures.  I think SID models might be able to do that, but I can’t find an example in the material for our TMF-based vendors.

The third aspect of the data model is process integration.  In order to synchronize and support efficient network and service management, you somehow have to link NFV to NMS and OSS/BSS processes.  There are two basic ways to do that—one being the virtual device approach and the other what I’ll call the explicit event steering approach.  Details on how vendors do this stuff is extremely sparse, so I can’t be as definitive as I’d like.

It appears that both Alcatel-Lucent and Oracle have adopted the virtual-device approach.  NFV sits at the bottom of a resource structure defined in the SID, and operations and management processes couple to SID as usual.  The goal of NFV deployment implicitly is to make the deployed elements look like a virtual device which can then be managed as usual.  In theory, the GB942 event coupling to SOA interfaces via the SID is available, but I’ve got no details on whether it is implemented by either vendor.  Since GB942 implementation is rare, the answer is “probably not”.  This combination means that service automation steps “inside” VNFs is probably managed totally by VNF Managers, and may be opaque to OSS/BSS/NMS.  I can’t be sure.

HP’s approach seems also to rely on VNFMs for lifecycle processes at the NFV level, but they have a general capability to link events to processes through what they call a monitor in their SiteScope management component.  You can establish monitor handlers, transgression handlers, trigger handlers, etc.  It would appear to me that this could be used to integrate lifecycle processes within NFV to larger management and operations domains, though I don’t have an example of that being done.

Overture has the clearest definition of process integration.  They employ a GB942-ish mechanism that uses a business process language and service bus to steer events generated by their management analytics layer, or another source, to the correct operations/management processes.  The approach seems clear and extensible.

The next area to consider is management data integration.  To me, the baseline requirement for this is the establishing of a management repository where status information from resources and services is collected.  Everyone seems to do that, with Alcatel-Lucent and HP integrating their management data with their own platforms, and Overture using a system that unites open-source NoSQL, Cassandra, and Titan.  In theory all these approaches could generate management data correctly, and I believe that both HP and Overture could present custom management of modeled service elements.  On the rest, I don’t have the data to form a clear view.

The final point is support for modeled legacy elements.  In HP this capability is explicit in the modeling, and HP is the only player that I can say that for.  HP’s data model incorporates legacy components explicitly so you can define service solutions in a mixture of legacy and SDN terms.  Since HP’s NFV Director is derived from their legacy control product line, it has full capabilities to control legacy devices.  HP can also operate through OpenStack and ODL.

With Alcatel-Lucent and Oracle, modeling legacy elements is implicit in their SID basis, meaning that their processes are really legacy processes into which NFV is introduced so there’s little question legacy networking could be accommodated.  Under the virtual devices that NFV represents, both support OpenStack and ODL.

Overture can support legacy devices in a number of ways.  First, the Titan model is technology-agnostic so there’s no data model issue.  The model can drive a control-layer technology set to actually deploy and connect resources using legacy or NFV (or SDN) technologies.  Overture has their own network controller that (obviously) supports their own equipment and can be augmented through plugins to support third-party elements.  They can also work through OpenStack and ODL to support legacy devices that have suitable plugins for either of those environments.

The data models used for services are, as I’ve said, the most important thing about how a vendor implements NFV, and yet we don’t talk about them at all.  Public material on these models is limited for all the vendors I’ve listed.  What would be great would be an example of a service modeled in their architecture, containing both VNFs and legacy elements, and linked to NMS and operations processes.  If any of these four vendors (or anyone else) wants to send me such a picture for open use and publication then I’d love to see it and I’d almost surely blog on it.

What can we learn about NFV from this?  Well, if the most important thing about NFV is something nobody is talking about and few have any support for at all, we’re not exactly in the happy place.  There is no way you could do a responsible analysis of NFV implementations without addressing the points I’ve outlined here, which means without defining an explicit data model to describe services and resources.  If you see an analysis without these points (whether from a vendor, the media, or an analyst) it’s incomplete, period.  I hope my comments here will help improve coverage on this important topic!

What’s the REAL Impact of Virtualization on Network Security?

When I was a newly-minted programmer I saw a cartoon in the Saturday Evening Post (yes, it was a long time ago!)  A programmer came home from the office, tossed his briefcase on the sofa, and said to his wife “I made a mistake today that would have taken a thousand mathematicians a hundred years to make!”  This goes to show that automation doesn’t always make things better when it makes them faster.  Sometimes it doesn’t make them better at all.

Automation is a necessary part of virtualization, and we can certainly envision how a little service automation done wrong could create chaos.  One area stands out in terms of risk, though, and that’s security/compliance.  We learned with the cloud that security is different in a virtual world.  Will it be different for SDN and NFV, and how?  What will we do about it?  There are no solid answers to these questions yet, but we can see some signposts worth reading.

Virtualization separates function from realization, from resources.  When you create services with virtualization, you are freed from some of the barriers of service-building that have hampered us in the past, most notably the financial and human cost of making changes to accommodate needs or problems.  Adding a firewall to a virtual network is easier than adding one to a device network.  However, you also create issues to go with your opportunities.  The potential for malware in the virtual world is very much higher because you’ve converted physical devices with embedded code into something not very different from cloud components—something easier to get at.

I would propose that security in the virtual world is a combination of what we could call “service security” and “resource security”.  The former would involve changes to the “black-box” or from-the-outside-in experience of the service and how those changes would either improve or reduce security.  The latter would involve the same plus-or-minus, but relating to the specific processes of realizing the resource commitments associated with each function.  Service security relates to the data plane and any control or management traffic that co-resides with it, and resource security relates to the plane where virtualization’s resources are connected.

Service security differences would have to arise from one of three problems—spoofing an endpoint, joining a limited-connectivity service, or intercepting traffic.   To the extent that these are risks exposed within the data plane of the service (or co-resident control/management planes) you would have the same exposure or less with SDN or NFV.  SDN, for example, could replace adaptive discovery with explicit connectivity.  Thus, at the basic service level, I think you could argue that virtualization doesn’t have any negative impact and could have a positive one.

Virtualization’s agility does permit added security features.  NFV explicitly aims at being able to chain in security services, and this approach has been advocated for enterprises in a pure SDN world by Fortinet.  You could augment network security by chaining in a firewall, encryption, virus-scanning on emails, DNS-based access control, or other stuff without sending a tech out or asking the customer to make a device connection on premises.  However, remember that you can have all these features today using traditional hardware, and that’s usually the case with businesses and even consumers.  It might be easier to add something with virtualization, but in the end we end up in much the same place.

If we carried virtualization to its logical end, which is application- and service-specific networks built using OpenFlow and augmented with NFV, you could see the end to open connectivity and the dawn of very controlled connectivity, almost closed-user-group-like in its capabilities.  Given this, I believe that at the service level at least, virtualization is likely to make security better over time, and perhaps so much better in the long term that security as we know it ceases to be an issue.  I want to stress the point that this isn’t a near-term outcome, but it does offer hope for the future.

If service security could be better with virtualization, the resource side of virtualization is another story.  At some point in the setup of a virtualization-based service, you have to commit real resources and connect them.  This all happens in multi-tenant pools, where the isolation of tenants from each other and management/control pathways from tenant data planes can’t be taken for granted.

The first fundamental question for either SDN or NFV is the extent to which the resource domain’s control traffic is isolated from the service data plane.  If you can address or attack resource-domain components then there’s a security risk of monumental proportions being added when you add virtual infrastructure, and nothing much you do at the service level is going to mitigate it.  I think you have to think of the resource domain as the “signaling network” of virtualization and presume it is absolutely isolated from the services.  My suggestion was to use a combination of physical-layer partitioning, virtual routing, and private IP addresses.  Other approaches would work too.

If you isolate the signaling associated with virtualization, then your incremental resource risks probably come from either malware or “maloperators.”  There is always a risk of both these things in any operations center, but with virtualization you have the problem of needing a secure signaling network and then needing to let things onto it.  The VNFs in NFV and perhaps even management elements (the VNF Managers) would be supplied by third parties.  They might, either through malice or through error, provide a pathway through which security problems could flow into the virtualization signaling network.  From there, there could be a lot of damage done depending on how isolated individual VNF or service subnets of virtual resources were from each other and from central control.

To me, there are three different “control/management domains” in an SDN/NFV network.  One is the service data plane, which is visible to the user.  The second is the service virtual resource domain, which is not visible to the user and is used to mediate service-specific resource connections.  The third is the global management plane, which is separate from both the above.  Some software elements might have visibility in more than one such plane, but with careful control.  See Google’s Andromeda virtual network control for a good example of how I think this part has to work.

Andromeda illustrates a basic truth about virtualization, which is that there’s a network inside the “black box” abstractions that virtualization depends on.  That network could boost flexibility, agility, resilience, and just about everything else good about networking, but it could also generate its own set of vulnerabilities.  Yet, despite the fact that both Amazon and Google have shown us true black-box virtual networking with their cloud offerings, we’re still ignoring the issue.

The critical point here is that most virtualization security is going to come down to proper management of the address spaces behind the services, in the resource domain.  Get this part right and you have the basis for specifications on how each management process has to address its resources and partner processes, and where and how it crosses over into other domains.  Get it wrong and I think you have no satisfactory framework for design and development of the core applications of SDN or NFV and you can’t ever hope to get security right.

We need an architecture for SDN and NFV, in the fullest sense of the word.  We should have started our process with one, and we aren’t going to get one now by simply cobbling products or ideas together.  The question is whether current work, done without a suitable framework to fit in, can be retrofitted into one.  If not, then will the current work create so much inertia and confusion that we can’t remedy the issues?  That would be a shame, but we have to accept it’s possible and work to prevent it from happening.

Why (and How) Infrastructure Managers are Critical in NFV Management

In a number of recent blogs I’ve talked about the critical value of intent modeling to NFV.  I’d like to extend that notion to the management plane, and show how intent modeling could bridge NFV, network management, and service operations automation into a single (hopefully glorious) whole.

In the network age, management has always had a curious dualism.  Network management has always been drawn as a bottom-up evolution from “element” management of single devices, through “network” management of collective device communities, to “service” management as the management of cooperative-device-based experiences.  But at the same time, “service” management in the sense of service provider operations, has always started with the service and then dissected it into “customer-” and “resource-facing” components.

The unifying piece to the management puzzle is the device, which has always been the root of whatever you were looking at in terms of a management structure.  Wherever you start, management comes down to controlling the functional elements of the network.  Except that virtualization removes “devices” and replaces them with distributed collections of functions.

The thing is, a virtual device (no matter how complicated it is functionally) is a black box that replicates the real device it was modeled on.  If you look at a network of virtual devices “from the top” of the management stack (including from the service operations side) you’d really want to see the management properties of the functionality, not the implementation.  From the bottom where your responsibility is to dispatch a tech to fix something, you’d need to see the physical stuff.

This picture illustrates the dichotomy of virtualization management.  You still have top-and-bottom management stack orientation, but they don’t meet cleanly because the resource view and the service view converge not on something real but rather on something abstract.

If we visualize our virtual device as an intent model, it has all the familiar properties.  It has functionality, it has ports, and it has an SLA manifest in a MIB collection of variables.  You could assemble a set of intent-modeled virtual devices into a network and you could then expect to manage it from the top as you’d managed before.  From the bottom, you’d have the problem of real resources that used to be invisible pieces of a device now being open, connected, service elements.  The virtual device is then really a black box with a service inside.

Might there be a better way of visualizing networks made up of virtual devices?  Well, one property of a black box is that you can’t see inside it.  Why then couldn’t you define “black boxes” representing arbitrary useful functional pieces that had nothing to do with real network devices at all?  After all, if you’re virtualizing everything inside a black box, why not virtualize a community of functions rather than the functions individually?

This gives rise to a new structure for networks.  We have services which divide into “features”.  These features divide into “behaviors” that represent cooperative activities of real resources, and they divide into the real resources.  In effect, what this would do is to create a very service-centric view of “services”, meaning a functional view rather than one based on how resources are assembled.  The task of assembling resources goes to the bottom of the stack.  All composition is functional, and when you’ve decomposed your composition to deploy it, you enter the structural domain at the last minute.

This approach leads to a different view of management, because if you assemble intent models to do something you have to have intent-model SLAs to manage against, but they have to be related somehow to those black-box elements inside them.  To see how this could work, let’s start by drawing a block that’s an intent model for Service X.  Alongside it, we have one of those cute printer-listing graphics that represents the MIB variables that define the intent model’s behavior—what it presents to an outside management interface.

But where do they come from?  From subordinate intent models.  We can visualize Services A and B “below” Service X, meaning that they are decomposed from it.  The variables for Services A and B must then be available to Service X for use in deriving its own variables.  We might have a notion that Service X Availability equals Service A Availability plus Service B Availability (simplified example, I realize!)  That means that if Services A and B are also black boxes that contain either lower-level intent models or real resources, the SLA variables for these services are then derived from their subordinate elements.

This notion of service composition would be something like virtual-device composition except that you don’t really try to map to devices but to aspects of services.  It’s more operations-friendly because that’s what you sell to customers.  I would argue that in the world of virtualization, it’s also more resource-management friendly because the relationship between resource state (as reflected in its variables) and service state (where it’s reflected) is explicit.

How would you compose a service?  From intent models, meaning from objects representing functions.  These objects would have “ports”, variables/SLAs, and functionality.  The variables could include parameter values and policies that could be set for the model, and those would then propagate downward to eventually reach the realizing resources.  Any set of intent models that had the same outside properties would be equivalent, so operators could define “VPN” in terms of multiple implementation approaches and substitute any such model where VPN is used.  You could also decompose variably based on policies passed down, so a VPN realized in a city without virtualization could realize on legacy infrastructure.

In this approach, interoperability and interworking is at the intent model level.  Any vendor who could provide the implementation of an intent model could provide the resources to realize it, so the intent-model market could be highly competitive.  Management of any intent model is always the same because the variables of that model are the same no matter how it’s realized.

The key to making this work is “specificational” in nature.  First, you have to define a set of intent models that represent functionally useful service components.  We have many such today, but operators could define more on their own or through standards bodies.  Second, you have to enforce a variable-name convention for each model, and create an active expression that relates the variables of a model to the variables generated by its subordinates (or internal structure).  This cannot be allowed to go further than an adjacent model because it’s too difficult to prevent brittle structures or consistency problems when you dive many layers down to grab a variable.   Each black box sees only the black boxes directly inside it; the deeper ones are opaque as they should be.

Now you can see how management works.  Any object/intent model can be examined like a virtual device.  The active expressions linking superior and subordinate models can be traversed upward to find impact or downward to find faults.  It it’s considered useful, it would be possible to standardize the SLAs/MIBs of certain intent models and even standardize active flows that represent management relationships.  All of that could facilitate plug-and-play distribution of capabilities, and even federation among operators.

We may actually be heading in this direction.  Both the SDN and NFV communities are increasingly accepting of intent models, and an organized description of such a model would IMHO have to include both the notion of an SLA/MIB structure and the active data flows I’ve described.  It’s a question of how long it might take.  If we could get to this point quickly we could solve both service and network management issues with SDN and NFV and secure the service agility and operations efficiency benefits that operators want.  Some vendors are close to implementing something like this, too.  It will be interesting to see if they jump out to claim a leading position even before standards groups get around to formalizing things.  There’s a lot at stake for everyone.

Why NFV’s VIMs May Matter More than Infrastructure Alone

Everyone knows what MANO means to NFV and many know what NFVI is, but even those who know what “VIM” stands for (Virtual Infrastructure Manager) may not have thought through the role that component plays and how variations on implementation could impact NFV deployment.  There are a lot of dimensions to the notion, and all of them are important.  Perhaps the most important point about “VIMs” is that how they end up being defined will likely set the dimensions of orchestration.

In the ISG documents, a VIM is responsible for the link between orchestration and management (NFVO and VNFM, respectively) and the infrastructure (NFVI).  One of the points I’ve often made is that a VIM should be a special class of Infrastructure Manager, in effect a vIM.  Other classes of IM would represent non-virtualized assets, including legacy technology.

The biggest open question about an IM is its scope, meaning how granular NFVI appears.  You could envision a single giant VIM representing everything (which is kind of what the ETSI material suggests) or you could envision IMs that represented classes of gear, different data centers, or even just different groups of servers.  There are two reasons why IM scope is important; for competitive reasons and for orchestration reasons.

Competitively, the “ideal” picture for IMs would be that there could be any number of them, each representing an arbitrary collection of resources.  This would allow an operator to use any kind of gear for NFV as long as the vendor provided a suitable VIM.  If we envisioned this giant singular IM, then any vendor who could dominate either infrastructure or the VIM-to-orchestration-and-management relationship would be able to dictate the terms through which equipment could be introduced.

The flip-side issue is that if you divide up the IM role, then the higher-layer functions have to be able to model service relationships well enough to apportion specific infrastructure tasks to the correct IM.  Having only one IM (or vIM) means that you can declare yourself as having management and orchestration without actually having much ability to model or orchestrate at all.  You fob off the tasks to the Great VIM in the Sky and the rest of MANO is simply a conduit to pass requests downward to the superIM.

I think this point is one of the reasons why we have different “classes” of NFV vendor.  The majority do little to model and orchestrate, and thus presume a single IM or a very small number of them.  Most of the “orchestration” functionality ends up in the IM by default, where it’s handled by something like OpenStack.  OpenStack is the right answer for implementing vIMs, but it’s not necessarily helpful for legacy infrastructure management and it’s certainly not sufficient to manage a community of IMs and vIMs.  The few who do NFV “right” IMHO are the ones who can orchestrate above multiple VIMs.

You can probably see that the ability to support, meaning orchestrate among, multiple IMs and vIMs would be critical to achieving full service operations automation.  Absent the ability to use multiple IMs you can’t accommodate the mix of vendors and devices found in networks today, which means you can’t apply service operations automation except in a green field.  That flies in the face of the notion that service operations automation should lead us to a cohesive NFV future.

Modeling is the key to making multiple IMs work, but not just modeling at the service level above the IM/vIM.  The only logical way to connect IMs to management and orchestration is to use intent models to describe the service goal being set for the IM.  You give an IM an intent model and it translates the model based on the infrastructure it supports.  Since I believe that service operations automation itself demands intent modeling above the IM, it’s fair to wonder what exactly the relationship between IMs/vIMs and management and orchestration models would be.

My own work on this issue, going back to about 2008, has long suggested that there are two explicit “domains”, service and resource.  This is also reflected in the TMF SID, with customer-facing and resource-facing service components.  The boundary between the two isn’t strictly “resources”, though—at least not as I’d see it.  Any composition of service elements into a service would likely, at the boundaries, create a need to actually set up an interface or something.  To me, the resource/service boundary is administrative—it’s functional versus structural within an operator.  Customer processes, being service-related, live on the functional/service side, and operator equipment processes live on the resource side.

Resource-side modeling is a great place to reflect many of the constraints (and anti-constraints) that the ISG has been working on.  Most network cost and efficiency modeling would logically be at the site level not the server level, so you might gain a lot of efficiency by first deciding what data centers to site VNFs in, then dispatching orders to the optimum ones.  This would also let you deploy multiple instances of things like OpenStack or OpenDaylight, which could improve performance.

Customer-based or customer-facing services are easy to visualize; they would be components that are priced.  Resource-facing services would likely be based on exposed management interfaces and administrative/management boundaries.  The boundary point between the two, clear in this sense, might be fuzzy from a modeling perspective.  For example, you might separate VLAN access services by city as part of the customer-facing model, or do so in the resource-facing model.  You could even envision decomposition of a customer-facing VLAN access services into a multiple set of resource-facing ones, for each city involved for example, based on what infrastructure happened to be deployed there.

From this point, it seems clear that object-based composition/decomposition could take place on both sides of the service/resource boundary, just for different reasons.  As noted earlier, most operators would probably build up resource-facing models from management APIs—if you have a management system domain then that’s probably a logical IM domain too.  But decomposing a service to resources could involve infrastructure decisions different from decomposing a service to lower-level service structures.  Both could be seen as policy-driven but different policies and policy goals would likely apply.

I think that if you start with the presumption that there have to be many Infrastructure Managers, you end up creating a case for intent modeling and the extension of these models broadly in both the service and resource domain.  At the very bottom you have things like EMSs or OpenDaylight or OpenStack, but I think that policy decisions to enforce NFV principles should be exercised above the IM level, and IMs should be focused on commissioning their own specific resources.  That creates the mix of service/resource models that some savvy operators have already been asking for.

A final point to consider in IM/vIM design is serialization of deployment processes.  You can’t have a bunch of independent orchestration tasks assigning the same pool of resources in parallel.  Somewhere you have to create a single queue in which all the resource requests for a domain have to stand till it’s their turn.  That avoids conflicting assignments.  It’s easy to do this if you have IM/vIM separation by domain, but if you have a giant IM/vIM, somewhere inside it will have to serialize every request made to it, which makes it a potential single point of processing (and failure).

Many of you are probably considering the fact that the structure I’m describing might contain half-a-dozen to several dozen models, and will wonder about the complexity.  Well, yes, it is a complex model, but the complexity arises from the multiplicity of management systems, technologies, vendors, rules and policies, and so forth.  And of course you can do this without my proposed “complex” model, using software that I can promise you as a former software architect would be much more complex.  You can write good general code to decompose models.  To work from parameters and data files and to try to anticipate all the new issues and directions without models?  Good luck.

To me, it’s clear that diverse infrastructure—servers of different kinds, different cloud software, different network equipment making connections among VNFs and to users—would demand multiple VIMs even under the limited ETSI vision of supporting legacy elements.  That vision is evolving and expanding, and with it the need to have many IMs and vIMs.  Once you get to that conclusion, then orchestration at the higher layer is more complicated and more essential, and models are the only path that would work.

Why is Network-Building Still “Business as Usual?”

If we tried to come up with a phrase that expressed the carrier directions as expressed so far in their financials and those of the prime network vendors, a good suggestion would be “business as usual.”  There’s been no suggestion of major suppression of current capital plans, no indications of shifts in technology that might signal a provocative shift in infrastructure planning.  We are entering the technology planning period for the budget cycle of 2016, a critical one in validating steps to reverse the revenue/cost-per-bit crunch operators predict.  Why isn’t there something more going on?

It’s perfectly possible that one reason is that operators were being alarmists with their 2017 crossover prediction.  Financial analysts and hedge funds live quarter to quarter, but it seems pretty likely to me that they’d be worried at least a little if there were a well-known impending crisis in the offing.  Somebody might take flight and cause a big dip in stock prices.  But I think it’s a bit more likely than not that the 2017 consensus date for a revenue/cost crunch is as good an estimate as the operators could offer.

Something that ranges over into the certainty area is that operators are responding, by putting vendors under price pressure, buying more from Huawei the price leader.  Except in deals involving US operators where Huawei isn’t a player, we’ve seen most vendors complain of pricing pressures and at least a modest slowing of deals.  Ciena said that yesterday on their call, though they say it’s not a systemic trend but rather a timing issue for a couple of players.

Another almost-sure-thing reason is that the operations groups that do the current network procurements haven’t been told to do much different.  VPs of ops told me, when I contacted them through the summer, that they were not much engaged in new SDN or NFV stuff at this point.  As they see it, new technology options are still proving out in the lab (hopefully).  Their focus is more on actual funded changes like enhancements to mobile infrastructure.

The question, the big one, is whether one reason operators are staying the course is that it’s the only course they have.  We’ve talked, as an industry, about massive changes in network infrastructure but for all the talk it’s hard to define just what a next-gen infrastructure would look like.  Harder, perhaps, to explain how we’d fund the change-over.

That’s the real point, I believe, because in our rush to endorse new network technologies we’ve forgotten a message from the past.  The notion of transformation of telecom infrastructure isn’t new.  We had analog telephony, then digital and TDM, and then the “IP convergence”.  What would we see if we looked back to the past and asked how the changes came about, particularly that last one to IP?

Packet networking was proposed in a Rand Corporation study in 1966, and we had international standard packet protocols and the OSI model just a decade later.  We also had the foundations of the Internet.  None of the stuff that evolved in that period was intended as a replacement for TDM.  That role was envisioned for Asynchronous Transfer Mode, or ATM.

The theory behind ATM at the technical level isn’t relevant here, so I’ll just summarize it.  You break down information into “cells” that are small enough so that the delay you’d experience waiting for a cell to be sent or received is small.  That lets you jump priority stuff in front of that which isn’t a priority, which lets you mingle time-sensitive stuff like voice (or video) with data.  This, in turn, lets you build a common network for all traffic types.  ATM was actually intended to replace the public network, designed for it in fact, and there was an enormous wave of interest in ATM.  I know because I was part of it.

I learned something from ATM, not from its success but from its failure.  There was nothing technically wrong with ATM.  There was nothing wrong with the notion that a single converged network would be more economical as the foundation of a shift of consumer interest from voice to data. The problem was that the transition to ATM was impractical.  Wherever you start with ATM, you deliver technology without community.  You can’t talk ATM with somebody because early deployment would be unlikely to involve both you and your desired partners in communication.  You needed to toss out the old and put in the new, and that’s a very hard pill to swallow for operators.

Why did IP then win?  It wasn’t technical superiority.  It won because it was pulled through by a service—the Internet.  Operators wanted consumer data, and the Internet gave it to them.  The revenue potential of the Internet could fund the deployment of what was then an overlay network based on IP.  More Internet, more IP, until we reached the point where we had so much IP that it became a viable service framework, a competitor to what had previously been its carrier technology—TDM.  We got to IP through the revenue subsidies of the Internet.

What revenue funds the currently anticipated infrastructure transformation?  We don’t have a candidate that has that same potential.  The benefits of SDN or NFV are subtle, and we have no history as an industry in exploiting subtle benefits, or even harnessing them.  That means, in my view, that we either have to find some camels’-nose service to pull through the change as the Internet did for IP, or we have to learn to systematize the change.  I’ve offered an example of both in recent blogs.

IoT and agile cloud computing could both be candidates for the camel role.  We could gain almost a trillion dollars in revenues worldwide from these services.  We’re slowly exploiting the cloud already, and while it would help if we had a realistic understanding of where we’re going with it, we’ll eventually muddle into a good place.  IoT is more complicated because we have absolutely no backing for a truly practical model, but I think eventually it will happen too.

That “eventually” qualifier is the critical one here.  We probably can’t expect any new service to take off as fast as the Internet did, but the Internet took a decade or more to socialize IP to infrastructure-level deployment.  My point with the notion of service operations automation is that we could do better.  If we build, through a combination of cloud features for infrastructure and enlightened software data modeling, a petri dish with an ideal growth medium in it, we could build many new services and attract many new revenue sources.  This could then drive the evolution of infrastructure forward as surely as one giant camel could have, and a lot faster.

Consumerism has blinded us to a reality, which is the reality of justification.  I buy a new camera not because I need one but because I want it.  That works for discretionary personal expenses up to a point, but it’s not likely the financial industry would reward, or even tolerate, a decision by operators to deploy SDN or NFV for no reason other than it was nice and shiny and new.  It’s my belief that we can accelerate change by understanding how it has to be paid for.  A useless technology will meet no financial test, and both SDN and NFV can be justified.  If we want earnings calls that cite explosive deployment growth in these new things, we’ll have to accept the need for that justification and get working on making it happen.