A Realistic if Unsatisfying View of the “Market” for SDN and NFV

You can hardly pick up (virtually) an online publication these days without seeing an extravagant market forecast on NFV.  I don’t have much faith in forecasts in general; they usually turn out to be aimed at validating the largest market possible because the buyers of the report are usually vendors.  NFV is particularly problematic, though, and SDN follows only slightly behind in terms of forecast difficulty.  There’s a lot of junk science out there.

Gartner made news this week by saying that neither SDN nor NFV are markets at all, which I also disagree with.  Yes, SDN is an architecture.  So is IP, and it’s a market.  Yes, NFV is a deployment option.  So is cloud computing and even Gartner agrees that’s a market.  So we can’t define our way out of facing the basic question of what the two technologies are likely to do.  We can’t fall back on avoidance.

OK, then let’s give the analysis of the SDN and NFV market our best shot.

SDN is the substitution of central control for packet forwarding using adaptive technology exchange.  If we stay with that purist definition, then SDN advances in two distinct phases.  In phase one we have the enclave model of SDN, where we apply SDN to connection problems that are limited enough to avoid the intrinsic truth that one controller couldn’t possibly software-define the Internet.  In phase two we develop or stumble on a reasonable federation model for SDN that preserves SDN value across enclave boundaries by avoiding the need to harmonize to pre-SDN network-to-network interfaces.

In the enclave SDN phase, my own model says we’re limited to penetrating about 4% of total switch/router spending.  Right now it looks like SDN would be most likely to find its place in cloud data centers first, followed by slowly advancing into metro infrastructure.  During that advance, SDN would break out of the enclaves and become a more broadly useful technology.  In phase two, my model says that it could penetrate almost 75% of the switch/router market, but this will take quite a while.

When?  The current timing my model offers is that phase one SDN cannot possibly achieve full penetration until 2018, and that by 2020 phase two would be well underway, with SDN then owning about 11% of switching/routing.  However, it’s important to note that the growth beyond that level would come largely from SDN control of virtual switch/router elements.  SDN, in short, wins by having lower OSI levels take on more of a role in grooming and fault response, and the partitioning of services and applications would then enable the use of virtual switch/routers.

NFV is a lot more complicated, primarily because it’s far from clear what specific drivers for NFV would look like.  You have to take two views—based on current benefit expectations and based on “magic bullet” expansion.  In both cases you have to look at NFV as being first a “market” in that there would be spending on NFV technology, and second as a market transformation driver that’s shifting carrier capex from network equipment to data center equipment.

Capital savings was the first NFV benefit to be cited, and most operators have abandoned it at this point.  For good reason, I think because my own model says that if that’s the justification for NFV then the maximum impact of NFV would be to shift about a percent of capex to servers.  It’s a pimple on the market posterior at that level.

If opex can be targeted, things get more interesting.  A full solution to service orchestration could justify a lot of incremental NFV spending, but interestingly enough it wouldn’t result in an enormous change in how operator capex is distributed.  In order for NFV to be operationally efficient it has to spread its wings to cover legacy infrastructure.  In doing that it creates a significant benefit even where network equipment remains the same.  What drives the transformational spending changes is the fact that the improved operations practices can then support the third benefit.

Which is new services—“service agility” in the current vernacular.  An agile and efficient operations framework would start to open new revenue opportunities for operators, which would be fulfilled increasingly through the use of servers.  In my model, this benefit-driven but unfocused drive to NFV creates a shift to servers that eventually ends up involving about a third of all carrier spending by 2025.

You noticed that I’ve not talked about what gets spent on specific NFV technology, and that was also true for SDN control.  In neither case are the expenditures a significant piece of carrier spending and my model won’t forecast something that small accurately.  A lot of this will be open source, some will be given away by vendors who will make money elsewhere.  The transformation will impact OSS/BSS spending, though, to the point where NFV-fashioned orchestration will make up almost 40% of that market, also by 2025.

My magic bullet approach does a bit better.  Recall from prior blogs that the magic bullet says that there are a small number of successful NFV applications (vCPE and mobile infrastructure, with perhaps IoT coming along).  Under this model, early NFV success is promoted in the low-apple areas where it makes sense, and these successes serve to justify near-term changes more efficiently because they limit the areas where infrastructure changes don’t pan out in immediate revenue or savings.

In this model, NFV shifts about 2% of carrier capex to servers by 2019 and that number grows to 5% by 2020.  If we could start to address IoT in respectable terms by 2016, that 2019 number would be 5% and 2020 would be 11%.  OSS/BSS impact would also be larger; 65% of that market would be orchestration-driven by 2025.

The combination of SDN and NFV would have significant impact on carrier capex.  Up to around 2020, the combination would actually increase capex slightly year over year as operators have to pre-capitalize to some degree.  By 2022 that’s turned around and there’s a decline of about the same rate, with capex stabilizing at just a bit more than 90% of current levels by 2030.

So there we are, at least as my current model shows things.  Many won’t like the lack of absolute numeric precision in my results, but I don’t think it’s possible to do an absolute forecast—not because SDN and NFV aren’t markets but because they are both alternatives to legacy equipment and their adoption will be driven by their benefits and by overall capital budgets.  I can’t forecast the latter—there are too many variables.

The most critical time for SDN and NFV is the period between now and 2018.  Any technology option that can prove its business case will drive infrastructure spending in its own direction at that point, and that which can’t prove out will be put on the back burner as operators look for better capital strategies going forward.  The moral for SDN and NFV vendors is simple.  You can’t just ride the wave here, because you’re going to have to be the wavemaker.  Otherwise something else wins.

Making Network Revolutions into Realized Revolutions

The notion that things are changing, perhaps a bit too fast for comfort, is hardly a modern phenomenon nor one confined to tech.  One of my favorite poems (Arthur Guieterman’s On the Vanity of Earthly Greatness) starts with the provocative line “The tusks that clashed in mighty brawls of mastodons…are billiard balls.”  Change, and not change for the better (from the mastodon perspective, anyway).  Our tech giants should give it a read, perhaps.

IBM was once the unchallenged leader in terms of strategic influence, with a score so high that their vision alone could drive purchase decisions.  Today they’ve fallen into a tie for third place among IT vendors and their influence and a couple bucks will get you a Starbucks.  Cisco’s score has also fallen, and so has that of Microsoft, who just generated a record loss.  In fact, all of the tech leaders are suffering, which makes you wonder just what kind of market we’re going to have in even a few years.  It also makes you wonder how we got to this point.

The root cause is likely simple.  Business technology has gotten subducted under mass market technology.  Selling expensive computers to the Fortune 500 is a decent business but nothing compared to selling smartphones to the masses of the earth.  If you look at any technology forum these days, even those supposedly focused on business IT, you find it dominated by consumer stories, consumer comments, and even consumer ads.  Absent any means of engaging the market, business IT is bound to languish.

A related issue is the death of productivity-enhancing innovations.  We’ve seen regular cycles of IT growth in the past, driven by new compute paradigms.  We ended the last cycle in about 2002 and nothing has started up since.  From that point, buyers became increasingly focused on sustaining the productivity gains they’d achieved, and the only “benefit” that a seller could offer is lower cost to achieve those goals.

You might look at this trend and see it as the dawn of the Age of IT Populism, where masses of startups rise (waving white boxes, presumably) and shake the councils of the mighty.  Well, not so fast.  While the major IT vendors are losing strategic influence and often missing revenue targets as well, they’re still sustaining market share, particularly if you take them as a group.  We might not consult IBM’s crystal ball before making decisions, but we still sign their sales orders.

It’s this contradiction that sets the tone for networking for the balance of this decade, I think.  Buyers are unhappy with their vendors, they espouse open products and open source and standards, and they fear “lock-in” and influence.  Yet they still do pretty much the same stuff as before when the PO comes out.  Perhaps the reason is an old saw; “Nobody ever lost their job by buying IBM!”  You can substitute your big vendor of choice for “IBM” here and the sense is the same.  Changing from a traditional supplier, abandoning an industry leader is a risky decision.  We’re not being paid to take risks here, says the buyer.

Which gets us back to benefits.  Suppose that private airplanes cost a thousand bucks.  Would that create an explosion in air travel?  No, but it would darn sure drive up the salaries of pilots.  What’s happening as cost of equipment and software buckle under competitive pressure is a shifting of cost focus from the cheapening capital elements of tech to skilled tech labor—professional services and integration.  We’re making something different into the limiting factor, not eliminating the limits.

In my pilot/plane example, you can see that mass-market airplanes by themselves don’t create a mass market because you can’t train people to fly them safely.  What you need to have is a product focus shift, one that takes the functional aspects of IT or network equipment for granted (computers compute, routers push bits) and concentrates on making the stuff bulletproof in terms of installation, operation, and support.

Thirty years ago, network management was said to be 18% of network spending.  In 2014 when I asked enterprises for their number, I got almost 38%.  Gear got cheaper, more gear is bought, more gear is more operationally complex…you get the picture.

Put this way, the revolutions like SDN and NFV are shooting at the wrong duck and only the cloud has a captured a glimmer of reality.  Not with IaaS, which probably consumes more tech humans than the alternative, but with SaaS.  If people want an application, sell them the application in the form they’ll use it in, not a heap of technology they can (hopefully) assemble into something that (hopefully) will do what they want.

It’s not that SDN and NFV can’t simplify things, but that we’re not seeing them as what the cloud says they’ll have to be to succeed.  This is all about network-as-a-service, not network as technology choices.  We’re back to the populist airplane thing; just making something cheaper doesn’t make it more consumable, more populist.

Almost two years ago I had a meeting with a half-dozen EU operators on the goals of NFV, and the point that was raised loud and clear was that if you wanted to do something significant with NFV, you had to address the service lifecycle.  And that’s what they say now.  One operator says “If you want ‘service agility’ then you want a complete opportunity-to-revenue lifecycle measured in days and not in years.”  Well, for all the discussions about agility, we’ve not addressed that complete lifecycle.

Part of the solution is to frame our new technology revolutions in a service management model.  The goal of all this iron is to sell services, so you need to understand how the new pieces will address that broad goal.  Making a five-hour difference in provisioning a service that’s taken you two years to frame into market terms is hardly noticeable, much less revolutionary.

NaaS is programming more than it’s anything else.  Software development principles have to guide service development if we’re going to make a massive difference in agility.  We’ve had years of good science built around agile software and now we’re trying to build agile services in activities that have nothing to do with software and have little or no participation from software people.  A software architect would never do SDN or NFV as it’s been done.

This is an easy problem to fix, too.  The software industry spent nearly two decades evolving to the modern state of modular development and as-a-service or microservice or whatever.  We have the results of all of that, and with “software” as the core of nearly every change we believe is coming in networking, it should be easy to conceptualize future services as the facile assembly of useful pre-designed components.  SDN and NFV both play a role in componentized services—connecting the components and deploying them, respectively.  What we’re missing is how the components are built, the “upperware” platforms needed to facilitate their development, and the management systems that can do drag-and-drop service creation as easily as we can do drag-and-drop development.

This is where vendors could really help.  At least three of the big NFV names (Alcatel-Lucent, HP, and Oracle) have software frameworks and skill sets.  All three have the tools needed to create an agile framework.  Let’s get to it, gang.

What Operators Think of SDN Deployment Models and What that Says about the Future

I had an interesting exchange with a planner in a mid-sized carrier, and got some insight into how network operators are seeing SDN.  Coming from an exchange with some other operators, my contact gave me a tutorial on the “models” of deployment the operators are seeing as promising.  Some are familiar, and some approaches we think of (and read/hear about) often seem to be getting discounted a bit in the real world.

The model that got mentioned first by operators in this exchange was the “vSwitch-plus” model.  In cloud or NFV data centers, SDN and OpenFlow are often used to set up the connectivity at the virtual level, through control of vSwitches.  Operators like that explicit model of setup—you connect what you say you want to connect—and so they are looking at a white-box data center where data center switches are OpenFlow-programmed as well.

One interesting point about this particular SDN application is that it’s seen primarily as a tool in reducing errors and advancing security and compliance.  Operators didn’t think the cost savings involved would excite senior management, and they didn’t expect operations costs to be materially impacted either, except as it would relate to perhaps the security/compliance area.

The second model that got some attention was the metro grooming model.  Operators made the point that it was becoming a practice to use lower-layer tunnels to separate traffic, services, applications, and users.  Physical media like fiber or Ethernet copper are examples, but they lack the granularity needed because advances in technology keep boosting the capacity physical media can support.  SDN could provide what might be called a “protocol-less tunnel”, an extension of physical media.

Where these tunnels should be supported is clear to operators too; primarily in fiber and in particular agile-optics gear.  There is already a strong interest in redefining networking as a series of parallel Level 3 universes separated at the physical level, rather than as a universally connected Level 3 world as Internet advocates might see it.  Right now this interest is focused in the metro because (logical, huh?) that’s where most of the capex is flowing.  It’s also true that metro is the focus of a lot of different services, and also a focus of net neutrality planning for operators.

Neutrality is a big reason to worry here.  If you were to build a hypothetically fully connected IP network as your metro foundation, you could (in the US and Europe in particular) find yourself defending any special service capacity against neutrality complaints because what you did would look and act much like an extension of the Internet.

There’s also an opex dimension here, but it’s more indirect.  Operators say that the opex cost per handled bit is highest at Level 3 and declines as you drop through to the OSI basement.  They also say that compartmentalized IP, particularly with topology and resiliency ceded to agile lower layers, is cheaper to run and the per-device traffic (being that of a single compartment) is low enough to make a hosted router/switch solution practical.

Some of you may recall a blog I did on this topic a while back—making the point that the network of the future could be far more virtual than we think if we compartmentalized applications and services at a virtual Level 1.  Maybe I’m self-fulfilling here, but I see some of that working its way into operator planning as this group sees it, at least.

The third model of SDN is one of end to end service separation.  The thought is that this would build out from metro commitments, meaning that operators would establish metro grooming as above, then extend it to do things like provision Carrier Ethernet using explicit OpenFlow forwarding.  These Level-1-separated services at the metro level would then be interconnected using “core grooming” to create end-to-end services.

I had hoped to see some interest in building cloud computing services this way.  It’s easy to say that you could combine the vSwitch-plus and metro grooming infrastructure choices to build a cloud data center with an application-specific network, then tunnel it to CPE in multiple sites to offer access.  This model didn’t get much attention, though.  It may be that it combines too many service areas.  Most operators run their clouds and their networks independently.  It may also be that they’re not seeing the end-game for that approach yet, and so can’t really justify diving into it.

One thing I didn’t hear from my operator source here was a goal of displacing traditional switching and routing with SDN.  I was somewhat surprised by the very focused interest, and wondered if that were simply an indication of the operators identifying low-apple opportunities from which they could build to their real goal.  No, they say—these low-apple opportunities are the real goal.  Strategic use of SDN, meaning the conceptualizing of a pure, true, SDN model for switching and routing, is simply not something they were thinking about.

I wonder whether this might not have a significant impact on SDN, but also on NFV and even carrier cloud computing.  In a way, it makes sense.  Here are the operators, saying that they are facing profit compression because of the commoditization of bit services.  Do they then re-architect bit services to offer better cost points in a technology-driven revolution, or do they just focus on the stuff that’s 1) costing a lot and 2) getting currently refreshed or capitalized?  The latter, they say.

If this is the case, then SDN needs to focus a lot more on “interworking”, both in a vertical sense with the L2/L3 stuff and horizontally from network segment to network segment.  Explicit SDN interworking beyond that accomplished using network-to-network gateway processes from L2 and L3 is essential if you’re trying to do internetworking below L2 and L3, after all.

For NFV, the challenge is that the views expressed by these operators challenges the notion of new services driving substantial NFV deployment.  This group would simply do an NFV-ish (probably more cloud-like, or agile-CPE like in the case of vCPE) low-apple implementation that’s not even particularly designed to go anywhere else.  To build NFV on a systemic scale you’d then need some overriding operations and orchestration benefit.

For the carrier cloud services, things could be really tough.  There really are no obvious low apples for cloud services.  IaaS is generalized but not much more profitable than pushing bits.  PaaS and SaaS require a specific market target that operators might have trouble even finding and later have problems addressing in an engaging way.

Maybe we’re asking too much here.  Maybe we need a technology vision of the future that operators can build toward without actually endorsing or even knowing about.  That goes against my own planning-intensive grain, but the market will ultimately decide what it’s willing to do.

The Five Stages of VNFs

VNFs, meaning virtual network functions, are important to NFV.  Without them there’s no possible business justification to be had, no matter how good our infrastructure or orchestration and management might be.  Well, we all know there are supposed to be five stages of grief.  I contend that there are five stages of VNF too, and our progress through them—as vendors and as an industry—may decide whether we can forestall the other five stages.

The first stage of VNFs is the billboard stage.  In this stage, VNF vendors eagerly seek publicity and in many cases do that by linking themselves to any vendor who can spell “ecosystem” (even if they get a few letters wrong).  The reason for this is fairly obvious; VNFs can’t deploy except as a part of a broader NFV ecosystem and it’s far from clear early on who the winners in that space might be.

Most VNF providers are in the billboard stage now, and they’re there because there’s little barrier to being a partner to the NFV masses and little they can do to get traction except as a part of either an NFV trial or a larger RFP.  Most will probably never get out of this phase, because the primo players at the ecosystem level aren’t interested in a cattle call, they want offerings they can justify integration and trial efforts on.  Many are called but few are chosen here.

The second stage of NFV is the see-what-sticks phase.  A VNF vendor enters this phase, typically from the billboard phase, when they start to understand that real engagement is going to involve a significant commitment of resources to every “ecosystem partner” they think has a shot.  Since most probably won’t qualify, this phase is really about weeding out the chaff so you can focus on the few great ones.

This is the phase where a vendor typically learns the details of operationalizing NFV, and the way those details will impact VNFs.  In many cases this will generate a significant amount of new development.  In some it may bring relief.  Some VNF vendors will realize that their “success” depends less on an ecosystem partner than they might have thought because the nature of their product (part of a business services vCPE chain, for example) allows them to deploy with little or no real NFV linkage.  Those VNF vendors will wait here a while as things develop on a broader NFV front.

Stage three is the ride a magic bullet phase.  Here, VNF vendors discover that NFV opportunities that seem real are focused on a very small set of services, justifying a small set of VNFs.  Right now, for example, the two magic bullets are mobile and business-related vCPE.  IoT could be added.  Where a VNF vendor happens to have such an offering, this offering now becomes the Road to the Promised Land, and (mixing a few metaphors) they’ll forsake all others to get their key VNF or VNFs buffed up and ready.

Some vendors, of course, won’t be that lucky.  For that group, the only hope is to find some connection between what they can do and the key opportunity drivers that are emerging.  Security, application acceleration, and even collaboration players will all want to make their stuff look like a part of the magic bullets being recognized.  Most won’t make a decisive connection.

Stage four is perhaps the most critical of all the NFV stages, the second effort stage.  It’s not that we’re asking VNF vendors to have another desperate run at the goal, but that this is the stage where everyone at every level recognizes that one magic bullet doesn’t win a war.  NFV has to be broad-based to gain enough in benefits to justify the enormous industry effort.

This is where NFV gets real, because it’s relatively easy in the current immature world of NFV specifications and implementations to make a single VNF work.  The question is whether you can do two, and this is the stage where that has to happen.  Most operators say that they’re going to expect a pathway to their second VNF about the time they start field trials on the first, and that they’ll be looking for the features that make their NFV solution inclusive during those trials.

Which brings us to stage five of VNFs, which is reckoning.  Every VNF vendor who manages to get to stage four won’t move on, in no small part because many will have hitched their wagon to the wrong star, focusing on an ecosystem that doesn’t have the breadth.  The big question for the industry is whether enough VNFs pass through this stage without washing out.  Every NFV trial, every ecosystem, won’t succeed but I suspect that if the odds aren’t way better than 50:50 even in the early stages, there’s going to be a lot of blowback.

If you’re a VNF vendor, you need to be looking at the stages one or two ahead of where you are, at the least.  The most critical stage of all, IMHO, is stage three, where VNFs will have to prove not only that they partnered with the right ecosystem vendor(s) but also that they’ve envisioned a lively market opportunity in which their VNF can play.  Next-most-critical is stage four, where a VNF provider with one story figures out that they need two good ones at the minimum.

The obvious question is where we are, stage-wise, and it’s difficult to answer that.  Every provider of VNFs has an independent vision and program, though maturity of both vary considerably.  My rough surveys suggest that two-thirds of all purported providers of VNFs have little other than a hopeful eye on the future.  One told me that “VNFs are aspirations not products”, citing the fact that there was only an immature vision of deployment.

Of the third who actually have something, the two focus areas are vCPE and mobility/IMS/EPS.  There is a pretty solid business case for agile mobile infrastructure, even out to cloud RAN or CRAN.  The interesting point is that mobile infrastructure is more a cloud application than an NFV application; the specifics of NFV are really not needed to justify early deployment and make a business case.  The problem is that it might be difficult to develop a second act from a pure-cloud vision of mobility.  Other apps require more dynamic deployment and management, and thus depend more on NFV.

vCPE is also knotty in terms of its ability to pull through full NFV.  The “edge-hosted” model offers some benefit versus standard hardware, and like mobility is could stand alone (in this case, as an agile-edge application set) without much real NFV involvement.  Again, the challenge would be in transitioning it to a broader opportunity base.  The edge features of business services are not themselves highly dynamic, and once you’ve deployed what a customer is willing to pay for, might a custom device that’s operationally simpler be a better approach?  We don’t have enough data to determine that right now.

Here’s an IoT Approach that Works (but Nobody Sells it)

I said in a comment on an earlier blog that I thought all the IoT approaches touted so far were irrational.  In earlier blogs I’ve noted my view that IoT had to be viewed more as a big-data application than as a network.  A few of you have asked me to expand on my own view of IoT, and so that’s what I propose to do here.

“Classic” IoT is a vague model for device connection to the Internet whereby sensors and controllers of various kinds are directly connected to the Internet.  Once there, they’re available to fuel a whole series of new applications.  For proponents of this vision, the question is how we support LTE or WiFi interfaces to all these gadgets.  There are a lot of issues associated with this model, from a public policy perspective, from the perspective of ROI on IoT applications, and from a simple technology-ecosystem perspective.  We have to start an IoT discussion by addressing these issues.

In policy terms, it’s clear that just putting a bunch of sensors and controllers on the Internet would create a massive challenge in security and privacy.   Imagine how much fun hackers would have diddling with the traffic lights in New York, shutting down lanes on bridges and tunnels, and perhaps even impacting pipeline valves and the power grid.  Imagine how much easier it would be for stalkers (or worse) to track prospective victims by looking at the security and traffic cameras.  Happy fishbowl, everyone.  Obviously there’s no chance this could be allowed.

Perhaps, then, it’s fortunate that it’s far from clear who’d have an interest in deploying this stuff.  State and local governments in many areas have found they can’t even get permission or funding to set up traffic cameras.  Public utilities already have sensor and controller connectivity, but it’s shielded from the very open environment the IoT proposes to foster and they’d hardly be looking at magnifying their vulnerability.  Private companies would look at the IoT model and ask how they could possibly earn a return by just publishing data or allowing control openly.

The technical challenges fall into two groups, one relating to both policy and ROI and the other to utility.  On the policy/ROI side, the problem is that the more sophisticated you make a sensor or controller the more it costs and the more power it will need.  If you have a home security system you probably use inexpensive wired sensors for your doors and windows, and maybe for motion detection.  These probably cost about twenty bucks a pop.  Imagine an IoT world, where each of these sensors is online through WiFi or LTE, and each is equipped with a firewall and network-based DDoS protection to prevent attack.  You’re probably looking at five times the cost, plus you’ll either have to power the stuff or change batteries a lot.

The utility issue arises from the fact that a given sensor is just an IP address in the classic IoT model.  How is that sensor put into a useful context?  For example, if it’s a traffic sensor, what road and milepost is it located at, and what format is its information in?  Is it counting cars, measuring speeds, or both? How do we know it’s even what it purports to be?  It might be spoofed by some hacker or presented by an enterprising rest stop owner who wants to divert traffic by making an expressway look jammed with traffic.

In my opinion, IoT isn’t a movement at the network level, but rather an architecture built around a big-data model.  Imagine a database where information from known and authenticated contributors is collected and structured.  The contributions could include traffic sensors, home sensors, even locations of mobile devices.  All the data would be contributed based on policy-defined limits on use.  Those who wanted to use the data would do a big-data query that would be policy-validated to insure it meets security and privacy rules, and would receive what they needed—historically or in real time.  Control elements would be represented by write-enabled variables, and accessing them would also be policy controlled.

Where’s the network?  Behind the database.  Any owner of sensors could contribute information into the big-data repository, but they would control the contribution and be able to state policies on how their data could be used.  The “network” connecting their sensors could be anything that’s convenient, meaning that all of the current sensors and controllers that are networked using any protocol could be admitted to the IoT repository through a gateway.  No need to make sensors directly visible online, or to change sensor technology to support direct Internet visibility.

This sort of IoT could be visualized as a collection of “device subnets” that would use any suitable technology to attach sensors and controllers.  Each would have a gateway through which the data was pumped into the IoT repository, and the gateway would manage the policies and formatting.  The IoT repository would be an online database query service—a web service.  It might be linked onto a company VPN, to a cloud application set, or made available on the Internet.

You can probably see the similarities between this model and the web.  Anyone can put up a website; anyone could “put up” a device subnet directly, or contribute to one of any number of IoT repositories subject to their policies.  Anyone could access what’s put up, subject to whatever policy limits the owner imposes.  The commercial terms of any of these relationships could be whatever the market sets.

IMHO, it would be the IoT repositories that would establish the value of the whole picture.  Any cloud provider could establish one, of course, including Amazon, Apple, Google, and Microsoft.  Interconnect players like Equinix could build them, and network operators could as well.  For some of the players like Amazon, Apple and Google, you could see their repository exploiting the mobile devices they offer (directly or as a platform).  Auto manufacturers could join somebody’s repository or start their own.  Same with home security companies, federal, state, and local government, and even public utilities.

What about standards?  Well, if we presume the IoT Repository model, and if we presume that we’re accessing primarily those devices with large installed populations, standards shouldn’t really be much of an issue.  A query can format data in any way that’s convenient, unlike an interface.

This model is also easily federated.  We have hotel and airline sites today, and discount travel sites that create front-ends to their models, and even a couple who front-end the front-ends.  We could build gateways between IoT repositories, high-level repositories that culled specialized data from others, or did specialized analytics.  Think cottage industry.

One of the most interesting points about this model is that it raises what might be called the “utility IoT” approach.  A company deploys a bunch of sensors and controllers and pays for the effort by 1) contributing the data to repositories and/or 2) developing and deploying their own IoT repository where they charge for access.  Doing this would be easier for telcos and public utilities who have historically low internal rates of return and tolerance for high first costs, but in theory any player could bootstrap into it.

This isn’t classic IoT, it’s not a universe where new OTTs mine sensors that somehow magically appear, magically create ROI, and magically generate traffic and equipment revenues.  It’s somewhere I think we could get to, and that seems a better approach to me.

NFV Management’s Final Dimension–OSS/BSS/NMS Integration

In prior blogs I looked at the NFV deployment model and the way that management as ETSI defined it would presumably work within a “typical” deployment.  The question this last of my more detailed explorations of NFV management will deal with is how “NFV management” relates to management and operations in a broader sense.  You can’t, after all, support services by managing only NFV infrastructure.  You almost certainly can’t built them that way either.

There’s no single management and operations model in play today among operators, but whatever is out there has to deal with those two areas in some way.  “Management” is normally applied to the physical resources used to build services, and “operations” to the business processes and commercial tasks related to service sale and maintenance.  It wouldn’t be unreasonable to say that operations is a customer-facing process and tool set, and management faces resources.  Since the TMF links these two in its SID data model, it should be clear that many view management to be “under” operations.  The fact that many services today are still provisioned through NMSs says that many see them separated.

Another TMF concept is useful in understanding management integration.  The Enhanced Telecommunications Operations Map or eTOM is a picture of the steps associated with creating, selling, sustaining, and terminating a service.  There are a number of eTOM references, depending on whether you are or are not a TMF member, but here’s a basic public vision.  eTOM is divided into levels or layers, and at the most detailed level it’s a pretty comprehensive picture of what has to be done from soup to nuts, service-wise.

In the real world, most eTOM activities are intermingled between human and automated tasks, and between operations and management tools (using my previous division of the two).  From low-level eTOM, one could almost picture service operations as a modular function, where different pieces might be implemented different ways and in different places.  As part of a service, NFV has to integrate in some way with eTOM.

How?  NFV, in the strict construction of the ETSI ISG, is a set of specifications that define how real network functions hosted in traditional devices could instead be deployed as cooperative software elements on some agile resource set.  The operative part of “NFV” that threatens the traditional management/operations model is the “virtual” part.  In effect, virtualization of any sort creates an intermediary.  We used to have customer-facing and resource-facing pieces, remember?  Well, now we have this “virtual” piece that might look like a resource from the customer side, a customer from the resource side, or all or none of the above.

In the ETSI E2E architecture, there is an implicit vision of how virtualization and management combine.  We have an Element Manager that’s almost cohabiting with VNFs and is responsible for management of the VNFs themselves in the “customer direction”.  We have a VNF Manager that is (via some intermediary elements) responsible for managing the resource relationships with the VNFs.  Presumably, though this isn’t stated explicitly, we have resource management tools and practices aimed at the NFV Infrastructure as a pool of devices.

IMHO, the ETSI activity has focused most of its specification work on the VNF Manager piece as the “management” approach.  This is consistent with what I’ve called a “black-box” view of network functionality.  A VNF is a function.  A function is managed as a function, not as a collection of chips (today) or software (under NFV).  What happens to make software into the manageable function we expect is largely the VNFM’s problem?  And largely what ETSI worries about.  We could draw this out if you like.  Make a box all the way on the right and call it “traditional management/operations”.  Draw a box to the left of that with a bidirectional arrow connecting it, and call the new box “ETSI Element Manager”.   Draw another right-working box called “VNFs”, then one more called “VNFM” and finally one called “VIM/NFVI” and you have the picture.

This picture doesn’t necessarily represent a break in any management model.  If we assume that the ETSI EM depicts the functional model of the underlying structure completely and accurately then we could substitute a VNF implementation for a real device 1:1 and nobody would care.  The devil is in the details.

Here’s an example.  We can horizontally scale components in NFV, right?  That’s supposed to be one of the benefits.  You don’t horizontally scale chips or devices on demand, so the current management model for Real_Widget wouldn’t have the properties of Virtual_Widget I’d like to sell, whatever a widget is.  However, I could in theory build a new Widget-MIB that had the fields necessary to represent incremental NFV functionality, and if my management system could contend with that extra data I’d still be fine.

Another issue less easily fixed is in the concept of FCAPS, which is traditionally seen as the high-level vision of “network management”.  All of the letters in the acronym represent something that had a single logical meaning in the old device days, but has two meanings in the world of NFV.  What’s a “fault?”  Is it a failure of the virtual device, meaning that we’ve exhausted the automatic remedies for replacement/reconfiguration of VNFCs that NFV might offer, or a failure of an underlying resource?

We could assume operations integration with FCAPS would work if we applied the acronym to the virtual world.  In the real world, downward to the resources, we have a problem of correlation because the relationship between resource faults and virtual device faults depends on how we’ve allocated resources and the extent to which we attempt automatic remediation.

Which raises the challenge of virtualization.  If we want operations to know about real problems, real resources, real capacities and cost accounting, then we have to dip below the virtual.  We have to somehow tie operations processes to the deeper reality.  That’s also true of management processes, because as we travel down the traditional service-network-element-management stack in a virtual world, we find there’s a basement, which is the virtual-to-resource mapping.

ETSI talks in general terms about operations/management relationships with the NFV software, but the interfaces for these are not defined nor are there any solid rules for how the relationships would be structured.  The TMF has a good opening approach in its customer/resource-facing service model and the (NGOSS Contract, now part of GB942, the TMF Business Services Suite) notion of steering service events to suitable processes through the intermediary of a service contract data model, but the specifics of this aren’t real clear even for the real-device world and that part of the TMF model is (according to my operator sources) rarely implemented.

In a standards sense, then, we’re not solving the problem yet.  Unfortunately, we can’t just ignore management integration because there will surely be no pure NFV service early on, and likely never a pure NFV service even down the line.  There are going to be legacy devices in networks for a very long time, likely forever.  Given that, and given that operations efficiencies and service agility isn’t very meaningful if you confine either or both to just a piece of a service, we need to harmonize management completely.  Here and there, federated and solo, NFV and legacy, applications and services, transport and connection.  Services to users have few boundaries even now, and management can’t have them either.

So here’s where I think we are.  There are only two ways to make a management connection from top to bottom.  One is to build “virtual-device MIBs” that could be based on current “real” MIBs but that would reflect data elements that represented any new service features, costs, or conditions that would arise in an NFV world.  We’d then have to populate these fields from real resource information as the service progressed through its lifecycle.  The other is to provide operations/management coupling through the virtual layer into the real resources.  My own work has always focused on the second of these approaches because I’m leery of having resources living behind a perpetual mask, but there’s no question that it would be easier to attack the former approach than the latter.

If this second approach is taken, then the service data model could be supplemented with the information collected when binding service components to each other and to resources.  These bindings could be traversed to dive into more detail on service state.  You could also, at any level of “object” in the model, describe the state/event relationships that would fulfill the TMF concept of mapping events to services.  It’s obviously more complicated, but if you did this you could define any current or newly developed operations process at any state/event intersection, and provide full integration of management components from top to bottom.

We have to do either a virtual-device-MIB or data-coupled management model; I don’t believe any other options even exist.  Unfortunately, I don’t think we have a convincing model for either in place; not in the ETSI ISG or TMF.  So I’d like to see operators and vendors cooperating (perhaps even in PoCs and lab trials) to explore the consequences of each approach and the alternatives for implementation.

The Three Paths to NFV Victory (and the Risk of Detours)

NFV is turning out to be a lot more complicated than it first appeared, and that’s particularly true in the area most critical to vendors—the business case.  While the question of making a broad business case for NFV is weeding out a lot of secondary players, it’s not deciding a market leader yet.  In fact, it’s not even clear what a winning strategy will be.  We have three options out there, and now’s a good time to look at them.

What most operators want from NFV is what I’ll call a systemic model for deployment, something that can justify a broad commitment to NFV (and almost always SDN, collaterally) and bring NFV to the largest collection of services and customers.  The average operator I’ve talked with thinks that systemic NFV could touch as much as 75% of all customers.

In order for systemic NFV to work, you have to be able to deliver operations efficiencies and service agility, because operators (particularly CFOs) say that capex improvements won’t create enough momentum or even fully justify NFV complexity.  That means you need to extend ETSI-modeled NFV both into legacy infrastructure and into OSS/BSS orchestration.  You also have to be able to host a large number of diverse VNFs.

From a sales perspective, systemic NFV is definitely a “hang in there” proposition.  The sheer scope of the success goal means that nearly everyone who signs anything will have to sign off on systemic NFV.  IT will touch every piece of the network, every major vendor, every craft practice and operations/management software tool.  It’s also so big that it’s hard to grasp it, and many proponents of this model are trapped in small-scale on-ramp projects that might or might not lead to a realization of the broad goal.

The second approach to NFV success is the magic bullet model.  Rather than trying to build up NFV to a broad base through a wide range of services, magic-bullet proponents seek to identify a killer app, a single service that has so profound a benefit case that it can carry NFV into deployment by itself.  Once this app has greased the NFV skids, other applications can then follow along.

Magic bullets, to succeed, have to be both accurate enough and massive enough, and that’s the current rub.  The obvious candidate for a magic-bullet attack is mobile services, because mobile infrastructure is still the capex focus.  It’s easier to deploy a new technology where money is still being spent on a large scale, than to displace already-bought stuff elsewhere.  The question is whether mobile is the right target.

The risks of mobility lie in extensibility in the service domain.  Yeah, we can apply NFV to manage costs in mobile networks, and perhaps even to improve operations efficiency, but service agility goals demand hypothetical services.  IMS and EPC are candidates for early NFV exploitation, but they’re specialized multi-tenant applications.  Services built to demonstrate agility would have to be built both on IMS and on NFV to be relevant, and right now we use both IMS and NFV only for efficient hosting—we don’t have a model of service-building.

The third NFV strategy is what I’ll call (given my penchant for quoting old poems and music) the September-Song approach.  “…I let the old earth take a couple of whirls…” is the relevant theme.  Septemberish NFV advocates are essentially saying that NFV is inevitable, that somebody will hit on the magic formula for deployment.  That somebody will then spawn explosive NFV growth, which will create an explosive growth in demand for something NFV consumes a lot of—servers, data center switches, software licenses.

It you’re a platform (server, OS) vendor, there’s something said for the wait-and-see approach, because 1) you don’t have to go out and create and merchandise a full NFV solution and 2) you don’t face the risk of alienating the players who do manage to make a business case.  It’s a kind of arms-merchant approach to the NFV wars, because you have something everyone will need.

The obvious problem the Septemberists face is the risk that an NFV magician who’s able to make the business case will sell servers and software too.  That could happen both for a systemic NFV player or a magic-bullet player, and the result would be that Septemberists would have to fight their way into a deal whose business case is under the control of another vendor.

We’ve had examples of competitive evolution for all these approaches recently, which I think proves that none of them are off the table yet.  That also means none are winning convincingly.

HP is the paramount player in the systemic camp.  Their OpenNFV has legacy device orchestration, OSS/BSS and NMS integration, a strong ecosystem, a good on-boarding model, and good engagement in a variety of trials to prove out service breadth.  Their problem has been that they’ve become perhaps a bit obsessed with the trials and have underplayed their systemic assets.  That’s easy to do because it’s hard to make something like NFV operations efficiency exciting.  In the service agility area, services are VNFs and you can’t be seen to favor a given partner if you’re a partnership-driven ecosystem.

HP doubled down on VNF partnerships this week with a big NEC announcement.  One thing this shows is that larger players like NEC see HP as a viable platform going forward, an endorsement that’s likely to play well with operators and with other prospective partners.  But the press release on the deal didn’t mention any specific services, which means that it doesn’t add a lot of near-term impetus to HP’s drive.

In the magic-bullet class of NFV player, Alcatel-Lucent has been making news through NFV-ready IMS and IMS-related offerings.  A highly focused mobile drive has given Alcatel-Lucent a presence even in accounts where another vendor (HP, for example, with Telefonica) already had a win.  Alcatel-Lucent has, in its Rapport collaboration framework, an application platform to facilitate service creation that’s NFV- and IMS-compatible, and so it addresses the limitations of early mobile-service targeting I noted above.

The challenge is that platforms do not a service make.  IMS has been a theoretical platform for rich communications services for a decade and it’s not killing off OTT competitors.  Part of the problem is that it’s not entirely clear what platform capabilities Alcatel-Lucent’s Rapport and IMS actually bring to service developers or VNF developers, nor is it clear how NFV and IMS cooperate to be greater than the sum of the parts.  Alcatel-Lucent needs to make all that clear.

The Septemberist giant is of course Intel.  An optimal deployment of NFV could generate over a hundred thousand new data centers, ten times that number of new servers, and a heck of a lot of new CPU chips.  Intel is the clear leader to pick up the NFV financial marbles because they’re a part of almost any credible winning strategy.

To address some of the risks on the platform side, Intel has been pushing its Wind River Titanium Server strategy, and recently won a Nokia validation that might signal a firm link with the leading magic-bullet player—Alcatel-Lucent—when/if the Alcatel-Lucent/Nokia deal closes.  Wind River is also a platform partner for systemic leader HP.

For all of this, Intel still hasn’t taken a step toward making the business case.  Yes, they have the right hardware to deploy NFV.  Yes they have the right software platform.  They don’t contribute much to the direct business case, though, and so they are still at risk for a slow-roll NFV that undershoots potential, or the introduction of a competitor who is able to take advantage of the slow roll to get into the game.

So where does this leave us?  I think that it will be difficult, though not impossible, for any player—even HP—to make a pure systemic run at the NFV opportunity.  It’s probably too late to socialize that complex story, though I think they need to try.  I think that mobile is going to be hard to use as a truly universal magic bullet because it doesn’t hit enough operators, and doesn’t hit hard enough to push universal adoption, unless you build a service framework on it.  And I think that waiting for somebody else to win and hoping to ride on their coat-tails is always an unacceptable risk.

Something evolving from mobile has to be the answer, and I think that “something” is the always-overplayed Internet of Things.  All three of our giants are trying to come to terms with a real IoT architecture, and I think that whoever wins it can win NFV too, as long as they make what should be the obvious connections.  That would create a truly massive win, perhaps the largest in networking since the early days of IP convergence.

Putting the ETSI NFV Architecture Through a Hypothetical Scenario Set

Hopefully your interest in NFV management prompted you to read yesterday’s blog and you’re ready to follow up.  If not, you may want to review it before you read this one because I’m building on the last one with only a very brief level-set!

Let’s assume we have a VNF with four components, one of which is horizontally scalable and sits in front of the other three, which are in line.  You can draw this out as four boxes left to right, with the user on the extreme left and the “service” interior on the right.  This is supported using a subnet and it’s got a private IP address (like the usual 192.168.1.x).  The leftmost VNF has a port exposed via something Google-Andromeda-or-Amazon-Elastic-IP-like mechanism which for brevity here I’ll just call “SuperNAT”.  Similarly the rightmost has an exposed port for service connection.

Let’s assume that we have a lot of load on our VNF on-ramp element.  The first obvious question is how we know that.  In the ETSI model we have Element Managers (EM) that are associated with the VNFs and we also have a VNF Manager.  It would seem logical that if the VNFs themselves were capable of understanding their own load profiles, EMs could communicate a need for scaling.  If not, it would have to come from “outside” meaning that the state of actual network and hosting resources might be used.

Whatever the source, scaling would have to be initiated as a lifecycle process, and the VNFM would drive the VIM to allocate additional instances.  That much is clear.  What is less clear is what would happen in a case like my example here, where in order to support multiple instances of our head-end VNF we’d likely need to load-balance.  We now need an additional functional component not part of the original picture.  How does that get instantiated?  Normally, the NFV Orchestrator would be responsible for this sort of decision.  Remember that we have a coordinated need to deploy the load balancer and to then reconnect the front-end elements, including the connection to the user.  (Note that for service availability reasons we might presume that every service with scaling had a predefined load-balancing element in the configuration to prevent interruptions during this reconnect phase.)

Faults are more complicated.  If something in NFVI breaks, then we have two possible paths forward.  First, we could assume that the fault would be recognized by the VNFs themselves based on conditions that would be visible to them on their interconnected pathways.  So if a VNF fails in our hypothetical service, the VNFs connected to it would presumably recognize the problem.  The other possibility is that the fault would be recognized by the infrastructure management systems, whatever they are.

The ETSI spec suggests that a VIM could notify a VNF manager of “changes in state”, and one might presume that this would mean that the VNFM could either undertake to replace the item on its own, or could notify an EM in the VNF, which would then start remediation.  It seems to me that if you have VNFMs and EMs in the picture, you have to let both of them know what’s being managed.

In a fault situation, we’re assuming that everything in real remediation terms is getting done by the VNFM, just as we did in the scaling example above.  That’s reasonable given that the VNFM seems to have all the parametric data on the service, but it kind of makes the Orchestrator look like a rump function.  I’d like to see a model where all of this was collected into a single software structure; I think it’s going to be difficult to build something with the separation of functions and the exposure of interfaces that the ETSI model defines, given the ambiguity of roles.

You can see the security ambiguity I talked about in the last blog more clearly now.  VNFMs have the ability to command resources, which means that to control a VNFM is to control infrastructure, at least in some sense.  The challenge here is that if the VNFM is specialized to the service itself, meaning we have per-service or per-VNF VNFMs, or even just if we have proprietary VNFMs, we’ve relaxed security on the network.  A single service, or worse a single service instance and its associated customer, has the ability to call on infrastructure.

I understand that this could in theory be prevented, meaning that you could “authenticate” requests.  The problem is that it’s hard to know what’s authentic.  Remediation or scaling requests additional resources, which obviously impacts the shared pool.  Under what conditions does a VNFM have the right to do that?  Who enforces the conditions?  If we say that the VNFM enforces its own security, we’ve just justified having no security at all because that principle runs afoul of all the firewall and management integrity checks traditional built into networks.

Then there’s operations integration.  We are spinning up additional VMs in scaling, and we’re replacing components due to a fault that would very possibly create an SLA violation in remediation.  It’s hard to see how both these conditions wouldn’t have to be reflected broadly, but in three specific places—the service model for NFV, the network operations center, and OSS/BSS.

Even for “NFV operations”, we have to maintain an accurate model of the service or we can’t respond to future change requirements.  Imagine the challenges of fault management if scaled components didn’t show up on the service model?  How does that happen, though?

I also wonder how a NOC finds out about a problem with a VNF.  You could say that overloading of a VNF isn’t a NOC problem, but if NOCs are still expected to respond to customer complaints, how do they see the conditions that the NFV service itself is responding to?

Then there’s how this gets integrated with OSS/BSS.  If a customer calls and asks something about the service, can a customer service rep dig into the details of the current service model and state?  Right now there’s no interface expressed for that, or any specific detail on the model itself.

You might get the idea from this that I’m saying that NFV won’t work as described.  I’m not saying that, but I am saying that I doubt that the ETSI model could be taken literally.  Operators tell me that all of their PoC and lab work is building out from basic ETSI descriptions into what’s essentially ad hoc extensions of NFV to cover all the bases.  That’s not necessarily a bad thing; innovation and multiple approaches can be valuable.  It does tend to negate the standards, though, because these innovations could very well be PoC-specific, service-specific, vendor-specific, and thus generate a bunch of silos.

What’s needed here?  Well, the simple answer is that we need to define the abstractions themselves—the service models that MANO would use, the models that are used by MANO to drive the VIMs—and we need complete flow diagrams to describe explicitly what happens under the kind of conditions I’ve outlined here.  You can’t define an architecture without testing your structure with the kind of things it’s expected to handle.  That’s routine in software design.  It needs to be done with NFV, and quickly.

NFV Management Discussion Phase One: NFV as a World of Subnets

NFV management has never been my favorite part of NFV, and I’ve groused about it here fairly regularly.  It’s probably time to talk about the issues in more detail, and so I’m going to do an as-yet-undetermined number of blogs in a series about the issue.

To get this straight, we have to set the stage.  NFV presumes that virtual network functions (VNFs) are collections of components that are hosted and connected during the deployment process by the NFV Orchestrator.  The management, meaning lifecycle management, of this collection is the responsibility of the VNF Manager or VNFM.

VNFs would have to be collected in some sort of subnetwork, and this is shown in the ETSI End to End Architecture Document’s Figure 3.  The easiest way to think of this would be as an IP subnet, though no specific reference to a network structure is provided in the document.  I’m presuming one here because it’s difficult to talk about management issues when you don’t have any specific way to reference the things you’re managing.

In our hypothetical NFV subnet we’d have a bunch of hosted software components (VNFCs) that are linked somehow.  The ETSI material calls the relationship a forwarding graph, but I’m not sure that doesn’t presume simple linear service chains.  Even if you had chains, you’d need to have something to chain through, meaning a network service that offered connectivity to the component.  Using this the elements inside a subnet would all be able to talk to each other, presuming they had an address reference.  Our components will also have to be visible to the real world, at least in part.  The ETSI Figure 3 shows endpoints connected to VNF1 and VNF3, which presumes that these endpoint connections on the VNFs are “visible” in the service address space, outside the VNF subnet.

Security, compliance, and sanity seem to dictate the presumption that our subnet is based on something like an RFC 1918 address space (assuming IPv4), so the VNFs would all be invisible to the outside world.  To make some of the ports on VNFs visible, we’d have something like NAT to translate between one of the private addresses and a public address.  We’d also need a DHCP function to assign addresses and a DNS to allow the components to see each other.  If we do this, then every VNF lives in its own private universe, secure from visibility to other VNFs and even to its own service address space.  It shares only ports it is designed to share, and only to other subnets it specifically elects to link with.

So where we are is as follows:  Something like Figure 3 would probably be set up by defining a private subnet with its own DNS and DHCP, and with NAT to convert between the internal addresses it wants to publish in the service data space, and addresses in that space.

We’re not done with subnets, though.  We have to be able to deploy this stuff, right?  Thus, we have to presume that there is an infrastructure network where all of the resources live.  We also have to assume that either in this network or in yet another network we have all the elements of NFV software, which means MANO, EMSs, and OSS/BSS connections (the actual OSS/BSS could be elsewhere but we have to be able to reach it).

You’re probably wondering why I’m getting into this, and the answer is that the framework we presume has to be there to deploy NFV will also have to support management of NFV services once they’re deployed.  We have to be able to harmonize the role of VNFM within this structure, and if we have any issues we have to get them addressed.

Management starts with the presumption that there is, included with a VNF, an Element Manager that performs all the VNF’s typical management functions.  This EM links with the VNFM for resource information and to provide lifecycle management.  The VNFM would go to the Virtual Infrastructure Manager to deploy something, such as a scale-out.  However, it also appears that the NFV Orchestrator also goes to the VIM for deployment.  To start with wouldn’t it be logical to say that “deployment” was a lifecycle stage?  Yes, but if an EM has to request lifecycle management that can’t happen till the EM, which co-deploys with the VNF, presumably, is actually deployed to do the requesting.

Apart from this we have some challenges of addressing and security.  It’s reasonable to assume that the EM talks to the VNFM through one of those NATted interfaces, so we can at least make the connection.  As long as there’s some record of the EMS address so that the VNFM could contact it, presuming it needs to, we are fine.

An issue arises if we look deeper into the VNFM proposal.  There’s a goal of supporting multiple VNFMs, so that VNF-specific VNFMs could be offered.  The reason given is that the task of lifecycle managing a VNF could be pretty specialized, and that may be true.  However, we now have to look at the addressing and security issues.

If a VNFM is provided by a VNF vendor, where does it live?  You have three options.   You can put it inside the VNF, inside the private subnet where MANO and the rest of the software lives, or in some third disconnected subnet.  What are the implications?

If you put the VNFM inside the VNF then we’re letting VNFs manage their own lifecycles, allocate resources, etc.  We have to give the VNFM a link to the VIM, which means that the VNF can “see” infrastructure directly and control real resources.  I think this is a serious security and stability problem.

If we put the VNFM inside the MANO subnet, we’re letting vendors add service-specific software inside NFV’s control software, where there are no barriers to what it could do.  That is IMHO a far worse issue with security and stability.

If we put the VNFM in its own subnet we’re still giving that subnet access to a VIM, and while that could be made more secure than the first (and second) options, it’s still not ideal.  The VNFM still can directly control resources.

My conclusion here is that we need to be looking at NFV deployments like cloud applications in a multi-tenant world.  Amazon and Google both provide a mechanism much like I’ve described to create subnets where components are hosted using private IP and then use NAT or “elastic IP” addresses to map to addresses visible outside.  We have to be able to draw a picture of an NFV deployment as a set of IP subnets which interconnect in some way.  Google offers such a picture in its own Andromeda architecture.  If we can draw the subnet structure of NFV we can see at least what connections between private spaces and public spaces are, and whether these connections are sufficient to permit NFV software to function as needed and secure enough to be acceptable by operators.

Functionality is yet another matter.  The best way to look at how this would work is to look at a deployment, a horizontal scaling, and a fault response.  That’s what I’ll do in later blogs.

 

 

Could SDN or NFV Save Us From Massive Outages?

Since the dual United Airlines and NYSE outages I’ve gotten a lot of email about the stability of new network architectures.  While I don’t have any special insight into those incidents and so can’t (and won’t) speculate on how they were caused or how they could be prevented, I do have some experience with network outages.  Are SDN and NFV going to make things easier for us, harder, or what?

The core reality of networking today is that it’s adaptive in operation.  Devices exchange information via control protocols, and the propagation of this information establishes forwarding rules that are then used to move packets around.  At the same time, management information moves to and from specific devices using the same data paths and forwarding rules.  The adaptive nature of today’s networks make them vulnerable in two distinct ways.

First, a device could be cut off from the rest of the network by a failure.  In this case, the device wouldn’t be accessible to management commands and thus could not be controlled, at least not until a pathway to that device was restored.  If the device had been contaminated by bad parameterization or software, the problem might prevent paths from ever being established, in which case you’d have to send someone to manually fix things (or provide an out-of-band management connection to every device).

The second issue is the bad apple problem.  You know (maybe) the old saying that “One bad apple spoils the barrel”.  The fact that devices in a legacy network derive most of their knowledge of topology and status from an exchange of information with adjacent devices means that a single device that’s contaminated could contaminate the whole network with bad information.  In most cases this means either that the device advertises “false” routes that are suboptimal or perhaps can’t even work, but it might also mean that the device floods partners with nonsense, ignores requests, and so forth.

Both these problems tend to happen for two reasons.  First, the device is parameterized incorrectly, meaning that there’s a human error dimension.  The largest network outage I ever saw in my career, which tool over fifty sites down hard for over 24 hours and caused failures of at least a quarter of sites at any given time for a week, was caused by a parameter error.  The second issue is a software problem.  We’ve all heard about how software updates to a device cause it to behave badly with its neighbors.

Logically, the questions to be asked for NFV and SDN are, first, how susceptible they’d be to the current pair of issues and second whether there might be new issues arising.  Let’s look at those points.

We have to set the stage here.  In SDN, we have a number of models of service in play these days.  Classic OpenFlow SDN says that white-box switches have their forwarding managed entirely by a central SDN controller.  In some cases classic SDN is overlaid on legacy forwarding, meaning that there’s still adaptive topology management being done by the device but explicit forwarding control via OpenFlow is possible.  Some other models (Cisco’s preferred approach) would utilize legacy adaptive behavior completely and use policies to add software control over the process.

In any model that retains adaptive behavior, we have the same risks that we have today.  If the model adds central SDN forwarding control, then we add the risks that such control might add.  Primarily, the risk of central control is the failure of the central controller.  If a controller fails, then it can’t respond to network conditions.  That doesn’t mean the network fails, only that it can’t be changed to reflect failures or changes in traffic or connection topology.  The big question for an SDN controller, IMHO, is whether it’s stable under load.  My biggest-network-failure example was caused by a parameter error, but the reason it exploded was that the error caused a flood of problems that swamped a central control mechanism.  When that failed under load, everything broke, and since everything now had to be restored, the controller never came up.

Bad-apple device problems in SDN wouldn’t impact the topology and forwarding of the network, but if a device went maverick and didn’t do forwarding updates correctly or at all, the central controller might not “understand” that the route it commanded hadn’t really worked.  I’ve not yet seen a demo/test of a controller that involved checking the integrity of routes and perhaps flagging a device as being down if it’s not doing what it’s supposed to do.

The cutoff problem in SDN has the same kind of risk.  A device could be cut off because an adjacent device killed the route to the controller.  If the device is functional enough to do what it would likely be supposed to do (try other paths) and if there were other paths available, you still might be able to restore contact.

Overall, my feeling is that purist OpenFlow SDN is less at risk to traditional adaptive-behavior-related outages for the obvious reason that it relies on central control.  If the controller is designed properly, hosted reliably, and if the devices are set up to deal with path loss to the controller in a reasonable way, then I think you can say that classic SDN would be more reliable than legacy networks.

NFV is a bit more complicated.  NFV doesn’t aim at changing network control-plane behavior, so if you hosted VNFs that did switching and routing via NFV you’d simply substitute a software version of a device for an appliance version.  All the adaptive risks would be the same.  If you hosted SDN VNFs and centrally controlled them, you’d now have the SDN risks.  Where NFV is different is first in the issue of node reliability and second in the management plane.

Servers, data center switches, and intra-VNF paths in an aggregate configuration make NFV more complex and likely generate a lower MTBF than you’d have with a ruggedized appliance.  NFV could potentially have an improved MTTR because you could fail over components, but you’d see an outage in most cases.  We also don’t really have much data on how fast service could be restored and how an extensive failure like a data center drop would impact the ability to even find alternative resources.  Thus, it’s hard to say at this point just what NFV will really do to network availability.

On the management side it’s even more complicated.  In traditional networks, management and data paths are handled equally, meaning that you have connectivity for both or for neither.  In NFV, the presumption is that at least a big chunk if not all of the management data is carried on a subnetwork separated from the service data paths.  It’s not unlike the SS7 signaling network of the old phone network (which we still have in most of the world).  If we presume that VNFs are isolated to secure them from accidental or malicious hacking from the data plane, we now have a subnet for every VNF and management connections within (and likely to and from) those subnets.  Because NFV depends on better remediation for its availability more than reliable appliance strategies, loss of management integrity could hurt it more.

The net for NFV is that we don’t know.  We have not built an NFV network large enough to be confident we’ve exposed all the issues.  We haven’t identified possible issues fully, and tested them in credible configurations.  I think that it would be possible to build NFV networks that were less susceptible to both the bad-apple and cut-off network problems, but I’m not sure the practices to do that have been accepted and codified.

The net, IMHO, is this.  If we do both SDN and NFV right, we could reduce the kind of outages we’ve seen this week.  If we do them badly, deploying either or both would make things worse.  Since we have far less experience managing SDN and NFV than managing legacy networks, that tells me that we have to be graceful and gradual in our evolution, or we’ll make reporters a lot happier with dramatic stories than we’ll make customers happy with reliable networks.