Stepping Up to, and Beyond, NFV

As we start to hear more about NFV from the application and services side, it’s becoming clear that there are different views of NFV insofar as its relationship with device/appliance networks, hosted functions, and the cloud.  From a benefits perspective it’s important to understand these differences because any specific NFV benefit can drive things forward only as far as the overall NFV model is accepted, and that will depend in part on its marginal utility, meaning how much better it is than the other alternatives.  And how well it does, if it gets going, will depend on the utility it can demonstrate toward the step beyond.

IMS and EPC are popular targets of NFV today.  Both arise out of the 3GPP evolution of mobile networking from circuit voice to packet multimedia.  Both specifications include functional elements like CSCF and MME and eNodeB that are typically mapped into appliances in current deployments.  What we’re now hearing are proposals to translate these to virtual functions to be managed and orchestrated by NFV principles.

But how, exactly?  What we see today is mostly a set of 1:1 mappings of IMS/EPC elements to software images, to be loaded in the places the devices were installed before.  If the goal of NFV is indeed what it was back in October 2012 when it kicked off—reduce capex by substituting hosting for appliances—then this is fine.  But operators have said many times that operations efficiency and service agility is really the goal.  So how far can NFV go to achieve that in the “mapped IMS/EPC” example?

We can do some things, of course.  We know from recent announcements that you can spin up additional EPC nodes to handle call loads, and that’s a good thing, but there’s an obvious question, which is whether we’d do EPC using the same functional divisions today, knowing what we know about the way to build agile applications and services?  For example, you can’t spawn pieces of a PGW or SGW device, so would you create functional subdivisions to optimize the ability to scale components horizontally?  Probably not.  Metaswitch’s Project Clearwater IMS shows that optimized functionality doesn’t necessarily fit the device boundaries of the past; you have to support the critical interfaces but you can still do the interior stuff in a more modern and virtual way.

When you map 1:1 between appliances and hosted functions you’re really not doing much more than porting elements of IMS/EPC to servers.  Perhaps, if you support cloud hosting you can say that you’re a cloud player, but you really have to be able to do something more dynamic than just sticking an app in a server VM and leaving it there till it breaks to say you’re NFV—at least you have to if you hope to drive NFV forward in a useful way.  So you can say that what differentiates the cloud model of network functions from the hosted model is resource dynamism, and what differentiates the NFV model from the cloud model is functional dynamism.  That’s why I think that you have to assume functional modeling and orchestration are part of service-building in an NFV world.

Functional dynamism is an expression of the variability of component relationships in a service.  You can get something highly dynamic because it has to scale or vary significantly for performance/reliability reasons, or because it’s actual functionality varies over a short period of time.  The former is a response to a combination of resource reliability and variation in workloads, and the latter is a function of the duration of the service relationship.

Functional dynamism is an important concept for NFV because it expresses the extent to which a service really benefits from virtualization of resources.  The range across which functional dynamism operates also determines just how sophisticated an NFV MANO function would have to be in order to optimize the binding of resources to services.  The fact is that no matter what you do to provide scaling, functional dynamism potential isn’t at its highest for multi-tenant applications like IMS/EPC for the simple reason that these applications aren’t instantiated per user but per service, and variability is somewhat dampened by the law of large numbers.  Yes, you’ll have macro events like a sporting event letting out, but these events happen on a schedule of days, not seconds.

If I’m right about functional dynamism, then it’s also fair to say that the popular “service chaining” applications may not be ideal for NFV either.  Even though business branch access isn’t inherently multi-tenant (you’re likely to see everyone with their own instance of a firewall, for example) the stuff is sold on a multi-year contract.  That means the components are likely to be put into place and stay there unless something breaks, so we’re back to almost that model of multi-tenancy in terms of variability.  And think about it.  How valuable is resource optimization and orchestration if you do it once a year?  You can’t save enough per occurrence to make much of a difference in revenue or cost, so there’s no driving benefit.

What all of this would mean is that the low apples for NFV may be too low; they may already be on the ground and so they don’t justify automating the “picking” process.  Yes, I know operators are all excited about the potential for service automation, but the benefit case for it is fragile if you focus your attention on things that are just not done that often.  That’s a good reason why you can’t afford to focus your MANO benefits on the part of a service that’s actually based on virtual functions—you won’t address enough in your automation efforts to change costs or improve agility enough to drive things forward.

But here’s the thing.  We’re forgetting IMS and EPC.  We don’t build an IMS/EPC instance for every call or even every customer.  They’re multi-tenant, remember?  Well, think on this point.  Suppose we make our functional relationships more and more dynamic, view services and applications as momentary ships passing in the IT night, created on demand?  Then as our dynamism increases, we reach a point where it’s less difficult to pass work among static instances of processes than to spawn processes for the purpose of doing work.   We’ve made the management workflow larger, the functional workflow smaller, and eventually they cross.  That’s why I think it’s important to think about the fusion of service logic and service management.  They’re inevitably going to fuse in a mass market, because we’ll address dynamism at scale by eliminating “provisioning” completely.

NFV is the implementation of a critical step, a step toward function-defined workflows and dynamic associations of processes with activities.  That’s what mobility is, what the thing I’ve called “point-of-activity empowerment” is, and even what the not-nonsense-and-hype part of the Internet of Things/Everything is.  It’s important we know that because it’s important that we judge NFV not just by what it is, but by what it will necessarily become.

Metaswitch Takes Another Big NFV Step

In the hype-ridden NFV space, Metaswitch has always stood out for me as a voice of both reason and experience.  As the founder of the Project Clearwater open-source IMS project, they’re knowledgeable about cloud-based carrier functionality and open-source at the same time.  As a member of the CloudNFV™ group they are one of the few VNF providers who actually understands NFV in a complete ecosystemic context.  Now they have a new open-source project that further extends both their credibility and the framework of NFV and large-scale cloud infrastructures, called Project Calico.

Before I get into Project Calico I have to explain the issues it addresses.  Our conception of “cloud infrastructure” tends to be the simplistic vision of a glorious cloud universe that when you look at it closely is a bunch of servers running a host OS, hypervisor, guest OSs, and virtual functions, all under the control of OpenStack.  That makes NFV and the cloud little more than IaaS at the platform level, and that’s a problem in four dimensions.

Dimension one is something I’ve already blogged about.  If your goal is SaaS or NFV, the whole notion of IaaS is perhaps fatally stupid.  You’re going to set up a VM, load a guest OS, for every piddling component of an application or feature of a service.  Many of these will have minimal utilization, so you’re wasting a ton of money, and you’re doing that in an application you think is saving you money.  What you’d really like for SaaS or NFV is something that can let you run a whole lot of very lightweight partitions, containers, into which you can stick something.

The second side of the problem is that OpenStack itself has limitations.  It’s a single-thread implementation that because of its scalability limits can only control a domain of servers of some limited size.  How many?  It depends on a bunch of factors, but many operators I’ve talked to say that even a large data center may be too big for a single OpenStack instance.  In our future with a zillion distributed elements, OpenStack won’t scale.  So a carrier cloud hosting NFV would logically have to include multiple OpenStack domains and services have to be somehow connected effectively among them.

Another side of the problem is that OpenStack visualizes application/service deployment in terms of what could be called “service-specific subnets”.  You take a bunch of virtual functions and instantiate them on hosts, then link them onto a VLAN that is then gateway-connected to the service users.  The problem with this model is that while it’s fairly easy to spread IP networks across a bunch of data centers, it’s not as easy to spread VLANs that way.  In order to create a service hosted across multiple data centers, I’d likely need to instantiate a local VLAN per data center and link them at Level 3, which would likely make some of my inter-component links visible in the service address space, which is a security problem.

Then there’s management.  I’ve noted in prior blogs that you can’t expose the management interfaces of shared components to the management processes of service components or service users.  How do you avoid that?  If we have some kind of VLAN or subnet for the VNFs to live in we’ll need to translate access to management data to users outside, and also somehow provide these VNFs with access to any other component management data.  That means sticking something inside each VLAN to act as a management agent, and even doing that creates the question of how everything is addressed.

You can see that there’s a networking dimension to all of these problems, and it’s that dimension that Project Calico aims to address.  What Project Calico does most directly is to build a connection model for cloud components or VNFs that works directly at the IP level rather than at Level 2.  This model is based on a combination of BGP and an enhanced use of access control lists, and what it does is create what I’ll call a component network that appears to be as private as a VLAN would be, but is built at Level 3.  What makes it a component network is not that the address space is partitioned or private, but that the ACLs prevent components from talking “outside” or others from talking through the boundary to components that aren’t supposed to be exposed.  This takes the notion of cloud resource pools and hosting points out of the subnet and into the wide world of Internet and VPN—Level 3 of the ISO model.

One impact of this is to make networking easier for lightweight containers.  It’s far easier to provide a container app or even just a thread in a multi-tasking operating system with an IP address than to provide it with a virtual Ethernet adapter.  Many containers don’t even offer low-level VLANs.  You also have less overhead because you’re not creating an overlay network to partition tenants, so there’s fewer headers in the stack and less resources used in processing them.

Metaswitch is obviously not eliminating OpenStack with Project Calico given that the initial instance at least is an OpenStack plugin, but it can ease the burden of running multiple OpenStack instances.  It’s hard to orchestrate a service across OpenStack instances if the service is based on a VLAN; it’s much easier to do that if the service points are IP-addressable.  With Project Calico, you can arrange OpenStack domains to cover data centers or even collections of racks, and be pretty confident you can organize the connectivity when you cross those boundaries with service orchestration.

The whole issue of inter-component (intra-function or -service) connectivity might also be a lot easier with Project Calico’s help.  If we presume that all of the components of a service are “inside” a BGP area with ACL controlling the access, we can either pass or block outside access to any of the interfaces we build, and do it without worrying about subnet structures, etc.  In effect, each component could look a bit like a web server with access control in front of it.  Inside the area you can connect things one way and at the same time block access from outside, or from other tenants.

Which could be critical for management.  I remain convinced (despite the fact that so far the ETSI ISG hasn’t officially sanctioned the vision) that you have to virtualize management for NFV and for SDN.  That means that any real resource is buffered from management intervention by an i2aex-like repository, and that repository then creates a management view based on the properties not of the resources themselves but of the virtual functionality the resources collectively provide.  Making those management connections requires even more precise management of address access control than intra-function connection does, and Project Calico could provide it.

What Metaswitch has done here, and what it did with Project Clearwater and IMS, was to look at NFV from the top down, visualize the ecosystem, and address specific issues that would impact multiple critical components.  In doing so they’ve raised issues that should have come up in formal NFV activity and didn’t.  You have to wonder whether we can we make any official NFV process work top-down and ecosystemically.  I guess we’ll see.  If not, then maybe Metaswitch may have to build all of NFV one open-source project at a time.

NFV PoCs, Top-Down Software, and the OPN

You’ve all probably heard about (if not read) the “Tale of Two Cities”, a story that in part emphasizes life as a tension between two poles.  Guess what?  We have that in NFV, and it will be interesting to see how it plays out.

Yesterday, I got a copy of the interim report for one of the NFV ISG’s PoCs, driven by Telefonica, Intel, Tieto, Qosmos, Wind River, and HP.   The PoC was called “Virtual Mobile Network with Integrated DPI” and the interim report was done very well—thoroughly documented with lots of statistical information, clear goals and proofs.  The focus, as you might suspect from the name, was the demonstration of virtualized EPC using NFV and with horizontal scaling triggered by deep packet inspection.

Today, we have a story in Light Reading about the initial meeting of the Open Platform for NFV (OPN) group, and a preparatory document that outlined the goals of the body.  The document suggests that the first priority for the new group will be defining the NFV Infrastructure (NFVI) and the “Virtual Infrastructure Manager” or VIM.

I have seen both documents and can’t share either one of them, but I can point out that the juxtaposition of the two is laying out the challenges NFV will have to meet along the road to deployment, even relevance.

The thing that I think shouts out from the PoC, which documents an actual test of something actually useful and interesting, is that the value of NFV is really a value generated by management.  Suppose we tool all of the elements of EPC and simply hosted them on static servers.  We’d have “hosted EPC” but not NFV.  Even hosting the EPC elements on a virtual distributed resource pool would create only “cloud EPC” but not NFV.  What makes something “NFV” is support for service automation and management integration.

In the PoC, changes in network conditions detected by DPI are fed through a management process that can scale out instances of eNodeB to respond to changes in calling patterns and behavior, illustrated through a nice set of example scenarios regarding commuting and a mass event.  The PoC illustrates that you can indeed control performance through horizontal scaling driven by independent network telemetry, and that is a useful step.

It was not the goal of this particular PoC to frame a specification for NFVI, the stuff this all gets hosted on.  There are some conclusions about the need for data-plane optimization, an issue that has been raised by other PoCs, but I can’t find any indication the authors/sponsors believe that it is essential to frame a spec for NFVI.  The NFVI interfaces with orchestration through the VIM, and it does seem clear that whatever you do at the NFVI level should be abstracted by the VIM, which means that VIMs should be able to present a common vision of resources to the orchestration processes regardless of NFVI specifics.

But the big question about the VIM is what relationship it might have with other non-virtual elements in the resource pool.  An NFVI is only part of service resources—unless we think every single network device is going to instantly be fork-lifted into virtual form.  What would happen in the PoC configuration and scenarios if we had legacy components involved?  We might end up with management black holes, places where we needed to adapt the behavior of network elements that weren’t visible in the NFV world at all.

The point is that the OPN process is reportedly making its early focus the definition of a reference configuration for NFVI and an implementation of VIM.  The current crop of PoCs provides some insight into both areas, but hardly a complete exploration of requirements.  I think it’s arguable that a reference architecture or implementation for either NFVI or VIM could be done without addressing the higher-level question of how legacy network elements are integrated.  Can that be done, can we define “service infrastructure” in its most general sense, and “infrastructure managers” that go beyond the virtual, based on what we know now, what the only real NFV implementations we can call out (the PoCs), have shown?  I don’t think we can.

Back in late 2012, in response to the operators’ first NFV Call for Action white paper, I responded with a document that included the following quote:  “Experience in the IPsphere Forum (IPSF, now part of TMF), the Telemanagement Forum (TMF) and CIMI Corporation’s own open-source Java-based service element abstraction project suggests to us that the key to achieving the goals of the paper is a structured top-down function virtualization process.”  Most software architects and developers today would agree that we live in an age where top-down is the accepted software mantra.  Why then are we looking at the bottom of the NFV process first, in a project aimed at implementation?

The key to the value of the PoC I’ve been referencing is the fact that you can take network events and trigger horizontal scaling.  I think that goal clearly shows that there is a need to visualize operations and management processes as the response to state/event transitions at the service and network level.  I think it also shows that while we can define a way to scale horizontally that fits a given application and a given event source, that could lead to an explosion in specialized operations software if we don’t structure the framework in which all this happens so that a common approach will solve all the problems, for all the possible mixtures of legacy and SDN and NFV technology.

The most disquieting thing I hear about the OPN activity was cited in the Light Reading article (quoting Mitch Wagner):  The document concludes: “A face-to-face inception meeting is being organized to take place June 30th to be hosted by CableLabs in Sunnyvale-CA. This meeting will be by invitation-only for those players indicating their strong interest in Platinum membership.”   Platinum membership?  This makes the OPN process look like a political activity driven by the big campaign contributors.  Yes, I know that things like OpenDaylight have been driven by premium vendor memberships too, but is OpenDaylight our example of how top-down design and development should be done?  I like OpenDaylight, but I think it’s going to need to be put into context to be successful.  I would like to think that the OPN activity, using PoC lessons and working top-down in accord with modern software practices, would create that context.  I’m worried that their process may not lead them to do that.

Cisco, SDN Competition, and the “Home Field Advantage”

Credit Suisse has some interesting data on the data center switching market, and I think it’s particularly interesting when you look at in light of the overall weakness in IT spending that Gartner previously reported.  It also raises some points about SDN evolution and what can be expected there.

First, the data center is the hub of enterprise capital spending on network equipment.  In the enterprise space, everything else has been a slave in at least a common-cause sense to data center evolution.  Credit Suisse points out that data center evolution is now driving network connectivity needs higher; they’re expecting a steady migration up the Ethernet speed ladder.  The picture is fairly congruent with the one Gartner paints in that it pegs major changes in data center network spending to upgrades to faster Ethernet, making a 2015-2018 uptick in spending likely.

At the same time, Credit Suisse is looking at two specific vendors and how this would impact them.  One is (obviously) Cisco, who is the market leader and has the most exposure to both negative and positive trends in opportunity, and the other is Arista, who Credit Suisse argues is a technology leader in the space because of its EOS operating system.  It’s pretty obvious (given that Cisco is rated “underperform” and Arista “outperform”) who is expected to win.  It’s not that simple.

Pretty much everyone in the financial analyst space acknowledges that switching is a commoditizing market.  Network feature differentiation is tough enough in the carrier market where there are at least potential operations differentiators and even a few technical features to work with.  In the data center switch market it’s really about how cheaply you can push bits with gear that has a high MTBF.  It’s also about the value of the drivers of change, relative to the cost and risk.

If you’re an enterprise whose data center traffic is exploding because of big data or IoT or the cloud and virtualization or whatever you think might be driving traffic growth, you’re evolution and its costs tie back directly to the drivers and the benefits they generate.  Big data is really about making better business use of analytics, and so the extent to which analytic betterment is created determines the pace of investment.  In my most recent survey, none of the popular drivers of IT growth are offered even a 50% chance of creating significant benefits in the next two years.  If an enterprise has a big data project or an IoT project in mind, they’re struggling to get it to pay back its own investment.  Making it drag along a major upgrade in data center networking is definitely not desirable.

What this means is that enterprises will work to contain the impact of their data center changes to reduce costs during the period when benefits are first of all most tenuous and second are being used to justify primary IT changes.  That Gartner says IT spending in the data center is lagging suggests that even servers and platform software are under pressure to hold the line on costs.  An incremental approach to upgrading will always favor the incumbent.  Cisco, for the next three years or so, isn’t likely to lose significant market share to “feature competitors”; their risk would be competition from those competing on price.  Arista would have a hard time creating value around the improved agility of their EOS platform if their gear, added where something has become obsolete, is still surrounded by un-depreciated Cisco switch assets.

That’s also the challenge that SDN faces.  If 20% of data center switching is displaced in 2015 (a fairly good number, likely) then the goal of planners will be to get as much bang for their buck as they can, and to be sure that changes they make don’t force premature obsolescence of the other 80%.  To replace all the switches will require a much bigger benefit.  If feature value is hard to create within a data center network, that bigger benefit has to equate to a truly monumental price reduction.  The cheapest switch is the one you already own, as long as it’s not broken.

I like Credit Suisse but I think they’re wrong in their view of how the data center market could be wrestled away from Cisco.  You have to change the enterprise network model in a systemic way, create a benefit that’s impressive and that both requires and justifies a major upgrade in equipment.  Evolution won’t do because the only credible benefit that justifies switching vendors for a couple switches is that the new option is cheaper.  Arista and SDN have to think revolution, and to do that you have to step out into the total network and address the total problem.

In past blogs I’ve talked about the possibilities created by an explicit connectivity model of enterprise WANs, a model that presumes nobody is authorized to communicate and that then permits only that which is authorized.  Application-specific VPNs, linked with application subnets in the data center, could revolutionize connectivity security, DDoS protection, and more.  There are probably other revolutions out there, but this one will do as an example.  It has a lot of benefits because it has a lot of impact.  It could give SDN or Arista a win over Cisco.

Particularly when you consider that an incumbent is always vulnerable to “benefit spread”.  If a data center switch and a cheap branch switch could change the whole dynamic of security, the combination could displace a lot of current security gear.  Gear that incumbent Cisco sells, which makes it less likely that Cisco would go in this new direction at all.  But if you shadowbox with Cisco on a switch-by-switch basis, you’ll always face the question “Why change for one switch?”

Every one of our technology refresh cycles or capacity upgrade cycles does offer an opportunity, though.  Having 20% of the switches on the line for replacement is more ecosystemic than having one.  If SDN or Arista or any other Cisco competitor is clever enough, they might be able to combine the momentum of the refresh cycle with some ecosystemic benefits and provide a reason to expand that 20% target.  And the more switches on the line for replacement, the smaller the incumbent’s natural home-field advantage.  Nobody is “home” in a new field.