How Can We Get to Modular Infrastructure for the Carrier Cloud?

In a blog last week, I mentioned the notion of an NFV infrastructure concept that I called a “modular infrastructure model”.  It was a response to operators’ comments that they were most concerned about the possibility that the largest dollar investment in their future—carrier cloud—would end up creating a silo.  They know how to avoid network equipment silos and vendor lock-in, but in the cloud?  It breaks, or demands breaking, new ground, and in several important ways.

Cloud infrastructure is a series of layers.  The bottom layer, the physical server resources and associated “real” data center switching, is fairly standardized.  We have different features, yet, but everyone is fairly confident that they could size resource pools based on the mapping of function requirements to the hardware features of servers.  The problems lie in the layers above.

A cloud server has a hypervisor, and a “cloud stack” such as OpenStack.  It has middleware associated with the platform and other middleware requirements associated with the machine image of the functions/features you’re deploying.  There are virtual switches to be parameterized, and probably virtual database features as well.  The platform features are often tuned to specific application requirements, which means that operators might make different choices for different areas, or in different geographies.  Yet differences in the platform can be hard to resolve into efficient operations practices.  I can recall, back in the CloudNFV project for which I served as the chief strategist, that it took about two weeks just to get a runnable configuration for the platform software, onto which we could deploy virtual functions.

Operators are concerned that while this delay isn’t the big problem with NFV, it could introduce something that could become a big problem.  If Vendor X offers a package of platform software for the carrier cloud that’s capable of hosting VNFs, then operators would be inclined to pick that just to avoid the integration issues.  That could be the start of vendor lock-in, some fear.

It appears that the NFV standards people were aware of this risk, and they followed a fairly well accepted path toward resolving it.  In the NFV E2E architecture, the infrastructure is abstracted through the mechanism of a Virtual Infrastructure Manager that rests on top of the infrastructure and presents it in a standard way to the management and orchestration (MANO) elements.

Problem solved?  Not quite.  From the first, there was a question of how the VIM worked.  Most people involved in the specs seemed to think that there was one VIM, and that this VIM would resolve differences in infrastructure below it.  This approach is similar to the one taken by SDN, and in particular by Open Daylight, and it follows the model of OpenStack’s own networking model, Neutron.  However, anyone who has followed Neutron or ODL knows that even for nothing more than connectivity, it’s not easy to build a universal abstraction like that.  Given that, a vendor who had a super-VIM might use it to lock out competitors by simply dragging out supporting them.  Lock-in again.

An easy (or at least in theory, easy) solution to this problem is one I noted just a couple blogs back—you support multiple VIM.  That way, a vendor could supply a “pod” or my modular infrastructure model, represented by its own VIM.  Now, nobody can use VIMs to lock their own solutions in.

As usual, it turns out to be a bit more complicated.  The big problem is that if you have a dozen different VIMs representing different resource pods, how do you know which one (or ones) to send requests to, and how do you parcel out the requests for large-scale deployment or change among them?  You don’t want to author specific VIM references into service models because that would make the models “brittle”, subject to change if any changes in infrastructure were made.  In fact, it could make it difficult to do scaling or failover, if you had to reference a random VIM that wasn’t in the service to start with.

There are probably a number of ways of dealing with this, but the one I’ve always liked and proposed for both ExperiaSphere and CloudNFV was the notion of a resource-side and service-side model, similar to the TMF’s Customer-Facing and Resource-Facing Services.  With this model, every VIM would assert a standard set of features (“Behaviors” in my terminology), and if you needed to DeployVNF, for example, you could use that feature with any VIM that represented hosting.  VIM selection would then be needed only to accommodate differences in resource types across a service geography, and it could be done “below” the service model.  Every VIM would be responsible for meeting the standard Behaviors of its class, which might mean all Behaviors if it was a self-contained VNF hosting pod.

All this is workable, but not sufficient.  We still have to address the question of management, or lifecycle management to be specific.  Every event that happens during a service lifecycle that impacts resource commitments or SLAs has to be reflected in an associated set of remedial steps, or you don’t have service automation and you can’t improve opex.  These steps could easily become very specific if they are linked to VNF processes—which the current ETSI specifications propose to do by having VNF Management (VNFM) at least partially embedded in the deployed VNFs.  If there is, or has to be, tight coupling between resources and VNFM, then you have resource-specific management and a back door into the world of silos at best and vendor lock-in at worst.

There are, in theory, ways to provide generalized management tools and interfaces between the resource and service sides of an operator.  I’ve worked through some of them, but I think that in the long pull most will fail to accommodate the diverse needs of future services and the scope of service use.  That means that what will be needed is asynchronous management of services and resources.  Simply put, “Behaviors” are “resource-layer services” that like all services offer an SLA.  There is a set of management processes that work to meet that SLA, and those processes are opaque to the service side.  You know the SLA is met, or has failed to be met, or is being remediated, and that’s that.

So what does/should a VIM expose in its “Behaviors” API?  All network elements can be represented as a black box of features that have a set of connections.  Each of the features and connections has a range of conditions it can commit to, the boundaries of its SLA.  When we deploy something with a VIM, we should be asking for such a box and connection set, and providing an SLA for each of the elements for which one is selectable.  Infrastructure abstraction, in short, has to be based on a set of classes of behavior to which infrastructure will commit, regardless of exactly how it’s constituted.  That’s vendor independence, silo-independence.

I’m more convinced every day that the key to efficient carrier cloud is some specific notion of infrastructure abstraction, whether we’re talking about NFV or not.  In fact, it’s the diversity of carrier cloud drivers and the fact that nothing really dominates the field in the near term, that makes the abstraction notion so critical.  We have to presume that over time the role of carrier cloud will expand and shift as opportunity focus changes for the operators.  If we don’t have a solid way of making the cloud a general resource, we risk wasting the opportunity that early deployment could offer the network operators.  That means risking their role in the cloud, and in future experience-based services.

Vendors have a challenge here too, because the fear of silos and vendor lock-in is already changing buyer behavior.  In NFV, the early leaders in technology terms were all too slow to recognize that NFV wasn’t a matter of filling in tick marks on an RFP, but making an integrated business case.  As a result, they let the market idle along and failed to gain traction for their own solutions at a time when being able to drive deployment with a business case could have been decisive.  Now we see open-source software, commodity hardware, and anti-lock-in-and-silo technology measures taking hold.

It’s difficult to say how much operator concerns over silos and lock-in are impacting server sales.  Dell, a major player, is private and doesn’t report their results.  HPE just reported its quarterly numbers, which were off enough to generate Street concern.  HPE also said “We saw a significantly lower demand from one customer and major tier 1 service provider facing a very competitive environment.”  That is direct evidence that operators are constraining even server purchases, and it could be an indicator that the fear of silos and lock-in is creating a problem for vendors even now.

Ericsson’s wins at Telefonica and Verizon may also be an indicator.  Where there’s no vendor solution to the problems of making a business case or integrating pieces of technology, integrators step in.  There seems to be a trend developing that favors Ericsson in that role, largely because they’re seen as a “fair broker” having little of their own gear in the game.

It’s my view that server vendors are underestimating the impact of operator concerns that either early server deployments won’t add up to an efficient carrier cloud, or will lock them into a single supplier.  It wouldn’t take a lot of effort to create a “modular infrastructure model” for the carrier cloud.  Because its importance lies mostly in its ability to protect operators during a period where no major driver for deployment has emerged, developing a spec for the model doesn’t fall cleanly into NFV or 5G or whatever.  Vendors need to make sure it’s not swept under the rug, or they face a further delay in realizing their sales targets to network operators.

Despite some of the Street comments and some media views on the HPE problem, the cloud is not suppressing server spending.  Every enterprise survey I’ve done in the last five years shows that cloud computing has not had any impact on enterprise server deployment.  If anything, the cloud and web-related businesses are the biggest source of new opportunity.  Even today, about a third of all servers sold are to web/cloud players and network operators.  My model has consistently said that carrier cloud could add a hundred thousand new data centers by 2030.

If the cloud is raining on the server market, more rain is needed.  It is very likely that network-related server applications represent the only significant new market opportunity in the server space, in which case anything that limits its growth will have a serious impact on server vendors.  The time to fix the problem here is short, and there’s also the threat of open hardware models lurking in the wings.  Perhaps there needs to be a new industry “Call for Action” here.