Just How Much of a Problem is VNF Onboarding and Integration?

We had a couple of NFV announcements this week that mention onboarding or integration.  Ericsson won a deal with Verizon that includes their providing services to integrate VNFs, and HPE announced an “onboarding factory” service that has the same basic goal.  The announcements raise two questions.  First, does this move the ball significantly with respect to NFV adoption?  Second, why is NFV, based as it is on standard interfaces, demanding custom integration to onboard VNFs?  Both are good questions.

Operators do believe that there’s a problem with VNF onboarding, and in fact with NFV integration overall.  Most operators say that integration difficulties are much worse than expected, and nearly all of them put difficulties in the “worse to much worse” category.  But does an integration service or factory change things radically enough to change the rate of NFV adoption significantly?  There, operators are divided, based of course on just how much VNF onboarding and integration they propose.

The majority of VNFs today are being considered in virtual CPE (vCPE) service-chaining business service applications, largely targeting branch office locations connected with carrier Ethernet services.  Operators are concerned with onboarding/integration issues because they encounter business users who like one flavor of VNF or another, and see offering a broad choice of VNFs as a pathway to exploding costs to certify all the candidates.

The thing is, many operators don’t even have this kind of prospect, and most operators get far less than 20% of their revenue from business user candidates for vCPE overall.  I’ve talked with some of the early adopters of vCPE, and they tell me that while there’s a lot of interest in having a broad range of available VNFs, the fact is that for any given category of VNF (firewall, etc.) there are probably no more than three candidates with enough support to justify including them in a vCPE function market list.

The “best” applications for NFV, meaning those that would result in the largest dollar value of services and of infrastructure purchasing, are related to multi-tenant stuff like IoT, CDN, and mobility.  All but IoT among this group tend to involve a small number of VNFs that are likely provided by a single source and are unlikely to change or be changed by the service customer.  You don’t pick your own IMS just because you have a mobile phone.  That being the case, it’s unlikely that one of the massive drivers of NFV change would really be stalled out on integration.

The biggest problem operators say they have with the familiar vCPE VNFs isn’t integration, but pricing, or perhaps the pricing model.  Most VNF providers say they want to offer their products on a usage price basis.  Operators don’t like usage prices because they feel they should be able to buy unlimited rights to the VNF at some point.  Some think that as usage increases, unit license costs should fall.  Other operators think that testing the waters with a new VNF should mean low first-tier prices that gradually rise when it’s clear they can make a business case.  In short, nothing would satisfy all the operators except free VNFs, which clearly won’t make VNF vendors happy.

Operators also tell me they’re more concerned about onboarding platform software and server or network equipment than VNFs.  Operators have demanded open network hardware interfaces for ages, as a means of preventing vendor lock-in.  AT&T’s Domain 2.0 model was designed to limit vendor influence by keeping vendors confined to a limited number of product zones.   What operators would like to see is a kind of modular infrastructure model where a vendor contributes a hosting and/or network connection environment that’s front-ended by a Virtual Infrastructure Manager (VIM) and that has the proper management connections to service lifecycle processes.

We don’t have one of these, in large part, because we still don’t have a conclusive model for either VIMs or management.  One fundamental question about VIMs is how many there could be.  If a single VIM is required, then that single VIM has to support all the models of hosting and connectivity needed, which is simply not realistic at this point.  If multiple VIMs are allowed, then you need to be able to model services so that the process of decomposition/orchestration can divide up the service elements among the infrastructure components each VIM represents.  Remember, we don’t have a solid service modeling approach yet.

The management side is even more complicated.  Today we have the notion of a VNF Manager (VNFM) that has a piece living within each VNF and another that’s shared for the infrastructure as a whole.  The relationship between these pieces and underlying resources isn’t clear, and it’s also not clear how you could provide a direct connection between a piece of a specific service (a VNF) and the control interfaces of shared infrastructure.

This gets to the second question I noted in my opening.  Why is this so much trouble?  Answer: Because we didn’t think it out fully before we got committed to a specific approach.  It’s very hard to go back and redo past thinking (though the NFV ISG seems to be willing to do that now), and it’s also time-consuming.  It’s unfair to vendors to do this kind of about-face as well, and their inertia adds delay to a process that’s not noted for being a fast-mover as it is.  The net result is that we’re not going to fix the fundamental architecture to make integration and onboarding logical and easy, not any time soon.

That may be the most convincing answer to the question of the relevance of integration.  If we could assume that the current largely-vCPE integration and onboarding initiatives were going to lead us to something broadly useful and wonderfully efficient, then these steps could eventually be valuable.  But they still don’t specifically address the big issue of the business case, an issue that demands a better model for the architecture in general, and management in particular.

I understand what vendors and operators are doing and thinking.  They’re taking baby steps because they can’t take giant strides.  But either baby steps or giant strides are more dangerous than useful if they lead to a cliff, and we need to do more in framing the real architecture of virtualization for network functions before we get too committed on the specific mechanisms needed to get to a virtual future.

Cloud Computing Service Success for Operators: Can It Be Done?

Operators have been fascinated by the public cloud opportunity, and new initiatives like that of Orange Business Services seem to promise that this fascination could be gaining some traction in the real world.  But Verizon at the same time is clearly pulling back from its public cloud computing commitments.  What’s really going on with operator public cloud services?

In a prior blog, I noted that operators had initially seen cloud computing services as an almost-natural fit.  These services require a major investment, and they offer a relatively small return, which fits the public-utility model most operators still adhere to.  Publicity on cloud computing suggested an oncoming wave of adoption that could carry everyone to riches, a trillion-dollar windfall.  It didn’t happen, of course, and after the blog I cited, I heard more from operator planners who were eager to offer some insight into their own situations.

All of those who contacted me agreed that the biggest problem they faced with cloud computing services was the unexpected difficulty in selling the services.  Operators are good at stuff that is marketed rather than sold, where publicity stimulates buyers to make contact and thus identify themselves.  They’re also often at least passable at dealing one-on-one with large prospective buyers, like the big enterprises.  They’re not good at pounding the pavement doing prospecting, and that seemed to be what cloud computing was really about.

One insight that operators offered on this point was that their initial target for cloud computing was the large enterprise CIO types, the same people who are instrumental in big telecom service buys.  They found that for the most part enterprise public cloud was driven by line department interest in “shadow IT” and that the CIO was as likely (or more) to oppose the cloud move as to support it.  Certainly they were not usually the central focus in making the deal.  That meant that operators would have to reach out to line departments, and that broke the sales model.

The second problem operators reported was the complexity of the cloud business case.  Operators believed rosy predictions of major savings, but while there might indeed be strong financial reasons to move at least some applications to the cloud, they were difficult to quantify.  Often there had to be a formal study, which of course either the operator had to do or had to convince the prospective buyer to do.  Several operators said they went through several iterations of this, and never came up with numbers the CFO would accept.

The final issue was security and governance.  Operators said that there were always people (often part of the IT organization) who asked specific questions about cloud security and governance, and those questions were very difficult to answer without (you guessed it!) another study.  This combined with the other issues to lengthen the selling cycle to the point where it was difficult to induce salespeople to stay the course.

If you contrast Orange and Verizon, you can see these factors operating.  In both cases, the operators were looking at headquarters sales targets.  Verizon has the largest number of corporate headquarters of any Tier One, and so it seemed to them they should have the best chance of doing a deal with the right people.  Orange seems to be proving that’s true only to a point; you can present the value proposition to headquarters, but it still has to be related to a compelling problem the buyer already accepts.  Multinationals, the Orange sales target, have a demonstrable problem in providing IT support in all their operating geographies.  The cloud is a better solution than trying to build and staff data centers everywhere in the world.

The question, of course, is whether the opportunity will be worth Orange’s building those data centers.  In effect, their approach is almost like a shared hosting plan; if a bunch of multinationals who need a data center in Country X combine to share one (Orange’s cloud service) the single data center would be a lot more cost-effective than the sum of the companies’ independent ones.  If there are enough takers to Orange services, then it works.  Obviously one customer in every data center would end up putting Orange in the “inefficient and unwise” category of data center deployment.  We can’t say at this point how well it will go for them.

It does seem that the Orange model of exploiting headquarters relationships and specific problems/issues is the right path for operators looking to get into public cloud services.  This would have the best chance of working where there were a lot of headquarters locations to sell to, obviously, which means fairly “thick” business density.  However, as I said, Verizon had that and couldn’t make a go of things, so what made their situation different?

Probably in part competition, less the direct-to-the-wallet kind than the hearts-and-minds kind.  US companies know the cloud well from players like Microsoft and Amazon, and they perceive network operators as come-from-behind players who are next-to-amateurs in status.  Probably in part geography; in the EU the operators are held in higher strategic regard, and most of them have faced profit pressure for a longer time, so they’re further along in the cycle of transformation.

The real question is what cloud needs the operators could more broadly fill, and the answer to that may be hard to deal with if you’re an operator.  My model says that there are no such needs, that there is no single opportunity that could pull through carrier cloud computing service success.  The only thing that might do it down the line is IoT, but the situation changes down the line in any case.

As operators manage to deploy carrier cloud for other missions, they’ll achieve economy of scale and a coveted low-latency position at the edge.  Those factors will eventually make them preferred providers, and once they take hold then carrier cloud computing services will quickly gain acceptance.

The only problem with that story is that it’s going to take time.  Between 2019 and 2021 is the hot period according to the model, the time when there’s finally enough cloud infrastructure to make operators truly competitive.  Even that requires that they deploy cloud infrastructure in other short-term service missions, starting even this year.  That may not happen, particularly if 5G standards take as much time to mature as NFV specifications have taken.

This could be a long slog, even for Orange.  My model says their own situation is right on the edge, and execution of both deployment and sales will have to be perfect or it won’t work and they’ll end up doing what Verizon has done.

The Road Ahead for the Networking Industry

Think network hardware is going great?  Look at Cisco’s results, and at Juniper’s decision to push even harder in security (which by the way is also a hot spot for Cisco’s quarter).  Look at the M&A in the vendor space.  Most of all, look at everyone’s loss of market share to Huawei.  USTelecom Daily Lead’s headline was “Network Hardware Woes Crimp Cisco Sales in Q2.”  SDxCentral said “Cisco’s Switching and Routing Revenue Dragged in Q2.”  Switching, routing, and data center were all off for Cisco and total revenue was down (again).  Do we need to set off fireworks to signal something here?

We clearly have a classic case of shrinking budgets for network buyers.  On the service provider side, the problem is that profit-per-bit shrinkage that I’ve talked about for years.  On the enterprise side, the problem is that new projects to improve productivity are half as likely to have a network budget increase component as they were a decade ago.  The Street likes to say this is due to SDN and NFV, but CFOs tell me that neither technology has had any measurable impact on spending on network gear.  Price pressure is the problem; you beat up your vendors for discounts and if that doesn’t work you go to Huawei.

None of this is a surprise, if you read my blog regularly.  Both the service provider and enterprise trends are at least five years old.  What is surprising, perhaps, is that so little has been done about the problem.  I remember telling Cisco strategists almost a decade ago that there was a clear problem with the normally cyclical pattern of productivity-driven IT spending increases.  I guess they didn’t believe me.

We are beyond the point now where some revolution in technology is going to save network spending.  In fact, all our revolutions are aimed at lowering it, and Cisco and its competitors are lucky that none of them are working very well—yet.

Equipment prices, according to my model, will fall another 12% before hitting a level were vendors won’t be willing/able to discount further.  That won’t be enough to stave off cost-per-bit pressure, so we can expect to see “transformation” steps being taken to further cut costs.  This is where vendors have a chance to get it right, or continue getting it wrong.

There is no way that adding security spending to offset reductions in network switch/router spending is going to work.  Yes, you could spend a bit more on security, but the gains we could see there can’t possibly offset that 12% price erosion, nor can they deter what’s coming afterward.  What has to happen is that there is a fundamental change in networking that controls cost.  The question is only what that change can be, and there are only two choices—major efforts to reduce opex, or a major transformation of infrastructure to erode the role of L2/L3 completely.

Overall, operators spend perhaps 18% or 19% of every revenue dollar on capital equipment.  They’ll spend 28% of each dollar on “process opex”, meaning costs directly attributable to service operations and customer acquisition/retention, in 2017.  If we were to see a reduction in capex of that 12%, we’d end up with about a two percent improvement.  Service automation alone could reduce process opex by twice that.  Further, by 2020 we’re going to increase process opex to about 31% of each revenue dollar, an increase larger than the credible capex reduction by price pressure could cover.  By that time, service automation could have lowered process opex to 23% of revenue.  That’s more than saving all the capex budget could do.

SDN and NFV could help too, but the problem is that the inertia of current infrastructure limits the rate at which you could modernize.  The process opex savings achieved by SDN/NFV lags that of service automation without any infrastructure change by a bit over two years.  The first cost of the change would be incurred years in advance of meaningful benefits, which means that SDN/NFV alone cannot solve the problem unless the operators dig a very big cost hole that will take five years or more to fill with benefits.

The infrastructure-transformation alternative would push more spending down to the fiber level, build networks more with virtual-wire or tunnel technology at the infrastructure level, and shift to overlay (SD-WAN-like) technology for the service layer.  This approach, again according to my model, could cut capex by 38%, and in combination with service management automation, it could cut almost 25 cents of cost per dollar of revenue.  The problem is the time it would take to implement it, because operators would have to find a way to hybridize the new model with current infrastructure to avoid having a fork-lift-style write-down of installed equipment.  My model says that SD-WAN technology could facilitate a “soft” migration to a new infrastructure model, so the time needed to achieve considerable benefits could be as little as three years.

So, what can the network equipment vendors do here?  It doesn’t take an accountant to see that the service automation approach would be better for network equipment vendors because it wouldn’t require any real infrastructure change.  However, there are two issues with it.  First, the network equipment vendors have been singularly resistive to pushing this sort of thing, perhaps thinking it would play into the hands of the OSS/BSS types.  Second, it may be too late for the network vendors to jump on the approach, given that operators are already focused on lowering equipment spending by putting pressure on vendors (or switching to Huawei, where possible).

Some of the network equipment strategists see inaction as an affirmative step.  “We don’t need to do anything to drive service automation,” one said to me via email.  “Somebody is going to do it, and when they do it will take capex pressure off.”  Well, I guess that’s how the waffling got started.  Others tell me that they saw service automation emerging from SDN/NFV, which they didn’t want to support for obvious reasons.

The potential pitfall of the inaction approach is that a competitor might be the one who steps up and takes the action instead of you.  Cisco can afford to have Amdocs or perhaps even Oracle or HPE become a leader in service automation, but they can’t let Nokia or (gasp!) Huawei do that.  If a network vendor developed a strong service automation story they could stomp on the competition.

Worse, an IT vendor could stomp on all the network vendors if they developed a story of service automation and our push-down-and-SD-WAN model of infrastructure.  Operators are receptive to this message for the first time, in no small part because of something I’ve mentioned above—they’ve become focused on cutting capex by putting price pressure on vendors.  SD-WAN has tremendous advantages as a vehicle for offering business services, not the least of which being that it’s a whole lot better down-market than MPLS VPNs.  It’s also a good fit to cloud computing services.  A smart IT vendor could roll with this.

If we have any.  The down-trend in network spending has been clear for several years now, and we still find it easier to deny it than to deal with it.  I suspect that’s going to change, and probably by the end of this year, and we’ll see who then steps up to take control over networking as an industry.  The answer could be surprising.

Don’t Ignore the Scalability and Resilience of SDN/NFV Control and Management!

It would be fair for you to wonder whether the notion of intent-based service modeling for transformed telco infrastructure is anything more than a debate on software architecture.  In fact, that might be a very critical question because we’re really not addressing, so far, the execution of the control software associated with virtualization in carrier infrastructure.  We’ve talked about scaling VNFs, even scaling controllers in SDN.  What about scaling the SDN/NFV control, orchestration, and management elements?  Could we bring down a whole network by a classic fault avalanche, or even just by a highly successful marketing campaign?  Does this work under load?

This isn’t an idle question.  If you look at the E2E architecture of the NFV ISG, you see a model that if followed would result in an application with a MANO component, a VIM component, and a VNFM component.  How does work get divided up among those?  Could you have two of a given component sharing the load?  There’s nothing in the material to assure that an implementation is anything but single-threaded, meaning that it processes one request at a time.

I think there are some basic NFV truths and some basic software truths that apply here, or should.  On the NFV side, it makes absolutely no sense to demand that there be scalability under load and dynamic replacement of broken components at the VNF level, and then fail to provide either for the NFV software itself.  At the basic software truth level, we know how the cloud would approach the problem, and so we have a formula that could be applied and has been largely ignored.

In the cloud, it’s widely understood that scalable components have to be stateless and must never retain data within the component from call to call.  Every time a component is invoked, it has to look like it’s a fresh-from-the-library copy, because given scalability and availability management demands, it might just be that.  Microservices are an example of a modern software development trend (linked to the cloud but not dependent on it) that also mandate stateless behavior.

This whole thing came up back about a decade ago, with the work being done in the TMF on the Service Delivery Framework.  Operators expressed concern to me over whether the architecture being considered was practical:  “Tom, I want you to understand that we’re 100% behind implementable standards, but we’re not sure this is one of them,” was the comment from a big EU telco.  With the support of the concerned operators (and the equipment vendor Extreme Networks) I launched a Java project to prove out how you could build scalable service control.

The key to doing that, as I found and as others found in other related areas, is the notion of back-end state control, meaning that all of the variables associated with the way that a component handles a request are stored not in the component (which would make it stateful) but in a database.  That way, any instance of a component can go to the database and get everything it needs to fulfill the request it receives, and even if five different components process five successive stages of activity, the context is preserved.  That means that if you get more requests than you can handle, you simply spin up more copies of the components that do the work.

You could shoehorn this approach into the strict structure of NFV’s MANO, but it wouldn’t be the right way—the cloud way—of doing it.  The TMF work on NGOSS Contract demonstrated that the data model that should be used for back-end state control is the service contract.  If that contract manages state control, and if all the elements of the service (what the TMF calls “Customer-Facing” and “Resource-Facing Services, or CFS/RFS) store state variables in it, then a copy of the service contract will provide the correct context to any software element processing any event.  That’s how this should be done.

The ONF vision, as they explained it yesterday, provides state control in their model instances, and so have all my own initiatives in defining model-driven services.  If the “states” start with an “orderable” state and advance through the full service lifecycle, then all of the steps needed to deploy, redeploy, scale, replace, and remove services and service elements can be defined as the processes associated with events in those states.  If all those processes operate on the service contract data, then they can all be fully stateless, scalable, and dynamic.

Functionally, this can still map back to the NFV ISG’s E2E model, but the functions described in the model would be distributed in two ways—first by separating their processes and integrating them with the model state/event tables as appropriate, and second by allowing their execution to be distributed across as many instances as needed to spread the load or replace broken pieces.

There are some specific issues that would have to be addressed in a model-driven, state/event, service lifecycle management implementation like this.  Probably the most pressing is how you’d coordinate the assignment of finite resources.  You can’t have five or six or more parallel processes grabbing for hosting, storage, or network resources at the same time—some things may have to be serialized.  You can have the heavy lifting of making deployment decisions, etc. operating in parallel, though.  And there are ways of managing the collision of requests for resources too.

Every operator facility, whether it’s network or cloud, could be a control domain, and while multiple requests to resources in the same domain would have to be collision-proof, you could have multiple domain requests running in parallel.  Thus, you can reduce the impact of the collision of requests.  This is necessary in my distributed approach, but it’s also necessary in today’s monolithic model of NFV implementation.  Imagine how you’d be able to deploy national/international services with a single instance of MANO!

The final point to make here is that “deployment” is simply a part of the service lifecycle.  If you assume that you deploy things using one set of logic and then sustain them using another, you’re begging for the problems of duplication of effort and very likely inconsistencies in handling.  Everything in a service lifecycle should be handled the same way, be defined by the same model.  That’s true for VNFs and also for the NFV control elements.

This isn’t a new issue, which perhaps is why it’s so frustrating.  In cloud computing today, we’re seeing all kinds of initiatives to create software that scales to workload and that self-heals.  There’s no reason not to apply those principles to SDN and NFV, particularly when parts of NFV (the VNFs) are specifically supposed to have those properties.  There’s still time to work this sort of thing into designs, and that has to be done if we expect massive deployments of SDN/NFV technology.

The “New ONF” Declares a Critical Mission, but Can They Fulfill It?

Yesterday the “New ONF” formed by the union of the old ONF and ON.Labs announced its new mission and its roadmap to achieving it.  I’m a guy who has worked in standards for well over two decades, and the experience has made me perhaps more cynical about standards than I am about most things (which, most of my readers will agree, is pretty darn cynical).  The new ONF actually excites me by stating a goal set and some key points that are spot on.  It also frightens me a little because there’s still one thing that the new group is doing that has been a major cause of failure for all the other initiatives in the service provider transformation space.

The “new ONF” is the union of the Open Network Foundation and ON.Labs, the organization that created the ONOS operating system and CORD, both of which I’ve talked about in the past.  I blogged about the importance of CORD early on (see THIS blog) and gain when Comcast jumped into the consortium, HERE, and everyone probably knows that the ONF is the parent of OpenFlow SDN.  The new ONF seems more focused on the ON.Labs elements, from which they hope to create way to use software-based or software-defined elements to build market-responsive networks and network services.

Networks of old were a collection of boxes joined by very standardized hardware interfaces.  Then, enter virtualization, software definition, the cloud, and all the other good stuff that’s come along in the last decade.  Each of these new initiatives had/have their champions in terms of vendors, buyers, and standardization processes.  Each of these initiatives had a very logical mission, and a logical desire to contain scope to permit timely progress.  Result?  Nothing connects in this wonderful new age.

This is a perhaps-flowery restatement of the opening positioning that the ONF offers for its new concept of the Open Innovation Pipeline.  The goal of the process is the notion of the “Software-Defined Standard”, something that by itself brings tears to the eyes of an old software architecture guy like me.  We’ve gone on way too far along the path of supposed software-defined stuff with little apparent concern for software design principles.  The ONF says they want to fix that, which has me excited.

Digging to the details, what the ONF seems to be proposing is the creation of an open ecosystem that starts (at least in many cases) with the ONOS operating system, on which is added the XOS orchestration layer (which is a kind of service middleware).  This is used to build the variety of CORD models (R-CORD, M-CORD, etc.), and it can also be used to build new models.  If this approach were to be followed, it would create a standardized open-source platform that builds from the bottom to the top, and that provides for easy customization and integration.

But it’s at the top of the architectural heap that I find what makes me afraid.  The architectural slide in all of this shows the open structure with a programmable forwarding plane at the bottom, a collection of Global Orchestrators at the top, and the new ONF focus as a box in between.  This vision is of course device-centric, and in the real world you’d be assembling conforming boxes and presumably other boxes, virtual or real, to create networks and services.  I don’t have a problem with the idea that there’s a forwarding plane at the bottom, because even service elements that are outside the service data plane probably have to forward something.  I’m a bit concerned about that Global Orchestrator thing at the top.

I’ve been a part of a lot of standards processes for decades, and it seems like all of them tend to show a diagram that has some important function sitting god-like at the top, but declared safely out of scope.  That’s what the ONF has done with those Global Orchestrators.  The problem with those past bodies and their past diagrams is that all of them failed their critical mission to make the business case, and all of them failed because they didn’t include elements that were critical to their business case in their scope of work.  So the fact that the ONF seems to do this is discouraging.

The ONF is right in saying that there’s an integration problem with the new-generation virtualization-based services.  They are also right in saying that having a common platform on which the elements of those new services are built would solve that problem, through the simple mechanism of a common implementation platform on which the features were built.  However, the past says that’s not enough, for two reasons.

First, everything is not built on the ONF’s architecture.  Even if we presumed that everything new was built that way, you still have to absorb all the legacy hardware and accommodate the open source initiatives for other virtualized-element models, all of which aren’t based on the ONF’s elements.  We have learned the bitter truth in NFV in particular; you can’t exclude the thing you are evolving from (legacy devices in particular) in your model of a future service, unless you never want to get there from here.  You could accommodate the legacy and “foreign” stuff in the ONF approach, but the details aren’t there yet.

Second, there’s the issue of the business case.  I can have a wonderful architecture for building standardized car parts, but it won’t serve me a whit if nobody wants to buy a car.  I’ve blogged a lot about the business case behind a new virtual service element—SDN, NFV, or whatever you like.  Most of that business case is going to come from the automation of the full service lifecycle, and most of that lifecycle and the processes that automate it live in that Global Orchestrators element that’s sitting out of scope on top of the ONF target functionality.

All of this could be solved in a minute with the inclusion of a model-based service description of the type I’ve been blogging about.  I presented just this notion to the ONF, in fact, back in about 2014.  A model like that could organize all of the pieces of ONF functionality, and it could also organize how they relate to the rest of the service processes, whether they’re NFV processes, OSS/BSS processes, or cloud computing.  Yes, this capability would be in a functional Global Orchestrator, but there aren’t any of them available and we know that because nobody has successfully made the business case with one, nor have they integrated all the service lifecycle processes.

There is a modeling aspect to the XOS layer, and it’s got all the essential pieces, as I said in my first blog on it (see above).  However, in execution, XOS seems to have changed its notion of “service” from a high-level one to something more like the TMF’s “Resource-Facing Services” or my ExperiaSphere “Behaviors”.  They’re what a network or infrastructure can do, more than a functional assembly that when decomposed ends up with these infrastructure capabilities.  That seems to be what created the Global Orchestrator notion; the lost functionality is pushed up into the out-of-scope part.  That’s what frightens me, because it’s the mistake that so many others have made.

I’m not knocking the new ONF here, because I have high hopes for it.  They, at least, seem to grasp the simple truth that software defined stuff demands a definition of stuff in software terms.  I also think that, at a time when useful standards to support integration in SDN and NFV seem to be going nowhere, the notion of a common platform seems unusually attractive.  Is it the best approach?  No, but it’s a workable one, which says a lot at this point.

There have been a lot of recent re-launching of standards and industry groups and activities, brought about because the original efforts of the bodies generated interest, hype, media extravagance, and not much in the way of deployment or transformation.  The new ONF now joins the group of industry mulligans, and the question is whether it will jump off what’s unquestionably a superior foundation and do the right thing, or provide us with another example of how to miss the obvious.  I’ll offer my unbiased view on that as the details of the initiative develop.

What Will Become of Test and Measurement in a Virtual World?

One of the joke statements of the virtual age is “You can’t send a real tech to fix a virtual problem!”  Underneath the joke is a serious question, which is just what happens to test and measurement in a virtual world?  Virtualization opens two issues—how do you test the virtual processes and flows, and how does virtualization impact T&M’s usual missions?  We’ll look at both today.

Test and measurement (T&M) differs from “management” in that the latter focuses on the ongoing status of things and the reporting of changes in status.  Management is status monitoring and not network monitoring.  T&M, in contrast, aims at supporting the “craft processes” or human activity that’s associated with taking a refined look at something—is it working, and how well—with the presumptive goal of direct remediation.

Many people, including me, remember the days when resolving a network problem involved looking at a protocol trace, and that practice is a good place to start our exploration.  Whether you have real or virtual devices, the data flows are still there and so are the issues of protocol exchanges.  However, a virtual device is fundamentally different from a real one, and the differences have to be accommodated in any realistic model of T&M for the virtual age.

There’s an easy-to-see issue that we can start with.  A real device has a location.  A virtual device has one too, in the sense that it’s hosted somewhere, but the hosting location isn’t the same thing as the location of a box.  A box is where it is; a virtual router instance is where it was convenient to put it.  At the least, you’d have to determine where an instance was being hosted before you could run out and look at it.  But that initial check of location isn’t enough in a virtual world.  Imagine a tech on route to stick a monitor in a virtual router path, only to find that while in route, the virtual router “moved”.  It’s common to have a soft collision between management-driven changes in a network and remediation, but in the traditional world the boxes at least stay put.  T&M in a virtual world has to deal with the risk of movement of the instance while the tech is setting up or during the test.

Simplicity is comforting even when it’s not quite as simple as looks, but this simple point of “where is it?” isn’t the real problem.  If software automation to improve opex is the goal (which operators say it is) for virtualization, then we’d have to assume that the goal is to move away from “T&M” to “management”, since the former is presumably explicitly a human activity.  That means that in the future, not only would it be more likely that a virtual router got moved, it would be likely that if there were a problem with it the first goal would be to simply replace it—an option that’s fine if you’re talking about a hosted software element but problematic if you’re dealing with a real box.  So, we’re really saying that virtualization first and foremost alters the balance between management and T&M.

When do you send a tech, or at least involve a tech?  The only satisfactory answer in a time when opex reduction is key is “When you’ve exhausted all your other options.”  One operator told me that their approach was something like this:

  1. If there’s a hard fault or an indication of improper operation, you re-instantiate and reroute the service as needed. It’s like saying that if your word processor is giving you a problem, save and reload it.
  2. If the re-instantiation doesn’t resolve things, you check to see if there was any change to software versions in the virtual device or its platform that, in timing, seem possibly related to the issue. If so, you roll back to the last configuration that worked.
  3. If neither of these resolves things or are not applicable, then you have to try remediation. The operator says that they’d first try to reroute or redeploy the service around the whole faulty function area and then try to recreate the problem in a lab under controlled conditions.  If that wasn’t possible they’d assume T&M was needed.

The same operator says that if we assumed a true virtual network, the goal would be to avoid dispatching a tech in favor of some kind of testing and monitoring from the network operations center (NOC).  The RMON specification from the IETF can be implemented in most real or virtual devices, and there are still a few companies that use hardware or software probes of another kind.  This raises the question of whether you could do T&M in a virtual world using virtual monitoring and test injection, which would eliminate the need to dispatch someone to hook up an analyzer.  A “real” dispatch would be needed only if there were a hardware failure of some sort on site, or a situation where a manual rewiring of the network connections of a device or server was needed.

One advantage of the virtual world is that you could instantiate a monitoring point as software somewhere convenient, and either connect it to a “T” you kept in place at specific locations, or cross-connect by rerouting.  The only issue with this approach is the same issue you can run into with remote monitoring today—the time delay that’s introduced from the point of “tapping” the flow to the point of viewing the monitoring could be an issue.  However, if you aren’t doing test injection at the monitoring point the issues should be minimal, and if you are then you’d need a more sophisticated remote probe to install so you could enter responses to triggers that are executed locally.

Another aspect of “virtual T&M” is applying T&M to the control APIs and exchanges associated with SDN or NFV.  This has been a topic of interest for many of the T&M vendors, and certainly the failure of a control or management path in SDN or NFV could present a major problem.  Operators, in fact, are somewhat more likely to think they need specialized T&M support for control/management exchanges in SDN and NFV than in the service data path.  That’s because of expected issues with integration among the elements at the control/management protocol level.

Most of the technology and strategy behind virtual T&M is the same whether we’re talking about the data path or the control/management plane.  However, there are profound issues of security and stability associated with any monitoring or (in particular) active intervention in control/management activity.  We would assume that T&M would have to live inside the same security sandbox as things like an SDN controller or NFV MANO would live, to insure nothing was done to compromise the mass of users and services that could be represented.

Overall, the biggest impact of virtualization trends on T&M is the fact that a big goal for virtualization is service lifecycle automation.  If that’s taken seriously, then more of what T&M does today would migrate into a management function that generated events to drive software processes, not technicians.  In addition, the T&M processes related to device testing are probably far less relevant in an age where the device is virtual and can be reinstantiated on demand.  But virtualization also lets T&M create what is in effect a virtual technician because it lets you push a probe and test generator anywhere it’s needed.  Will the net be positive or negative?  I think that will depend on how vendors respond to the challenge.

Could Modeling Be the Catalyst for OSS/BSS Transformation?

I can vividly recall one of my early telco transformation meetings.  It was just after NFV had launched, but before any real work had been done.  At the meeting, two of the telco experts sitting next to each other expressed their views on OSS/BSS.  One wanted to transform it, retaining as much as possible of the current systems.  The other wanted to start over.  This polarized view of OSS/BSS futures, it turned out, was fairly pervasive among operators and it’s still dominant today.

The notion of transforming OSS/BSS has a long history, going back more than a decade in fact.  The first transformational notion I saw was the TMF’s NGOSS Contract work, something I’ve cited often.  This was an early attempt to reorganize operations processes into services (SOA, at the time) and to use the contract data model to steer service events to the right process.  This, obviously, was the “event-driven OSS/BSS” notion, and also the “service-based” or “component-based” model.

We sort of did services and components, but the event-driven notion has been harder to promote.  There are some OSS/BSS vendors who are now talking about orchestration, meaning the organization of operations work through software automation, but not all orchestration is event-driven (as we know from the NFV space and the relatively mature area of DevOps for software deployment).  Thus, it would be interesting to think about what would happen should OSS/BSS systems be made event-driven.  How would this impact the systems, and how would it impact the whole issue of telco transformation?

We have to go back, as always, to the seminal work on NGOSS Contract to jump off into this topic.  The key notion was that a data model coupled events to processes, which in any realistic implementation means that the OSS/BSS is structured as a state/event system with the model recording state.  If you visualized the service at the retail level as a classic “black box” or abstraction, you could say that it had six states; Orderable, Activating, Active, Terminating, Terminated, and Fault.  An “order” event transitions to the Activating state, and a report that the service is properly deployed would translate it to the Active state.  Straightforward, right?  In all the states, there’s a key event that represents its “normal” transition driver, and there’s also a logical progression of states.  All except “Fault” of course, which would presumably be entered on any report of an abnormal condition.

You can already see this is too simplistic to be useful, of course.  If the service at the retail level is an abstract opaque box, it can’t be that at the deployment level in most cases.  Services have access and transport components, features, different vendor implementations at various places.  So inside our box there has to be a series of little subordinate boxes, each of which represents a path along the way to actually deploying.  Each of these subordinates are connected to the superior in a state/event sense.

When you send an Order event to a retail service, the event has to be propagated to its subordinates so they are all spun up.  Only when all the subordinates have reported being Active can you report the service itself to be Active.  You can see that the state/event process also synchronizes the cooperative tasks that are needed to build a service.  All of this was implicit in the NGOSS Contract work, but not explained in detail in the final documents (GB942).

Operations processes, in this framework, are run in response to events.  When you report an event to a subordinate (or superior) component of a service, the state that component is in and the event itself combine to define the processes to be run.  The way that an OSS/BSS responds to everything related to a service is by interpreting events within the state/event context of the data models for the components.

This approach contrasts to what could be described as the transactional or workflow approach that has been the model for most business applications, including most OSS/BSS.  In a transactional model, operations tasks are presumed to be activated by something (yes, we could think of it as an event) and once activated the components will then run in a predefined way.  This is why we tend to think of OSS/BSS components like “Order Management” or “Billing”; the structure mirrors normal business software elements.

To make the OSS/BSS operate as an event-driven system, you need to do three things.  First, you need a data model that defines a service and its subordinate elements in a structured way, so that each of the elements can be given a specific state/event table to define how it reacts to events.  Second, you need events for the system to react to, and finally you need to have OSS/BSS processes defined as services or components that can be invoked from the intersection of current state and received event, in any given state/event table.

Most OSS/BSS systems are already modular, and both operators and vendors have told me that there’s little doubt that any of them could be used in a modular-as-service way.  Similarly, there are plenty of business applications that are event-driven, and we have all manner of software tools to code conditions as events and associate them with service models.  What we lack, generally, are the models themselves.  It’s not that we don’t have service modeling, but that the models rarely have state/event tables.  Those would have to be authored as part of service-building.

You can see from this description that the process of modernizing OSS/BSS based on NGOSS-Contract state/event principles is almost identical to the process of defining virtualized function deployments as described by the NFV ISG, or the way that AT&T’s ECOMP proposes to build and manage services.  That has three important consequences.

First, it would clearly be possible to organize both SDN/NFV service lifecycle management and OSS/BSS modernization around the same kind of model, meaning of course that it could be the same model.  Properly done, a move in one space would move you in the other, and since automation of both operations and the lower-level lifecycle management processes are essential for opex efficiency and service agility, the combined move could meet transformation goals.

Second, the model could be defined either at the OSS/BSS level or “below” that, perhaps as independent NFV orchestration.  From where it starts, it could then be percolated up/down to cover the other space.  Everyone in either the OSS/BSS space, the SDN/NFV space, the DevOps or orchestration space, could play in this role.

Third, this level of model-driven integration of operations processes with service and resource management processes at the lower level isn’t being touted today.  We see services and service modeling connected to OSS/BSS, presumably through basic order interfaces.  If that’s accidental, it seems to suggest that even advanced thinkers in the vendor and operator communities aren’t thinking about full-scope service automation.  If it’s deliberate, then it isolates operations modernization from the service modeling and orchestration trends, which in my view would marginalize OSS/BSS and hand victory to those who wanted to completely replace it rather than modernize it.

That returns us to those two people at the meeting table, the two who had diametrically opposed views of the future of OSS/BSS.  Put in the terms of the modeling issue we’ve been discussing here, the “modernize” view would favor incorporating OSS/BSS state/event handling into the new service automation and modeling activity that seems to be emerging in things like ECOMP.  The “trash it and start over” view says that the differences in the role of OSS/BSS in a virtual world are too profound to be accommodated.

My own view falls between these two perspectives.  There are a lot of traditional linear workflows involved in OSS/BSS today, and many of them (like billing) really don’t fit a state/event model.  However, the old workflow-driven thinking doesn’t match cloud computing trends, distributed services, and virtualization needs.  What seems to be indicated (and which operators tell me vendors like Amdocs and Netcracker are starting to push) is a hybrid approach where service management as an activity is visualized as a state/event core built around a model, and traditional transactional workflow tasks are spawned at appropriate points.  It’s not all-or-nothing, it’s fix-what’s-broken.

Or, perhaps, it’s neither.  The most challenging problem with the OSS/BSS modernization task and the integration of OSS/BSS with broader virtualization-driven service management, is the political challenge created by the organization of most operators.  So far, SDN and NFV have been CTO projects.  OSS/BSS is a CIO domain, and there is usually a fair degree of tension between these two groups.  Even where the CIO organization has a fairly unanimous vision of OSS/BSS evolution (in the operator I opened this blog with, both views on operations evolution were held within the CIO organization) there’s not much drive so far to unite that vision with virtualization at the infrastructure level.

Could standardization help this?  The standards efforts tend to align along these same political divides.  The TMF is the go-to group for CIO OSS/BSS work, and the CTO organizations have been the participants in the formal bodies like the NFV ISG.  Underneath it all is the fact that all these activities rely on consensus, which has been hard to come by lately as vendor participants strive for competitive advantage.  We may need to look to a vendor for the startling insights needed.  Would we have smartphones today without Steve Jobs, if a standards process had to create them?  Collective insight is hard, and we’ve not mastered it.

Could We Unify CORD and ECOMP to Accelerate Infrastructure Transformation?

If you like the idea of somehow creating a union between CORD and ECOMP then the next obvious question is just where that union has to start.  The answer, in my view, isn’t in a place where both architectures contribute something that could be united, but where neither does enough and external unionizing forces are essential.  That’s the notion of modeling, not resources but functions.

In my last blog, I noted that integration depends on the ability to freely substitute different implementations of the same function without changing the service definitions or the management practices.  To make that happen, you need to have some Platonic shapes that define all the functions you intend to use in composing services…or even applications.  Each of these models then represents the “look and feel” of the function as seen from above.  The vendors who want to contribute those functions are responsible for building downward from the abstract model to make sure what they do fits seamlessly.

The goal is to make a function’s “object” into a representation of that function through the service lifecycle.  You manipulate the function at the model level, and the manipulation is coupled downward into whatever kind of implementation happens to be used.  That way, things that have to view or control a “router” don’t have to worry (or even know) whether it’s an instance of software, a physical device, or a whole system of routing features either built by SDN forwarding or by combining devices/software into a “network”.

The TMF really got a lot of this started back in the 2006-2010 timeframe, with two initiatives.  One was the “NGOSS Contract” that proposed that events would be steered to the appropriate lifecycle processes through the intermediary of the model service contract.  That approach was the first to make a contract (which the TMF modeled as a series of connected service elements) into a state/event machine.  The other was the Service Delivery Framework (SDF), that explicitly targets the lifecycle management of services that consist of multiple functions/features.

To me, the union of these two concepts required the notion that each service element or model element (my “router”) be represented as an object that had properties determined by the class of feature it defined.  That object was then a little “engine” that had state/event properties and that translated standard class-based features (“a router does THIS”) into implementation-specific methods (“by doing THIS”).  A service was a structured assembly of these objects, and each service was processed by a lifecycle management software element that I called a “Service Factory”, a term the TMF briefly adopted.

Service lifecycle management, which starts by instantiating a service model onto real infrastructure by making the connections between the “objects” that define the model and a service-instance-specific way of deploying or committing resources, lives above the model.  It never has to worry about implementation because it manipulates only the abstract vision (“router”).  The first step in lifecycle management is responsible for deployment, and it makes the connections between the general object vision of available features (probably in the form of APIs) and the way each object is actually deployed in the referenced service.

When a model is deployed, the abstract “model” has to be changed from a template that describes something to an instance that represents something.  There are two basic approaches to doing this.  One is to actually spawn a set of software objects that will then run to process service lifecycle events.  In this approach, a service is a real software application made up of modules for the features.  The second approach is to use a general software tool that interprets the model as needed, meaning that there is in the instance of a service model a set of references to software, not the software itself.  The references could be real pointers to software processes, or they could be a data model that would be passed to a generic software element.

CORD uses abstractions to represent things like the access network and the service trunking.  There are also arguably standard models for resources.  The former are useful but not sufficient to model a service because they don’t have the functional range needed to support all the service features.  The latter open the question of “standardization” below the service objects, which I’ll get to in a bit.

ECOMP also contributes elements.  It has the notion of a service model, though I’d argue it’s not as specific as the approach I’ve described.  It has the notion of service lifecycle management, again not as detailed.  Much of ECOMP detail is in the management and resource portion of the issue, again below the service model I’ve described.

If CORD describes the CO of the future and ECOMP describes the integration of elements, then the thing that would unite them in a logical sense is a complete statement of the service models that relate the processes of ECOMP with the resources of CORD.  To consider that, it’s now time to address the question of what happens underneath a service model.  Here we have three basic options to consider:

  1. We could use the same modeling approach below as we had used for service models, so that the decomposition of a “router” object into a network of “router” objects would use the same tools.
  2. We could use some other standardized modeling approach to describe how an “object of objects” is represented.
  3. We could let anything that works be used, foregoing standardization.

The best approach here, in my view, would depend on how many of the “other standardized modeling” approaches would be fielded in the market.  Below the service model, the mandate is to pick an implementation strategy and then connect it to the service-model’s object-level APIs.  You could see the work of the NFV ISG and MANO living down here, and you could also see modeling options like TOSCA, TMF SID, and YANG, and even more general API or data languages like XML or JSON.  The more options there are, the more difficult it would be to get a complete model from the underside of our highest-level service objects to the resources that will have to be committed.  That’s because it’s likely that vendors would support only a few model choices—their own gladly and everything else with great reluctance.

Clearly the last option leads to chaos in integration.  So does the second option, unless we can define only a very limited set of alternative approaches.  That leaves us with the first option, which is to find a general modeling approach that would work top to bottom.  However, that approach fields about as many different choices as my second one did—and it then demands we pick one before we can go very far in modeling services.  Given all of this, what I’d suggest is that we focus on defining what must be standardized—the structure of those abstract functional objects like “router”.  From there, we’d have to let the market decide by adopting what works best.

It should be easy to unify CORD and ECOMP with service modeling because both require and even partially define it, but neither seems to be firmly entrenched in a specific approach.  It’s also something that the NFV ISG might be ideally positioned to provide, since the scope of objects that need to be defined for the model are all within the range of functions considered by NFV.  It could also be done through open-source activities (including CORD and ECOMP), and it could be done by vendors.  Perhaps with all these options on the table, at least one could come to fruition.

There’s a lot at stake here.  Obviously, this could make both CORD and ECOMP much more broadly relevant.  It could also re-ignite the relevance of the NFV ISG.  It could help the TMF turn its ZOOM project into something other than a lifetime tenure for its members.  I also think that carrier cloud adoption could be accelerated significantly, perhaps by as much two years, if something like this were done.  But make no mistake, carrier cloud is going to happen and result in a lot of new money in the IT world.  Once that’s clear (by 2020 certainly) I think there will be a rush to join in.  For some, it will be too late to reap the full benefits.

Some Specific Recommendations on Boosting the Role of NFV in the Carrier Cloud

In the last several blogs I developed the theme of making NFV relevant and exploring the relationship between the drivers of “carrier cloud”.  One point that I raised, but without numerical detail, is the contribution that NFV would actually make to carrier cloud.  If you look at the model results, there’s a firm connection with NFV in only about 6% of carrier cloud deployment.  That doesn’t really tell the story, though, because there is a credible way of connecting over 80% of carrier cloud deployment to NFV, meaning making NFV relevant to almost all server deployments by operators.  If that’s the case, then the risk that NFV proponents face today is failing to realize that credible connection.

The challenge in realization comes down to one of several forms of integration.  There’s been a lot said about the problems of NFV integration, but most of it has missed all of the real issues.  If we look at the goal of realizing the incremental 74% link to carrier cloud that’s on the table, and if we start from that top goal, we can get some more useful perspectives on the integration topic, and maybe even some paths to solution.

The next four years of carrier cloud evolution are critical because, as I noted in yesterday’s blog, there’s no paramount architectural driver, or even any single paramount application or service, behind the deployments of that period.  The risk (again, citing yesterday’s blog) is that all the stuff that happens, stuff that will end up deploying almost seven thousand new data centers globally, won’t organize into a single architecture model that can then be leveraged further.  If “carrier cloud” is cohesive architecturally, or if cohesion can somehow be fitted onto whatever happens, then the foundation goes a long way toward easing the rest of the deployment.  This is the first level of integration.

The minimum operator/infrastructure goal for the next four years should be to build a single resource pool based on compatible technology and organized and managed through a common framework.  The resources that make up the carrier cloud must be the foundation of the pool of resources that will build the base for future services, for the later phases of cloud evolution.  That means that:

  1. Operators should presume a cloud host basis for all applications that involve software hosting, whether it’s for features, management, operations support, databases, or whatever. Design everything for the cloud, and insist that everything that’s not cloud-ready today be made so.
  2. There should be a common framework for resource management imposed across the entire carrier cloud pool, from the first, and that framework should then expand as the carrier cloud expands.
  3. Network connectivity with, and to, the carrier cloud resource pool should fit a standard model that is SDN-ready and that is scalable to the full 100-thousand-data-center level that we can expect to achieve globally by 2030.
  4. Deployment of anything that runs on the carrier cloud must be based on an agile DevOps approach that recognizes the notion of abstraction and state/event modeling. It’s less important to define what the model is than to say that the model must be used everywhere and for everything.  Deploy a VNF?  Use the model.  Same with a customer’s cloud application, an element of OSS/BSS, or anything else that’s a software unit.

The next point builds off this point, and relates to the integration of the functionality of a service or application using software automation.  Here I want to draw on my own experience in the TMF SDF project, the CloudNFV initiative, my ExperiaSphere work, and work with both operators and vendors in software automation and modeling.  The point is that deployment and application or service lifecycle management must be based on an explicit multi-layer model of the service/application, which serves as the connection point between the software that manages the lifecycle and the management facilities that are offered by the functional stuff being deployed.

A real router or a virtual software router or an SDN network that collectively performs like a router are all, functionally, routers.  There should then be a model element called “router” that represents all of these things, and that decomposes into the implementation to be used based on policy.  Further, a “router network” is also a router—a big abstract and geographically distributed one.  If everything that can route is defined by that single router object, then everything that needs routing, needs to manage routing, or needs to connect with routing can connect to that object.  It becomes the responsibility of the software automation processes to accommodate implementation differences.

The second level of integration we need starts with this set of functional model abstractions, and then demands that vendors who purport to support the NFV process supply the model software stubs that harmonize their specific implementation to that model’s glorious standard.  The router has a management information base.  If your implementation doesn’t conform exactly, then you have a stub of code to contribute that harmonizes what you use to that standard MIB.

This helps define what the model itself has to be.  First, the model has to be an organizer for those stubs of stuff.  The “outside” of the model element (like “router”) is a software piece that exposes the set of APIs that you’ve decided are appropriate to that functional element.  Inside that is the set of stub code pieces that harmonize the exposed API to the actual APIs of whatever is being represented by the model—a real router, a management system, a software element—and that performs the function.  Second, the model has to be able to represent the lifecycle states of that functional element, and the events that have to be responded to, such events coming from other elements, from “above” at the user level, or from “below” at the resource level.

This also defines what integration testing is.  You have a test jig that attaches to the “interfaces” of the model—the router object.  You run that through a series of steps that represent operation and the lifecycle events that the functional element might be exposed to, and you see whether it does what it’s supposed to do.

Infrastructure is symbiotic almost by definition; elements of deployed services should prepare the way for the introduction of other services.  Agile orchestration and portals mean nothing if you can’t host what you want for the future on what your past services justified.  CORD has worked to define a structure for future central offices, but hasn’t done much to define what gets us to that future.  ECOMP has defined how we bind diverse service elements into a single operational framework, but there are still pieces missing in delivering the agility needed early on, and of course ECOMP adoption isn’t universal.

To me, what this means is that NFV proponents have to forget all their past stuff and get behind the union of ECOMP and CORD.  That’s the only combination that can do what’s needed fast enough to matter in the critical first phase of carrier cloud deployment.  I’d like to see the ISG take that specific mission on, and support it with all their resources.

The Evolution of the Carrier Cloud

The concept of carrier cloud has taken some hits over the last months.  Media coverage for network functions virtualization (NFV) shows a significant negative shift in the attitude of operators over the progress of NFV.  Verizon sold off the cloud business it purchased, and is now reported to be selling off its cloud computing business.  Cisco, who had announced an ambitious Intercloud strategy aimed at anchoring a federation of operator clouds, dropped the fabric notion completely.  Vendors are quietly reassessing just what could be expected from sales of cloud infrastructure to network operators.

Do we have a problem here, and could it be reversed?  There have always been two broad drivers for “carrier cloud”.  One is cloud computing services, and the other is the application of cloud hosting to communications services infrastructure.  The history of carrier cloud is written by the balance of interest in these drivers, and so the future will be.

A decade ago, when operators were getting serious about transformation and looking for new service vehicles to embody their goals, they believed that public cloud services were going to be the driver of carrier cloud.  By 2012 support for that view had declined sharply.  Verizon did the Terremark deal in 2011, a year after the high-water mark of operator interest in public cloud services.

What soured operators on public cloud services is lack of support for any credible revenue opportunity.  Many of the operators I’d surveyed back at the start were presuming that the success of cloud computing was written in the stars.  Actually, it was only written in the media.  The presumption that every application run in data centers would shift to the cloud would have opened a trillion dollars in incremental revenue, which certainly was enough to make everyone sit up and take notice.  The presumption was wrong, and for three reasons.

The first reason is that the total addressable market presumption was nonsense.  Cloud computing as a replacement for current IT spending is an economy-of-scale play.  Enterprises, in their own data centers, achieve between 92% and 99% of cloud provider economy of scale.  There are some users whose operations costs are enough to add more benefit to the pie, but for most the efficiency difference won’t cover cloud provider profit goals.  Taking this into consideration, the TAM for business migration of software to the public cloud was never more than 24% of total IT spending.

The second reason is that even the real TAM for cloud services is largely made up of SMB and SaaS.  SMBs are the class of business for whom IaaS hosting can still offer attractive pricing, because SMBs have relatively poor economies of scale.  Enterprise cloud today is mostly SaaS services because these services are easily adopted by line departments and displace nearly all the support costs as well as server capital costs.  Since operators want to sell IaaS, they can’t sell to enterprises easily and direct sales to SMBs is inefficient and impractical.

The final reason is that real public cloud opportunity depends on platform service features designed to support cloud-specific development.  These features are just emerging, and development practices to exploit them are fairly rare.  An operator is hardly a natural partner for software development, and so competitors have seized this space.

For all these reasons, operator-offered cloud computing hasn’t set the world afire, and it’s not likely to any time soon.  What’s next?  Well, on the communications services side of carrier cloud drivers, the Great Hope was NFV, but here again the market expectations were unrealistic.  Remember that NFV focuses primarily on Layer 4-7 features for business virtual CPE and what are likely primarily control-plane missions in content and mobile networks (CDN, IMS, EPC).  The first of these missions don’t really create carrier cloud opportunities because they are directed primarily at hosting features on agile CPE.  The remaining missions are perhaps more credible as direct drivers of carrier cloud than as drivers of NFV, and it’s these missions that set the stage for the real carrier cloud opportunity.  Unfortunately for vendors, these missions are all over the place in terms of geography, politics, and technology needs.  A shift from box sales to solution sales might help vendors address this variety, but we all know the trend is in the opposite direction.

Virtualization will build data centers, and at a pace that depends first on the total demand/opportunity associated with each service mission and second on the hostable components of the features of each service.  Our modeling of carrier cloud deployment through 2030 shows a market that develops in four distinct phases.  Keep in mind that my modeling generates an opportunity profile, particularly when it’s applied to a market that really has no specific planning drive behind it yet.  These numbers could be exceeded with insightful work by buyers and/or sellers, and of course we could also fall short.

In Phase One, which is where we are now and which will last through 2020, CDN and advertising services drive the primary growth in carrier cloud.  NFV and cloud computing services will produce less than an eighth of the total data centers deployed.  It’s likely, since there are really no agreed architectures for deploying cloud elements in these applications, that this phase will be a series of ad hoc projects that happen to involve hosting.  At the end of Phase one, we have deployed only 6% of the carrier cloud opportunity.

Phase Two starts in 2021, ushered in by the transformation in mobile and access infrastructure that’s usually considered to be part of 5G.  This phase lasts through 2023, and during it the transformation of services and infrastructure to accommodate mobile/behavioral services will generate over two-thirds of the carrier cloud data center deployments.  This phase is the topic, in a direct or indirect way, for most of the planning now underway, and so it’s this phase that should be considered the prime target for vendors.  At the end of Phase Two, we will have deployed 36% of the total number of carrier cloud data centers.

This is perhaps the most critical phase in carrier cloud evolution.  Prior to this, a diverse set of missions was driving carrier cloud and there’s a major risk that this would create service-specific silos even in data center deployment.  Phase Two is where we’ll see the first true architecture driver—5G.  Somehow this driver has to sweep up all the goodies that where driven before it, or somehow those goodies have to anticipate 5G needs.  How well that’s managed will likely decide how much gets done from 2021 onward.

The next phase, Phase Three, is short in chronological terms, lasting from 2024 through 2025.  This phase represents the explosion of carrier cloud opportunity driven by the maturation of contextual services for consumers and workers, in large part through harnessing IoT.  Certainly, IoT-related big-data and analytics applications will dominate the growth in carrier cloud, which by the end of the phase will have reached 74% of the opportunity.  In numbers terms, it is Phase Three that will add the largest number of data centers and account for the fastest growth in carrier cloud capex.  It’s worthy to note that cloud computing services by operators will see their fastest growth in this period as well, largely because most operators will have secured enough cloud deployment to have compelling economies of scale and low-latency connections between their data centers and their users.

The final phase, Phase Four, begins in 2026 and is characterized by an exploitive application of carrier cloud to all the remaining control-plane and feature-based missions.  Both capital infrastructure and operations practices will have achieved full efficiency at this point, and so the barrier to using carrier cloud for extemporaneous missions will have fallen.  Like the network itself, the carrier cloud will be a resource in service creation.

The most important point to be learned from these phases is that it’s service missions that drive carrier cloud, not SDN or NFV or virtualization technology.  Benefits, when linked to a credible realization path, solve their own problems.  Realizations, lacking driving benefits, don’t.  SDN and NFV will be deployed as a result of carrier cloud deployments, not as drivers to it.  There is an intimate link, but not a causal one.

If all this is true, then supporters of the carrier cloud have to forget the notion that technology changes automatically drive infrastructure changes.  Technology isn’t the driver; we’ve seen that proven in every possible carrier cloud application already.  The disconnect between tech evolution and service evolution only disconnects technology evolution from real drivers for change, and thus from realization of transformation opportunities.

We are eventually going to get to the carrier cloud, but if the pace of that transformation is as slow as it has been, there will be a lot of vendor pain and suffering along the way, and not just for network equipment vendors.  Open source has been the big beneficiary of the slow pace of NFV, and open white-box servers could be the result of slow-roll carrier cloud.  Only a fast response to opportunity or need justifies early market movement, and creates early vendor profit.  You don’t get a fast response by tossing tech tidbits to the market, you get there by justifying a revolution.