Are You in the Mood for Indigo? AT&T’s New Concept Could Change Your Mind!

When you have an architecture that set the standard for NFV, what do you to for an encore?  AT&T’s answer to that question is “Network 3.0 Indigo” or in short terms, just “Indigo”.  It’s another of those huge concepts that’s difficult to describe or to understand, and its sheer scope is certain to create healthy skepticism on whether AT&T can meet the goals.  Whatever happens in realization, though, Indigo is profoundly important because it frames operators’ views of the future exceptionally well.

Operators have consistently been telling me that their biggest problem with technology initiatives, from SDN and NFV to 5G, is that they seem to be presented as justified for their own sake.  What operators need is a business goal that can be met, an opportunity addressed, and in the complex world of networking most technologies proposed lack the critical property of scope.  They just don’t do the job by themselves, which is why integration is becoming such an issue.  AT&T advanced NFV with ECOMP by incorporating more into it, and they hope to do even more with Indigo.

Let’s start with a quote from AT&T’s Indigo vision statement: “The network of the future will be more than just another “G”, moving from 2G to 3G to 4G and beyond.  It’s about bundling all the network services and capabilities into a constantly evolving and improving platform powered by data. This is about bringing software defined networking and its orchestration capabilities together with big data and an emerging technology called microservices, where small, discrete, reusable capabilities can team up as needed to perform a task. And, yes, it’s about so-called ‘access’ technologies like 5G and our recently -announced Project AirGig. Put all that together, and you have a new way to think about the network.”

Feel better, more educated?  Most people who read the above statement don’t, so don’t feel inadequate.  In simple terms, what Indigo is about is creating agility and efficiency, which you’ll probably recognize as the two paramount (credible) NFV goals.  AT&T is making an important statement here, even if it’s not easy to parse.  The network isn’t going to evolve as a series of disconnected technical shifts, but as a result of serving a clear set of business requirements.  Given that, it makes no sense to keep on talking about “SDN” or “NFV” or “5G” as though they were the only games in town.  There has to be a holistic vision, which is why the quote above ends with the statement that Indigo is “a new way to think about the network.”  It’s about creating something that becomes what’s needed.

Faster access, which is pretty much all anyone thinks about these days when they hear about telecom changes, is rapidly reaching a point where further gains in performance will be difficult to notice.  I’ve said many times that most users could not actually exploit even 25 Mbps; you need multiple people sharing a connection to actually use that much.  AT&T correctly points out that at the point where more bits equals superior service becomes blasé, it’s the overall experience that counts.  Indigo is therefore an experience-based network model.

But, you might rightfully ask, what the heck is it technically?  The kind of detailed Indigo information that we all might like isn’t available, but it’s possible to interpret the high-level data AT&T has provided to gather some useful insight into their approach.  As you might expect from the notion of “experience-based” network services, Indigo steps out beyond connections, to an intermediary position that AT&T calls a “Data-Powered Community”.  Inside this new artifact is the usual access network options, and the now-common commitment to SDN, but there’s also identity management, AI, a data platform that in my view will emerge as the framework for AT&T’s IoT model, and the software orchestration and management tools that tie all this together.

From what I can see, the key technology concept in Indigo is the breaking down of monolithic software structures and service structures into microservices, which are then orchestrated (presumably using ECOMP).  Just as ECOMP can deploy an NFV-based service, it could deploy a function-based application.  Want an operations tool?  Compose it from microservices.  Want to sell a cloud service?  Compose it.  A Community in Indigo is an ad-hoc composition of functional and connection elements.

The Communities Indigo defines are the frameworks that house the customer experiences that they value.  That means that traditional networking ends up merging more with network-related features like agile bandwidth and connectivity, but also with cloud computing and applications.  I think Indigo is a promise that to AT&T, a virtual function and a cloud application will be faces of the same coin, and that services will use both of these two feature packages to add value for users and revenue for AT&T.

One important feature of Indigo is the ability to support services whose pieces are drawn from a variety of sources.  “Federation” isn’t just a matter of interworking connectivity services, it’s a full-blown trust management process that lets third-party partners create elements of services and publish them for composition.  This doesn’t mean that AT&T won’t offer their own advanced service features, but that they expect to have to augment what they can build by incorporating useful stuff from outside.

If you look at the use cases for Indigo that AT&T has already presented, you don’t see more than hint of what I’m describing.  There are four such use cases, and most of them are pretty pedestrian.  What’s really needed is a broader and clearer picture of this federation approach, and in particular examples of how it might be integrated with IoT services.  If there’s a giant revenue pie that AT&T needs to bite into, IoT will likely create it.  Given this, and given that AT&T cites IoT trends twice in its lead-in to justifying Indigo, it’s surprising that they don’t offer any IoT-specific or even related use cases.  In fact, beyond the two justifying mentions, IoT doesn’t appear in the rest of the AT&T technical document on Indigo.

Which, frankly, is my big concern about Indigo.  Yes, all the framing points AT&T makes about the evolution of services and service opportunity are true.  Yes, a framework that envelopes both connectivity and the experiences users want to be connected with is where we’re heading.  And, yes, it’s true that IoT services are still off in the future.  However, they are the big focus of opportunity and Indigo will stand or fail based on whether it supports IoT-related services well.  It’s IoT that offers AT&T and other operators an application so big that most competitors (including OTTs) will be afraid to capitalize it.  They can own IoT, if they really can frame it in Indigo terms.

Indigo’s greatest near-term contribution may well be its impact on ECOMP.  Universal orchestration and software decomposition to microservices would mean a significant enhancement to the ECOMP model of defining services and managing their lifecycle.  A broader goal for orchestration is critical for NFV’s success because the scope needed to deliver the business case is larger than the bite the NFV ISG has taken of the issues.  Indigo is big, which is a risk, but here, bigness could be a precursor to greatness.

How Can We Get to Modular Infrastructure for the Carrier Cloud?

In a blog last week, I mentioned the notion of an NFV infrastructure concept that I called a “modular infrastructure model”.  It was a response to operators’ comments that they were most concerned about the possibility that the largest dollar investment in their future—carrier cloud—would end up creating a silo.  They know how to avoid network equipment silos and vendor lock-in, but in the cloud?  It breaks, or demands breaking, new ground, and in several important ways.

Cloud infrastructure is a series of layers.  The bottom layer, the physical server resources and associated “real” data center switching, is fairly standardized.  We have different features, yet, but everyone is fairly confident that they could size resource pools based on the mapping of function requirements to the hardware features of servers.  The problems lie in the layers above.

A cloud server has a hypervisor, and a “cloud stack” such as OpenStack.  It has middleware associated with the platform and other middleware requirements associated with the machine image of the functions/features you’re deploying.  There are virtual switches to be parameterized, and probably virtual database features as well.  The platform features are often tuned to specific application requirements, which means that operators might make different choices for different areas, or in different geographies.  Yet differences in the platform can be hard to resolve into efficient operations practices.  I can recall, back in the CloudNFV project for which I served as the chief strategist, that it took about two weeks just to get a runnable configuration for the platform software, onto which we could deploy virtual functions.

Operators are concerned that while this delay isn’t the big problem with NFV, it could introduce something that could become a big problem.  If Vendor X offers a package of platform software for the carrier cloud that’s capable of hosting VNFs, then operators would be inclined to pick that just to avoid the integration issues.  That could be the start of vendor lock-in, some fear.

It appears that the NFV standards people were aware of this risk, and they followed a fairly well accepted path toward resolving it.  In the NFV E2E architecture, the infrastructure is abstracted through the mechanism of a Virtual Infrastructure Manager that rests on top of the infrastructure and presents it in a standard way to the management and orchestration (MANO) elements.

Problem solved?  Not quite.  From the first, there was a question of how the VIM worked.  Most people involved in the specs seemed to think that there was one VIM, and that this VIM would resolve differences in infrastructure below it.  This approach is similar to the one taken by SDN, and in particular by Open Daylight, and it follows the model of OpenStack’s own networking model, Neutron.  However, anyone who has followed Neutron or ODL knows that even for nothing more than connectivity, it’s not easy to build a universal abstraction like that.  Given that, a vendor who had a super-VIM might use it to lock out competitors by simply dragging out supporting them.  Lock-in again.

An easy (or at least in theory, easy) solution to this problem is one I noted just a couple blogs back—you support multiple VIM.  That way, a vendor could supply a “pod” or my modular infrastructure model, represented by its own VIM.  Now, nobody can use VIMs to lock their own solutions in.

As usual, it turns out to be a bit more complicated.  The big problem is that if you have a dozen different VIMs representing different resource pods, how do you know which one (or ones) to send requests to, and how do you parcel out the requests for large-scale deployment or change among them?  You don’t want to author specific VIM references into service models because that would make the models “brittle”, subject to change if any changes in infrastructure were made.  In fact, it could make it difficult to do scaling or failover, if you had to reference a random VIM that wasn’t in the service to start with.

There are probably a number of ways of dealing with this, but the one I’ve always liked and proposed for both ExperiaSphere and CloudNFV was the notion of a resource-side and service-side model, similar to the TMF’s Customer-Facing and Resource-Facing Services.  With this model, every VIM would assert a standard set of features (“Behaviors” in my terminology), and if you needed to DeployVNF, for example, you could use that feature with any VIM that represented hosting.  VIM selection would then be needed only to accommodate differences in resource types across a service geography, and it could be done “below” the service model.  Every VIM would be responsible for meeting the standard Behaviors of its class, which might mean all Behaviors if it was a self-contained VNF hosting pod.

All this is workable, but not sufficient.  We still have to address the question of management, or lifecycle management to be specific.  Every event that happens during a service lifecycle that impacts resource commitments or SLAs has to be reflected in an associated set of remedial steps, or you don’t have service automation and you can’t improve opex.  These steps could easily become very specific if they are linked to VNF processes—which the current ETSI specifications propose to do by having VNF Management (VNFM) at least partially embedded in the deployed VNFs.  If there is, or has to be, tight coupling between resources and VNFM, then you have resource-specific management and a back door into the world of silos at best and vendor lock-in at worst.

There are, in theory, ways to provide generalized management tools and interfaces between the resource and service sides of an operator.  I’ve worked through some of them, but I think that in the long pull most will fail to accommodate the diverse needs of future services and the scope of service use.  That means that what will be needed is asynchronous management of services and resources.  Simply put, “Behaviors” are “resource-layer services” that like all services offer an SLA.  There is a set of management processes that work to meet that SLA, and those processes are opaque to the service side.  You know the SLA is met, or has failed to be met, or is being remediated, and that’s that.

So what does/should a VIM expose in its “Behaviors” API?  All network elements can be represented as a black box of features that have a set of connections.  Each of the features and connections has a range of conditions it can commit to, the boundaries of its SLA.  When we deploy something with a VIM, we should be asking for such a box and connection set, and providing an SLA for each of the elements for which one is selectable.  Infrastructure abstraction, in short, has to be based on a set of classes of behavior to which infrastructure will commit, regardless of exactly how it’s constituted.  That’s vendor independence, silo-independence.

I’m more convinced every day that the key to efficient carrier cloud is some specific notion of infrastructure abstraction, whether we’re talking about NFV or not.  In fact, it’s the diversity of carrier cloud drivers and the fact that nothing really dominates the field in the near term, that makes the abstraction notion so critical.  We have to presume that over time the role of carrier cloud will expand and shift as opportunity focus changes for the operators.  If we don’t have a solid way of making the cloud a general resource, we risk wasting the opportunity that early deployment could offer the network operators.  That means risking their role in the cloud, and in future experience-based services.

Vendors have a challenge here too, because the fear of silos and vendor lock-in is already changing buyer behavior.  In NFV, the early leaders in technology terms were all too slow to recognize that NFV wasn’t a matter of filling in tick marks on an RFP, but making an integrated business case.  As a result, they let the market idle along and failed to gain traction for their own solutions at a time when being able to drive deployment with a business case could have been decisive.  Now we see open-source software, commodity hardware, and anti-lock-in-and-silo technology measures taking hold.

It’s difficult to say how much operator concerns over silos and lock-in are impacting server sales.  Dell, a major player, is private and doesn’t report their results.  HPE just reported its quarterly numbers, which were off enough to generate Street concern.  HPE also said “We saw a significantly lower demand from one customer and major tier 1 service provider facing a very competitive environment.”  That is direct evidence that operators are constraining even server purchases, and it could be an indicator that the fear of silos and lock-in is creating a problem for vendors even now.

Ericsson’s wins at Telefonica and Verizon may also be an indicator.  Where there’s no vendor solution to the problems of making a business case or integrating pieces of technology, integrators step in.  There seems to be a trend developing that favors Ericsson in that role, largely because they’re seen as a “fair broker” having little of their own gear in the game.

It’s my view that server vendors are underestimating the impact of operator concerns that either early server deployments won’t add up to an efficient carrier cloud, or will lock them into a single supplier.  It wouldn’t take a lot of effort to create a “modular infrastructure model” for the carrier cloud.  Because its importance lies mostly in its ability to protect operators during a period where no major driver for deployment has emerged, developing a spec for the model doesn’t fall cleanly into NFV or 5G or whatever.  Vendors need to make sure it’s not swept under the rug, or they face a further delay in realizing their sales targets to network operators.

Despite some of the Street comments and some media views on the HPE problem, the cloud is not suppressing server spending.  Every enterprise survey I’ve done in the last five years shows that cloud computing has not had any impact on enterprise server deployment.  If anything, the cloud and web-related businesses are the biggest source of new opportunity.  Even today, about a third of all servers sold are to web/cloud players and network operators.  My model has consistently said that carrier cloud could add a hundred thousand new data centers by 2030.

If the cloud is raining on the server market, more rain is needed.  It is very likely that network-related server applications represent the only significant new market opportunity in the server space, in which case anything that limits its growth will have a serious impact on server vendors.  The time to fix the problem here is short, and there’s also the threat of open hardware models lurking in the wings.  Perhaps there needs to be a new industry “Call for Action” here.

Just How Much of a Problem is VNF Onboarding and Integration?

We had a couple of NFV announcements this week that mention onboarding or integration.  Ericsson won a deal with Verizon that includes their providing services to integrate VNFs, and HPE announced an “onboarding factory” service that has the same basic goal.  The announcements raise two questions.  First, does this move the ball significantly with respect to NFV adoption?  Second, why is NFV, based as it is on standard interfaces, demanding custom integration to onboard VNFs?  Both are good questions.

Operators do believe that there’s a problem with VNF onboarding, and in fact with NFV integration overall.  Most operators say that integration difficulties are much worse than expected, and nearly all of them put difficulties in the “worse to much worse” category.  But does an integration service or factory change things radically enough to change the rate of NFV adoption significantly?  There, operators are divided, based of course on just how much VNF onboarding and integration they propose.

The majority of VNFs today are being considered in virtual CPE (vCPE) service-chaining business service applications, largely targeting branch office locations connected with carrier Ethernet services.  Operators are concerned with onboarding/integration issues because they encounter business users who like one flavor of VNF or another, and see offering a broad choice of VNFs as a pathway to exploding costs to certify all the candidates.

The thing is, many operators don’t even have this kind of prospect, and most operators get far less than 20% of their revenue from business user candidates for vCPE overall.  I’ve talked with some of the early adopters of vCPE, and they tell me that while there’s a lot of interest in having a broad range of available VNFs, the fact is that for any given category of VNF (firewall, etc.) there are probably no more than three candidates with enough support to justify including them in a vCPE function market list.

The “best” applications for NFV, meaning those that would result in the largest dollar value of services and of infrastructure purchasing, are related to multi-tenant stuff like IoT, CDN, and mobility.  All but IoT among this group tend to involve a small number of VNFs that are likely provided by a single source and are unlikely to change or be changed by the service customer.  You don’t pick your own IMS just because you have a mobile phone.  That being the case, it’s unlikely that one of the massive drivers of NFV change would really be stalled out on integration.

The biggest problem operators say they have with the familiar vCPE VNFs isn’t integration, but pricing, or perhaps the pricing model.  Most VNF providers say they want to offer their products on a usage price basis.  Operators don’t like usage prices because they feel they should be able to buy unlimited rights to the VNF at some point.  Some think that as usage increases, unit license costs should fall.  Other operators think that testing the waters with a new VNF should mean low first-tier prices that gradually rise when it’s clear they can make a business case.  In short, nothing would satisfy all the operators except free VNFs, which clearly won’t make VNF vendors happy.

Operators also tell me they’re more concerned about onboarding platform software and server or network equipment than VNFs.  Operators have demanded open network hardware interfaces for ages, as a means of preventing vendor lock-in.  AT&T’s Domain 2.0 model was designed to limit vendor influence by keeping vendors confined to a limited number of product zones.   What operators would like to see is a kind of modular infrastructure model where a vendor contributes a hosting and/or network connection environment that’s front-ended by a Virtual Infrastructure Manager (VIM) and that has the proper management connections to service lifecycle processes.

We don’t have one of these, in large part, because we still don’t have a conclusive model for either VIMs or management.  One fundamental question about VIMs is how many there could be.  If a single VIM is required, then that single VIM has to support all the models of hosting and connectivity needed, which is simply not realistic at this point.  If multiple VIMs are allowed, then you need to be able to model services so that the process of decomposition/orchestration can divide up the service elements among the infrastructure components each VIM represents.  Remember, we don’t have a solid service modeling approach yet.

The management side is even more complicated.  Today we have the notion of a VNF Manager (VNFM) that has a piece living within each VNF and another that’s shared for the infrastructure as a whole.  The relationship between these pieces and underlying resources isn’t clear, and it’s also not clear how you could provide a direct connection between a piece of a specific service (a VNF) and the control interfaces of shared infrastructure.

This gets to the second question I noted in my opening.  Why is this so much trouble?  Answer: Because we didn’t think it out fully before we got committed to a specific approach.  It’s very hard to go back and redo past thinking (though the NFV ISG seems to be willing to do that now), and it’s also time-consuming.  It’s unfair to vendors to do this kind of about-face as well, and their inertia adds delay to a process that’s not noted for being a fast-mover as it is.  The net result is that we’re not going to fix the fundamental architecture to make integration and onboarding logical and easy, not any time soon.

That may be the most convincing answer to the question of the relevance of integration.  If we could assume that the current largely-vCPE integration and onboarding initiatives were going to lead us to something broadly useful and wonderfully efficient, then these steps could eventually be valuable.  But they still don’t specifically address the big issue of the business case, an issue that demands a better model for the architecture in general, and management in particular.

I understand what vendors and operators are doing and thinking.  They’re taking baby steps because they can’t take giant strides.  But either baby steps or giant strides are more dangerous than useful if they lead to a cliff, and we need to do more in framing the real architecture of virtualization for network functions before we get too committed on the specific mechanisms needed to get to a virtual future.

Cloud Computing Service Success for Operators: Can It Be Done?

Operators have been fascinated by the public cloud opportunity, and new initiatives like that of Orange Business Services seem to promise that this fascination could be gaining some traction in the real world.  But Verizon at the same time is clearly pulling back from its public cloud computing commitments.  What’s really going on with operator public cloud services?

In a prior blog, I noted that operators had initially seen cloud computing services as an almost-natural fit.  These services require a major investment, and they offer a relatively small return, which fits the public-utility model most operators still adhere to.  Publicity on cloud computing suggested an oncoming wave of adoption that could carry everyone to riches, a trillion-dollar windfall.  It didn’t happen, of course, and after the blog I cited, I heard more from operator planners who were eager to offer some insight into their own situations.

All of those who contacted me agreed that the biggest problem they faced with cloud computing services was the unexpected difficulty in selling the services.  Operators are good at stuff that is marketed rather than sold, where publicity stimulates buyers to make contact and thus identify themselves.  They’re also often at least passable at dealing one-on-one with large prospective buyers, like the big enterprises.  They’re not good at pounding the pavement doing prospecting, and that seemed to be what cloud computing was really about.

One insight that operators offered on this point was that their initial target for cloud computing was the large enterprise CIO types, the same people who are instrumental in big telecom service buys.  They found that for the most part enterprise public cloud was driven by line department interest in “shadow IT” and that the CIO was as likely (or more) to oppose the cloud move as to support it.  Certainly they were not usually the central focus in making the deal.  That meant that operators would have to reach out to line departments, and that broke the sales model.

The second problem operators reported was the complexity of the cloud business case.  Operators believed rosy predictions of major savings, but while there might indeed be strong financial reasons to move at least some applications to the cloud, they were difficult to quantify.  Often there had to be a formal study, which of course either the operator had to do or had to convince the prospective buyer to do.  Several operators said they went through several iterations of this, and never came up with numbers the CFO would accept.

The final issue was security and governance.  Operators said that there were always people (often part of the IT organization) who asked specific questions about cloud security and governance, and those questions were very difficult to answer without (you guessed it!) another study.  This combined with the other issues to lengthen the selling cycle to the point where it was difficult to induce salespeople to stay the course.

If you contrast Orange and Verizon, you can see these factors operating.  In both cases, the operators were looking at headquarters sales targets.  Verizon has the largest number of corporate headquarters of any Tier One, and so it seemed to them they should have the best chance of doing a deal with the right people.  Orange seems to be proving that’s true only to a point; you can present the value proposition to headquarters, but it still has to be related to a compelling problem the buyer already accepts.  Multinationals, the Orange sales target, have a demonstrable problem in providing IT support in all their operating geographies.  The cloud is a better solution than trying to build and staff data centers everywhere in the world.

The question, of course, is whether the opportunity will be worth Orange’s building those data centers.  In effect, their approach is almost like a shared hosting plan; if a bunch of multinationals who need a data center in Country X combine to share one (Orange’s cloud service) the single data center would be a lot more cost-effective than the sum of the companies’ independent ones.  If there are enough takers to Orange services, then it works.  Obviously one customer in every data center would end up putting Orange in the “inefficient and unwise” category of data center deployment.  We can’t say at this point how well it will go for them.

It does seem that the Orange model of exploiting headquarters relationships and specific problems/issues is the right path for operators looking to get into public cloud services.  This would have the best chance of working where there were a lot of headquarters locations to sell to, obviously, which means fairly “thick” business density.  However, as I said, Verizon had that and couldn’t make a go of things, so what made their situation different?

Probably in part competition, less the direct-to-the-wallet kind than the hearts-and-minds kind.  US companies know the cloud well from players like Microsoft and Amazon, and they perceive network operators as come-from-behind players who are next-to-amateurs in status.  Probably in part geography; in the EU the operators are held in higher strategic regard, and most of them have faced profit pressure for a longer time, so they’re further along in the cycle of transformation.

The real question is what cloud needs the operators could more broadly fill, and the answer to that may be hard to deal with if you’re an operator.  My model says that there are no such needs, that there is no single opportunity that could pull through carrier cloud computing service success.  The only thing that might do it down the line is IoT, but the situation changes down the line in any case.

As operators manage to deploy carrier cloud for other missions, they’ll achieve economy of scale and a coveted low-latency position at the edge.  Those factors will eventually make them preferred providers, and once they take hold then carrier cloud computing services will quickly gain acceptance.

The only problem with that story is that it’s going to take time.  Between 2019 and 2021 is the hot period according to the model, the time when there’s finally enough cloud infrastructure to make operators truly competitive.  Even that requires that they deploy cloud infrastructure in other short-term service missions, starting even this year.  That may not happen, particularly if 5G standards take as much time to mature as NFV specifications have taken.

This could be a long slog, even for Orange.  My model says their own situation is right on the edge, and execution of both deployment and sales will have to be perfect or it won’t work and they’ll end up doing what Verizon has done.

The Road Ahead for the Networking Industry

Think network hardware is going great?  Look at Cisco’s results, and at Juniper’s decision to push even harder in security (which by the way is also a hot spot for Cisco’s quarter).  Look at the M&A in the vendor space.  Most of all, look at everyone’s loss of market share to Huawei.  USTelecom Daily Lead’s headline was “Network Hardware Woes Crimp Cisco Sales in Q2.”  SDxCentral said “Cisco’s Switching and Routing Revenue Dragged in Q2.”  Switching, routing, and data center were all off for Cisco and total revenue was down (again).  Do we need to set off fireworks to signal something here?

We clearly have a classic case of shrinking budgets for network buyers.  On the service provider side, the problem is that profit-per-bit shrinkage that I’ve talked about for years.  On the enterprise side, the problem is that new projects to improve productivity are half as likely to have a network budget increase component as they were a decade ago.  The Street likes to say this is due to SDN and NFV, but CFOs tell me that neither technology has had any measurable impact on spending on network gear.  Price pressure is the problem; you beat up your vendors for discounts and if that doesn’t work you go to Huawei.

None of this is a surprise, if you read my blog regularly.  Both the service provider and enterprise trends are at least five years old.  What is surprising, perhaps, is that so little has been done about the problem.  I remember telling Cisco strategists almost a decade ago that there was a clear problem with the normally cyclical pattern of productivity-driven IT spending increases.  I guess they didn’t believe me.

We are beyond the point now where some revolution in technology is going to save network spending.  In fact, all our revolutions are aimed at lowering it, and Cisco and its competitors are lucky that none of them are working very well—yet.

Equipment prices, according to my model, will fall another 12% before hitting a level were vendors won’t be willing/able to discount further.  That won’t be enough to stave off cost-per-bit pressure, so we can expect to see “transformation” steps being taken to further cut costs.  This is where vendors have a chance to get it right, or continue getting it wrong.

There is no way that adding security spending to offset reductions in network switch/router spending is going to work.  Yes, you could spend a bit more on security, but the gains we could see there can’t possibly offset that 12% price erosion, nor can they deter what’s coming afterward.  What has to happen is that there is a fundamental change in networking that controls cost.  The question is only what that change can be, and there are only two choices—major efforts to reduce opex, or a major transformation of infrastructure to erode the role of L2/L3 completely.

Overall, operators spend perhaps 18% or 19% of every revenue dollar on capital equipment.  They’ll spend 28% of each dollar on “process opex”, meaning costs directly attributable to service operations and customer acquisition/retention, in 2017.  If we were to see a reduction in capex of that 12%, we’d end up with about a two percent improvement.  Service automation alone could reduce process opex by twice that.  Further, by 2020 we’re going to increase process opex to about 31% of each revenue dollar, an increase larger than the credible capex reduction by price pressure could cover.  By that time, service automation could have lowered process opex to 23% of revenue.  That’s more than saving all the capex budget could do.

SDN and NFV could help too, but the problem is that the inertia of current infrastructure limits the rate at which you could modernize.  The process opex savings achieved by SDN/NFV lags that of service automation without any infrastructure change by a bit over two years.  The first cost of the change would be incurred years in advance of meaningful benefits, which means that SDN/NFV alone cannot solve the problem unless the operators dig a very big cost hole that will take five years or more to fill with benefits.

The infrastructure-transformation alternative would push more spending down to the fiber level, build networks more with virtual-wire or tunnel technology at the infrastructure level, and shift to overlay (SD-WAN-like) technology for the service layer.  This approach, again according to my model, could cut capex by 38%, and in combination with service management automation, it could cut almost 25 cents of cost per dollar of revenue.  The problem is the time it would take to implement it, because operators would have to find a way to hybridize the new model with current infrastructure to avoid having a fork-lift-style write-down of installed equipment.  My model says that SD-WAN technology could facilitate a “soft” migration to a new infrastructure model, so the time needed to achieve considerable benefits could be as little as three years.

So, what can the network equipment vendors do here?  It doesn’t take an accountant to see that the service automation approach would be better for network equipment vendors because it wouldn’t require any real infrastructure change.  However, there are two issues with it.  First, the network equipment vendors have been singularly resistive to pushing this sort of thing, perhaps thinking it would play into the hands of the OSS/BSS types.  Second, it may be too late for the network vendors to jump on the approach, given that operators are already focused on lowering equipment spending by putting pressure on vendors (or switching to Huawei, where possible).

Some of the network equipment strategists see inaction as an affirmative step.  “We don’t need to do anything to drive service automation,” one said to me via email.  “Somebody is going to do it, and when they do it will take capex pressure off.”  Well, I guess that’s how the waffling got started.  Others tell me that they saw service automation emerging from SDN/NFV, which they didn’t want to support for obvious reasons.

The potential pitfall of the inaction approach is that a competitor might be the one who steps up and takes the action instead of you.  Cisco can afford to have Amdocs or perhaps even Oracle or HPE become a leader in service automation, but they can’t let Nokia or (gasp!) Huawei do that.  If a network vendor developed a strong service automation story they could stomp on the competition.

Worse, an IT vendor could stomp on all the network vendors if they developed a story of service automation and our push-down-and-SD-WAN model of infrastructure.  Operators are receptive to this message for the first time, in no small part because of something I’ve mentioned above—they’ve become focused on cutting capex by putting price pressure on vendors.  SD-WAN has tremendous advantages as a vehicle for offering business services, not the least of which being that it’s a whole lot better down-market than MPLS VPNs.  It’s also a good fit to cloud computing services.  A smart IT vendor could roll with this.

If we have any.  The down-trend in network spending has been clear for several years now, and we still find it easier to deny it than to deal with it.  I suspect that’s going to change, and probably by the end of this year, and we’ll see who then steps up to take control over networking as an industry.  The answer could be surprising.

Don’t Ignore the Scalability and Resilience of SDN/NFV Control and Management!

It would be fair for you to wonder whether the notion of intent-based service modeling for transformed telco infrastructure is anything more than a debate on software architecture.  In fact, that might be a very critical question because we’re really not addressing, so far, the execution of the control software associated with virtualization in carrier infrastructure.  We’ve talked about scaling VNFs, even scaling controllers in SDN.  What about scaling the SDN/NFV control, orchestration, and management elements?  Could we bring down a whole network by a classic fault avalanche, or even just by a highly successful marketing campaign?  Does this work under load?

This isn’t an idle question.  If you look at the E2E architecture of the NFV ISG, you see a model that if followed would result in an application with a MANO component, a VIM component, and a VNFM component.  How does work get divided up among those?  Could you have two of a given component sharing the load?  There’s nothing in the material to assure that an implementation is anything but single-threaded, meaning that it processes one request at a time.

I think there are some basic NFV truths and some basic software truths that apply here, or should.  On the NFV side, it makes absolutely no sense to demand that there be scalability under load and dynamic replacement of broken components at the VNF level, and then fail to provide either for the NFV software itself.  At the basic software truth level, we know how the cloud would approach the problem, and so we have a formula that could be applied and has been largely ignored.

In the cloud, it’s widely understood that scalable components have to be stateless and must never retain data within the component from call to call.  Every time a component is invoked, it has to look like it’s a fresh-from-the-library copy, because given scalability and availability management demands, it might just be that.  Microservices are an example of a modern software development trend (linked to the cloud but not dependent on it) that also mandate stateless behavior.

This whole thing came up back about a decade ago, with the work being done in the TMF on the Service Delivery Framework.  Operators expressed concern to me over whether the architecture being considered was practical:  “Tom, I want you to understand that we’re 100% behind implementable standards, but we’re not sure this is one of them,” was the comment from a big EU telco.  With the support of the concerned operators (and the equipment vendor Extreme Networks) I launched a Java project to prove out how you could build scalable service control.

The key to doing that, as I found and as others found in other related areas, is the notion of back-end state control, meaning that all of the variables associated with the way that a component handles a request are stored not in the component (which would make it stateful) but in a database.  That way, any instance of a component can go to the database and get everything it needs to fulfill the request it receives, and even if five different components process five successive stages of activity, the context is preserved.  That means that if you get more requests than you can handle, you simply spin up more copies of the components that do the work.

You could shoehorn this approach into the strict structure of NFV’s MANO, but it wouldn’t be the right way—the cloud way—of doing it.  The TMF work on NGOSS Contract demonstrated that the data model that should be used for back-end state control is the service contract.  If that contract manages state control, and if all the elements of the service (what the TMF calls “Customer-Facing” and “Resource-Facing Services, or CFS/RFS) store state variables in it, then a copy of the service contract will provide the correct context to any software element processing any event.  That’s how this should be done.

The ONF vision, as they explained it yesterday, provides state control in their model instances, and so have all my own initiatives in defining model-driven services.  If the “states” start with an “orderable” state and advance through the full service lifecycle, then all of the steps needed to deploy, redeploy, scale, replace, and remove services and service elements can be defined as the processes associated with events in those states.  If all those processes operate on the service contract data, then they can all be fully stateless, scalable, and dynamic.

Functionally, this can still map back to the NFV ISG’s E2E model, but the functions described in the model would be distributed in two ways—first by separating their processes and integrating them with the model state/event tables as appropriate, and second by allowing their execution to be distributed across as many instances as needed to spread the load or replace broken pieces.

There are some specific issues that would have to be addressed in a model-driven, state/event, service lifecycle management implementation like this.  Probably the most pressing is how you’d coordinate the assignment of finite resources.  You can’t have five or six or more parallel processes grabbing for hosting, storage, or network resources at the same time—some things may have to be serialized.  You can have the heavy lifting of making deployment decisions, etc. operating in parallel, though.  And there are ways of managing the collision of requests for resources too.

Every operator facility, whether it’s network or cloud, could be a control domain, and while multiple requests to resources in the same domain would have to be collision-proof, you could have multiple domain requests running in parallel.  Thus, you can reduce the impact of the collision of requests.  This is necessary in my distributed approach, but it’s also necessary in today’s monolithic model of NFV implementation.  Imagine how you’d be able to deploy national/international services with a single instance of MANO!

The final point to make here is that “deployment” is simply a part of the service lifecycle.  If you assume that you deploy things using one set of logic and then sustain them using another, you’re begging for the problems of duplication of effort and very likely inconsistencies in handling.  Everything in a service lifecycle should be handled the same way, be defined by the same model.  That’s true for VNFs and also for the NFV control elements.

This isn’t a new issue, which perhaps is why it’s so frustrating.  In cloud computing today, we’re seeing all kinds of initiatives to create software that scales to workload and that self-heals.  There’s no reason not to apply those principles to SDN and NFV, particularly when parts of NFV (the VNFs) are specifically supposed to have those properties.  There’s still time to work this sort of thing into designs, and that has to be done if we expect massive deployments of SDN/NFV technology.

The “New ONF” Declares a Critical Mission, but Can They Fulfill It?

Yesterday the “New ONF” formed by the union of the old ONF and ON.Labs announced its new mission and its roadmap to achieving it.  I’m a guy who has worked in standards for well over two decades, and the experience has made me perhaps more cynical about standards than I am about most things (which, most of my readers will agree, is pretty darn cynical).  The new ONF actually excites me by stating a goal set and some key points that are spot on.  It also frightens me a little because there’s still one thing that the new group is doing that has been a major cause of failure for all the other initiatives in the service provider transformation space.

The “new ONF” is the union of the Open Network Foundation and ON.Labs, the organization that created the ONOS operating system and CORD, both of which I’ve talked about in the past.  I blogged about the importance of CORD early on (see THIS blog) and gain when Comcast jumped into the consortium, HERE, and everyone probably knows that the ONF is the parent of OpenFlow SDN.  The new ONF seems more focused on the ON.Labs elements, from which they hope to create way to use software-based or software-defined elements to build market-responsive networks and network services.

Networks of old were a collection of boxes joined by very standardized hardware interfaces.  Then, enter virtualization, software definition, the cloud, and all the other good stuff that’s come along in the last decade.  Each of these new initiatives had/have their champions in terms of vendors, buyers, and standardization processes.  Each of these initiatives had a very logical mission, and a logical desire to contain scope to permit timely progress.  Result?  Nothing connects in this wonderful new age.

This is a perhaps-flowery restatement of the opening positioning that the ONF offers for its new concept of the Open Innovation Pipeline.  The goal of the process is the notion of the “Software-Defined Standard”, something that by itself brings tears to the eyes of an old software architecture guy like me.  We’ve gone on way too far along the path of supposed software-defined stuff with little apparent concern for software design principles.  The ONF says they want to fix that, which has me excited.

Digging to the details, what the ONF seems to be proposing is the creation of an open ecosystem that starts (at least in many cases) with the ONOS operating system, on which is added the XOS orchestration layer (which is a kind of service middleware).  This is used to build the variety of CORD models (R-CORD, M-CORD, etc.), and it can also be used to build new models.  If this approach were to be followed, it would create a standardized open-source platform that builds from the bottom to the top, and that provides for easy customization and integration.

But it’s at the top of the architectural heap that I find what makes me afraid.  The architectural slide in all of this shows the open structure with a programmable forwarding plane at the bottom, a collection of Global Orchestrators at the top, and the new ONF focus as a box in between.  This vision is of course device-centric, and in the real world you’d be assembling conforming boxes and presumably other boxes, virtual or real, to create networks and services.  I don’t have a problem with the idea that there’s a forwarding plane at the bottom, because even service elements that are outside the service data plane probably have to forward something.  I’m a bit concerned about that Global Orchestrator thing at the top.

I’ve been a part of a lot of standards processes for decades, and it seems like all of them tend to show a diagram that has some important function sitting god-like at the top, but declared safely out of scope.  That’s what the ONF has done with those Global Orchestrators.  The problem with those past bodies and their past diagrams is that all of them failed their critical mission to make the business case, and all of them failed because they didn’t include elements that were critical to their business case in their scope of work.  So the fact that the ONF seems to do this is discouraging.

The ONF is right in saying that there’s an integration problem with the new-generation virtualization-based services.  They are also right in saying that having a common platform on which the elements of those new services are built would solve that problem, through the simple mechanism of a common implementation platform on which the features were built.  However, the past says that’s not enough, for two reasons.

First, everything is not built on the ONF’s architecture.  Even if we presumed that everything new was built that way, you still have to absorb all the legacy hardware and accommodate the open source initiatives for other virtualized-element models, all of which aren’t based on the ONF’s elements.  We have learned the bitter truth in NFV in particular; you can’t exclude the thing you are evolving from (legacy devices in particular) in your model of a future service, unless you never want to get there from here.  You could accommodate the legacy and “foreign” stuff in the ONF approach, but the details aren’t there yet.

Second, there’s the issue of the business case.  I can have a wonderful architecture for building standardized car parts, but it won’t serve me a whit if nobody wants to buy a car.  I’ve blogged a lot about the business case behind a new virtual service element—SDN, NFV, or whatever you like.  Most of that business case is going to come from the automation of the full service lifecycle, and most of that lifecycle and the processes that automate it live in that Global Orchestrators element that’s sitting out of scope on top of the ONF target functionality.

All of this could be solved in a minute with the inclusion of a model-based service description of the type I’ve been blogging about.  I presented just this notion to the ONF, in fact, back in about 2014.  A model like that could organize all of the pieces of ONF functionality, and it could also organize how they relate to the rest of the service processes, whether they’re NFV processes, OSS/BSS processes, or cloud computing.  Yes, this capability would be in a functional Global Orchestrator, but there aren’t any of them available and we know that because nobody has successfully made the business case with one, nor have they integrated all the service lifecycle processes.

There is a modeling aspect to the XOS layer, and it’s got all the essential pieces, as I said in my first blog on it (see above).  However, in execution, XOS seems to have changed its notion of “service” from a high-level one to something more like the TMF’s “Resource-Facing Services” or my ExperiaSphere “Behaviors”.  They’re what a network or infrastructure can do, more than a functional assembly that when decomposed ends up with these infrastructure capabilities.  That seems to be what created the Global Orchestrator notion; the lost functionality is pushed up into the out-of-scope part.  That’s what frightens me, because it’s the mistake that so many others have made.

I’m not knocking the new ONF here, because I have high hopes for it.  They, at least, seem to grasp the simple truth that software defined stuff demands a definition of stuff in software terms.  I also think that, at a time when useful standards to support integration in SDN and NFV seem to be going nowhere, the notion of a common platform seems unusually attractive.  Is it the best approach?  No, but it’s a workable one, which says a lot at this point.

There have been a lot of recent re-launching of standards and industry groups and activities, brought about because the original efforts of the bodies generated interest, hype, media extravagance, and not much in the way of deployment or transformation.  The new ONF now joins the group of industry mulligans, and the question is whether it will jump off what’s unquestionably a superior foundation and do the right thing, or provide us with another example of how to miss the obvious.  I’ll offer my unbiased view on that as the details of the initiative develop.

What Will Become of Test and Measurement in a Virtual World?

One of the joke statements of the virtual age is “You can’t send a real tech to fix a virtual problem!”  Underneath the joke is a serious question, which is just what happens to test and measurement in a virtual world?  Virtualization opens two issues—how do you test the virtual processes and flows, and how does virtualization impact T&M’s usual missions?  We’ll look at both today.

Test and measurement (T&M) differs from “management” in that the latter focuses on the ongoing status of things and the reporting of changes in status.  Management is status monitoring and not network monitoring.  T&M, in contrast, aims at supporting the “craft processes” or human activity that’s associated with taking a refined look at something—is it working, and how well—with the presumptive goal of direct remediation.

Many people, including me, remember the days when resolving a network problem involved looking at a protocol trace, and that practice is a good place to start our exploration.  Whether you have real or virtual devices, the data flows are still there and so are the issues of protocol exchanges.  However, a virtual device is fundamentally different from a real one, and the differences have to be accommodated in any realistic model of T&M for the virtual age.

There’s an easy-to-see issue that we can start with.  A real device has a location.  A virtual device has one too, in the sense that it’s hosted somewhere, but the hosting location isn’t the same thing as the location of a box.  A box is where it is; a virtual router instance is where it was convenient to put it.  At the least, you’d have to determine where an instance was being hosted before you could run out and look at it.  But that initial check of location isn’t enough in a virtual world.  Imagine a tech on route to stick a monitor in a virtual router path, only to find that while in route, the virtual router “moved”.  It’s common to have a soft collision between management-driven changes in a network and remediation, but in the traditional world the boxes at least stay put.  T&M in a virtual world has to deal with the risk of movement of the instance while the tech is setting up or during the test.

Simplicity is comforting even when it’s not quite as simple as looks, but this simple point of “where is it?” isn’t the real problem.  If software automation to improve opex is the goal (which operators say it is) for virtualization, then we’d have to assume that the goal is to move away from “T&M” to “management”, since the former is presumably explicitly a human activity.  That means that in the future, not only would it be more likely that a virtual router got moved, it would be likely that if there were a problem with it the first goal would be to simply replace it—an option that’s fine if you’re talking about a hosted software element but problematic if you’re dealing with a real box.  So, we’re really saying that virtualization first and foremost alters the balance between management and T&M.

When do you send a tech, or at least involve a tech?  The only satisfactory answer in a time when opex reduction is key is “When you’ve exhausted all your other options.”  One operator told me that their approach was something like this:

  1. If there’s a hard fault or an indication of improper operation, you re-instantiate and reroute the service as needed. It’s like saying that if your word processor is giving you a problem, save and reload it.
  2. If the re-instantiation doesn’t resolve things, you check to see if there was any change to software versions in the virtual device or its platform that, in timing, seem possibly related to the issue. If so, you roll back to the last configuration that worked.
  3. If neither of these resolves things or are not applicable, then you have to try remediation. The operator says that they’d first try to reroute or redeploy the service around the whole faulty function area and then try to recreate the problem in a lab under controlled conditions.  If that wasn’t possible they’d assume T&M was needed.

The same operator says that if we assumed a true virtual network, the goal would be to avoid dispatching a tech in favor of some kind of testing and monitoring from the network operations center (NOC).  The RMON specification from the IETF can be implemented in most real or virtual devices, and there are still a few companies that use hardware or software probes of another kind.  This raises the question of whether you could do T&M in a virtual world using virtual monitoring and test injection, which would eliminate the need to dispatch someone to hook up an analyzer.  A “real” dispatch would be needed only if there were a hardware failure of some sort on site, or a situation where a manual rewiring of the network connections of a device or server was needed.

One advantage of the virtual world is that you could instantiate a monitoring point as software somewhere convenient, and either connect it to a “T” you kept in place at specific locations, or cross-connect by rerouting.  The only issue with this approach is the same issue you can run into with remote monitoring today—the time delay that’s introduced from the point of “tapping” the flow to the point of viewing the monitoring could be an issue.  However, if you aren’t doing test injection at the monitoring point the issues should be minimal, and if you are then you’d need a more sophisticated remote probe to install so you could enter responses to triggers that are executed locally.

Another aspect of “virtual T&M” is applying T&M to the control APIs and exchanges associated with SDN or NFV.  This has been a topic of interest for many of the T&M vendors, and certainly the failure of a control or management path in SDN or NFV could present a major problem.  Operators, in fact, are somewhat more likely to think they need specialized T&M support for control/management exchanges in SDN and NFV than in the service data path.  That’s because of expected issues with integration among the elements at the control/management protocol level.

Most of the technology and strategy behind virtual T&M is the same whether we’re talking about the data path or the control/management plane.  However, there are profound issues of security and stability associated with any monitoring or (in particular) active intervention in control/management activity.  We would assume that T&M would have to live inside the same security sandbox as things like an SDN controller or NFV MANO would live, to insure nothing was done to compromise the mass of users and services that could be represented.

Overall, the biggest impact of virtualization trends on T&M is the fact that a big goal for virtualization is service lifecycle automation.  If that’s taken seriously, then more of what T&M does today would migrate into a management function that generated events to drive software processes, not technicians.  In addition, the T&M processes related to device testing are probably far less relevant in an age where the device is virtual and can be reinstantiated on demand.  But virtualization also lets T&M create what is in effect a virtual technician because it lets you push a probe and test generator anywhere it’s needed.  Will the net be positive or negative?  I think that will depend on how vendors respond to the challenge.

Could Modeling Be the Catalyst for OSS/BSS Transformation?

I can vividly recall one of my early telco transformation meetings.  It was just after NFV had launched, but before any real work had been done.  At the meeting, two of the telco experts sitting next to each other expressed their views on OSS/BSS.  One wanted to transform it, retaining as much as possible of the current systems.  The other wanted to start over.  This polarized view of OSS/BSS futures, it turned out, was fairly pervasive among operators and it’s still dominant today.

The notion of transforming OSS/BSS has a long history, going back more than a decade in fact.  The first transformational notion I saw was the TMF’s NGOSS Contract work, something I’ve cited often.  This was an early attempt to reorganize operations processes into services (SOA, at the time) and to use the contract data model to steer service events to the right process.  This, obviously, was the “event-driven OSS/BSS” notion, and also the “service-based” or “component-based” model.

We sort of did services and components, but the event-driven notion has been harder to promote.  There are some OSS/BSS vendors who are now talking about orchestration, meaning the organization of operations work through software automation, but not all orchestration is event-driven (as we know from the NFV space and the relatively mature area of DevOps for software deployment).  Thus, it would be interesting to think about what would happen should OSS/BSS systems be made event-driven.  How would this impact the systems, and how would it impact the whole issue of telco transformation?

We have to go back, as always, to the seminal work on NGOSS Contract to jump off into this topic.  The key notion was that a data model coupled events to processes, which in any realistic implementation means that the OSS/BSS is structured as a state/event system with the model recording state.  If you visualized the service at the retail level as a classic “black box” or abstraction, you could say that it had six states; Orderable, Activating, Active, Terminating, Terminated, and Fault.  An “order” event transitions to the Activating state, and a report that the service is properly deployed would translate it to the Active state.  Straightforward, right?  In all the states, there’s a key event that represents its “normal” transition driver, and there’s also a logical progression of states.  All except “Fault” of course, which would presumably be entered on any report of an abnormal condition.

You can already see this is too simplistic to be useful, of course.  If the service at the retail level is an abstract opaque box, it can’t be that at the deployment level in most cases.  Services have access and transport components, features, different vendor implementations at various places.  So inside our box there has to be a series of little subordinate boxes, each of which represents a path along the way to actually deploying.  Each of these subordinates are connected to the superior in a state/event sense.

When you send an Order event to a retail service, the event has to be propagated to its subordinates so they are all spun up.  Only when all the subordinates have reported being Active can you report the service itself to be Active.  You can see that the state/event process also synchronizes the cooperative tasks that are needed to build a service.  All of this was implicit in the NGOSS Contract work, but not explained in detail in the final documents (GB942).

Operations processes, in this framework, are run in response to events.  When you report an event to a subordinate (or superior) component of a service, the state that component is in and the event itself combine to define the processes to be run.  The way that an OSS/BSS responds to everything related to a service is by interpreting events within the state/event context of the data models for the components.

This approach contrasts to what could be described as the transactional or workflow approach that has been the model for most business applications, including most OSS/BSS.  In a transactional model, operations tasks are presumed to be activated by something (yes, we could think of it as an event) and once activated the components will then run in a predefined way.  This is why we tend to think of OSS/BSS components like “Order Management” or “Billing”; the structure mirrors normal business software elements.

To make the OSS/BSS operate as an event-driven system, you need to do three things.  First, you need a data model that defines a service and its subordinate elements in a structured way, so that each of the elements can be given a specific state/event table to define how it reacts to events.  Second, you need events for the system to react to, and finally you need to have OSS/BSS processes defined as services or components that can be invoked from the intersection of current state and received event, in any given state/event table.

Most OSS/BSS systems are already modular, and both operators and vendors have told me that there’s little doubt that any of them could be used in a modular-as-service way.  Similarly, there are plenty of business applications that are event-driven, and we have all manner of software tools to code conditions as events and associate them with service models.  What we lack, generally, are the models themselves.  It’s not that we don’t have service modeling, but that the models rarely have state/event tables.  Those would have to be authored as part of service-building.

You can see from this description that the process of modernizing OSS/BSS based on NGOSS-Contract state/event principles is almost identical to the process of defining virtualized function deployments as described by the NFV ISG, or the way that AT&T’s ECOMP proposes to build and manage services.  That has three important consequences.

First, it would clearly be possible to organize both SDN/NFV service lifecycle management and OSS/BSS modernization around the same kind of model, meaning of course that it could be the same model.  Properly done, a move in one space would move you in the other, and since automation of both operations and the lower-level lifecycle management processes are essential for opex efficiency and service agility, the combined move could meet transformation goals.

Second, the model could be defined either at the OSS/BSS level or “below” that, perhaps as independent NFV orchestration.  From where it starts, it could then be percolated up/down to cover the other space.  Everyone in either the OSS/BSS space, the SDN/NFV space, the DevOps or orchestration space, could play in this role.

Third, this level of model-driven integration of operations processes with service and resource management processes at the lower level isn’t being touted today.  We see services and service modeling connected to OSS/BSS, presumably through basic order interfaces.  If that’s accidental, it seems to suggest that even advanced thinkers in the vendor and operator communities aren’t thinking about full-scope service automation.  If it’s deliberate, then it isolates operations modernization from the service modeling and orchestration trends, which in my view would marginalize OSS/BSS and hand victory to those who wanted to completely replace it rather than modernize it.

That returns us to those two people at the meeting table, the two who had diametrically opposed views of the future of OSS/BSS.  Put in the terms of the modeling issue we’ve been discussing here, the “modernize” view would favor incorporating OSS/BSS state/event handling into the new service automation and modeling activity that seems to be emerging in things like ECOMP.  The “trash it and start over” view says that the differences in the role of OSS/BSS in a virtual world are too profound to be accommodated.

My own view falls between these two perspectives.  There are a lot of traditional linear workflows involved in OSS/BSS today, and many of them (like billing) really don’t fit a state/event model.  However, the old workflow-driven thinking doesn’t match cloud computing trends, distributed services, and virtualization needs.  What seems to be indicated (and which operators tell me vendors like Amdocs and Netcracker are starting to push) is a hybrid approach where service management as an activity is visualized as a state/event core built around a model, and traditional transactional workflow tasks are spawned at appropriate points.  It’s not all-or-nothing, it’s fix-what’s-broken.

Or, perhaps, it’s neither.  The most challenging problem with the OSS/BSS modernization task and the integration of OSS/BSS with broader virtualization-driven service management, is the political challenge created by the organization of most operators.  So far, SDN and NFV have been CTO projects.  OSS/BSS is a CIO domain, and there is usually a fair degree of tension between these two groups.  Even where the CIO organization has a fairly unanimous vision of OSS/BSS evolution (in the operator I opened this blog with, both views on operations evolution were held within the CIO organization) there’s not much drive so far to unite that vision with virtualization at the infrastructure level.

Could standardization help this?  The standards efforts tend to align along these same political divides.  The TMF is the go-to group for CIO OSS/BSS work, and the CTO organizations have been the participants in the formal bodies like the NFV ISG.  Underneath it all is the fact that all these activities rely on consensus, which has been hard to come by lately as vendor participants strive for competitive advantage.  We may need to look to a vendor for the startling insights needed.  Would we have smartphones today without Steve Jobs, if a standards process had to create them?  Collective insight is hard, and we’ve not mastered it.

Could We Unify CORD and ECOMP to Accelerate Infrastructure Transformation?

If you like the idea of somehow creating a union between CORD and ECOMP then the next obvious question is just where that union has to start.  The answer, in my view, isn’t in a place where both architectures contribute something that could be united, but where neither does enough and external unionizing forces are essential.  That’s the notion of modeling, not resources but functions.

In my last blog, I noted that integration depends on the ability to freely substitute different implementations of the same function without changing the service definitions or the management practices.  To make that happen, you need to have some Platonic shapes that define all the functions you intend to use in composing services…or even applications.  Each of these models then represents the “look and feel” of the function as seen from above.  The vendors who want to contribute those functions are responsible for building downward from the abstract model to make sure what they do fits seamlessly.

The goal is to make a function’s “object” into a representation of that function through the service lifecycle.  You manipulate the function at the model level, and the manipulation is coupled downward into whatever kind of implementation happens to be used.  That way, things that have to view or control a “router” don’t have to worry (or even know) whether it’s an instance of software, a physical device, or a whole system of routing features either built by SDN forwarding or by combining devices/software into a “network”.

The TMF really got a lot of this started back in the 2006-2010 timeframe, with two initiatives.  One was the “NGOSS Contract” that proposed that events would be steered to the appropriate lifecycle processes through the intermediary of the model service contract.  That approach was the first to make a contract (which the TMF modeled as a series of connected service elements) into a state/event machine.  The other was the Service Delivery Framework (SDF), that explicitly targets the lifecycle management of services that consist of multiple functions/features.

To me, the union of these two concepts required the notion that each service element or model element (my “router”) be represented as an object that had properties determined by the class of feature it defined.  That object was then a little “engine” that had state/event properties and that translated standard class-based features (“a router does THIS”) into implementation-specific methods (“by doing THIS”).  A service was a structured assembly of these objects, and each service was processed by a lifecycle management software element that I called a “Service Factory”, a term the TMF briefly adopted.

Service lifecycle management, which starts by instantiating a service model onto real infrastructure by making the connections between the “objects” that define the model and a service-instance-specific way of deploying or committing resources, lives above the model.  It never has to worry about implementation because it manipulates only the abstract vision (“router”).  The first step in lifecycle management is responsible for deployment, and it makes the connections between the general object vision of available features (probably in the form of APIs) and the way each object is actually deployed in the referenced service.

When a model is deployed, the abstract “model” has to be changed from a template that describes something to an instance that represents something.  There are two basic approaches to doing this.  One is to actually spawn a set of software objects that will then run to process service lifecycle events.  In this approach, a service is a real software application made up of modules for the features.  The second approach is to use a general software tool that interprets the model as needed, meaning that there is in the instance of a service model a set of references to software, not the software itself.  The references could be real pointers to software processes, or they could be a data model that would be passed to a generic software element.

CORD uses abstractions to represent things like the access network and the service trunking.  There are also arguably standard models for resources.  The former are useful but not sufficient to model a service because they don’t have the functional range needed to support all the service features.  The latter open the question of “standardization” below the service objects, which I’ll get to in a bit.

ECOMP also contributes elements.  It has the notion of a service model, though I’d argue it’s not as specific as the approach I’ve described.  It has the notion of service lifecycle management, again not as detailed.  Much of ECOMP detail is in the management and resource portion of the issue, again below the service model I’ve described.

If CORD describes the CO of the future and ECOMP describes the integration of elements, then the thing that would unite them in a logical sense is a complete statement of the service models that relate the processes of ECOMP with the resources of CORD.  To consider that, it’s now time to address the question of what happens underneath a service model.  Here we have three basic options to consider:

  1. We could use the same modeling approach below as we had used for service models, so that the decomposition of a “router” object into a network of “router” objects would use the same tools.
  2. We could use some other standardized modeling approach to describe how an “object of objects” is represented.
  3. We could let anything that works be used, foregoing standardization.

The best approach here, in my view, would depend on how many of the “other standardized modeling” approaches would be fielded in the market.  Below the service model, the mandate is to pick an implementation strategy and then connect it to the service-model’s object-level APIs.  You could see the work of the NFV ISG and MANO living down here, and you could also see modeling options like TOSCA, TMF SID, and YANG, and even more general API or data languages like XML or JSON.  The more options there are, the more difficult it would be to get a complete model from the underside of our highest-level service objects to the resources that will have to be committed.  That’s because it’s likely that vendors would support only a few model choices—their own gladly and everything else with great reluctance.

Clearly the last option leads to chaos in integration.  So does the second option, unless we can define only a very limited set of alternative approaches.  That leaves us with the first option, which is to find a general modeling approach that would work top to bottom.  However, that approach fields about as many different choices as my second one did—and it then demands we pick one before we can go very far in modeling services.  Given all of this, what I’d suggest is that we focus on defining what must be standardized—the structure of those abstract functional objects like “router”.  From there, we’d have to let the market decide by adopting what works best.

It should be easy to unify CORD and ECOMP with service modeling because both require and even partially define it, but neither seems to be firmly entrenched in a specific approach.  It’s also something that the NFV ISG might be ideally positioned to provide, since the scope of objects that need to be defined for the model are all within the range of functions considered by NFV.  It could also be done through open-source activities (including CORD and ECOMP), and it could be done by vendors.  Perhaps with all these options on the table, at least one could come to fruition.

There’s a lot at stake here.  Obviously, this could make both CORD and ECOMP much more broadly relevant.  It could also re-ignite the relevance of the NFV ISG.  It could help the TMF turn its ZOOM project into something other than a lifetime tenure for its members.  I also think that carrier cloud adoption could be accelerated significantly, perhaps by as much two years, if something like this were done.  But make no mistake, carrier cloud is going to happen and result in a lot of new money in the IT world.  Once that’s clear (by 2020 certainly) I think there will be a rush to join in.  For some, it will be too late to reap the full benefits.