What Now Gets NFV On Track? Open Source? Standards? Testing?

We are again seeing stories and comments around “what’s wrong with NFV”.  That’s a good thing in that it at least shows awareness that NFV has not met the expectations of most who eagerly supported it four years ago.  It’s a bad thing because most of the suggested ills, and therefore the explicit or implied remedies, are just as wrong.

Before I get into this I want to note something I’ve said before when talking about NFV.  I got involved with the ETSI activity in March of 2013 and I’ve remained active in monitoring (and occasionally commenting on) the work since then.  I have a lot of respect for the people who have been involved with the effort, but I’ve been honest from the first in my disagreement with elements of the process, and therefore with some of the results.  I have to be just as honest with those who read this blog, and so I will be.

The first thing that’s wrong is less with NFV than with expectations.  We cover technology in such a way as to almost guarantee escalation of claims.  If you review the first white papers and attended the early meetings, you see that NFV’s intended scope was never revolutionary, and could never have supported the transformational aspirations of most of its supporters.  NFV was, from the first, focused on network appliances that operated above Level 2/3, meaning that it wasn’t intended to replace traditional switching and routing.  Much of the specialized equipment associated with mobile services, higher-layer services, and content delivery were prime targets.  The reason this targeting is important is that these devices collectively amount to only about 17% of capex overall.  NFV in its original conception could never have been a revolution.

The second thing that’s wrong is NFV’s scope (in no small part because of its appliance focus) didn’t include operations integration.  Nobody should even think about questioning the basic truth that a virtual function set, hosted on cloud infrastructure in data centers, and chained together with service tunnels, is more complicated than an equivalent physical function in a box, yet the E2E diagrams of NFV propose that we manage virtual functions with the same general approach we use for physical ones.  There has been from the first a very explicit dependence of NFV on the operations and management model associated with virtual function lifecycles, but the details were kept out of scope.  Given that “process opex” or operations costs directly related to service fulfillment, already accounts for 50% more cost than capex, and that unbridled issues with virtual function complexity could make things even worse, that decision is very hard to justify, or overcome.

The third issue with NFV is that it was about identifying standards and not setting them.  On the surface this is completely sensible; all we need is more redundant and potentially contradictory standards processes.  The problem it caused with NFV is that identification of standards demands a clear holistic vision of the entire service process, or you have no mechanism with which to make your selection from the overall standards inventory.  What’s a good candidate standard, other than the best one to achieve the overall business goal.  But what, exactly, is that goal?  How do standards get molded into an ecosystem to achieve it?  If you had to write standards, the scope of what you did and the omissions or failures could be fairly obvious.  If you’re only picking things, it’s harder to know whether the process is on track or not.

So what fixes this?  Not “servers capable of replacing switches and routers”, because a broader role for NFV first tends to exacerbate the other issues I pointed out, and because you don’t really need NFV to deploy static multi-tenant network elements like infrastructure components.  You don’t really even need cloud computing.  “Standards” or “interoperability” or “onboarding” are all reasonable requirements, but we’ve had them all along and have somehow failed to exploit them.  What, then?

First you have to decide what “fixing” means.  If you’re happy with the original goals of the papers, the above-the-network missions in virtual CPE and so forth, then you need to envelope NFV in a management/operations mode, which the ETSI ISG declared out of scope.  There’s nothing wrong with the declaration, as long as you recognize that declaring it out of scope doesn’t mean it isn’t critically necessary.  If you do want service and infrastructure revolution, it’s even easier.  Forget NFV except as a technical alternative to physical devices and focus entirely on automating service lifecycle management.  That can’t be done within the scope of the ETSI work—not at this point.

This is where open-source comes in.  In fact, there are two ways that open source could go here.  One is to follow the NFV specifications, in which case it will inherit all of the ills of the current process and perhaps add in some new ones associated with the way that open-source projects work.  The other is to essentially blow a kiss or two at the ETSI specs and proceed to do the right thing regardless of what the specs say.  Both these approaches are represented in the world of NFV today.

The specs as they exist will not describe an NFV that can make a business case.  The specs as they exist today are incomplete in describing how software components could be combined to build NFV-based service lifecycle management, or how NFV software could scale and operate in production networks.  That is my view, true, but I am absolutely certain it is accurate.  This is not to say that the issues couldn’t be resolved, and in many cases resolved easily.  It’s not to say that the ETSI effort was wasted, because the original functional model is valid as far as it goes, and it illustrates what the correct model would be even if it doesn’t describe it explicitly.  What it does say is that these issues have to be resolved, and if open source jumps off into the Great NFV Void and does the work again, they can get it right or they can get it wrong.  If the latter, they can make the same mistakes, or new ones.

The automation of a service lifecycle is a software task, so it has to be done as a software project to be done right.  We did not develop NFV specifications with software projects in mind, and they are not going to be optimal in guiding a project for that reason.  The best channel for progress is open source, because it’s the channel that has the best chance of overcoming the lack of scope and systemic vision that came about (quite accidentally, I think) in the ETSI NFV efforts.  The AT&T ECOMP project, now combined into the ONAP project (with Open-O), offers what I think is the best chance for success because it does have the necessary scope, and also has operator support.

Some people are upset because we have multiple projects that seem to compete.  I’m not, because we need a bit of natural selection here.  If we had full, adequate, systemic specifications for the whole service lifecycle management process we could insist on having a unified and compatible approach.  We don’t have those specs, so we are essentially creating competitive strategies to find the best path forward.  That’s not bad, it’s critically necessary if we’re to go forward at all.

The big problem we have with open-source-dominated NFV isn’t lack of consistency, it’s lack of relevance.  If open-source solves the problems of service lifecycle automation, and if it has the scope to support legacy and cloud, operator and federation, then it will succeed and NFV will succeed along with it.  But NFV was never the solution to service lifecycle automation; it declared most of the issues out of scope.  That means that for NFV, “success” won’t mean dominating transformation, it will simply mean playing its truthfully limited role.

Most network technology will never be function-hosted, but most operator profits will increasingly depend on new higher-layer cloud elements.  Right now, NFV isn’t even needed there.  If I were promoting NFV, and I wanted it to be more dominant, I’d look to the cloud side. There’s plenty of opportunity there for everyone, and the cloud shows us that there’s nothing wrong with open-source, or with multiple parallel projects.  It’s fulfilling the mission that counts, as it should always be.

Is Verizon Behind in the Telco Race?

Verizon certainly raised a ruckus in the industry with their views on consolidation.  The sense of their CEO’s comments was that Verizon was open to a merger that offered them content ownership, and that says a lot about the industry overall.  Here we have a giant telco saying that without content ownership their position is at risk, and there’s some support for that negative view in their most recent quarter.  So why is that, is it true, and what does it mean for us all?

To set the business stage, Verizon had subscriber loss issues in the mobile space—over three hundred thousand according to their quarterly report.   The company lost revenue in wireline, and FiOS video net losses were about 13,000 connections.  While Verizon is seen as having the “best” wireless and FiOS is seen as the best wireline Internet and video, the company faces competitive pressure on all fronts, and it’s increasingly doubtful that buyers will pay for premium service.

The core issue facing Verizon (and other operators) is the explosion in video delivery to mobile devices.  It’s not that this represents a massive shift away from channelized real-time TV viewing, but it does demonstrate a shift in video behavior, two in fact.  First, mobile broadband gives people video access when they have no opportunity to use traditional tethered TV.  Second, viewers are much more into time-shifting than before and if you’re not going to watch what’s on when it’s on, you are open to watching it differently, on a different device.

Mobile video has been a problem for operators because competitive pressure prevents them from usage pricing in a way that would realize much incremental revenue from the shift.  They’re stuck with another reason for revenue per bit to decline, sinking into the realm of dumb, cheap, plumbing.  And, of course, if the road is becoming free, then you have to make money on what’s traveling the road, which is video content.  To make things more complicated, TV advertisers want strong mobile video presence.

With market trends challenging enough, rival AT&T has messed things up further for Verizon, in two ways.  First, it’s been offering video from its DIRECTV property to its mobile customers, without having the viewing count against usage, a move Verizon had to follow.  Second, it’s done much better at moving its services and infrastructure toward that elusive telco goal of “transformation.”  Verizon had two natural advantages over AT&T at the start of this decade, and now they’re far less relevant.  We’ll look at Verizon’s lost edge first, then at the two ways AT&T helped them lose it.

What I call “demand density” was the first of Verizon’s natural assets.  The value of network infrastructure, its ability to return on investment, is proportional to the economic value of the homes and businesses that infrastructure passes.  My own modeling showed decades ago that this was very roughly related to the GDP per square inhabitable mile, which I called “demand density”, and by that measure Verizon had nearly a 7x advantage over AT&T.  That’s why Verizon could jump on FTTH for at least a good-sized chunk of its market, and AT&T had to be satisfied with a hokey IPTV-over-DSL approach.

The second of Verizon’s natural assets was on the business side.  The easiest place to make a business sale of telco services is at the corporate HQ.  Verizon had more corporate HQs than AT&T, a lot more in fact.  Their edge has eroded because of a general shift of industries from the northeast to California and Texas, but they still hold a lead.  The problem is that business services are under incredible pricing pressure, and the winner in the race to the bottom will always win in that sort of situation.

The mobile-broadband focus of the market gave AT&T an opportunity to focus its “video” on a combination of satellite TV and their my-content-isn’t-counted-against-usage approach to mobile video.  Mobile services in general bypass the demand density issue because you’re not stringing wire, and the no-usage-charge video model promotes synergy between their satellite TV approach and mobile services.  You can see from Verizon’s price hikes on FiOS TV (and customers squatting in unexpected numbers in their cheapest offerings) that demand density and FiOS aren’t guaranteeing victory any more.

The transformation win AT&T is now posting is even more troublesome.  Process opex, the operations costs directly attributable to network services, accounts for about 28 cents per revenue dollar today across operators of all types, and if left unchecked that will grow to over 33 cents in five years.  Transformation, in theory, could reduce process opex by fifty percent or more.  With a cost advantage that large, AT&T could kill a non-transformed competitor—like Verizon.  AT&T’s ECOMP is becoming a de facto model for operators globally, but one Verizon can’t easily adopt for competitive reasons, and so Verizon is grappling with the question of how to counter ECOMP.

The AT&T proposal to buy Time Warner is the potential nail in the coffin.  Here’s AT&T already pushing hard against Verizon’s market advantages.  Then Comcast jumps in, first to buy NBCU and then to announce their own MVNO service that, no surprise, will deliver Comcast’s own video to mobile users without usage charges.  Then AT&T wants to buy a content company, giving it even more power in the mobile video space and better margins “above the network” to offset declines in profit per bit.  Given that AT&T could transform itself out of an immediate profit-per-bit problem, this is not good news.

Hence, the Verizon CEO’s comments on selling out to Disney or someone like them.  This is more than “Hey, Mom, every kid on the block is buying a content company!”  Verizon probably knows it would be difficult for it to actually do an acquisition of a good one, and harder to get regulatory approval.  Getting acquired by a content company might be easier, though regulatory approvals might still depend on the more permissive view of regulations held by the current administration.

Regulatory policy may hold the big wild card here.  Recall that the original neutrality order promulgated by the FCC under Genachowski didn’t close the door on settlement or paid prioritization.  Those were added by the Wheeler FCC.  Might the current Chairman, Pai, revert to a more ISP-friendly view and open the door to one or both?  That could open the possibility of Verizon obtaining revenue from OTT vendors, and even to new Internet-based services.

The question here is whether, even if regulatory relief were granted, it would be sufficient to redress the negative long-term issues Verizon faces.  Cost advantages held by competitors who are more aggressive with automated service lifecycle management aren’t wiped out, regulations would just open new areas to compete in.  And if you can charge for priority content delivery and your competitors who own content assets elect not to, how do you respond other than by not charging?

Verizon seems to be doubling down on fiber deployment, based on its Corning deal, and that could suggest either a doubling-down on “quality” service or a hope that their cost base could be improved by shifting capex more to transport.  It may also hope that somehow buying OTTs (Yahoo, recently) will give it a leg up in that space.  Perhaps all of this will help, but I really think that the weak spot for Verizon isn’t content ownership, but service management automation.  They need to outdo competitors like AT&T in that space, and I don’t see any clear signs that they’re on the road to doing that.

IBM: Is It a Problem or a Symptom?

We are confronted now the need to talk about IBM, not for the first time.  The company beat estimates on EPS slightly, with a set of one-time moves that the Street didn’t like much.  They missed yet again on revenue, and their shares took a hit (again) as a result.  Here is a company who has weathered more technology storms than any other, has pulled victories out of every defeat.  Why is it unable to do that this time?  What should be done now?  I don’t want to reprise past blogs on how they got into their current mess, but focus instead on getting out.

The best place from which to launch a recovery is usually a place where you have your greatest strengths.  In IBM’s case, that greatest strength is strategic influence.  Back in 2013, IBM was a runaway winner in strategic influence on enterprise buyers, scoring between double and triple its rival computer vendors.  Since then, IBM’s influence has fallen by half, but so has the influence of its competitors and so IBM still leads the IT pack in terms of its influence on enterprise buyers.  The question is how to play that card effectively.

I hinted at what’s probably the best answer to this a couple of blogs back.  IBM has the direct account influence needed to launch that elusive next productivity wave that could create an IT investment explosion the like of which we haven’t seen for decades.  The problem, I think, is that IBM doesn’t seem to have any better idea of what might drive that wave than anyone else, or perhaps has no ability to communicate what it knows.

Watson, or AI-linked analytics, isn’t the answer—it’s too indirect.  Knowledge may be power eventually, but worker empowerment gets you to the finish line immediately.  Watson holds out to senior management the promise that somehow getting better information will make them successful.  Empowerment of workers makes you successful, period.  IBM actually took some steps in that direction with its deal with Apple, but if there was a new paradigm in the deal I didn’t see it, nor did the enterprise buyers I’ve talked with since.

If there is a company that has all the tools needed to create contextual point-of-activity empowerment of workers through exploiting mobile broadband, IBM is that company.  Given that they also (still) have the influence, I think it’s clear that this should be IBM’s strategic priority.  Watson is useful only in that context.

The next answer in how to play their assets comes from the financial community’s criticisms of IBM.  According to the Street, IBM is in trouble because of mainframe exposure and cloud impact.  The implication is that cloud computing is eating up IBM’s sales of mainframes.  In point of fact, mainframes have been a sweet spot for IBM and mainframe accounts are the places where IBM retains the most strategic influence.  Further, the cloud has had less than a 4% impact on IT spending overall.  Most cloud revenue comes from web startups and web-front-ends to current applications, so it’s money that was never spent in house.  The cloud, in fact, could be an incremental asset to any vendor at this point, because it could be the largest source of new server deployments.  How does IBM get the benefit of that shift?

Acknowledging it might be helpful.  IBM hasn’t articulated a differentiated vision of the cloud.  If we were to totally fulfill cloud potential across all possible markets, we would raise net IT spending by almost 90%.  Just getting on the right path to achieving full cloud transformation of business could, in the next five years, raise spending by over 30%.  You need to have three things to get that money—position, an ecosystemic story, and a product set that realizes the goals.  IBM has two of the three; what it lacks is the story.  Well, gosh, IBM—how much inertia does a story represent?  How long would it have taken the IBM of the past to sing a pretty song about the future?

That brings is to the final point in playing IBM’s cards effectively; marketing.  If you’ve read my past blogs on IBM, you know that I’ve painted their current problem as arising largely from a lack of marketing.  I know from many of my friends and contacts in IBM that there’s support for that view internally.  OK, then, if you needed marketing in the past and didn’t have it, need it in the present and future and still don’t have it, what do you now need?  Answer: marketing.  I would bet that IBM has people who know exactly what to say to enterprise buyers about that next wave of productivity improvement.  They are just not allowed to say it.

Marketing isn’t about telling a simplistic story to get the attention of a reporter or a click on a URL.  It’s about telling a compelling story.  Compelling enough to force reporters to pay attention, and to induce buyers to read about it even if it’s not packaged in 140 pithy characters or a 500-word article.  Does anyone out there seriously believe that a company who found a strategy that could boost their productivity by thirty or forty percent would refuse to learn it because it was too wordy or complicated?

Selling is all about trajectory management, as my research has shown for thirty years or so.  Media mention sells web site visits, web site visits sell sales calls, sales calls sell products—a trajectory.  IBM could surely package a compelling story to start that flow off, but remember that they have account presence in the largest enterprises, who spend the most money.  They can shortstop the trajectory early on, and build on that success to then generate new stories that engage those IBM isn’t directly influencing.

There are almost certainly people in IBM who see all of this, but somehow they don’t seem to get things moving in the right direction.  Is IBM content to focus on cost-cutting, admitting it will never turn revenue around?  Does IBM think that the opportunities and technologies will somehow come together on their own?  Or does IBM think that the Street rewards the current quarter, not the long haul?  It may be that some people think all of these things, and that the combined disorder is enough to stall meaningful progress.

Progress is clearly needed.  IBM got beat up rather badly in the markets after their quarterly earnings were announced.  Certainly you can’t justify taking a short-term view to please the Street when the result isn’t Street-pleasing.  It’s also hard to keep hoping that somehow natural forces will fix all your problems, when that hasn’t happened up to now, for almost three years.  That leaves accepting failure.  The IBM I’ve known would never do that.  Will they now?  I don’t know.

But there’s another option, even worse.  Is IBM simply the leading edge of a negative IT trend?  Have we, as an industry, broken those cycles of productivity-driven IT investment forever, and we’re now doomed to commoditization?  I think there’s a risk that will happen, and the most important lesson we may be learning from IBM’s problems is what happens to anyone in an industry that’s lost its mojo.

New SLAs and New Management Paradigms for the Software-Defined Era

There is no shortage of things we inherit from legacy telecom.  An increasing number of them are millstones around the neck of transformation, and many of those that are drags are related to management and SLA practices.  Those who hanker for the stringent days of TDM SLAs should consider going back in time, but remember that 50 Mbps or so of access bandwidth used to cost almost ten thousand dollars a month.  For those with more price sensitivity than that, it’s better to consider more modern approaches to management and SLAs, particularly if you’re looking ahead to transformation.

All management boils down to running infrastructure to meet commitments.  When infrastructure and services were based on time-division multiplexing or fixed and dedicated capacity, the management processes were focused on finding capacity to allocate and then insuring that your allocation met the specified error-and-availability goals.  Packet networking, which came along in the ‘60s, started the shift to a different model, because packet networks were based on statistical multiplexing of traffic.  Data traffic doesn’t have a 100% duty cycle, so you can fit peaks of one flow into valleys in another.  That can multiply capacity considerably, but it breaks the notion of “allocation” of resources because nobody really gets a guarantee.  There’s no resource exclusivity.

A packet SLA has to reflect this by abandoning what you could call “instantaneous state”, the notion that at every moment there’s a guarantee, in favor of “average conditions”.  Over a period of time (a day, a week, a month), you can expect that effective capacity planning will deliver to a given packet user a dependable level of performance and availability.  At any given moment, it may (and probably will) not.

TDM-style SLAs have to be based on measurement of current conditions because it’s those conditions being guaranteed.  Packet SLAs have to be based on analysis of network conditions and traffic trends versus the capacity plan.  It’s more about analytics than about measurement, strictly speaking, because measurements are simply an input into the analysis aimed at answering the Great Question of Packet SLAs: “Will the service perform, on the average and over the committed period, to the guarantees?”  Remediation isn’t about “fixing the problem” at the SLA level, as much as bringing the trend back into line.

Another management issue that has evolved with packet network maturity is that of “stateful versus stateless core” behavior.  Packet protocols have been offered in both “connectionless” and “connection-oriented” modes.  Connection-oriented packet networks, including frame relay and ATM, offer behavior in SLA terms that’s a bit of an intermediary between TDM and IP, which is connectionless.  When a “connection” is made in a packet network, the connection reserves resources along the path.

The problem is that if a core network element breaks, that breaks all the connections associated with it, resulting in a tear-down backward toward the endpoints and restoration of the connections using a different route.  In the core, there could be tens of thousands of such connections.  Connectionless protocols don’t reserve resources that way, and there’s no stateful behavior in the core.  Arguably, one reason why IP has dominated is that a core fault creates a fairly awesome flood of related management issues and breaks a lot of SLAs, some because nodes are overloaded with connection tear-down and setup.

We’ve had packet SLAs for ages; nobody writes TDM SLAs for IP or Ethernet networks.  Yet we seem stuck on TDM notions like “five nines”, a concept that at the packet service level is hardly relevant because it’s hardly likely to be achieved unless you define what those nines apply to rather generously.  We’ve learned from the Internet to write applications that tolerate variable rates of packet loss and a range of latencies.  It was that tolerance that let operators pass on connection-oriented packet protocols so as to avoid the issues of stateful core networks, issues that could have included a flood-of-problems-induced collapse of parts of the network and a worse and more widespread SLA violation.

We now have to think about managing networks evolving to central management (SDN) of traffic, and hosted device instances that replace physical devices.  There, it’s particularly dangerous to apply the older TDM concepts, because service management almost has to be presented to the service buyer in familiar device-centric terms, and many faults and conditions in evolved networks won’t relate to those devices at all.  We need, at this point, a decision to break the remaining bonds between service management and service SLAs, and the explicit state of specific resources underneath the services.

In the best of all possible worlds, if you want management to be the easiest and service operations costs the lowest, you’d build infrastructure according to a capacity plan, exercise basic admission control to keep things running within the design parameters, and as long as you were meeting your capacity plan goals, nobody would see an SLA violation at all.  That happy situation would be highly desirable in transformed infrastructure, because it’s far easier than trying to link specific services, specific resources, and specific conditions to form an SLA.  As I pointed out yesterday, though, there are issues.

Adaptive IP networks do what their name suggests, which is adapt.  When you have a network that’s centrally traffic managed like SDN, or you have resources in NFV that have to be deployed and scaled on demand, you have a resource commitment to make.  Where real resources are committed—whether they’re SDN routes or NFV hosting points—you have a consumable that’s getting explicitly consumed.  You can’t have multiple activities grabbing the same thing at the same time or things are going to get ugly.  That forces serialization of the requests for this sort of change in infrastructure state, which creates single points of processing that can become bottlenecks in responding to network conditions or customer requests.  These same points can be single points of failure.

In the long run, we need to work out a good strategy for making more of SDN and NFV control processes scalable and resilient.  For now, we can try to narrow the scope of control for a given OpenStack instance or SDN controller, and “federate” them though a modeling structure that can divide up the work to insure things operate at the pace needed.  As SDN and NFV mature, we’re likely to need to rethink how we build controllers and OpenStack instances that are themselves built from components that adhere to cloud principles.

If you’d have tried to sell a two-nines service to business thirty years ago, you’d have tanked.  Today, almost all large companies rely heavily on services with about that level of quality.  We had a packet revolution.  Now we’re proposing a software-centric revolution, and it’s time we recognized that constraining services to the standards of even the recent past (much less falling back to “five nines”) is no more likely to be a good strategy now than it was at the time of the TDM/packet transition.  This time, the incentive to change may well be to improve operations efficiency, and given process opex is approaching costs of 30 cents per revenue dollar, that should be incentive enough.

A Transformed Service Infrastructure from Portal to Resources

Transformation, for the network operators, is a long-standing if somewhat vague goal.  It means, to most, getting beyond the straight-jacket of revenue dependence on connection services and moving higher on the food chain.  Yet, for all the aspirations, the fact is that operators are still looking more at somehow revitalizing connection services than transforming much of anything.  The reasons for this have been debated/discussed for a long time, including here in my blog, so I don’t want to dig further into them.  Instead I want to look at the technology elements that real transformation would require.

I’ve said in the past that there were two primary pieces to transformation technology—a portal system that exposes service status and ordering directly to customers, and a service lifecycle management system that could automate the fulfillment of not only today’s connection services and their successors, but also those elusive higher-layer services that operators in their hearts know they need.  This two-piece model is valid, but perhaps insufficient to guide vendor/product selection.  I want to dig further.

We do have long-standing validation of the basic two-piece approach.  Jorge Cardoso did a visionary project that combined LinkedUSDL, OpenTOSCA, and SugarCRM to produce a transformed service delivery framework.  It had those two pieces—TOSCA orchestration of service lifecycle management and SugarCRM portal technology, bound by LinkedUSDL.  This was a research project, a proof of concept, and it needs a bit of generalizing before it could become a commercial framework capable of supporting transformation.

While there are two major functional elements in the transformative model we’ve been talking about, each of these elements are made up of distinct pieces.  To really address transformation, we have to make all these pieces fit, and make sure that each performs its own mission as independently as possible, to prevent “brittle” or silo implementations.  That’s possible, but not easy.

The front-end of any portal, we know from decades of experience, should be a web-like process, based on RESTful APIs and designed to deliver information to any kind of device—from a supercomputer data center to a smartphone.  This web front-end hosts what we could call the “retail APIs”, meaning the APIs that support customer, partner, and customer-service processes.  To the greatest extent possible, these should be as general as a web server is, because most of the changes we’re going to see in new service applications will focus on this layer.

Behind the web-process piece of the portal is what we could call the cloud-support layer.  You want the front-end to be totally about managing the user interface, so any editing, validation, and rights brokerage should be pulled into something like a cloud process.  I’m calling this “cloud” to mean that the components here should be designed for scaling and replacement, either by being stateless or by using back-end (database or data model) state control.  This is particularly important for portal functions that are inquiries—management status—rather than service orders or updates.  That’s because historically there are more of these status-related requests, because users expect quick responses from such requests, and finally because there’s no long-cycle database updates at the end of the flow to limit how much QoE improvement scalable processes up front can make.

For the entire portal flow, from web-side to cloud-side, it’s more important to have an agile architecture than to have a specific product.  You should be able to adapt any web-based process to be a portal front-end, and you should be wary of selecting a cloud layer that’s too specific in terms of what it does, because the demands of future services are difficult to predict.  It’s also important to remember that the greatest innovations in terms of creating responsive and resilient clouds—microservices and functional (Lambda) computing—are only now rolling out, and most front-end products won’t include them.  Think architecture here!

The “back end” of the portal process is the linkage into the service lifecycle management system, and how this linkage works will depend of course on how the service lifecycle management process has been automated.  My own recommendation has always been that it be based on service modeling and state/event processing, which means that the linkage with the portal will be made by teeing up a service model (either a template for a new service or an instance representing an old one) and generating an event.  This is a useful model even for obtaining current service status; a “service event” could propagate through the service model and record the state/parameters of each element.

If a service model is properly defined (see my Service Lifecycle Management 101 blog series), then any instance of a process can handle it, which means that the structure is scalable as needed to handle the work.  This is important because it does little good to have a front-end portal process that’s elastic and resilient and then hook it to a single-thread provisioning system.  In effect, the very front end of the service lifecycle management system is inherently cloud-ready, which of course is what should be done.

As you dive deeper into service lifecycle management, though, you inevitably end up hitting the border of resource-bound processes.  You can have a thousand different service lifecycle flows in the service layer of the model, where the state and parameters of a given service are always recorded in the data model itself.  Deeper in, you hit the point where resources set their own intrinsic state.  I can allocate something only if it’s not already allocated to the maximum, which means that having parallel processes that maintain their state in a service model are now constrained by “real” state.

The problem of “serialization” of requests to prevent collisions in allocation of inherently stateful resource elements is greatest where specific resources are being allocated.  As an example, you can have a “cloud process” that commands something be deployed on a resource pool, and that process may well be parallel-ready because the pool is sized so as to prevent collision of requests.  But at some point, a general request to “Host My Stuff” will have to be made specific to a server/VM/container, and at that point you have to serialize.

The only good solution to this problem is to divide the administration of pooled resources so that resource-specific processes like loading an app into a VM are distributed among dozens of agents (running, for example, OpenStack) rather than concentrated in a single agent that supports the pool at large.  That means decomposing resource-layer requests to the level of “administrative zone” first, then within the zone to the specific server.

I’ve seen about 20 operator visions on the structure of retail portals that feed service lifecycle management, and I have to honestly say that I’ve not yet seen one that addresses all of these issues.  Most of the operators are saying that they’re only doing early prototypes, and that they’ll evolve to a more sophisticated approach down the line.  Perhaps, but if you base your early prototypes on a model that can’t do the sophisticated stuff, your evolution relies on that model evolving in the right direction.  If it doesn’t, you’re stuck in limited functionality, or you start over.

The thing is, all of this can be done right today.  There are no voids in the inventory of functional capabilities needed to do exactly what I’ve described here.  If you pick the right stuff up front, then your evolution to a full-blown, transformed, system will be seamless.  I argue that it won’t take significantly longer to start off right today, and it darn sure will take a lot longer to turn around a project that can’t be taken to a full level of transformation because the early pieces don’t fit the ultimate mission.

What Will it Take to Drive Tech Transformation for Operators and Enterprises?

We tend to think of transformation as something that network operators, particularly telcos, have to go through.  In point of fact, transformation, meaning technology transformation, is going to happen to everyone, buyers and sellers, operators and enterprises.  That truth leaves two questions—what kind of transformation will happen, and how will the players respond to the challenges and opportunities created.  Most of us realize that tech is far from its “golden age”, that we seem to be focused more on reducing cost than enhancing capabilities.  There’s a reason for that, and a resolution is also possible once we understand what’s behind the unfavorable trend.

Technology is like everything else that acts on society and economies as a force.  It changes things either because it takes on a new mission, or because it gets a lot cheaper and more pervasive.  If we look at information technology from its infancy in the early 1950s to today, we see signs of both forces.  Certainly, IT has gotten cheaper.  When I learned to program, the computer I learned on had 16 kilobytes of RAM, worked at a speed measured in milliseconds, and was the size of about ten filing cabinets.  You can get wearables today with better performance.  The computer I learned on also cost several hundred thousand dollars, and today a good laptop might cost one thousand, and many are half that price.

The plummeting price/performance ratio for computing allowed it to spread.  We used to have data centers in which giant systems lived, tended by operations personnel who fed them records of activity that had been keypunched.  These would be turned into “reports”.  As computing got cheaper and more powerful, we started doing online transaction processing, then distributed computing, personal computers, tablets and smartphones, the Internet.  All of these things can be linked to the exploitation of the lower cost of computing.

If cost were the only factor at play here, we could expect to see IT transforming things in a fairly linear way over time, as reducing cost expanded the things you could do with computing technology, but that’s not the case.  If you examine public data on IT spending, it’s clear that we have waves of IT investment (we’ve had three so far).  These represent the new mission piece of the puzzle.  As IT things got cheaper, we found new ways to use them that boosted productivity for workers, and that justified faster growth in spending—30% to 40% more than the average rates of growth.  When we’d fully capitalized the new paradigm, IT spending fell to perhaps 80% of average levels.

This poses the transformation question for everyone.  What would happen if IT commoditization (price/performance) continued to drop, and nothing came along to generate a new mission?  The value of the IT tool would be static.  Every generation, replacing it would get cheaper and so spending would decline.  We’d end up with low-cost, no-differentiation, products.  Eventually, cost alone wouldn’t be enough to drive new applications.  Just because hammers get cheaper, you don’t drive more nails.  Price will transform IT into a commodity space.  New missions will generate new benefits, new demand for new features, new business cases, and raise spending.  That’s the reality for buyers and sellers, operators and enterprises, even consumers.

I’ve suggested in past blogs that the new mission likely to drive future IT growth would be one of two things—mobile broadband empowerment and “contextual processes” of the Internet of Things (IoT).  I still think that one or both of these will have to anchor the engine of mission growth.  Exploiting them is the pathway out of commoditization, but exploiting them isn’t a matter of inventing some dazzling new tech element.  We already have the hardware and software tools needed to make more than just a good start at both these new missions.  What is it we lack, then?

Call it “vision” or “insight”.  Say that we’re mired in quarter-at-a-time tactical thinking when new missions obviously demand a strategic sense.  Say we’ve lost our capacity to communicate complex things, or lost a way to monetize them in their early phase.  Say any of these things and I think you’d be right.  The last time we had a mission-driven cycle, the application of technology to the new mission was very clear.  Personal computers did stuff that we could immediately see, and they did it in a way that didn’t demand a lot of insight to understand.  The Internet gave us a universal market for consumer data services, and it was clear almost instantly, from the launching of the Web, what could be done with it.

Contrast that with our thinking about mobile broadband or IoT.  How many companies see “mobile empowerment” as giving the user a mobile-readable screen from the same applications that they ran all along?  Is giving someone a different view of the same data going to revolutionize their productivity?  Does a worker with a powerful information appliance in their pocket as they move about doing their job, do it the same way as one chained to a desk?  Why would deskbound applications anticipate the things a mobile worker might want, or do?  And IoT?  We focus only on getting sensors and controllers “on the Internet” and not on what incremental value they’d create once there (or how we’d pay for their deployment, secure them, protect from privacy intrusions, and so forth).

So who fixes this?  Probably, eventually, we’ll blunder into a path that gets us to another wave of mission-driven IT investment.  I’m frankly surprised that hasn’t happened already, given that it’s been fifteen years since we had an IT growth cycle, far longer than we’ve ever waited for one before.  Given that unhappy bit of history, we probably can’t count on spontaneous resolution.

What do we then count on?  Some vendor could jump-start the process.  If we look back at prior tech cycles, we can see that they happened because technology filled a specific niche in business, a niche that vendors themselves saw and exploited.  Talking with both service providers and enterprises has given me some insight into what buyers think would be necessary for this happy niche-filling to happen again.

First, it’s about account presence.  Almost everyone in the buyer space believes that transformational technology solutions couldn’t be propagated except by direct sales contact.  “It’s a matter of trust,” one said.  “If I’m going to put myself on the line for some big shift, I want the vendor who’s promoting it to be there holding my hand.”  The contacts need to be at a high level, too.  Way over three quarters of buyers say that there has to be solid, ongoing CIO-level contact, and just slightly less said that truly transformational technology shifts would require COO and CEO engagement.  All this speaks to a long-standing account presence, which only the largest vendors can hope to have.

The second requirement is a clear solution ecosystem.  Nobody wants to piece together transformation by summing the parts.  A few buyers who have tried, or even done, just that say that if they had to do it again, they’d demand a holistic architecture up front.  It’s not that a vendor would be expected to be the supplier of every piece in the puzzle, just that they be offering it all, and taking responsibility for it.  This means that integrators could be a player in transformation in theory, but buyers also indicated that they were fairly skeptical of pure integrators.  They think that the vendor should have enough product skin in the game to be able to draw profit from their contribution.  Otherwise, buyers think they’re paying all the product retail margins, and integration besides.

The third requirement is an open approach.  Buyers are a bit embarrassed by this requirement; they know that it’s a contradiction to expect a vendor to have everything, sell everything, support everything, and at the same time preserve them from vendor lock-in, but hey, who says you have to be reasonable?  The point is that the approach that transformation takes has to include a framework into which a buyer can integrate the products they already have and such future products as their business needs dictate.  As a practical matter, this means a clear architecture, no proprietary interfaces or flows, and explicit provision for supplemental functionality.

I’m not surprised by what buyers say they want here.  In fact, it’s not really a massive change from what they’ve always wanted.  The challenge of transformation is that this is a big initiative, and the more that’s required the greater the risk and the more stringently buyers hold to their requirements to mitigate those risks.  I think the barrier can be broken, but I think that it’s going to take a vendor faced with incredible profit pressure from current technologies.  We may not be quite at the tipping point in that regard, in which case we can expect a year or two (or three) of doldrum spending till someone sees the elephant.

Google Yet Again Teaches Operators a Lesson

The network of the future might have to evolve from the network of today, but it has to evolve and not take root there.  Google has consistently looked first at the future, in network terms in particular, and only then worry about linking the future back to the present.  Their latest announced concept, Expresso, has been in use for a couple years now, and I hope that operators and vendors alike will learn some lessons from it.

“Early on, we realized that the network we needed to support our services did not exist and could not be bought.”  That’s a Google quote from their blog on Expresso, and it’s absolutely profound.  It’s first and foremost an indictment of the notion that the goal of evolution in networking is creating incremental, safe-feeling, change not facing the future.  I want to sell something, and networking as it exists will constrain my ability to do that.  Therefore, I will change networking.  How many network operators could have and should have done that?

Expresso is an evolution of a concept that Google introduced to allow its data centers to be linked with a highly efficient SDN core while at the same time maintaining the look of an IP network.  SDN allows for very deterministic traffic management and in its pure form it eliminates the adaptive routing that’s been a fixture of the Internet for ages.  Google’s goal was to make its data-center-connect SDN network (B4) and its data center network SDN (Jupiter) work inside an abstraction, something that looked like an IP core network but wasn’t.  Expresso is what does that.

A company like Google connects with a lot of the Internet; they say they peer with ISPs in over 70 metro areas.  At this “peering surface” Google has to look like an IP network because the ISPs themselves don’t implement B4, they implement IP/BGP.  What Expresso does is create a kind of abstract giant virtual BGP router around its entire peering surface, creating a core SDN network and data center structure inside.  All of Google’s services appear to be peered with all its partners at each of the peering points.  Inside, SDN links user traffic to services without a lot of intermediary router hops that generate latency.  This is what makes it possible for Google to offer low-latency services like it’s “Hey, Google” voice assistant.

Expresso creates what’s effectively a two-tier Internet.  The “slow lane” of the Internet is the stuff that’s based on IP end-to-end, and it might take a half-dozen hops to get through the BGP area in which a service is offered and be linked to the server that actually provides it.  With Expresso, once you pass through the peering surface you’re in the SDN “fast lane”.

Expresso also does some “process caching” in effect.  A user can be linked to the service point that offers the best performance without changing the IP address the user sees as the service URL.  Think of it as providing services a “process IP address” that Google then maps to the best place to run that process.

Traffic management is based on the pure-SDN notion of centralization.  There is no adaptive routing, no “convergence time” for the network to adapt to changing conditions.  A central process makes route decisions based on a broad set of utilization, availability, and QoE metrics, and that same process coordinates traffic on every trunk, every route.  The result is a very high level of utilization combined with very deterministic performance, a combination difficult to achieve in a real IP network except where utilization is held to 50% or so.  In effect, Google doubles the capacity of its trunks.

Taken as a whole, Expresso demonstrates how SDN should have been considered, which is also how NFV should have been considered, and how everything should be.  Google was goal-oriented, top-down, in their approach.  That let Google define the way Expresso had to work based on how their B4 data center interconnect (DCI) worked, and how they wanted their services to work.  What they came up with is abstraction, the notion of making a cooperative system of any given technology or set of technologies, look like a single virtual piece of a different technology.  BGP in effect creates an “intent model” of an Internet area, inside which the property that’s visible is the property BGP makes visible.  How the property is fulfilled is nobody’s business but Google’s.

Another interesting aspect of Expresso is that it could be pushed ‘way out toward the network edge.  Google is already metro-peering.  As access networks and operator metro infrastructure changes, it’s easy to see Expresso sitting right inside the access edge grabbing Google traffic and giving it the best possible performance in support of Google’s services.  The further forward Expresso gets, the more useful it is because the more traditional inefficient, higher-latency, IP routing is displaced by SDN.

So suppose that you had an “Expresso Agent” right at the edge itself?  Suppose you could tap off Google traffic and tunnel it right into B4?  One of the lovely properties of an abstraction like the one Expresso creates is that it doesn’t have to be the only face that Google shows to the sun.  You could take the same set of Google networks and features and push them through a custom SDN abstraction, one more aware of services and less location-specific.  Could Google then define not only the core of the future, but the service edge of the future?

Perhaps, but promoting any new edge protocol would be a challenge in an age where all anyone knows about is IP.  The big value Expresso reveals is the notion of the abstract proxy.  You don’t have to proxy IP/BGP, you could proxy any convenient protocol with an Expresso-like approach.  Stick Expresso-like technology in an edge router or an SD-WAN box and you can connect to anything you like inside.  You can transform the way the network works while leaving the services intact.

That’s what operators need to be thinking about, particularly for things like 5G.  Why wouldn’t the goals of 5G be best satisfied with an inside/outside Expresso model?  Here we have something carrying an estimated 25% of all the traffic of the Internet, so it’s surely proved itself.  We should have been paying more attention to Google all along, as I’ve said in prior blogs.  We darn sure should be paying attention now.

Service Lifecycle Management 101: Integrating with Management Processes

One of the questions certain to arise from discussions of service lifecycle management is how VNFs are managed.  The pat answer to this is “similar to the way that the physical network functions (PNFs) that the VNFs replace were managed.”  Actually, it’s not so pat a response, either.  It is very desirable that management practices already in place be altered by NFV or any new technology only when the alternation is a significant net benefit.  Let’s then start with the management of the PNFs.

PNFs, meaning devices or systems of devices, are represented in the service model by a low-level (meaning close-to-resource) intent model.  That model exposes a set of parameters, some of which might relate to an SLA and others simply to the state of the operation the model represents.  The general rule of model hierarchies is that these parameters are populated from data exposed by the stuff within/below.  In the case of our hypothetical PNF-related intent model, the stuff below is the device or device system and the set of management parameters it offers, presumably in a MIB.

Every intent-model element or object in the service model instance has a parameter set, and each can expose an interface through which that parameter set can be viewed or changed.  This mechanism would allow a service management process to alter the behavior of the PNF that was embedded in the object we’re using to represent it.  Presumably the PNF’s own MIB is still accessible as it would normally be, though, and this raises the risk of collision of activities.

One way to prevent PNF management from colliding with service management is to presume that the PNF isn’t “managed” in an active sense by the service management processes.  That would mean that the PNF asserts an SLA and either meets it or has failed.  The PNF management system, running underneath and hidden from service management, does what’s required to keep things working according to the SLA and to restore operation if the PNF does break.

This isn’t a bad approach; you could call it “probabilistic management” because service management doesn’t explicitly restore operation at the level we’re talking about.  Instead, there is a capacity-planned SLA and invisible under-the-model remediation.  For a growing number of services, it’s the most efficient way to assure operation.

If you don’t want to do stuff under the covers, so to speak, then you have to actively mediate the management requests to ensure that you don’t have destructive collisions.  The easiest way to do that is to require that the PNF’s EMS/NMS work not with the actual interfaces but through the same intent-model as the service management system.  That model would then have to serialize management changes as needed to insure stable operation.

The serializing could be done in two ways—directly via the intent model, or at the process level.  Process-level serialization means that the intent model asserts a management API (by referencing its process) and that API is a talker on a bus that the real management process listens to.  All the requests to that API are serialized.  The intent-model-level approach says that management requests are events, and that the management event is generated by whatever is trying to manage.  Events have to be queued in any event because they’re asynchronous, so this is an easier approach.  Event-based management also lets you change how you handle management commands based on the state of the object—you could ignore them if you’re in a state that indicates you’re waiting for something specific.

All of this is fine, providing that we have an EMS/NMS that’s managing the PNF.  When we translate the PNF to a VNF, what happens?  It’s complicated.

A VNF has two layers of management; the management of the function itself (which should look much like managing the PNF from which the VNF was derived) and the management of the virtualization of the function.  There are some questions with the first layer, and a lot with the second.

Arguably it’s inconvenient in any management framework to have differences in management properties depending on the vendor or device itself.  For automated management in any form, the inconvenience turns into risk because it might not be easy to harmonize the automated practices across the spectrum of devices.  Thus, it would certainly aid the cause of service lifecycle management if we had uniform VNF functional management.  That could be accomplished simply by translating all the different PNF MIBs into a single MIB via an “Adapter” design pattern.

For the virtualization side of VNF management we have to think differently, because PNFs aren’t being hosted in clouds and service chaining of functions replaces having them live in a common box.  We cannot expose virtualization parameters and conditions to management systems that don’t know what a host is or why we’re addressing subnets and threading tunnels.

The convenient way to address this all is to think of VNF management as being a set of objects/elements.  The top one is the function part, and the bottom the virtualization part.  It’s my view that the boundary between these (the abstraction) should separate two autonomous management frameworks that are working to a mutual SLA.  So in effect, the function is an intent model and the virtual realization another.  In that second model, we always presume that the management process is working under the covers to sustain the SLA, not exposing its behavior or components to what’s above.  That means that what the NFV ISG calls “MANO” is largely invisible to the higher level of service lifecycle management, just as a YANG model of device control would be invisible—both are inside an intent model.

The whole of the vast, disorderly, often-criticized VNF onboarding process can be viewed as connecting the VNF to this two-level model of a lifecycle element.  You need to define the state/event handling at the top layer, and some mechanism to coordinate the MANO behavior in the virtual part.  You could create a “stub” of those Adapter design patterns in the specialized, VNF-resident, piece of the VNF Manager, to be accessed by a central management process that builds the connection.

You “could” do that, but should you?  I’m concerned that literal adherence to the ETSI model would actually tend to defeat service lifecycle management principles and make software automation and VNF onboarding more difficult.  The only purpose of a “stub” cohabiting with the VNF should be to adapt the management interfaces to a standard structure for the generation of events.  The service model should define the states of the related service elements and how they integrate events with processes.  That way, the service model defines the service, period.  If you have management logic inside a VNF, or if you have a global management process outside the VNF that is shared across VNFs, then you have a traditional transactional structure, one that has a fixed capacity to process things.  That’s kind of anachronistic when one of the goals of NFV is to provide scalable processes that replace non-scalable physical devices.

Functionally, there’s nothing wrong with a model that says that there are a set of boxes inside NFV that connect with abstract interfaces.  Literally, meaning at the software level, that can lead you to implementations that won’t in the long run satisfy market needs.  Automated service lifecycle management is what is needed for NFV to work.  We can get there using proven principles, even proven models, and I’m confident that somebody is going to get it right.  I just wish it would go a bit faster, and exposing the issues is the best way I know to advance progress.

Service Lifecycle Management 101: Modeling Techniques

Picking the best approach to service modeling for lifecycle management is like picking the Final Four; there’s no shortage of strongly held opinions.  This blog is my view, but as you’ll see I’m not going to war to defend my choice.  I’ll lay out the needs, give my views, and let everyone make their own decision.  If you pick something that doesn’t meet the needs I’m describing, then I believe you’ll face negative consequences.  But as a relative once told me a decade or so ago, “Drive your own car, boy!”

I’ve described in prior blogs that there are two distinct layers in a service model—the service layer and the resource layer.  The need to reflect the significant differences in the missions of these layers without creating a brittle structure effectively defines a third layer, the boundary layer where I’ve recommended that the actual process of laying out the abstractions to use should start.  I’m going to start the modeling discussion at the top, though, because that’s where the service really starts.

The service layer describes commercial issues, parameters, elements, and policies at the business level.  These models, in my view, should be structured as intent models, meaning that they create abstract elements or objects whose exposed properties describe what they do, not how.  The beauty of an intent model is that it describes the goal, which means that the mechanisms whereby that goal can be met (which live within the intent model) are invisible and equivalent.

I’ve done a fair amount of intent modeling, in standards groups like IPsphere and the TMF and in my own original ExperiaSphere project (spawned from TMF Service Delivery Framework work, TMF519), the CloudNFV initiative where I served as Chief Architect, and my new ExperiaSphere model that addressed the SDN/NFV standards as they developed.  All of these recommended different approaches, from TMF SID to Java Classes to XML to TOSCA.  My personal preference is TOSCA because I believe it’s the most modern, the most flexible, and the most complete approach.  We live in a cloud world; why not accept that and use a cloud modeling approach?  But what’s important is the stuff that is inside.

An intent model has to describe functionality in abstract.  In network or network/compute terms, that means that it has to define the function the object represents, the connections that it supports, the parameters it needs, and the SLA it asserts.  When intent models are nested as they would be in a service model, they also have to define, but internally, the decomposition policies that determine how objects at the next level are linked to this particular object.  All of this can be done in some responsive way with any of the modeling approaches I’ve mentioned, and probably others as well.

When these models spawn subordinates through those decomposition policies, there has to be a set of relationships defined between the visible attributes of the superior object and those of its subordinates, to ensure that the intrinsic guarantees of the abstract intent model are satisfied.  These can operate in both directions; the superior passes a relationship set based on its own exposed attributes to subordinates, and it takes parameters/SLA exposed by subordinates and derives its own exposed values from them.

It follows from this that any level of the model can be “managed” providing that there are exposed attributes to view and change and that there’s something that can do the viewing and changing.  It also follows that if there’s a “lifecycle” for the service, that lifecycle has to be derived from or driven by the lifecycles of the subordinate elements down to the bottom.  That means that every intent model element or object has to have a “state” and a table that relates how events would be processed based on that state.  Thus, each one has to specify an event interface and a table that contains processes that are to be used for all the state/event intersections.

Events in this approach are signals between superior and subordinate models.  It’s critical that they be exchanged only across this one specific adjacency, or we’d end up with a high-level object that knew about/from something inside what’s supposed to be an opaque abstraction.  When an event happens, it’s the event that would trigger the model element to do something, meaning that it’s the event that activates the lifecycle progression.  That’s why this whole state/event thing is so important to lifecycle management.

A service model “instance”, meaning one representing a specific service contract or agreement, is really a data model.  If you took that model in its complete form and handed it to a process that recognized it, the process could handle any event and play the role of the model overall.  That makes it possible to distribute, replicate, and replace processes as long as they are properly written.  That includes not only the thing that processes the model to handle events, but also the processes referenced in the state/event table.  The model structures all of service lifecycle management.

It’s easy to become totally infatuated with intent modeling, and it is the most critical concept in service lifecycle management, but it’s not the only concept.  Down at the bottom of a tree of hierarchical intent models will necessarily be something that commits resources.  If we presume that we have a low-level intent model that receives an “ACTIVATE” event, that model element has to be able to actually do something.  We could say that the process that’s associated with the ACTIVATE in the “Ordered” state does that, of course, but that kind of passes the buck.  How does it do that?  There are two possibilities.

One is that the process structures an API call to a network or element management system that’s already there, and asks for something like a VLAN.  The presumption is that the management system knows what a VLAN is and can create one on demand.  This is the best approach for legacy services built from legacy infrastructure, because it leverages what’s already in use.

The second option is that we use something model-driven to do the heavy lifting all the way down to infrastructure.  TOSCA is a cloud computing modeling tool by design, so obviously it could be used to manage hosted things directly.  It can also describe how to do the provisioning of non-cloud things, but unless you’re invoking that EMS/NMS process as before, you’d have to develop your own set of processes to do the setup.

Where YANG comes in, in my view, is at this bottom level.  Rather than having a lot of vendor and technology tools you either inherit and integrate or build, you could use YANG to model the tasks of configuring network devices and generating the necessary (NETCONF) commands to the devices.  In short you could reference a YANG/NETCONF model in your intent model.  The combination is already used in legacy networks, and since legacy technology will dominate networking for at least four to five more years, that counts a lot.

I want to close this by making a point I also made in the opening.  I have a personal preference for TOSCA here, based on my own experiences, but it’s not my style to push recommendations that indulge my personal preferences.  If you can do what’s needed with another model, it works for me.  I do want to point out that at some point it would be very helpful to vendors and operators if models of services and service elements were made interchangeable.  That’s not going to happen if we have a dozen different modeling and orchestration approaches.

The next blog in this series will apply these modeling principles to the VNFs and VNF management, which will require a broader look at how this kind of structured service model supports management overall.

 

Service Lifecycle Management 101: Principles of Boundary-Layer Modeling

Service modeling has to start somewhere, and both the “normal” bottom-up approach and the software-centric top-down approach have their plusses and minuses.  Starting at the bottom invites creating an implementation-specific approach that misses a lot of issues and benefits.  Starting at the top ignores the reality that operators have an enormous sunk cost in network infrastructure, and a revenue base that depends on “legacy” services.  So why not the middle, which as we saw in the last blog means that boundary layer?

A boundary-layer-driven approach has the advantage of focusing where the capabilities of infrastructure, notably the installed base of equipment, meets the marketing goals as defined by the service-level modeling.  The trick for service planners, or for vendors or operators trying to define an approach that can reap the essential justifying benefits, is a clear methodology.

The most important step in boundary-layer planning for service lifecycle management and modeling is modeling legacy services based on OSI principles.  Yes, good old OSI.  OSI defines protocol layers, but it also defines management layers, and this latter definition is the most helpful.  Services, says the OSI management model, are coerced from the cooperative behavior of systems of devices.  Those systems, which we call “networks”, are of course made of the devices themselves, the “physical network functions” that form the repository of features that NFV is targeting, for example.

Good boundary-layer planning starts with the network layer.  A service or resource architect would want to first define the network behaviors that are created and exploited by current infrastructure.  Most network services are really two-piece processes.  You have the “network” as an extended set of features that form the communications/connection framework that’s being sold, and you have “access”, which is a set of things that get sites connected to that network framework.  That’s a good way to start boundary planning—you catalog all the network frameworks—Internet, VPN, VLAN, whatever—and you catalog all the access pieces.

You can visualize networks as being a connection framework, to which are added perhaps-optional hosted features.  For example, an IP VPN has “router” features that create connectivity.  It also has DNS and DHCP features to manage URL-versus-IP-address assignment and association, and it might have additional elements like security, tunnels, firewalls, etc.  The goal of our network behavior definition is to catalog the primary network services, like IP VPN, and to then list the function/feature components that are available for it.

From the catalog of services and features, we can build the basic models at the boundary layer.  We have “L3Connect” and “L2Connect” for example, to express an IP network or an Ethernet network.  We could also have an “L1Connect” to represent tunnels.  These lowest-level structures are the building-blocks for the important boundary models.

Let’s go back to IP VPN.  We might say that L3Connect is an IP VPN.  We might further classify IP VPN into “IPSubnet”, which is really an L2Connect plus a default gateway router.  We might say that an L1Connect plus a SDWAN access set is also an IP VPN.  You get the picture, I think.  The goal is to define elements that can be nuclear, or be made up of a combination of other elements.  All of the elements we define in the boundary layer relate to what it looks like as a service and how we do it through a device or device system.

Don’t get caught up in thinking about retail services at this point.  What we want to have is a set of capabilities, and a mechanism to combine those capabilities in ways that we know are reasonable and practical.  We don’t worry about the underlying technology needed to build our L2Connect or whatever, only that the function of a Level 2 connection resource exists and can be created from infrastructure.

The boundary-layer functions we create obviously do have to be sold, and do have to be created somehow, but those responsibilities lie in the resource and service layers, where modeling and lifecycle management defines how those responsibilities are met.  We decompose a boundary model element into resource commitments.  We decompose a retail service into boundary model elements.  That central role of the boundary element is why it’s useful to start your modeling there.

I think it’s pretty self-evident how you can build boundary models for legacy services.  It’s harder to create them when there is no service you can start with, where the goal of the modeling is to expose new capabilities.  Fortunately, we can go back to another structural comment I made in an earlier blog.  All network services can be built as a connection model, combined with in-line elements and hosted elements.  An in-line element is something that information flows through (like a firewall) and a hosted element is something that performs a service that looks something like what a network endpoint might do (a DNS or DHCP server).  A connection model describes the way the ports of the service relate to traffic.  Three connection models are widely recognized; “LINE” or “P2P”, which is point-to-point, “LAN” or “MP” which is multipoint, and “TREE” which is broadcast/multicast.  In theory you could build others.

If we presume that new services would be defined using these three most general models, we could have something that says that a “CloudApplication” is a set of hosted elements that represent the components, and a connection model that represents the network service framework in which the hosted elements are accessible.  Users get to that connection model via another connection model, the LINE or access model, and perhaps some in-line elements that represent things like security.

If new services can be built this way it should be obvious that there are some benefits in using these lower-level model concepts as ways to decompose the basic features like L2Connect.  That’s an MP connection model built at L2, presumably with Ethernet.  If this approach of decomposing to the most primitive features is followed uniformly, then the bottom of the boundary layer is purely a set of function primitives that can be realized by infrastructure in any way that suits the functions.  L3Connect is a connection model of MP realized at the Level 3 or IP level.  You then know that you need to define an MP model, and make the protocol used a parameter of the model.

Even cloud applications, or cloud computing services, can be defined.  We could say that an ApplicationService is a hosted model, connected to either an L2 or L3 Connect service that’s realized as a MP model.  How you host, meaning whether it’s containers or VMs, can be a parameter of the hosting model if it’s necessary to know at the service layer which option is being used.  You could also have a purely “functional” hosting approach that decomposes to VMs or containers or even bare metal.

There is no single way to use the boundary layer, but for any given combination of infrastructure and service goals, there’s probably a best way.  This means it’s worth taking the time to find what’s your own best approach before you get too far along.

In our next piece, we’ll look at the modeling principles for the service, boundary, and resource layers to lay out what’s necessary in each area, and what might be the best way of getting it.