Taking a Longer Look at 5G Infrastructure and Services

It seems possible, based on the results of the MWC show, to speculate a bit on what infrastructure and service considerations are likely to arise out of the 5G specs.  “Speculate” is the key word here; I’ve already noted that the show didn’t address the key realities of 5G, IoT, or much anything else.  I also want to point out that we don’t have firm specifications here, and in my view, don’t even have convincing indicators that all the key issues are going to be addressed in the specs that do develop.  Thus, we can’t say if these “considerations” will be considered, outside this blog and those who respond on LinkedIn or to me directly.

Three things that 5G is supposed to do according to both the operators and what I read as “show consensus” are to support a unified service framework for wireline and wireless, support “network slicing” to separate services and operators who share infrastructure, and allow mobile services to incorporate elements of other connectivity resources, including wireline and satellite.  These three factors seem to frame one vision of the future that’s still not accepted widely—the notion of an explicit overlay/underlay structure for 5G.

Traditional networking is based on two notions; that services are built on layers that abstract a given layer from the details in implementing the layers below, and that within a layer the protocols of the layer define the features of the service.  When you have an IP network, for example, you rely on some Level 2 and Level 1 service, but you don’t “see” those layers directly.  You do “see” the features of the IP network in the features of your service.

Overlay/underlay networking is similar to the layered structure of the venerable OSI model, but it extends it a bit.  We have overlay/underlay networking today in “tunnel networks” that build connectivity based on the use of virtual paths or tunnels supported by a protocol like Ethernet or IP, and we now have formalized overlays built using SDN or SD-WAN technology.  Most overlay/underlay networks, in contrast to typical OSI-layer models, don’t rely on any feature of the layer below other than connectivity.  There are no special protocols or features needed.  Also, overlay/underlay networking has from the first been designed to allow multiple parallel overlays on a single underlay; most OSI-modeled networks have a 1:1 relationship between L2 and L3 protocols.

In a 5G model, the presumption of overlay/underlay services would be that there would be some (probably consistent) specification for an overlay, both in terms of its protocols and features.  This specification would be used to define all of the “service networks” that wireline and wireless services currently offer, and so the overlay/underlay framework would (with one proviso I’ll get to) support any “service network” over any infrastructure.  That satisfies the first of our three points.

The second point is also easily satisfied, because multiple parallel overlay networks are exactly what network slicing would demand.  If we expanded the “services” of the underlay network to include some class-of-service selectivity, the overlays could be customized to the QoS needs of the services they represent in turn.

In both SD-WAN and SDN overlays, the connectivity of the overlay is managed independent of the underlay; the OSI model tends to slice across layer boundaries or partition the devices to create overlay/underlay connectivity.  In most SD-WAN applications the presumption is that the edge devices (where the user is attached) terminate a mesh of tunnels that create connectivity.  In SDN, there may be a provision for intermediary steering, meaning that an endpoint might terminate some tunnels and continue others.  For proper 5G support, we need to review these options in the light of another element, which is explicit network-to-network interconnect.

Most protocols have some mechanism for NNI, but these are usually based on creating a connection between those singular top-of-the-stack OSI protocols.  In overlay/underlay networks, an NNI element lives at the overlay level, and simply connects across what might be a uni-protocol (same protocol for the underlay) or a multi-protocol (a different underlay on each side) border.  Alternatively, you could have an underlay gateway that binds the two networks together and harmonizes connectivity and QoS, and this could allow the overlay layer to treat the two as the same network.

The border concept could also describe how an underlay interconnect would be shared by multiple overlays, and that concept could be used to describe how a fiber trunk, satellite link, or other “virtual wire” would be represented in an overlay/underlay structure and how it could be used by multiple services.  On- and off-ramps to links like this are a form of gateway, after all.

The question that’s yet to be addressed here is the role that virtual function hosting might play.  There’s nothing explicitly in 5G discussions to mandate NFV beyond hopefulness.  On the other hand, the existence of an overlay technology could well create the beginning of an NFV justification, or at least a justification for cloud-hosting of these overlay components rather than dedicating devices to that role.  An overlay network should be more agile than the underlay(s) that support it.  That agility could take the form of having nodes appear and disappear at will, based on changes in traffic or connectivity, and also in response to changes in the state of the underlay network.  Virtual nodes fit well into the overlay model, even NFV-hosted virtual nodes.

Beyond that it’s harder to say, not because hosting more features isn’t beneficial but because hosting alone doesn’t justify NFV.  NFV was, from the first, fairly specialized in terms of its mission.  A “virtual network function” is a physical network function disembodied.  There really aren’t that many truly valuable physical network functions beyond nodal behavior.  Yes, you can hypothesize things like virtual firewalls and NATs, but you can get features like that for a few bucks at the local Staples or Office Depot, at least for the broad market.  Moving outside nodal (connectivity-routing) features to find value quickly takes you outside the realm of network functions and into application components.  Is a web server a network function, or a mail server?  Not in my view.

From the perspective of 5G and IoT, though, the requirements for hosting virtual functions or hosting cloud processes are very similar; there is a significant connectivity dimension.  We have done very little work in the NFV space to frame what network model is required to support the kind of function-hosting-and-management role needed.  That work that’s been done in the cloud space has focused on a pure IP-subnet model that’s too simple to address all the issues of multi-tenant functions that have to be securely managed as well.  In fact, the issue of addressing and address management is probably the largest issue to be covered, even in the overlay/underlay model.  If operators and vendors are serious about 5G then they need to get serious about this issue too.

What Would Edge-Hosting Mean to Infrastructure and Software Design?

If computing in general and carrier cloud in particular is going to become more edge-focused over time, then it’s time to ask just what infrastructure features will be favored by the move.  Even today we see variations in server architecture and the balance of compute and I/O support needed.  It is very likely that there will be even more variations emerging as a host of applications compete to dominate the cloud’s infrastructure needs.  What are the factors, and what will the result be?  I’m going to have to ask you to bear with me, because understanding the very important issues here means going way beyond 140-character Tweets.

It’s always a challenge to try to predict how something that’s not even started will turn out in the long term.  Carrier cloud today is almost pre-infancy; nearly all carrier IT spending is dedicated to traditional OSS/BSS, and what little is really cloud-building or even cloud-ready is so small that it’s not likely representative of broader, later, commitment.  Fortunately, we have some insight we can draw from the IT world, insight that’s particularly relevant given the fact that things like microservices are already a major driver of change in IT, and are of increasing interest in the carrier cloud.  To get to these insights we need to look a bit at the leading edge of cloud software development.

Microservices are in many ways a kind of bridge between traditional componentized applications (including those based on the Service Oriented Architecture of almost two decades ago) and the “bleeding edge” of computing architecture, the functional programming or Lambda function wave.  A Lambda function is a software element that processes an input and produces an output without relying on the storage of internal pieces—it has a single function regardless of the context of its use.  What makes this nice is that because nothing is ever saved inside a Lambda function, you can give a piece of work to any copy of the function and get exactly the same result.  I’m going to talk a lot about the Lambda functions in this blog, so to save typing I’m going to call them “Lambdas” with apologies to the people who use the term (without capitalizing) to mean “wavelength”.

In the broader development context, this kind of behavior is known as “stateless” behavior, because there are no “states” or differences in function outcome depending on the sequence of events or messages being processed.  Stateless behavior is mandatory for Lambdas, and also highly recommended if not mandated for microservices.  Stateless stuff is great because you can replace it, scale it, or use any convenient element of it and there’s no impact, no cross-talk.  They’re bad because many processes aren’t stateless at all—think of taking money out of the bank if you need an easy example.  What you have left depends on what you’ve put in or taken out before.

The reason for this little definitional exercise is that both Amazon and Microsoft have promoted Lambda programming as a pathway to event-driven IT, and the same is being proposed for microservices.  In Amazon’s case, they linked it with distributing functions out of the cloud and into an edge element (Greengrass).  Event-driven can mean a lot of things, but it’s an almost-automatic requirement for what are called “control loop” applications, where something is reported and the report triggers a process to handle it.  IoT is clearly a control-loop application, but there are others even today, which is why Amazon and Microsoft have focused on cloud support for Lambda functions.  You can write a little piece of logic to do something and just fire it off into the network somewhere it can meet the events it supports.  You don’t commit machine image resources or anything else.

If IoT and carrier cloud will focus on being event-driven, it follows they would likely become at least Lambda-like, be based on stateless microservices that are pushed toward the edge to shorten the control loop while traditional transactional processes stay deeper in the compute structure.  Applications, then, could be visualized as a cloud of Lambdas floating around, supporting collectively a smaller number of stateful repository-oriented central applications.  The latter will almost surely look like any large server complex dedicated to online transaction processing (OLTP).  What about the latter?

The Lambda vision is one of functional units that have no specific place to live, remember.  It’s a vision of migration of capabilities to assemble them along the natural path of work, at a place that’s consistent with their mission.  If they’re to be used in event-handling, this process of marshaling Lambdas can’t take too long, which means that you’d probably have a special system that’s analyzing Lambda demand and caching them, almost like video is cached today.  You’d probably not want to send a Lambda somewhere as much as either have it ready or load it quickly from a local resource.  Once it’s where it needs to be, it’s simply used when the appropriate event shows up.

This should make it obvious that running a bunch of Lambdas is different from running applications.  You don’t need a lot of disk I/O for most such missions, unless the storage is for non-volatile referential data rather than a dynamic database.  What you really want is powerful compute capabilities, a lot of RAM capacity to hold functions-in-waiting, and probably flash disk storage so you can quickly insert a function that you need, but hadn’t staged for use.  Network I/O would be very valuable too, because restrictions on network capacity would limit your ability to steer events to a convenient Lambda location.

How Lambda and application hosting balance each other, requirements-wise, depends on how far you are from the edge.  At the very edge, the network is more personalized and so the opportunity to host “general-use Lambdas” is limited.  As you go deeper, the natural convergence of network routes along physical facilities generate places where traffic combines and Lambda missions could reasonably be expected to be shared across multiple users.

This builds a model of “networking” that is very different from what we have now, perhaps more like that of a CDN than like that of the Internet.  We have a request for event-processing, which is an implied request for a Lambda stream.  We wouldn’t direct the request to a fixed point (any more than we direct a video request that way), but would rather assign it to the on-ramp of a pathway along which we had (or could easily have) the right Lambdas assembled.

I noted earlier in this blog that there were similarities between Lambdas and microservices.  My last paragraph shows that there is also at least one difference, at least in popular usage, between Lambdas and microservices.  The general model for microservices is based on extending componentization and facilitating the use of common functions in program design.  A set of services, as independent components, support a set of applications.  Fully exploiting the Lambda concept would mean that there really isn’t a “program” to design at all.  Instead there’s a kind of ongoing formula that’s developed based on the source of an event, its ultimate destination, and perhaps the recent process steps taken by other Lambdas.  This model is the ultimate in event-driven behavior, and thus the ultimate in distributed computing and edge computing.

There’s another difference between microservices and Lambdas, more subtle and perhaps not always accepted by proponents of the technologies.  Both are preferred to be “stateless” as I noted, but in microservices it’s acceptable to use “back-end” state control to remove state/context from the microservices themselves.  With Lambdas, this is deprecated because in theory different copies of the same Lambdas might try to alter state at the same time.  It would be better for “state” or context to be carried as a token along with the request.

We don’t yet really have a framework to describe it, though.  Here’s an event, pushed out by some unspecified endpoint.  In traditional programming, something is looking for it, or it’s being posted somewhere explicitly.  Maybe it’s publish-and-subscribe.  However, in a pure Lambda model, something Out There is pushing Lambdas out along the path of the event.  What path is that?  How does the Something know what Lambdas are needed or where to put them?

If you applied the concepts of state/event programming to Lambda control, you could say that when an event appears it is associated with some number of state/event tables, tables that represent contexts that need to process that event.  The movement of the event through Lambdas could be represented as the changing of states.  Instead of the traditional notion of an event arriving at a process via a state/event table, we have a process arriving at the event for the same reason.  But it’s still necessary to know what process is supposed to arrive.  Does the process now handling an event use “state” information that’s appended to it and identify the next process down the line?  If so, how does the current process know where the next one has been dispatched, and how does the dispatcher know to anticipate the need?  You can see this needs a lot of new thinking.

IoT will really need this kind of edge-focused, Lambda-migrating, thinking.  Even making OSS/BSS “event-driven” could benefit from it.  Right now, as far as I can see, all the good work is being done in abstract with functional programming, or behind the scenes of web-focused, cloud-hosted startups who probably have stimulated both Amazon and Microsoft to offer Lambda capabilities in their clouds.  It will be hard to make IoT the first real use case for this—it’s a very big bite—but maybe that’s what has to happen.

A Slightly Early MWC Retrospective

The iconic MWC conference is now pretty much history.  The big announcements have been made, the attendees have largely exhausted themselves (the exhibitors certainly have!), and it’s time to take stock and decide whether anything important was really said and shown.  In terms of point announcements, it’s rare for something huge to come out at an event like MWC—too much crosstalk.  The buzz of the show is another matter; we can pick out some important points by looking across all the announcements and demonstrations to detect shifts and non-shifts.

The most important thing that I take away from MWC is that there is an enormous gap between 5G expectation and the current state of the technology.  The goal of 5G is service- and infrastructure-shaking, and the reality of 5G at the moment struggles to be a major shift in the RAN.  Part of the reason for this shift is the (usual) slow progress of the specifications, but another part is the fact that standards groups have a habit of grabbing the low apples or focusing on the most visible questions.

5G RAN improvements are important, but operators I talk with have consistently said that their biggest priority was to standardize the metro and access models for wireless and wireline, and to support wireless 5G extensions of fiber networks.  Without these capabilities, many operators said that it would be difficult to justify 5G versus enhanced 4G.  Ironically, the early “5G trials” have all focused on RAN and on modest adjustments to 4G, like supporting 5G frequencies, to “prove out” the technology.  Some operators have been public in their rejection of this approach, but that’s what’s been happening.

One public approach to pre-standard 5G even retains the Evolved Packet Core, which most operators told me was something that they wanted (as a number-one or number-two priority) to eliminate.  Clearly the focus of many 5G proponents is to move the process ahead even if there’s less utility in what’s produced.  That also was a criticism that’s been made in public.

The next point is that we have not yet abandoned our short-sighted and stupid vision of IoT as being all about wireless connections.  There were plenty of examples of this, but two were particularly figured in the overall stream of hype.  The first is a broadening of the notion that this is all about RF, which makes IoT all about connections.  The second is the almost hypnotic attraction to “connected car” as the prototypical IoT application.

I’m almost tired of saying that getting devices connected is the least of our IoT worries, but it is.  The majority of IoT applications will almost certainly use devices that not only aren’t directly on the Internet at all, but don’t even use Internet-related technology for connections.  Home control today relies on technologies that aren’t related to Ethernet, IP, or the Internet.  Only the home controller is an Internet device, and this model of connectivity is likely to dominate for a long time to come.  If we insist that all our sensors and controllers be IP devices that are Internet-connected, we’re building a barrier to adoption that will take unnecessary years to jump.

The connected car is another potential trap.  Most of what a connected car will do is offer WiFi to consumer mobile devices that passengers and drivers (the latter, hopefully, not while moving) are using in the vehicle.  Yes, there are other features, but the value proposition is really more like a moving WiFi hotspot than a real IoT mission.  There’s always pressure to pick something that’s actually happening and then broaden your definition of revolutionary technology to envelope it, justifying your hype.  That’s not helpful when there are real questions and issues that are not addressed by the billboard-technology example, but will have to be addressed for the market to develop.

The first positive point from the show is that both network operators and equipment vendors realize that mobile broadband personalization is the only relevant demand driver.  Wireline broadband for both consumers and businesses is really just a matter of wringing as much as profit as possible out of something that’s already marginal at best.  If there is new revenue to be had for operators, that revenue is going to come from the exploitation of mobile broadband in both enterprises and consumer markets.

There’s a sad side even to this happiness, though.  For all the fact that the explosion of interest in MWC demonstrates the victory of mobile broadband, or that many who exhibit and probably even more who attend MWC are there for things not directly related to cellular networks, we’re still missing a lot of the key points that justify the mobile focus.

A mobile device is a direct extension of the user, a kind of technological third leg or second head.  It brings the knowledge and entertainment base of the Internet and the power of cloud computing right into the hands of everybody.  The best way to look at IT evolution since the ‘50s is that each new wave brought processing closer to people.  Mobile broadband fuses the two.

Also in my view a positive was the talk from FCC Chairman Ajit Pai, where he said what shouldn’t really have surprised anyone—that the FCC planned a “lighter touch” under the new administration.  The FCC had already taken steps that indicated it would retreat from the very activist position taken by the body under the previous Chairman (Wheeler), but Pai voted against the neutrality ruling and his comments at MWC suggest he has specific moves in mind.  Reinforcing the “lighter touch” was the comment (referencing neutrality) that “It has become evident that the FCC made a mistake.  Our new approach injected tremendous uncertainty into the broadband market. And uncertainty is the enemy of growth.”

Net neutrality is important, insofar as it protects OTT competitors from operators cutting favorable deals with their own subsidiaries.  The current rules, though, were not enough to prevent AT&T from offering outside-data-plan video to its TV customers.  On the other hand, the extension of the rules that Wheeler promoted has made the relationship between subsidiaries and ISPs confusing to say the least, and it’s probably limited willingness of operators to pursue initiatives that would have promoted broadband infrastructure investment.

I have to agree with Pai here.  I think that the FCC in the last term overstepped simple neutrality goals and took a stand on the broadband business that favored one party—the OTTs—over the other, to a degree the FCC had never done before.  A dynamic broadband market—the kind that MWC and 5G propose to support—demands a symbiosis and not an artificial financial boundary.  Through almost my whole consulting career I’ve supported the notion of Internet settlement, and I still support it.  I think it’s time to take some careful, guarded, steps toward trying it out.

How Could We Accelerate the Pace of New Edge-Deployed Data Centers?

There should be no question that I am a big fan of edge computing, and so I’m happy that Equinix is pushing storage to the edge (according to one story yesterday) or that Vapor IO supports micro-data-centers at the wireless edge.  I just wish we had more demand focus to explain our interest in the supply.  There are plenty of things happening that might drive a legitimate migration of hosting/storage to the network edge, and I can’t help but feel we’d do a better job with deployment if there was a specific business case behind the talk.

Carrier cloud is the definitive network-distributed IT model, and one of the most significant questions for server vendors who have aspirations there is just how the early carrier cloud data centers will be distributed.  A central model of hosting, even a metro-central one, would tend to delay the deployment of new servers.  An edge-centric hosting model would build a lot of places where servers could be added, and encourage operators to quickly support any missions where edge hosting offered a benefit.  So, in the balance, where are we with this?

Where you host something is a balance between economy of scale, economy of transmission, and propagation issues.  Everyone knows that a pool of servers offers lower unit cost than single servers do, and most also know that the cost per bit of transmission tends to fall to a point, then rise again as you pass the current level of economical fiber transport.  Most everyone knows that traversing a network introduces both latency (delay) and packet loss that grows with the distance involved.  The question is how these things combine to impact specific features or applications.

Latency is an exercise in physics; the speed of light and electrons and the delay introduced by queuing and handling in network devices.  The only way to reduce it is to shorten the path, which means pushing stuff to the edge.  Arguably, the only reason to edge-host something is because of latency (though we’ll explore that point later), and most applications run today aren’t enormously latency-sensitive.  Telemetry and control applications, which involve the handling of an event and sending a response, are often critically sensitive to latency in M2M applications.

That means that IoT is probably the obvious reason to think about edge-hosting something.  The example of self-driving cars is trite here, but effective.  You can imagine what would happen if a vehicle was controlled by something half-a-continent away.  You can easily get a half-second control loop, which would mean almost fifty feet of travel at highway speed.

Responses to human queries, particularly voice-driven personal assistants, are also delay sensitive.  I saw a test run a couple years ago that demonstrated that people got frustrated if their queries were delayed more than about two seconds, resulting in their repeating the question and creating a context disconnect with the system.  Since you have to factor in actual “think time” to a response, a short control loop would be helpful here, but you can accommodate longer delay by having your assistant say “Working on that….”

Content delivery in any form is an example of an application where latency per se isn’t a huge issue, but it raises another important point—resource consumption or “economy of transmission”.  If you were to serve (as a decades-old commercial suggested you could) all the movies ever made from a single point, the problem you’d hit is that multiple views of the same movie would quickly explode demands for capacity.  You also expose the stream to extreme variability in network performance and packet loss, which can destroy QoE.  Caching in content delivery networks is a response to both these factors, and CDNs represent the most common example of “edge hosting” we see today.

Let’s look at the reason we have CDNs to explore the broader question of edge-hosting economies versus more centralized hosting.  Most user video viewing hits a relatively contained universe of titles, for a variety of reasons.  The cost of storing these titles in multiple places close to the network edge, thus reducing network resource consumptions and the risk of performance issues, is minimal.  What makes it so is the fact that so many content views hit a small universe of content.  If we imagine for a moment that every user watched their own unique movie, you’d see that content caching would quickly become unwieldy.  Unique resources, then, are better candidates for “deep hosting” if all other traffic and scale economies are equal.

That brings us to scale.  I’ve mentioned in many blogs that economies of scale don’t follow an exponential or even linear curve, but an Erlang C curve.  That means that when you get to a certain size data center, further efficiency gains from additional servers are minimal.  For an average collection of applications I modeled for a client, you reached 95% optimality at about 800 servers, and there are conditions under which less than half that would achieve 90% efficiency.  That means that supersized cloud data centers aren’t necessary.  Given that, my models have always said that by around 2023, operators would have reached the point where there was little benefit to augmenting centralized data centers and move to edge hosting.  The biggest growth in new data centers occurs in the model between 2030 and 2035, where the number literally doubles.  If I were a vendor, I would want to accelerate that shift to the edge.

Centralization of resources is necessary for unique resources.  Edge hosting is necessary where short control loops are essential to application performance.  If you model the processes, you find that up to about 2020, carrier cloud is driven more by caching and video/ad consideration than anything else, and that tends to encourage a migration of processing toward the edge.  From 2020 to about 2023, enhanced mobile service features begin to introduce more data center applications that are naturally central or metro-scoped, and beyond 2023 you have things like IoT that magnify the need for edge caching again.

Video, then, is seeding the initial data center edge locations for operators.  Metro-focused applications will probably use a mixture of space in these existing edge hosting points and new and more metro-central resources.  The natural explosion in the number of data centers will occur when the newer short-control-loop stuff emerges, perhaps four to five years from now.  It would be hard to advance something like this; the market change is pretty profound.

Presuming this is all true, then current emphasis on caching of data is smart and edge hosting of processing features may be premature.  What could accelerate the need for edge hosting?  This is where something like NFV could be a game-changer, providing a mission for edge-centric hosting before broad-market changes in M2M and IoT emerge and building many more early data centers.  If you were to double the influence of NFV, for example, in the period up to 2020, you would add an additional thousand edge data centers worldwide.

NFV will never drive carrier cloud, but what it could do is to promote edge-placement of many more data centers between now and 2020, shifting the balance of function distribution in the 2020-2023 period toward the edge, simply because the resources are there.  That could accelerate the number of hosting points (and slightly increase the number of servers) through 2025, and it would be a big windfall for vendors.

IT vendors looking at the carrier cloud market should examine the question of how this early NFV success could be motivated by specific benefits, and what specific steps in standardization, operationalization, or whatever might be essential in supporting that motivating.  There are few applications that could add as much to the data center count, realistically, in the next three years.

Are You in the Mood for Indigo? AT&T’s New Concept Could Change Your Mind!

When you have an architecture that set the standard for NFV, what do you to for an encore?  AT&T’s answer to that question is “Network 3.0 Indigo” or in short terms, just “Indigo”.  It’s another of those huge concepts that’s difficult to describe or to understand, and its sheer scope is certain to create healthy skepticism on whether AT&T can meet the goals.  Whatever happens in realization, though, Indigo is profoundly important because it frames operators’ views of the future exceptionally well.

Operators have consistently been telling me that their biggest problem with technology initiatives, from SDN and NFV to 5G, is that they seem to be presented as justified for their own sake.  What operators need is a business goal that can be met, an opportunity addressed, and in the complex world of networking most technologies proposed lack the critical property of scope.  They just don’t do the job by themselves, which is why integration is becoming such an issue.  AT&T advanced NFV with ECOMP by incorporating more into it, and they hope to do even more with Indigo.

Let’s start with a quote from AT&T’s Indigo vision statement: “The network of the future will be more than just another “G”, moving from 2G to 3G to 4G and beyond.  It’s about bundling all the network services and capabilities into a constantly evolving and improving platform powered by data. This is about bringing software defined networking and its orchestration capabilities together with big data and an emerging technology called microservices, where small, discrete, reusable capabilities can team up as needed to perform a task. And, yes, it’s about so-called ‘access’ technologies like 5G and our recently -announced Project AirGig. Put all that together, and you have a new way to think about the network.”

Feel better, more educated?  Most people who read the above statement don’t, so don’t feel inadequate.  In simple terms, what Indigo is about is creating agility and efficiency, which you’ll probably recognize as the two paramount (credible) NFV goals.  AT&T is making an important statement here, even if it’s not easy to parse.  The network isn’t going to evolve as a series of disconnected technical shifts, but as a result of serving a clear set of business requirements.  Given that, it makes no sense to keep on talking about “SDN” or “NFV” or “5G” as though they were the only games in town.  There has to be a holistic vision, which is why the quote above ends with the statement that Indigo is “a new way to think about the network.”  It’s about creating something that becomes what’s needed.

Faster access, which is pretty much all anyone thinks about these days when they hear about telecom changes, is rapidly reaching a point where further gains in performance will be difficult to notice.  I’ve said many times that most users could not actually exploit even 25 Mbps; you need multiple people sharing a connection to actually use that much.  AT&T correctly points out that at the point where more bits equals superior service becomes blasé, it’s the overall experience that counts.  Indigo is therefore an experience-based network model.

But, you might rightfully ask, what the heck is it technically?  The kind of detailed Indigo information that we all might like isn’t available, but it’s possible to interpret the high-level data AT&T has provided to gather some useful insight into their approach.  As you might expect from the notion of “experience-based” network services, Indigo steps out beyond connections, to an intermediary position that AT&T calls a “Data-Powered Community”.  Inside this new artifact is the usual access network options, and the now-common commitment to SDN, but there’s also identity management, AI, a data platform that in my view will emerge as the framework for AT&T’s IoT model, and the software orchestration and management tools that tie all this together.

From what I can see, the key technology concept in Indigo is the breaking down of monolithic software structures and service structures into microservices, which are then orchestrated (presumably using ECOMP).  Just as ECOMP can deploy an NFV-based service, it could deploy a function-based application.  Want an operations tool?  Compose it from microservices.  Want to sell a cloud service?  Compose it.  A Community in Indigo is an ad-hoc composition of functional and connection elements.

The Communities Indigo defines are the frameworks that house the customer experiences that they value.  That means that traditional networking ends up merging more with network-related features like agile bandwidth and connectivity, but also with cloud computing and applications.  I think Indigo is a promise that to AT&T, a virtual function and a cloud application will be faces of the same coin, and that services will use both of these two feature packages to add value for users and revenue for AT&T.

One important feature of Indigo is the ability to support services whose pieces are drawn from a variety of sources.  “Federation” isn’t just a matter of interworking connectivity services, it’s a full-blown trust management process that lets third-party partners create elements of services and publish them for composition.  This doesn’t mean that AT&T won’t offer their own advanced service features, but that they expect to have to augment what they can build by incorporating useful stuff from outside.

If you look at the use cases for Indigo that AT&T has already presented, you don’t see more than hint of what I’m describing.  There are four such use cases, and most of them are pretty pedestrian.  What’s really needed is a broader and clearer picture of this federation approach, and in particular examples of how it might be integrated with IoT services.  If there’s a giant revenue pie that AT&T needs to bite into, IoT will likely create it.  Given this, and given that AT&T cites IoT trends twice in its lead-in to justifying Indigo, it’s surprising that they don’t offer any IoT-specific or even related use cases.  In fact, beyond the two justifying mentions, IoT doesn’t appear in the rest of the AT&T technical document on Indigo.

Which, frankly, is my big concern about Indigo.  Yes, all the framing points AT&T makes about the evolution of services and service opportunity are true.  Yes, a framework that envelopes both connectivity and the experiences users want to be connected with is where we’re heading.  And, yes, it’s true that IoT services are still off in the future.  However, they are the big focus of opportunity and Indigo will stand or fail based on whether it supports IoT-related services well.  It’s IoT that offers AT&T and other operators an application so big that most competitors (including OTTs) will be afraid to capitalize it.  They can own IoT, if they really can frame it in Indigo terms.

Indigo’s greatest near-term contribution may well be its impact on ECOMP.  Universal orchestration and software decomposition to microservices would mean a significant enhancement to the ECOMP model of defining services and managing their lifecycle.  A broader goal for orchestration is critical for NFV’s success because the scope needed to deliver the business case is larger than the bite the NFV ISG has taken of the issues.  Indigo is big, which is a risk, but here, bigness could be a precursor to greatness.

How Can We Get to Modular Infrastructure for the Carrier Cloud?

In a blog last week, I mentioned the notion of an NFV infrastructure concept that I called a “modular infrastructure model”.  It was a response to operators’ comments that they were most concerned about the possibility that the largest dollar investment in their future—carrier cloud—would end up creating a silo.  They know how to avoid network equipment silos and vendor lock-in, but in the cloud?  It breaks, or demands breaking, new ground, and in several important ways.

Cloud infrastructure is a series of layers.  The bottom layer, the physical server resources and associated “real” data center switching, is fairly standardized.  We have different features, yet, but everyone is fairly confident that they could size resource pools based on the mapping of function requirements to the hardware features of servers.  The problems lie in the layers above.

A cloud server has a hypervisor, and a “cloud stack” such as OpenStack.  It has middleware associated with the platform and other middleware requirements associated with the machine image of the functions/features you’re deploying.  There are virtual switches to be parameterized, and probably virtual database features as well.  The platform features are often tuned to specific application requirements, which means that operators might make different choices for different areas, or in different geographies.  Yet differences in the platform can be hard to resolve into efficient operations practices.  I can recall, back in the CloudNFV project for which I served as the chief strategist, that it took about two weeks just to get a runnable configuration for the platform software, onto which we could deploy virtual functions.

Operators are concerned that while this delay isn’t the big problem with NFV, it could introduce something that could become a big problem.  If Vendor X offers a package of platform software for the carrier cloud that’s capable of hosting VNFs, then operators would be inclined to pick that just to avoid the integration issues.  That could be the start of vendor lock-in, some fear.

It appears that the NFV standards people were aware of this risk, and they followed a fairly well accepted path toward resolving it.  In the NFV E2E architecture, the infrastructure is abstracted through the mechanism of a Virtual Infrastructure Manager that rests on top of the infrastructure and presents it in a standard way to the management and orchestration (MANO) elements.

Problem solved?  Not quite.  From the first, there was a question of how the VIM worked.  Most people involved in the specs seemed to think that there was one VIM, and that this VIM would resolve differences in infrastructure below it.  This approach is similar to the one taken by SDN, and in particular by Open Daylight, and it follows the model of OpenStack’s own networking model, Neutron.  However, anyone who has followed Neutron or ODL knows that even for nothing more than connectivity, it’s not easy to build a universal abstraction like that.  Given that, a vendor who had a super-VIM might use it to lock out competitors by simply dragging out supporting them.  Lock-in again.

An easy (or at least in theory, easy) solution to this problem is one I noted just a couple blogs back—you support multiple VIM.  That way, a vendor could supply a “pod” or my modular infrastructure model, represented by its own VIM.  Now, nobody can use VIMs to lock their own solutions in.

As usual, it turns out to be a bit more complicated.  The big problem is that if you have a dozen different VIMs representing different resource pods, how do you know which one (or ones) to send requests to, and how do you parcel out the requests for large-scale deployment or change among them?  You don’t want to author specific VIM references into service models because that would make the models “brittle”, subject to change if any changes in infrastructure were made.  In fact, it could make it difficult to do scaling or failover, if you had to reference a random VIM that wasn’t in the service to start with.

There are probably a number of ways of dealing with this, but the one I’ve always liked and proposed for both ExperiaSphere and CloudNFV was the notion of a resource-side and service-side model, similar to the TMF’s Customer-Facing and Resource-Facing Services.  With this model, every VIM would assert a standard set of features (“Behaviors” in my terminology), and if you needed to DeployVNF, for example, you could use that feature with any VIM that represented hosting.  VIM selection would then be needed only to accommodate differences in resource types across a service geography, and it could be done “below” the service model.  Every VIM would be responsible for meeting the standard Behaviors of its class, which might mean all Behaviors if it was a self-contained VNF hosting pod.

All this is workable, but not sufficient.  We still have to address the question of management, or lifecycle management to be specific.  Every event that happens during a service lifecycle that impacts resource commitments or SLAs has to be reflected in an associated set of remedial steps, or you don’t have service automation and you can’t improve opex.  These steps could easily become very specific if they are linked to VNF processes—which the current ETSI specifications propose to do by having VNF Management (VNFM) at least partially embedded in the deployed VNFs.  If there is, or has to be, tight coupling between resources and VNFM, then you have resource-specific management and a back door into the world of silos at best and vendor lock-in at worst.

There are, in theory, ways to provide generalized management tools and interfaces between the resource and service sides of an operator.  I’ve worked through some of them, but I think that in the long pull most will fail to accommodate the diverse needs of future services and the scope of service use.  That means that what will be needed is asynchronous management of services and resources.  Simply put, “Behaviors” are “resource-layer services” that like all services offer an SLA.  There is a set of management processes that work to meet that SLA, and those processes are opaque to the service side.  You know the SLA is met, or has failed to be met, or is being remediated, and that’s that.

So what does/should a VIM expose in its “Behaviors” API?  All network elements can be represented as a black box of features that have a set of connections.  Each of the features and connections has a range of conditions it can commit to, the boundaries of its SLA.  When we deploy something with a VIM, we should be asking for such a box and connection set, and providing an SLA for each of the elements for which one is selectable.  Infrastructure abstraction, in short, has to be based on a set of classes of behavior to which infrastructure will commit, regardless of exactly how it’s constituted.  That’s vendor independence, silo-independence.

I’m more convinced every day that the key to efficient carrier cloud is some specific notion of infrastructure abstraction, whether we’re talking about NFV or not.  In fact, it’s the diversity of carrier cloud drivers and the fact that nothing really dominates the field in the near term, that makes the abstraction notion so critical.  We have to presume that over time the role of carrier cloud will expand and shift as opportunity focus changes for the operators.  If we don’t have a solid way of making the cloud a general resource, we risk wasting the opportunity that early deployment could offer the network operators.  That means risking their role in the cloud, and in future experience-based services.

Vendors have a challenge here too, because the fear of silos and vendor lock-in is already changing buyer behavior.  In NFV, the early leaders in technology terms were all too slow to recognize that NFV wasn’t a matter of filling in tick marks on an RFP, but making an integrated business case.  As a result, they let the market idle along and failed to gain traction for their own solutions at a time when being able to drive deployment with a business case could have been decisive.  Now we see open-source software, commodity hardware, and anti-lock-in-and-silo technology measures taking hold.

It’s difficult to say how much operator concerns over silos and lock-in are impacting server sales.  Dell, a major player, is private and doesn’t report their results.  HPE just reported its quarterly numbers, which were off enough to generate Street concern.  HPE also said “We saw a significantly lower demand from one customer and major tier 1 service provider facing a very competitive environment.”  That is direct evidence that operators are constraining even server purchases, and it could be an indicator that the fear of silos and lock-in is creating a problem for vendors even now.

Ericsson’s wins at Telefonica and Verizon may also be an indicator.  Where there’s no vendor solution to the problems of making a business case or integrating pieces of technology, integrators step in.  There seems to be a trend developing that favors Ericsson in that role, largely because they’re seen as a “fair broker” having little of their own gear in the game.

It’s my view that server vendors are underestimating the impact of operator concerns that either early server deployments won’t add up to an efficient carrier cloud, or will lock them into a single supplier.  It wouldn’t take a lot of effort to create a “modular infrastructure model” for the carrier cloud.  Because its importance lies mostly in its ability to protect operators during a period where no major driver for deployment has emerged, developing a spec for the model doesn’t fall cleanly into NFV or 5G or whatever.  Vendors need to make sure it’s not swept under the rug, or they face a further delay in realizing their sales targets to network operators.

Despite some of the Street comments and some media views on the HPE problem, the cloud is not suppressing server spending.  Every enterprise survey I’ve done in the last five years shows that cloud computing has not had any impact on enterprise server deployment.  If anything, the cloud and web-related businesses are the biggest source of new opportunity.  Even today, about a third of all servers sold are to web/cloud players and network operators.  My model has consistently said that carrier cloud could add a hundred thousand new data centers by 2030.

If the cloud is raining on the server market, more rain is needed.  It is very likely that network-related server applications represent the only significant new market opportunity in the server space, in which case anything that limits its growth will have a serious impact on server vendors.  The time to fix the problem here is short, and there’s also the threat of open hardware models lurking in the wings.  Perhaps there needs to be a new industry “Call for Action” here.

Just How Much of a Problem is VNF Onboarding and Integration?

We had a couple of NFV announcements this week that mention onboarding or integration.  Ericsson won a deal with Verizon that includes their providing services to integrate VNFs, and HPE announced an “onboarding factory” service that has the same basic goal.  The announcements raise two questions.  First, does this move the ball significantly with respect to NFV adoption?  Second, why is NFV, based as it is on standard interfaces, demanding custom integration to onboard VNFs?  Both are good questions.

Operators do believe that there’s a problem with VNF onboarding, and in fact with NFV integration overall.  Most operators say that integration difficulties are much worse than expected, and nearly all of them put difficulties in the “worse to much worse” category.  But does an integration service or factory change things radically enough to change the rate of NFV adoption significantly?  There, operators are divided, based of course on just how much VNF onboarding and integration they propose.

The majority of VNFs today are being considered in virtual CPE (vCPE) service-chaining business service applications, largely targeting branch office locations connected with carrier Ethernet services.  Operators are concerned with onboarding/integration issues because they encounter business users who like one flavor of VNF or another, and see offering a broad choice of VNFs as a pathway to exploding costs to certify all the candidates.

The thing is, many operators don’t even have this kind of prospect, and most operators get far less than 20% of their revenue from business user candidates for vCPE overall.  I’ve talked with some of the early adopters of vCPE, and they tell me that while there’s a lot of interest in having a broad range of available VNFs, the fact is that for any given category of VNF (firewall, etc.) there are probably no more than three candidates with enough support to justify including them in a vCPE function market list.

The “best” applications for NFV, meaning those that would result in the largest dollar value of services and of infrastructure purchasing, are related to multi-tenant stuff like IoT, CDN, and mobility.  All but IoT among this group tend to involve a small number of VNFs that are likely provided by a single source and are unlikely to change or be changed by the service customer.  You don’t pick your own IMS just because you have a mobile phone.  That being the case, it’s unlikely that one of the massive drivers of NFV change would really be stalled out on integration.

The biggest problem operators say they have with the familiar vCPE VNFs isn’t integration, but pricing, or perhaps the pricing model.  Most VNF providers say they want to offer their products on a usage price basis.  Operators don’t like usage prices because they feel they should be able to buy unlimited rights to the VNF at some point.  Some think that as usage increases, unit license costs should fall.  Other operators think that testing the waters with a new VNF should mean low first-tier prices that gradually rise when it’s clear they can make a business case.  In short, nothing would satisfy all the operators except free VNFs, which clearly won’t make VNF vendors happy.

Operators also tell me they’re more concerned about onboarding platform software and server or network equipment than VNFs.  Operators have demanded open network hardware interfaces for ages, as a means of preventing vendor lock-in.  AT&T’s Domain 2.0 model was designed to limit vendor influence by keeping vendors confined to a limited number of product zones.   What operators would like to see is a kind of modular infrastructure model where a vendor contributes a hosting and/or network connection environment that’s front-ended by a Virtual Infrastructure Manager (VIM) and that has the proper management connections to service lifecycle processes.

We don’t have one of these, in large part, because we still don’t have a conclusive model for either VIMs or management.  One fundamental question about VIMs is how many there could be.  If a single VIM is required, then that single VIM has to support all the models of hosting and connectivity needed, which is simply not realistic at this point.  If multiple VIMs are allowed, then you need to be able to model services so that the process of decomposition/orchestration can divide up the service elements among the infrastructure components each VIM represents.  Remember, we don’t have a solid service modeling approach yet.

The management side is even more complicated.  Today we have the notion of a VNF Manager (VNFM) that has a piece living within each VNF and another that’s shared for the infrastructure as a whole.  The relationship between these pieces and underlying resources isn’t clear, and it’s also not clear how you could provide a direct connection between a piece of a specific service (a VNF) and the control interfaces of shared infrastructure.

This gets to the second question I noted in my opening.  Why is this so much trouble?  Answer: Because we didn’t think it out fully before we got committed to a specific approach.  It’s very hard to go back and redo past thinking (though the NFV ISG seems to be willing to do that now), and it’s also time-consuming.  It’s unfair to vendors to do this kind of about-face as well, and their inertia adds delay to a process that’s not noted for being a fast-mover as it is.  The net result is that we’re not going to fix the fundamental architecture to make integration and onboarding logical and easy, not any time soon.

That may be the most convincing answer to the question of the relevance of integration.  If we could assume that the current largely-vCPE integration and onboarding initiatives were going to lead us to something broadly useful and wonderfully efficient, then these steps could eventually be valuable.  But they still don’t specifically address the big issue of the business case, an issue that demands a better model for the architecture in general, and management in particular.

I understand what vendors and operators are doing and thinking.  They’re taking baby steps because they can’t take giant strides.  But either baby steps or giant strides are more dangerous than useful if they lead to a cliff, and we need to do more in framing the real architecture of virtualization for network functions before we get too committed on the specific mechanisms needed to get to a virtual future.

Cloud Computing Service Success for Operators: Can It Be Done?

Operators have been fascinated by the public cloud opportunity, and new initiatives like that of Orange Business Services seem to promise that this fascination could be gaining some traction in the real world.  But Verizon at the same time is clearly pulling back from its public cloud computing commitments.  What’s really going on with operator public cloud services?

In a prior blog, I noted that operators had initially seen cloud computing services as an almost-natural fit.  These services require a major investment, and they offer a relatively small return, which fits the public-utility model most operators still adhere to.  Publicity on cloud computing suggested an oncoming wave of adoption that could carry everyone to riches, a trillion-dollar windfall.  It didn’t happen, of course, and after the blog I cited, I heard more from operator planners who were eager to offer some insight into their own situations.

All of those who contacted me agreed that the biggest problem they faced with cloud computing services was the unexpected difficulty in selling the services.  Operators are good at stuff that is marketed rather than sold, where publicity stimulates buyers to make contact and thus identify themselves.  They’re also often at least passable at dealing one-on-one with large prospective buyers, like the big enterprises.  They’re not good at pounding the pavement doing prospecting, and that seemed to be what cloud computing was really about.

One insight that operators offered on this point was that their initial target for cloud computing was the large enterprise CIO types, the same people who are instrumental in big telecom service buys.  They found that for the most part enterprise public cloud was driven by line department interest in “shadow IT” and that the CIO was as likely (or more) to oppose the cloud move as to support it.  Certainly they were not usually the central focus in making the deal.  That meant that operators would have to reach out to line departments, and that broke the sales model.

The second problem operators reported was the complexity of the cloud business case.  Operators believed rosy predictions of major savings, but while there might indeed be strong financial reasons to move at least some applications to the cloud, they were difficult to quantify.  Often there had to be a formal study, which of course either the operator had to do or had to convince the prospective buyer to do.  Several operators said they went through several iterations of this, and never came up with numbers the CFO would accept.

The final issue was security and governance.  Operators said that there were always people (often part of the IT organization) who asked specific questions about cloud security and governance, and those questions were very difficult to answer without (you guessed it!) another study.  This combined with the other issues to lengthen the selling cycle to the point where it was difficult to induce salespeople to stay the course.

If you contrast Orange and Verizon, you can see these factors operating.  In both cases, the operators were looking at headquarters sales targets.  Verizon has the largest number of corporate headquarters of any Tier One, and so it seemed to them they should have the best chance of doing a deal with the right people.  Orange seems to be proving that’s true only to a point; you can present the value proposition to headquarters, but it still has to be related to a compelling problem the buyer already accepts.  Multinationals, the Orange sales target, have a demonstrable problem in providing IT support in all their operating geographies.  The cloud is a better solution than trying to build and staff data centers everywhere in the world.

The question, of course, is whether the opportunity will be worth Orange’s building those data centers.  In effect, their approach is almost like a shared hosting plan; if a bunch of multinationals who need a data center in Country X combine to share one (Orange’s cloud service) the single data center would be a lot more cost-effective than the sum of the companies’ independent ones.  If there are enough takers to Orange services, then it works.  Obviously one customer in every data center would end up putting Orange in the “inefficient and unwise” category of data center deployment.  We can’t say at this point how well it will go for them.

It does seem that the Orange model of exploiting headquarters relationships and specific problems/issues is the right path for operators looking to get into public cloud services.  This would have the best chance of working where there were a lot of headquarters locations to sell to, obviously, which means fairly “thick” business density.  However, as I said, Verizon had that and couldn’t make a go of things, so what made their situation different?

Probably in part competition, less the direct-to-the-wallet kind than the hearts-and-minds kind.  US companies know the cloud well from players like Microsoft and Amazon, and they perceive network operators as come-from-behind players who are next-to-amateurs in status.  Probably in part geography; in the EU the operators are held in higher strategic regard, and most of them have faced profit pressure for a longer time, so they’re further along in the cycle of transformation.

The real question is what cloud needs the operators could more broadly fill, and the answer to that may be hard to deal with if you’re an operator.  My model says that there are no such needs, that there is no single opportunity that could pull through carrier cloud computing service success.  The only thing that might do it down the line is IoT, but the situation changes down the line in any case.

As operators manage to deploy carrier cloud for other missions, they’ll achieve economy of scale and a coveted low-latency position at the edge.  Those factors will eventually make them preferred providers, and once they take hold then carrier cloud computing services will quickly gain acceptance.

The only problem with that story is that it’s going to take time.  Between 2019 and 2021 is the hot period according to the model, the time when there’s finally enough cloud infrastructure to make operators truly competitive.  Even that requires that they deploy cloud infrastructure in other short-term service missions, starting even this year.  That may not happen, particularly if 5G standards take as much time to mature as NFV specifications have taken.

This could be a long slog, even for Orange.  My model says their own situation is right on the edge, and execution of both deployment and sales will have to be perfect or it won’t work and they’ll end up doing what Verizon has done.

The Road Ahead for the Networking Industry

Think network hardware is going great?  Look at Cisco’s results, and at Juniper’s decision to push even harder in security (which by the way is also a hot spot for Cisco’s quarter).  Look at the M&A in the vendor space.  Most of all, look at everyone’s loss of market share to Huawei.  USTelecom Daily Lead’s headline was “Network Hardware Woes Crimp Cisco Sales in Q2.”  SDxCentral said “Cisco’s Switching and Routing Revenue Dragged in Q2.”  Switching, routing, and data center were all off for Cisco and total revenue was down (again).  Do we need to set off fireworks to signal something here?

We clearly have a classic case of shrinking budgets for network buyers.  On the service provider side, the problem is that profit-per-bit shrinkage that I’ve talked about for years.  On the enterprise side, the problem is that new projects to improve productivity are half as likely to have a network budget increase component as they were a decade ago.  The Street likes to say this is due to SDN and NFV, but CFOs tell me that neither technology has had any measurable impact on spending on network gear.  Price pressure is the problem; you beat up your vendors for discounts and if that doesn’t work you go to Huawei.

None of this is a surprise, if you read my blog regularly.  Both the service provider and enterprise trends are at least five years old.  What is surprising, perhaps, is that so little has been done about the problem.  I remember telling Cisco strategists almost a decade ago that there was a clear problem with the normally cyclical pattern of productivity-driven IT spending increases.  I guess they didn’t believe me.

We are beyond the point now where some revolution in technology is going to save network spending.  In fact, all our revolutions are aimed at lowering it, and Cisco and its competitors are lucky that none of them are working very well—yet.

Equipment prices, according to my model, will fall another 12% before hitting a level were vendors won’t be willing/able to discount further.  That won’t be enough to stave off cost-per-bit pressure, so we can expect to see “transformation” steps being taken to further cut costs.  This is where vendors have a chance to get it right, or continue getting it wrong.

There is no way that adding security spending to offset reductions in network switch/router spending is going to work.  Yes, you could spend a bit more on security, but the gains we could see there can’t possibly offset that 12% price erosion, nor can they deter what’s coming afterward.  What has to happen is that there is a fundamental change in networking that controls cost.  The question is only what that change can be, and there are only two choices—major efforts to reduce opex, or a major transformation of infrastructure to erode the role of L2/L3 completely.

Overall, operators spend perhaps 18% or 19% of every revenue dollar on capital equipment.  They’ll spend 28% of each dollar on “process opex”, meaning costs directly attributable to service operations and customer acquisition/retention, in 2017.  If we were to see a reduction in capex of that 12%, we’d end up with about a two percent improvement.  Service automation alone could reduce process opex by twice that.  Further, by 2020 we’re going to increase process opex to about 31% of each revenue dollar, an increase larger than the credible capex reduction by price pressure could cover.  By that time, service automation could have lowered process opex to 23% of revenue.  That’s more than saving all the capex budget could do.

SDN and NFV could help too, but the problem is that the inertia of current infrastructure limits the rate at which you could modernize.  The process opex savings achieved by SDN/NFV lags that of service automation without any infrastructure change by a bit over two years.  The first cost of the change would be incurred years in advance of meaningful benefits, which means that SDN/NFV alone cannot solve the problem unless the operators dig a very big cost hole that will take five years or more to fill with benefits.

The infrastructure-transformation alternative would push more spending down to the fiber level, build networks more with virtual-wire or tunnel technology at the infrastructure level, and shift to overlay (SD-WAN-like) technology for the service layer.  This approach, again according to my model, could cut capex by 38%, and in combination with service management automation, it could cut almost 25 cents of cost per dollar of revenue.  The problem is the time it would take to implement it, because operators would have to find a way to hybridize the new model with current infrastructure to avoid having a fork-lift-style write-down of installed equipment.  My model says that SD-WAN technology could facilitate a “soft” migration to a new infrastructure model, so the time needed to achieve considerable benefits could be as little as three years.

So, what can the network equipment vendors do here?  It doesn’t take an accountant to see that the service automation approach would be better for network equipment vendors because it wouldn’t require any real infrastructure change.  However, there are two issues with it.  First, the network equipment vendors have been singularly resistive to pushing this sort of thing, perhaps thinking it would play into the hands of the OSS/BSS types.  Second, it may be too late for the network vendors to jump on the approach, given that operators are already focused on lowering equipment spending by putting pressure on vendors (or switching to Huawei, where possible).

Some of the network equipment strategists see inaction as an affirmative step.  “We don’t need to do anything to drive service automation,” one said to me via email.  “Somebody is going to do it, and when they do it will take capex pressure off.”  Well, I guess that’s how the waffling got started.  Others tell me that they saw service automation emerging from SDN/NFV, which they didn’t want to support for obvious reasons.

The potential pitfall of the inaction approach is that a competitor might be the one who steps up and takes the action instead of you.  Cisco can afford to have Amdocs or perhaps even Oracle or HPE become a leader in service automation, but they can’t let Nokia or (gasp!) Huawei do that.  If a network vendor developed a strong service automation story they could stomp on the competition.

Worse, an IT vendor could stomp on all the network vendors if they developed a story of service automation and our push-down-and-SD-WAN model of infrastructure.  Operators are receptive to this message for the first time, in no small part because of something I’ve mentioned above—they’ve become focused on cutting capex by putting price pressure on vendors.  SD-WAN has tremendous advantages as a vehicle for offering business services, not the least of which being that it’s a whole lot better down-market than MPLS VPNs.  It’s also a good fit to cloud computing services.  A smart IT vendor could roll with this.

If we have any.  The down-trend in network spending has been clear for several years now, and we still find it easier to deny it than to deal with it.  I suspect that’s going to change, and probably by the end of this year, and we’ll see who then steps up to take control over networking as an industry.  The answer could be surprising.

Don’t Ignore the Scalability and Resilience of SDN/NFV Control and Management!

It would be fair for you to wonder whether the notion of intent-based service modeling for transformed telco infrastructure is anything more than a debate on software architecture.  In fact, that might be a very critical question because we’re really not addressing, so far, the execution of the control software associated with virtualization in carrier infrastructure.  We’ve talked about scaling VNFs, even scaling controllers in SDN.  What about scaling the SDN/NFV control, orchestration, and management elements?  Could we bring down a whole network by a classic fault avalanche, or even just by a highly successful marketing campaign?  Does this work under load?

This isn’t an idle question.  If you look at the E2E architecture of the NFV ISG, you see a model that if followed would result in an application with a MANO component, a VIM component, and a VNFM component.  How does work get divided up among those?  Could you have two of a given component sharing the load?  There’s nothing in the material to assure that an implementation is anything but single-threaded, meaning that it processes one request at a time.

I think there are some basic NFV truths and some basic software truths that apply here, or should.  On the NFV side, it makes absolutely no sense to demand that there be scalability under load and dynamic replacement of broken components at the VNF level, and then fail to provide either for the NFV software itself.  At the basic software truth level, we know how the cloud would approach the problem, and so we have a formula that could be applied and has been largely ignored.

In the cloud, it’s widely understood that scalable components have to be stateless and must never retain data within the component from call to call.  Every time a component is invoked, it has to look like it’s a fresh-from-the-library copy, because given scalability and availability management demands, it might just be that.  Microservices are an example of a modern software development trend (linked to the cloud but not dependent on it) that also mandate stateless behavior.

This whole thing came up back about a decade ago, with the work being done in the TMF on the Service Delivery Framework.  Operators expressed concern to me over whether the architecture being considered was practical:  “Tom, I want you to understand that we’re 100% behind implementable standards, but we’re not sure this is one of them,” was the comment from a big EU telco.  With the support of the concerned operators (and the equipment vendor Extreme Networks) I launched a Java project to prove out how you could build scalable service control.

The key to doing that, as I found and as others found in other related areas, is the notion of back-end state control, meaning that all of the variables associated with the way that a component handles a request are stored not in the component (which would make it stateful) but in a database.  That way, any instance of a component can go to the database and get everything it needs to fulfill the request it receives, and even if five different components process five successive stages of activity, the context is preserved.  That means that if you get more requests than you can handle, you simply spin up more copies of the components that do the work.

You could shoehorn this approach into the strict structure of NFV’s MANO, but it wouldn’t be the right way—the cloud way—of doing it.  The TMF work on NGOSS Contract demonstrated that the data model that should be used for back-end state control is the service contract.  If that contract manages state control, and if all the elements of the service (what the TMF calls “Customer-Facing” and “Resource-Facing Services, or CFS/RFS) store state variables in it, then a copy of the service contract will provide the correct context to any software element processing any event.  That’s how this should be done.

The ONF vision, as they explained it yesterday, provides state control in their model instances, and so have all my own initiatives in defining model-driven services.  If the “states” start with an “orderable” state and advance through the full service lifecycle, then all of the steps needed to deploy, redeploy, scale, replace, and remove services and service elements can be defined as the processes associated with events in those states.  If all those processes operate on the service contract data, then they can all be fully stateless, scalable, and dynamic.

Functionally, this can still map back to the NFV ISG’s E2E model, but the functions described in the model would be distributed in two ways—first by separating their processes and integrating them with the model state/event tables as appropriate, and second by allowing their execution to be distributed across as many instances as needed to spread the load or replace broken pieces.

There are some specific issues that would have to be addressed in a model-driven, state/event, service lifecycle management implementation like this.  Probably the most pressing is how you’d coordinate the assignment of finite resources.  You can’t have five or six or more parallel processes grabbing for hosting, storage, or network resources at the same time—some things may have to be serialized.  You can have the heavy lifting of making deployment decisions, etc. operating in parallel, though.  And there are ways of managing the collision of requests for resources too.

Every operator facility, whether it’s network or cloud, could be a control domain, and while multiple requests to resources in the same domain would have to be collision-proof, you could have multiple domain requests running in parallel.  Thus, you can reduce the impact of the collision of requests.  This is necessary in my distributed approach, but it’s also necessary in today’s monolithic model of NFV implementation.  Imagine how you’d be able to deploy national/international services with a single instance of MANO!

The final point to make here is that “deployment” is simply a part of the service lifecycle.  If you assume that you deploy things using one set of logic and then sustain them using another, you’re begging for the problems of duplication of effort and very likely inconsistencies in handling.  Everything in a service lifecycle should be handled the same way, be defined by the same model.  That’s true for VNFs and also for the NFV control elements.

This isn’t a new issue, which perhaps is why it’s so frustrating.  In cloud computing today, we’re seeing all kinds of initiatives to create software that scales to workload and that self-heals.  There’s no reason not to apply those principles to SDN and NFV, particularly when parts of NFV (the VNFs) are specifically supposed to have those properties.  There’s still time to work this sort of thing into designs, and that has to be done if we expect massive deployments of SDN/NFV technology.