What is a Model and Why Do We Need One in Transformation?

After my blog on Cisco’s intent networking initiative yesterday, I got some questions from operator friends on the issue of modeling.  We hear a lot about it in networking—“service models” or “intent models”, but typically with a prequalifier.  What’s a “model” and why have one?  I think the best answer to that is to harken back to what I think are the origins of the “model” concept, then look at what those origins teach us about the role of models in network transformation.

At one level, modeling starts with a software concept called “DevOps”.  DevOps is short for “Development/Operations”, and it’s a software design and deployment practice aimed at making sure that when software is developed, there’s collateral effort undertaken to get it deployed the way the developers expected.  Without DevOps you could write great software and have it messed up by not being installed and configured correctly.

From the first, there were two paths toward DevOps, what’s called the “declarative” or “descriptive” path, and what’s called the “prescriptive” path.  With the declarative approach, you define a software model of the desired end-state of your deployment.  With the prescriptive path, you define the specific steps associated with achieving a given end-state.  The first is a model, the second is a script.  I think the descriptive or model vision of DevOps is emerging as the winner, largely because it’s more logical to describe your goal and let software drive processes to achieve it, than to try to figure out every possible condition and write a script for it.

Roughly concurrent with DevOps were two telecom-related activities that also promoted models.  One was the Telemanagement Forum’s “NGOSS Contract”, and the other the IPsphere Forum’s notion of “elements”.  The TMF said that a contract data model could serve as the means of associating service events and service processes, and the IPSF said that a service was made up of modular elements assembled according to a structure, and “orchestrated” to coordinate lifecycle processes.

What’s emerged from all of this is the notion of “models” and “modeling” as the process of describing the relationship between components of what’s a logically multi-component, cooperative, system that provides a service.  The idea is that if you can represent all suitable alternative implementation strategies for a given “model”, you can interchange them in the service structure without changing service behavior.  If you have a software process that can perform NGOSS-contract-like parsing of events via the service model represented by a retail contract, you can use that to manage and automate the entire service lifecycle.

I think that most operators accept the idea that future service lifecycle management systems should be based on “models”, but I’m not sure they all recognize the features of the models that model derivation as I explained it would require.  A model has to be a structure that can represent as two separate things the properties of something and the realization of those properties.  It’s a “mister-outside-mister-inside” kind of thing.  The outside view, the properties view, is what we could call an “intent model” because it focuses on what we want done and not on how we do it.  Inside might be some specific implementation, or it might be another nested set of models that eventually decompose into specific implementations.

One of the big mistakes made in modeling is the requirement for event integration.  Each model element has an intent and a realization, and the realization is the management of the lifecycle of that element.  Thus, every model element has its own events and operating states, and these define the processes that the model requires to handle a given event at a given time.  If you don’t have state/event handling in a very explicit way, then you don’t have a model that can coordinate the lifecycle of what you’re modeling, and you don’t have service automation.

One of the things I look for when vendors announce something relating to SDN or NFV or cloud computing or transformation is what they do for modeling.  Absent a modeling approach that has the pieces I’ve described, you can’t define a complete service lifecycle in a way that facilitates software automation, so you can’t have accurate deployments and you can’t respond to network or service conditions efficiently.  So, no opex savings.

Models also facilitate integration.  If a service model defines the elements of a service, each through its own model, and defines the service events and operating states, then you can look at the model and tell what’s supposed to happen.  Any two implementations that fit the same intent model description are equivalent.  Integration is implicit.  Absent a model, every possible service condition has to somehow figure out what the current service state is, and what the condition means in that state, and then somehow invoke the right processes.  The service model can define even the APIs that link process elements; with no model what defines them, and insures all the pieces can connect?

Where something like policy management fits into this is a bit harder to say, because while we know what policies are at a high level (they are rules that govern the handling of conditions), unlike models it may not be clear how these rules relate to specific lifecycle stages or what specific events the conditions of the policies represent.  It’s my view that policy management is a useful way of describing self-organizing systems, usually ones that have a fairly uniform resource set on which they depend.

Router networks are easily managed using policies.  With NFV-deployed router instances, you have to worry about how each instance gets deployed and how it might be scaled or replaced.  It’s much more difficult to define policies to handle these dependencies, because most policy systems don’t do well at communicating asynchronous status between dependent pieces.  I’m not saying that you can’t write policies this way, but it’s much harder than simply describing a TMF-IPSF-DevOps declarative intent model.

Policies can be used inside intent models, and in fact a very good use for policies is describing the implementation of “intents” that are based on legacy homogeneous networks like Ethernet or IP.  A policy “tree” emerging from an intent model is a fine way of coordinating behavior in these situations.  As a means of synchronizing a dozen or hundred independent function deployments, it’s not good at all.

This all explains two things.  First, why SDN and NFV haven’t delivered on their promises.  What is the model for SDN or NFV?  We don’t have one, and so we don’t have a consistent framework for integration or service lifecycle management.  Second, why I like the OASIS TOSCA (Topology and Orchestration Specification for Cloud Applications).  Because it’s all about doing the very thing that’s too dynamic and complex to control via policies.  Remember, we generally deploy cloud applications today using some sort of model.

Integration is fine.  API specifications are fine.  Without models, neither of them are more than a goal, because there’s no practical way to systematize, to automate, what you end up with.  We will never make controlled services and service infrastructure substitute for autonomous and adaptive infrastructure without software automation, and it’s models that can get us there.  So forget everything else in SDN and NFV and go immediately to the model step.  It’s the best way to get everything under control.

What Does Cisco Intend with “Intent Networking?”

Cisco has announced it’s going to support, and perhaps even focus on, “intent-based” networking.  At one level this could be viewed as a vindication of a widely held view that intent-modeling is the essential (and perhaps under-supplied or even missing) ingredient in the progression of virtualization.  At another level, it could be seen as another Cisco marketing strategy.  The truth is that it’s a little of both.

At the heart of today’s issue set is whole different notion, that of determinism.  The old-day time-division-multiplexed networks were deterministic; they worked in a specific way and provided very specific capacity and SLAs.  As packet networks, and particularly the Internet, evolved, networking tossed out strict determinism in favor of lower cost.  We had “best efforts” networks, which is what dominates today.

So what does this have to do with “intent?”  Well, best efforts is increasingly not good enough in a competitive market, but nobody wants to go back to full determinism to achieve something better—the cost would be excessive.  The alternative is to somehow couple service requirements into packet networks in a way that doesn’t break the bank.  In an intent model, elements of infrastructure are abstracted into a black box that asserts interfaces and an SLA but hides the details.  Intent modeling is therefore a way of looking at how to express how deterministic a network has to be.  It also leaves it to the vendor (and presumably the network-builder) to decide how to fulfill the intent.

Intent modeling is an incredibly important tool in realizing the benefits of virtualization and infrastructure transformation, because it lets operators create abstract building-blocks (intent-based black boxes) that combine to build networks, and that then evolve internally from legacy to modern technology.  A good evolutionary intent model has to be anchored in the present, and support the future.

Cisco’s approach to transformation has always been what cynics would call “cosmetic”.  Instead of focusing on building SDN or building NFV, Cisco has focused on achieving the goals of those technologies using behaviors coerced from current technology.  At one level, this is the same kind of marketing gloss Cisco has been famous for, for decades in fact.  At another it’s reflective of a simple truth, which is that transformational technologies that do the transforming by displacing legacy infrastructure are exceptionally difficult to promote because of cost and risk.

There really isn’t much new in the Cisco intent approach.  Cisco has always been an advocate of “policy-based” networking, meaning a form of determinism where the goals (the “intent”) is translated into a hierarchy of policies that then guide how traffic is handled down below.  This is still their approach, and so you have to wonder why they’d do a major announcement that included the financial industry to do little more than put another face on a concept they’ve had around for almost a decade.

One reason is marketing, of course.  “News”, as I’ve always said, means “novelty”.  If you want coverage in the media rags (or sites, in modern terms) then you have to do something different, novel.  Another reason is counter-predation.  If a competitor is planning on eating its way along a specific food chain to threaten your dominance, you cut them off by eating a critical piece yourself.  Intent modeling is absolutely critical to infrastructure transformation.  If you happen to be a vendor winning in legacy infrastructure, and thus want to stall competitors’ reliance on intent modeling as a path to displacing you, then you eat the concept yourself.

OK, yes, I’m making this sound cynical, and it is.  That’s not necessarily bad, though, and I’d be the first to admit that.  In one of my favorite media jokes, Spielberg when asked what had been the best advice he’d received as a director, said “When you talk to the press, lie.”  But to me the true boundary between mindless prevarication and effective marketing is the buyers’ value proposition.  Is Cisco simply doing intent models the only way they are likely to get done?  That, it turns out, is hard to say “No!” to.

We have struggled with virtualization for five years now, and during that period we have done next to nothing to actually seize the high-level benefits.  In effect, we have as an industry focused on what’s inside the black-box intent model even though the whole purpose of intent models is to make that invisible.  Intent modeling as a driving concept for virtualization emerged in a true sense only within the last year.  Cisco, while they didn’t use the term initially, jumped onto that high-level transformation mission immediately.  Their decision to do that clearly muddies the business case for full transformation via SDN and NFV, but if the proponents of SDN and NFV weren’t making (and aren’t making) the business case in any event, what’s the problem?

Cisco has done something useful here, though of course they’ve done it in an opportunistic way.  They have demonstrated the real structure of intent models—you have an SLA (your intent) on top, and you have an implementation that converts intent into network behavior below.  Cisco does it with policies, but you could do the same thing with APIs that passed SLAs, and then have the SLAs converted internally into policies.  Cisco’s model works well for homogeneous infrastructure that has uniform dependence on policy control; the other approach of APIs and SLAs is more universal.  So Cisco could be presenting us with a way to package transformation through revolution (SLAs and APIs) and transformation through coercion (policies) as a single thing—an intent model.

They could also be stimulating the SDN and NFV world to start thinking about the top of the benefit pyramid.  If Cisco can make the business case for “transformation” without transforming infrastructure, bring service control and a degree of determinism to networking without changing equipment, then more radical approaches are going nowhere unless they can make a better business case.

Is Cisco sowing the seeds of its own competition?  More likely, as I suggested above, Cisco is seeing the way that a vulnerability might be developing and working to cut it off.  But one way or the other, Cisco is announcing that the core concept of SDN and NFV isn’t just for SDN and NFV to realize.  Those who don’t want five years of work to be a science project had better start thinking about those high-level benefits that Cisco is now chowing down on.  There are only so many prey animals in the herd, and Cisco is a very hungry predator.

Solving the Problem that Could Derail SDN and NFV

Back in the days of the public switched telephone network, everyone understood what “signaling” was.  We had an explicit signaling network, SS7, that mediated how resources were applied to calls and managed the progression of connections through the hierarchy of switches.  The notion of signaling changed with IP networks, and I’m now hearing from operators that it changes even more when you add in things like SDN and NFV.  I’m also hearing that we’ve perhaps failed to recognize just what those changes could mean.

You could argue that the major difference between IP networks and the old circuit-switched networks is adaptive routing.  In traditional PSTN we had routing tables that expressed where connections were supposed to be routed.  In IP networks, we replaced that with routing tables that were built and maintained by adaptive topology discovery.  Nodes told each other who they could reach and how good the path was, simply stated.

The big advantage of adaptive routing is that it adapts, meaning that issues with a node or a trunk connection can be accommodated because the nodes will discover a better path.  This takes time, to be sure, for what we call “convergence”, meaning a collective understanding of the new topology.  The convergence time is a period of disorder, and the more complicated convergence is, the longer that disorder lasts.

SDN sought to replace adaptive routing with predetermined, centrally managed routes.  The process whereby this determination and management happens is likely not the same as it was for the PSTN, but at least one of the goals was to do a better job of quickly settling on a new topology map that efficiently managed the remaining capacity.  The same SDN central processes could also be used to put the network into one of several operating modes that were designed to accommodate special traffic conditions.  A great idea, right?

Yes, but.  What some operators have found is that SDN has implicitly reinvented the notion of signaling, and because the invention was implicit and not explicit they’re finding that the signaling model for SDN isn’t fully baked.

Some operators and even some vendors say NFV has many of the same issues.  A traffic path and a set of feature hosts are assembled in NFV to replace discrete devices.  The process of assembling these parts and stitching the connections, and the process of scaling and sustaining the operation of all the pieces, happens “inside” what appears at the service level to be a discrete device.  That interior stuff is really signaling too, and like SDN signaling it’s not been a focus of attention.

It’s now becoming a focus, because when you try to build extensive SDN topologies that span more than a single data center, or when you build an NFV service over a practical customer topology, you encounter several key issues.  Most can be attributed to the fact that both SDN and NFV depend on a kind of “out of band” (meaning outside the service data plane) connectivity.  Where does that come from?

SDN’s issue is easily recognized.  Say we have a hundred white-box nodes.  Each of these nodes has to have a connection to the SDN controller to make requests for a route (in the “stimulus” model of SDN) and to receive forwarding table updates (in either model).  What creates that connection?  If the connection is forwarded through other white-boxes, creating what would be called a “signaling network”, then the SDN controller also has to maintain its signaling paths.  But if something breaks along such a path and the path is lost, how does the controller reach the nodes to tell them how the new topology is to be handled?  You can, in theory, define failure modes for nodes, but you then have to ensure that all the impacted nodes know that they’re supposed to transition to such a mode.

In NFV, the problem is harder to explain in simple terms and it’s also multifaceted.  Suppose you have to scale out, meaning instantiate a new copy of a VNF to absorb additional load.  You have to spin up a new VNF somewhere, which means you need to signal a deployment of a VNF in a data center.  You also have to connect it into the data path, which might mean spinning up another VNF that acts as a load-balancer.  In NFV, if we’re to maintain security of the operations processes, we can’t expose the deployment and connection facilities to the tenant service data paths or they could be hacked.  Where are they then?  Like SS7, they are presumably an independent network.  Who builds it, with what, and what happens if that separate network breaks?  Now you can’t manage what you’ve deployed.

I opened this blog with a comment on SS7 because one EU operator expert explained the problem by saying “We’re finding that we need an SS7 network for virtualization.”  The fact is that we do, and a collateral fact is that we’ve not been thinking about it.  Which means that we are expecting a control process that manages resources and connectivity to operate using the same resources it’s managing, and never lose touch with all the pieces.  If that were practical we’d never have faults to manage.

The signaling issue has a direct bearing on a lot of the SDN and NFV reliability/availability approaches.  You don’t need five-nines devices with virtualization, so it’s said, because you can replace a broken part dynamically.  Yes, if you can contact the control processes that do the replacing, then reconnect everything.  To me, that means that if you want to accept the dynamic replacement approach to availability management, you need to have a high-reliability signaling network to replace the static-high-availability-device approach.

Even the operators who say they’ve seen the early signs of the signaling issue say that they see only hints today because of the limited scope of SDN and NFV deployments.  We are doing trials in very contained service geographies, with very small resource pools, and with limited service objectives.  Even there, we’re running into situations where a network condition cuts the signaling connections and prevents a managed response.  What’s going to happen with we spread out our service deployments?

I think SDN and NFV signaling is a major issue, and I know that some operators are seeing signs of that too.  I also think that there are remedies, but in order to apply them we have to take a much higher-level view of virtualization and think more about how we provide the mediation and coordination needed to manage any distributed system.  Before we get too far into software architectures, testing, and deployment we should be looking into how to solve the signaling problem.

The Drive to Add Feature Value to Networks is Changing How Networks are Built

Every vendor wants to sit astride the critical value propositions, and in networking that’s particularly true.  With capital spending under pressure, it’s crucial to have some strong value propositions you can spout to impress buyers.  The problem has been that “value” really means either cost or revenue, and much of networking is insulated from both these areas by the structure of services.  But the problem is increasingly in the past, because drive to differentiate is creating innovative solutions.

Traditional networks build upward from physical media through a series of “layers”, each of which sees only the services/features of the layer below.  The user experience is created at the top layer, which for the network (despite the media hype to the contrary) is “Level 3” most times, and Level 2 for the rest.  Operators tell me that there are very few services asserted at other layers.  They also tell me that almost three-quarters of their operations costs are incurred at the service layer, the top.

Virtualization has thrown out the basis of the traditional division of features, because virtualization allows for the creation of a virtual network with virtual layers.  For example, you can use virtualization to build a “virtual wire” from one continent to another, transiting all manner of facilities.  It still looks like a wire to the higher layers, but the traditional mission of Level 3, which is to tack physical media spans together to create routes and paths, really isn’t necessary any more.

Lower-layer network features that replace features normally provided at a higher layer are subtractive feature examples; your new features subtract requirements from higher layers, and by doing that potentially simplify them.  If, for example, we had total resilience at the optical layer, would we be able to eliminate not only error recovery but even dynamic routing at higher layers?

Another thing virtualization could do is support a new model of services.  NFV is an example of what could be called feature addition; you can build services that add features to basic connectivity.  These features could be related to connectivity, or they might be elements of applications, as would be the case with cloud computing.

Finally, you could think of parallelism as an attribute of services.  Today we get most IP services by partitioning IP infrastructure.  Might we instead use virtual wires to create truly independent subnetworks at the lower layers, and then pair these with hosted instances of switches and routers, or any other set of feature-driving elements, above?  Why not?

Virtualization isn’t the only driver here, either.  Every lower layer in the network, and every vendor who has products there, has aspirations to add features and capabilities.  These features generate impacts that fall into these same three categories.  Chip vendors want programmable forwarding, which is another set of features whose value proposition demands they change how “services” would be built.

How do we build them?  Each of the virtualization impact models require different accommodations, with a final common element.

Subtractive models are the simplest and most complicated at the same time.  They’re simple because a problem removed at a lower layer automatically simplifies the higher layers.  If an error doesn’t occur, then higher-layer steps aren’t required and operations is automatically less complicated and expensive.  They’re complicated because full exploitation of subtractive feature benefits requires that you remove the features from the higher layer.  As an example, SDN could take advantage of subtractive reliability and availability features at the optical layer because it has no intrinsic capability to recover from problems—you’d have to include that capability explicitly.  With proper lower-layer features you simplify SDN software and presumably lower cost.

The additive models are easier to understand because we have examples of them in both the cloud and NFV.  The challenge here is to identify incremental features that are valuable, and then cull them to eliminate those that are not expensive enough to justify hosting in the network.  Business firewalls are expensive, so you can probably host them.  Residential features of the same type are part of a device that might cost forty bucks and includes the essential WiFi feature.  Host that?

The parallel model is really all about benefits and costs.  We build VPNs today using network-resident features (MPLS).  We could build them by partitioning capacity through virtual wires at the optical or optics-plus level, and then add in the L2/L3 features using hosted instances of routing/switching or simply by using SD-WAN features.  Would the cost of this alternative approach be lower in the long run?  Would the security benefits make the service inherently more valuable to buyers?

The common element in all of this is that future services are less likely to be “coerced” from the cooperative behavior of a system of compatible devices and more likely to be deployed as a collection of features.  You start with infrastructure that is inherently more service-independent and make it service-specific by targeting the services you can sell and then deploying the behaviors that make up those services.

This requires a violation of the old OSI assumption of building on lower layers, because with the new model the goal may well be to exploit and thus to expose features of lower layers.  We’re seeing some of the new debate on this point already, because if you have (for example) the ability to expand transport bandwidth at Level 1, do you allow service creation to do that, or do you wait until the sum of higher-layer capacity demands signal a need?

If you decide to let service creation cross OSI layers, you buy into a much more complex approach to service management today, but one that better prepares for those three virtualization feature models down the road.  If we want to see every aspect of networking freed to develop its own optimum cost/benefit relationships, then we have to free every aspect of networking from traditional constraints—constraints applied at a time when we had no novel capabilities to exploit, so we had nothing to lose.

Experiences are the way of the future, service-wise.  We already know that universal connectivity provides a platform for enriching our experiences by supporting easy delivery.  Easy delivery alone isn’t the experience, though.  Eventually, we will have to see all services as composed feature sets married with connectivity.  Eventually, service management will have to be able to model those complex relationships and make them efficient and reliable.  The sooner we look at how that’s done, the better.

This is IMHO the strongest reason to be looking at cloud-intent-model-driven service models (like OASIS TOSCA) rather than models that were derived to control connectivity.  The most connective network you could ever have is simply an invitation for disintermediation if what the buyer wants is the experience that the network can deliver.  Those experiences can always be modeled as functions-plus-connectivity, and that looks like a cloud and not like a network.

Exploring the Operators’ Views on Transformation Drivers and Time-Line

In past blogs, I’ve said that there were three dominant drivers for transformational change in networking.  One is carrier cloud (which of course has its own drivers), one is 5G, and the last is IoT.  Given that the industry is fast-paced, it’s a good time to look at where we stand on each, based on the project information I’ve seen from the network operators.

This is particularly important given that traditionally cited drivers like SDN and NFV are clearly a moving finish line according to surveys.  Light Reading published a summary of a survey that said, in essence, that operators were pushing back their expected realization of virtualization plans.  Yes, because technology change without business drivers is just a science project.  So, what are the real drivers doing?  In some cases, a lot and in others not much.

Carrier cloud is a bit of a catch-all driver, but it is the meaningful consequence of all the carrier interest in transformation.  The net-net of this driver is that it represents a goal of shifting the carrier business model to owning experiences and delivering them on the network, rather than owing just the network and letting others provide the experience.  NFV, operator entry into cloud computing, and content and advertising ventures could all create specific carrier cloud opportunities.

In my view, based on operator inputs, I think carrier cloud as a driver is suffering from a large dose of the vagues.  To say that you want to transform to, or lead in, carrier cloud is like saying you want your profits to go up.  The goal is inarguable but the realization of it is difficult because there’s no specific defined pathway you can expect to follow.  Carrier cloud activity is dominantly linked to lab and market research projects that are still largely in the hoped-for stage.

What might get carrier cloud out of the fuzziness that plagues it now is IoT.  Smart operators have been gradually coming to realize that the media vision of IoT as “billions of new devices on the Internet generating billions of new cellular service bills” (as one cynical operator staff type puts it) is a pipe dream.  That realization killed off the one obvious IoT business model but didn’t suggest anything more rational.  Fortunately, Amazon has come along (joined by Microsoft and Google) in presenting IoT as an application of event-based cloud computing—in the form of functional or “lambda” programming.

Event-driven systems, as I’ve said, pose a dilemma for cloud providers.  On the one hand, events are likely widely distributed, so they might be a very logical source of new cloud demand.  On the other hand, they demand a short control loop for at least the early stages of their processing.  That’s why Amazon introduced its Greengrass on-premises hosting option for the Amazon event-centric Lambda service.  Carriers have convenient real estate to host event processing near the source.

The problem for both IoT as a driver and for carrier cloud as something IoT could drive, is that operators are still hesitant about getting into process hosting, particularly in the form of a generalized cloud service.  Remember that carriers have been generally unsuccessful in cloud computing ventures.  The Amazon example might seem a clear direction to mimic, but not if it’s an example of cloud computing.  Thus, carriers would have to see IoT as a set of functions aimed at facilitating M2M or something.  They may not see it as a cellular-billing opportunity any longer, but they still don’t have the process perspective.

The “Why?” of that is best answered by relating the question of a Tier One.  “What do we deploy?”  Operators have a major problem planning things that don’t have any hard-deployable elements.  This is the sort of thing they’ve traditionally turned to vendors to provide. If IoT is about event processing and function hosting, then will somebody please sell me the stuff?  If you can point it out, tell me what it takes to run, price it, etc. then I can figure out whether I can make a business case.  The notion of formulating an IoT event-and-process strategy from scratch, then going out and assembling the pieces, is pretty well out of the carrier comfort zone.

The good news is that smart staff types in at least the major Tier One operators are accepting event-and-function reality for the first time, and in no small part because of the initiatives of the cloud providers.  They are looking at some packages (including GE Digital’s Predix) and some middleware and hosting options.  I think we may see some movement in this space by late in 2018.

Surprisingly, the carrier cloud impact of IoT planning might come along before the real IoT application.  One reason is that NFV’s “favorite” application, virtual CPE, is logically targeted at business sites, and these are also the targets for event processing.  A “function” in IoT terms isn’t the same as an NFV virtual function, but there are many similarities at the hosting level, and it would be possible to present IoT event-function hosting as an NFV application (were any vendors fully rational in this space, at least).  Vendors and operator planners eager to find a reason to deploy some servers in edge offices might find the IoT process hosting mission an attractive add-on to the vCPE hosting opportunity.

The final driver to be explored is 5G, and it’s the hardest of the three to get any handle on.  On the one hand, mobile operators are at least titularly committed to 5G at an almost unprecedented rate.  On the other hand, hardly any of them think that the specifications are even fully baked, and when you ask them what their 5G drivers are, you often get vague responses.

In fact, there are only two solid drivers for 5G at this point, one technological and one marketing.  On the technological side, many operators think that 5G is an opportunity for them to create fiber tail-connections to make fiber-to-the-node truly useful.  It’s a bonus that these nodal 5G cells could also provide better mobile coverage.  On the marketing side, it’s competition.  “More G’s win,” according to one operator.

If the only value in 5G is “more G’s” then it’s likely any deployment would be designed to have minimal impact on the network overall, simply to avoid introducing new costs.  A 5G tail connection doesn’t necessarily need a lot of the novel 5G features either, so it’s safe to say that 5G is still in a very early deliberation-and-planning stage.  Operators think that’s the case too; most stay they can “see the value” of things like network slicing, but they can’t put a number on it.  This reminds me of the state of NFV a couple years ago.

The bad news here is that none of the transformational drivers in networking show any sign of driving major changes in infrastructure in the near term.  The good news is that there does seem to be recognition that at least IoT and some aspects of 5G could drive value, and my personal view is that vendors or operators who want action before 2020 will need to somehow promote the edge-hosting model of event-driven IoT to get it.

How Do We Define Software-Defined Network Models?

If networks are truly software-defined, what defines the software that defines them?  This is not only the pivotal question in the SDN and NFV space, but perhaps the pivotal question in the evolution of networks.  We knew how to build open, interoperable networks using fixed devices like switches and routers, but it’s increasingly clear that these old methods won’t work in the new age of virtualization and software.  What does?  There are a variety of answers out there, but it may be that none are really complete.

The classic network solution is the formal standards process, which we have seen for both NFV and SDN.  The big question with the standards approach is “What do you standardize?”  SDN focused on standardizing a protocol, OpenFlow, and presumed that by doing that they would achieve openness and interoperability among the things that supported the protocol.  NFV originally said they weren’t going to write standards at all, but simply select from those already available.  That approach isn’t being followed IMHO, and arguably what NFV did do was to standardize a framework, an architecture, that identified “interfaces” it then proposed to standardize.

I don’t think that either SDN or NFV represents a successful application of formal standards to the software-defined world.  You could kick around the reasons why that’s true, but I think the root cause is that software design doesn’t work like formal standards work.  You really need to start software at the top, with your benefits and the pathway you propose to follow in achieving them.  Both SDN and NFV defined their model before they had identified the critical benefits and the critical steps that would be needed to secure them.

A second approach, this one from the software side, is the open-source model.  Open source software is about community development, projects staffed by volunteers who contribute their efforts and aimed at producing a result that’s open for all to use without payment for the software that results.  It’s worked with Linux, and so why not here?

I’m a fan of open source, but it has its limitations, the primary one being that the success of the project depends on the right software architecture, and it’s hard to say where that architecture vision comes from.  In Linux, it came from one man, and his inspiration was a running operating system (UNIX) mired in commercial debates and competing versions.  But for SDN and NFV there’s not just a division but a set of divisions on the open-source side that make things even more complicated.

One obvious division is among competing projects that have the same goal.  For both SDN and NFV we have that already, and even if all the projects are open, they are different and so threaten interoperability by creating competing software models that could be the targets for integration and deployment.  What works with one will probably not work with others, without special integration work at least.

Another division is the end-game-versus-evolutionary-path approach conflict.  We have projects like CORD (Central Office Re-architected as a Datacenter) that define the future end-state of software-driven networking, and others like the various NFV MANO projects that define a stepping stone toward that future.  It’s not clear to many (including me) just what all the MANO projects would generate as a future network model, and it’s not clear what actionable steps toward CORD would look like.

All of this uncertainty is troubling at best, but it’s intolerable if you want operators to commit to a big-budget transformation.  Add to this the fact that the important work being done today is de facto committed to one of these (flawed) approaches, and you can see that we could have a big problem.  It may even be too big to solve for current software-driven initiatives, but at the least we should try to lay out the right approach so future initiatives will have a better shot at success.  If we can then apply the right future answer to current work, retrofit it, we have at least a rational pathway forward.  If we can’t make the retrofit work, then we’ll have to accept that current initiatives are not likely to be fully successful.

Software projects have to start with an architecture, because you have to build software by successively decomposing missions and goals into functions and processes.  The architecture can’t be a “functional” one in the sense of the ETSI End-to-End model because it has to describe the organization of the software.  ETSI ended up doing a high-level software architecture perhaps without intending it, because you can’t interpret a functional model into software any other way.   A software expert would not have designed NFV that way, and that problem cannot be fixed by tighter interface descriptions, etc.  The software design based on the model isn’t optimum, period.  SDN has a similar problem, but the Open Daylight work there, combined with the fact that SDN is a much more contained strategy, has largely redeemed the SDN approach.  Still, the fact that there has to be an “optical” version of the spec demonstrates that the approach was wrong; the right design would have covered any path types without needing extensions.

Standards, in the traditional network sense, aren’t going to generate software architectures.  That requires software architects, who are rarely involved in formal standards processes.  It would certainly be possible to target the creation of a software architecture as a part of a standards process, though, and we should do that for any future software-defined activities.  We didn’t do it with SDN and NFV, though, and it’s exceptionally difficult to retrofit a new architecture onto an existing software project.  That means that open-source software would have to evolve into an optimum direction, based on recognized issues and opportunities.  Which, of course, could take time.

We may have to let nature take its course now with SDN and NFV, but in my view, it’s time to admit that we can’t fit the right model onto the current specification, and that in any case we’re past the point where standards and specifications will help us.  Once we have an implementation model we need to pursue it.  If we have several, we need to let market conditions weed them out.  That means that current SDN and NFV standards shouldn’t drive the bus at all, but rather should undertake specific and limited missions to harmonize the multiplicity of approaches being taken by open source.

Specs and standards guide vendor implementations, and it’s clear that in the case of SDN and NFV we are not going to get implementations that fully address the benefit goals of the operators.  We have to start with things that do, and in my own view there is only one that does, which is AT&T’s ECOMP, now part of the ONAP project in the Linux Foundation, along with OPEN-O.  ECOMP provides the total-orchestration model that the ETSI spec and other MANO implementations lack.  That’s true not only for NFV, but also for SDN.

It’s time for a change here, and the thing we need to change to is the new ONAP platform.  The best role for ETSI here would be to map their stuff to ONAP and facilitate the convergence of MANO alternatives with it.  The best role for the ONF would be to do the same with SDN.  Then, we need to get off the notion that traditional standards can ever successfully drive software virtualization projects.

The Role of As-a-Service in Event Processing, and it’s Impact on the Network

We seem to be in an “everything as a service age”, or at least in an age where somebody is asserting that everything might be made available that way.  Everything isn’t a service, though.  Modern applications tend to divide into processes stimulated by a simple event, and processes that introduce context into event-handling.  We have to be able to host both kinds of processes, and host them in the right place, and we also have to consider the connection needs of these processes (and not just of the “users” of the network) when we build the networks of the future.

The purpose of an as-a-service model is to eliminate (or at least reduce) the need for specialized hardware, software, and information by using a pool of resources whose result is delivered on demand.  The more specialized, and presumably expensive, a resource is, the more valuable the “aaS” delivery model could be.  The value could come in the cost of the resource, or because analytic processes needed to create the resource would be expensive to replicate everywhere the results are needed.

You can easily envision an as-a-service model for something like calculating the orbit of something, or the path of an errant asteroid in space, but the average business or consumer doesn’t need those things.  They might need to have air traffic control automated, and there are obvious advantages to having a single central arbiter of airspace, at least for a metro area.  On the other hand, you don’t want that single arbiter to be located where the access delay might be long enough for a jet to get into trouble while a solution to a traffic issue was (so to speak) in flight to it.

Which might happen.  The most obvious issue that impacts the utility of the “aaS” option is economy the resource pool can offer.  This is obviously related not only to the cost of the resource, but also to how likely it’s going to be used, by how many users, and in what geography.  It’s also related to the “control loop”, or the allowable time between a request for the resource in service form and the delivery of a result.  I’d argue that the control loop issue is paramount, because if we could magically suspend any latency between request and response, we could serve a very large area with a single pool, and make the “aaS” model totally compelling.

The limiting factor in control loop length is the speed of light in fiber, which is about 120 thousand miles per second, or 120 miles per millisecond.  If we wanted to insure a control loop no more than 50 milliseconds long, and if we presumed 20 milliseconds for a lightweight fulfillment process, we’re left with 30 milliseconds for a round trip in fiber, or a distance of about 1800 miles.  A shorter control loop requirement would obviously shorten the distance our request/response loop could travel.  So would any latency introduced by network handling.  As a practical matter, most IoT experts tell me that process control likely can’t be managed effectively at more than metro distances because there’s both a short control loop requirement and a lot of handling that happens in the typical access and metro network.

Still, once you’ve paid the price for access/metro handling and have your request for a resource/service on fiber, you can haul it a long way for an additional millisecond or two.  Twenty milliseconds could get you to a data center in the middle of the US from almost anywhere else in the country, and back again.  That is, in my view, the determining factor in the as-a-service opportunity.  You can’t do everything as a service with that long a control loop, which means that event-driven processes in the cloud or as a part of a carrier service will have to be staged in part to resources nearer the edge.  But with proper software and network design you can do a lot, and the staging that’s needed for resource hosting is probably the driver behind most network changes over the next decade or so.

One obvious truth is that if electrical handling adds a lot to the delay budget, you want to minimize it.  Old-day networks were an electrical hierarchy to mass up traffic for efficient handling.  If fiber is cheap enough, no such massing up is needed.  If we could mesh hosting points with fiber connections, then we could make more seldom-used (and therefore not widely distributed) features available in service form without blowing our control loop budget.

In a given metro area, it would make sense to mesh as many edge hosting points as possible with low-latency fiber paths (wavelengths on DWDM are fine as long as you can do the mux/demux without a lot of wasted time).  I’d envision a typical as-a-service metro host network as being a redundant fiber path to each edge point from a metro data center in the center, with optical add/drop to get you from any edge point to any other with just a couple milliseconds of add/drop insertion delay.  Now you can put resources to support any as-a-service element pretty much anywhere in the metro area, and everything ties back with a low-latency path to a metro (and therefore nearby) data center for hosting processes that don’t require as short a control loop.  You could carry this forward to state/regional and central data centers too.

All this hosting organization is useless if the software isn’t organized that way, and it’s not enough to use “functional” techniques to make that happen.  If the context of an event-driven system has to be determined by real-time correlation of all relevant conditions, then you end up with everything at the edge, and everything has to have its own master process for coordination.  That doesn’t scale, nor does it take advantage of short-loop-long-loop segregation of processes.  Most good event-driven applications will combine local conditions and analytic intelligence to establish conditions in terms of “operating modes”, which are relevant and discrete contexts that establish their own specific rules for event-handling.  This is fundamental to state/event processing, but it also lets you divide up process responsibility efficiently.

Take the classic controlled-car problem.  You need something in the car itself to respond to short-loop conditions like something appearing in front of you.  You need longer-loop processes to guide you along a determined route to your destination.  You can use a long-loop process to figure out the best path, then send that path to the car along with a set of conditions that would indicate the path is no longer valid.  That’s setting a preferred state and some rules for selecting alternate states.  You can also send alerts to cars if something is detected (a traffic jam caused by an accident, for example) well ahead, and include a new route.  We have this sort of thing in auto GPSs today; they can receive broadcast traffic alerts.  We need an expanded version in any event-driven system so we can divide the tasks that need local hosting from those that can be more efficiently handled deeper in the cloud.

We also need to be thinking about securing all this.  An as-a-service framework is subject to hacking as much as a sensor would be, though it’s likely easier to secure it.  There is a unique risk with one, though, and that’s the risk of impersonation.  If you have an event-driven system sensitive to external messages, you have a system that can be doctored by spoofing.  Since event processing is about flows, we need to understand how to secure all the flows to prevent impersonation.

As-a-service is critical for future cloud applications, but particularly so for event-driven systems.  By presenting information requirements derived from analytics and not just triggered by a simple event as “services” we can simplify applications and help divide tasks so that we use more expensive edge resources more efficiently.  To get the most from this model, we’ll need to rethink how we network our edge, and how we build applications to use those services optimally.

Amazon Signals a Major Shift in Software and the Cloud

Amazon is making its Greengrass functional programming cloud-to-premises bridge available to all customers, and Nokia has already announced its support on Nokia Multi-Access Edge Computing (MEC) gear.  This is an important signal to the market in the area of IoT, and also a potentially critical step in deciding whether edge (fog) computing or centralized cloud will drive cloud infrastructure evolution.  It could also have profound impact on chip vendors, server vendors, and software overall.

Greengrass is a concept that extends Amazon’s Lambda service outside the cloud to the premises.  For those who haven’t read my blogs on the concept, the Lambda service applied functional programming principles to support event processing and other tasks.  A “lambda” is a unit of program functionality that runs when needed, offering “serverless” computing in the cloud.  Amazon and Microsoft support the functional/lambda model explicitly, and Google does so through its microservices offering.

The challenge that Amazon faced with Lambda was a poster child for the edge/central cloud issue I’ve blogged about (most recently last week).  The most compelling application for Lambda is event processing, including IoT.  Most event processing is associated with what are called “control loop” applications, meaning that an event triggers a process control reaction.  These applications typically demand a very low latency for the obvious reason that if, for example, you get a signal to kick a defective product off an assembly line, you have a short window to do that before the product moves out of range.  Short control loops are difficult to achieve over hosted public cloud services because the cloud provider’s data center isn’t local to the process being controlled.  Greengrass is a way of moving functions out of the cloud data center and into a server that’s proximate to the process.

The obvious message here is that Amazon knows that event-processing applications will need edge-hosting of functions.  Greengrass solves that problem by moving them out of the public cloud, which is good in that it solves the control-loop-length problem and bad in that it denies Amazon the revenue associated with the running of the functions.  To me, this is a clear signal that Amazon eventually wants to offer “edge hosting” as a cloud service, which means that the cloud event-processing opportunity creates such a need, which means that IoT creates it.

There are few credible IoT applications that aren’t related to some kind of event processing since IoT is all about handling sensor-created events.  Thus, a decisive shift toward IoT as a driver of cloud deployments could shift the focus of those deployments to the edge.  This could change a lot of power balances.

In the cloud provider space, edge hosting is problematic because of real estate.  Cloud providers have traditionally focused on a small number of large data centers, not only for reasons of economy of scale in hosting, but to avoid having to acquire facilities in every metro area.  Amazon may be seeing Greengrass as an opportunity to enter the event fray with an “integrated hybrid cloud” approach, where they could license a cloud service that includes the option for premises hosting.  However, facility-based service providers (telcos, ISPs, cablecos, etc.) would have edge-hosting-ready real-estate to exploit, and that could force the traditional cloud providers to look for their own space.

On the vendor side, edge hosting would be a boon to the chip vendors, particularly vendors who focus not on chips for “servers” but chips designed to run the more compute-intensive functional programming components associated with event processing.  The event-cloud model could look like a widely distributed set of compute nodes, requiring what could be millions of new chips.

At the same time, edge hosting divides the chip opportunity, or perhaps even totally changes it.  Functional programming is highly compute-intensive, to the point where strict adherence to its principles would make it a totally compute-driven process.  General-purpose server chips can still execute functional programs, but it’s likely that you could design a function-specific chip that would do better, and be cheaper.

At the server design level, we could see the possibility of having servers made up of more of these specialized chips, either by having dense multi-chip boards or by having a bunch of “micro-boards” hosting a traditional number of chips per board.  The combination would provide an entry point for a lot of new vendors.

This shift would favor (as I pointed out last week) network equipment vendors as providers for “hosting”.  A network-edge device is a logical place to stick a couple compute boards if you are looking for event processing support.  This wouldn’t eliminate the value of and need for deeper hosting, since even event-driven applications have back-end processes that look more like traditional software than like functional programming, but it would make the back-end the tail to the event-edge dog.

On the software side, event-focused application design that relies on functional programming techniques could shift the notion of cloud applications radically too.  You don’t need a traditional operating system and middleware; functional components are more like embedded control software than like traditional programs.  In fact, I think that most network operating systems used by network equipment vendors would work fine.

That doesn’t mean there aren’t new software issues.  Greengrass itself is essentially a functional-middleware tool, and Microsoft offers functional-supporting middleware too.  There are also special programming languages for functional computing (Haskell, Elm, and F# are the top three by most measures, with F# likely having a momentum edge in commercial use), and we need both a whole new set of middleware tools and a framework in which to apply them to distributed application-functional design.

The issues of functional software architectures for event-handling are complicated, probably too complicated to cover in a blog that’s not intended purely for programmers.  Suffice it to say that functional event programming demands total awareness of workflows and latency issues, and that it’s probably going to be used as a front-end to traditional applications.  Since events are distributed by nature, it’s reasonable to expect that event-driven components of applications would map better to public cloud services than relatively centralized and traditional IT applications.  It’s therefore reasonable to expect that public cloud services would shift toward event-and-functional models.  That’s true even if we assume nothing happens with IoT, and clearly something will.

What we can say about functional software impacts is that almost any business activity that’s stimulated by human or machine activity can be viewed as “event processing”.  A transaction is an event.  The model of software we’ve evolved for mobile-Internet use, which is to front-end traditional IT components with a web-centric set of elements, is the same basic model that functional event software implements.  Given that, it is also very possible that functional logic will evolve to be a preferred tool in any application front-end processes, IoT and machine-driven or human-based.

That means that Amazon’s Greengrass might be a way for Amazon to establish a role for itself in broader IT applications.  Since Amazon (and Microsoft, and Google) also have mobile front-end tools, this might all combine to create a distinct separation of applications between a public-cloud-front-end component set and a traditional data-center-centric back end.  This, and not the conversion of legacy applications to run in the cloud, would then be the largest source of public cloud opportunity.

A “functional cloud” would also have an impact on networking.  If we assume that event processing is a driver for the future of cloud services, then we have to assume that there is a broad need to control the length of that control loop, meaning network latency.  Edge-hosting accomplishes that for functional handling that occurs proximate to the event source, but remember that all business applications end up feeding more traditional deeper processes like database journaling.  In addition, correlation of events from multiple sources has to draw from all those sources, which means that the correlation has to be sited conveniently to all, and have low-latency paths.  All of this suggests that functional clouds would have to be connected with a lot of fiber, that “data center interconnect” would become a subset of “function-host-point interconnect”.

Overall, the notion of a function-and-event-driven cloud could be absolutely critical.  It would change how we view the carrier cloud because it would let carriers take advantage of their edge real estate to gain market advantage.  It’s been validated by all the major public cloud providers, including OTT giant Google.  Now, Amazon is showing how important edge hosting is.  I think it’s clear that Amazon’s decision alone would carry a lot of predictive weight, and of course it’s only the latest step on an increasingly clear path.  The times, they are a ‘changing.

Which of the Many NFVs are Important?

Sometimes words trip us up, and that’s particularly true in tech these days.  Say we start with a new term, like NFV.  It has a specific technical meaning, but we have an industry-wide tendency to overhype things in their early stages, and vendors jump onto the concept with offerings and announcements that really aren’t strongly related to the original.  Over time, it gets harder to know what’s actually going on with that original concept.  Over time, the question arises whether the “original concept” is really important, or whether we should accept the dynamic process of the market as the relevant source of the definition.  So it is with NFV.

The specific technical meaning of NFV would be “the implementation of virtual function hosting in conformance with the ETSI ISG recommendations.”  Under this narrow definition, there is really relatively little deployment and frankly IMHO little opportunity, but there are some important forks in the road that are already established and will probably be very important.  In fact, NFV will leave a mark on networking forever through one or more of these forks in the evolutionary pathway, and that would be true if there was never even a single fully ETSI-compliant deployment.

One fork is the “agile CPE” fork.  This emerged from the NFV notion of “virtual CPE”, which initially targeted cloud-hosted instances of virtual functions to replace premises-based appliances.  You could frame a virtual premises device around any set of features that were useful, and change them at will.  Vendors quickly realized two things.  First, you needed to have something on premises just to terminate the service.  Second, there were sound financial reasons to think about the virtual hosting as an on-premises device, especially given that first point.

The result, which I’ll call “aCPE”, is a white-box appliance designed to accept loaded features.  These features may be “VNFs” in the ETSI sense, or they may simply be something that can be loaded easily, following perhaps a cloud or container model.  aCPE using a simple feature-loading concept could easily be a first step in deploying vCPE; you could upgrade to the ETSI spec as it matured.  aCPE-to-vCPE transformation would also prep you for using the cloud instead of that agile device, or using a hybrid of the two.

Most of what we call “NFV” today is a form of aCPE.  Since it would be fairly wasteful to demand all of the cloud-hosted elements, including “service chaining” when all your functions are in the same physical box, most of it doesn’t conform to the ETSI spec.  I suspect that over time it might, providing that a base of fully portable VNFs emerges from the ongoing ETSI activity.

Another form is the “virtual device instance”, which I’ll call vDI.  Virtual functions are features that are presumably somewhat dynamic.  The vDI is a replacement for a network device, not an edge feature, and so it’s much more like a hosted instance of device software.  A “virtual router” is usually a vDI, because it’s taking the place of a physical one and once it’s instantiated it behaves pretty much like the physical equivalent.

Perhaps the most significant attribute of the vDI is that it’s multi-service and multi-tenant.  You don’t spin up a set of Internet routers for every Internet user, you share them.  Same with vDIs that replace the real routers.  There are massive differences between the NFV model of service-and-tenant-specific function instantiation and a multi-tenant vDI model, and you can’t apply service-specific processes to multi-tenant applications unless you do want to build an Internet for everyone.

Issues notwithstanding, we’re starting to see some activity in the vDI space, after a number of tentative beginnings.  Brocade’s Vyatta router software (now acquired by AT&T) was an early vDI, to some extent subsumed into the NFV story.  However, vDI is really more like a “cloud router” than a virtual network function.  I believe that most of the IMS/EPC/5G software instances of functionality will really be vDIs because they’ll deploy in a static configuration.  In the 5G space, the NFV ISG seems to be taking up some multi-tenant issues in the context of their 5G network slicing work, but it’s too early to say what it will call for.

The real driver of vDI, and also perhaps a major driver for NFV, depends on reshaping the role of the lower OSI layers.  The original OSI model (Basic Reference Model for Open Systems Interconnect) was created in an age where networking largely depended on modems, and totally depended on error-prone electrical paths.  Not surprisingly, it built reliable communications on a succession of layers that dealt with their own specific issues (physical-link error control was Layer 2, for example).  In TCP/IP and the Internet, a different approach was taken, one that presumed a lot of lower-level disorder.  Neither really fits a world of fiber and virtual paths.

If we were to create, using agile optics and virtual paths/wires, a resilient and flexible lower-layer architecture, then most of the conditions that we now handle at Levels 2 and 3 would never arise.  We could segment services and tenants below, too, and that combination would allow us to downsize the Level 2/3 functionality needed for a given user service, perhaps even for the Internet.  This would empower the vDI model.  Even a less-radical rethinking of VPN services as a combination of tunnel-based vCPE and network-resident routing instances could do the same, and if any of that happens we could have an opportunity explosion here.  If the applications were dynamic enough, we could even see an evolution from vDI to VNFs, and to NFV.

Another of my versions of NFV that are emerging is what could be called “multi-layer orchestration”, which I’ll call “MLO” here.  NFV pioneered the notion of orchestrated software automation of a virtual function deployment lifecycle, which was essential if virtual network functions were to be manageable in the same way as physical network functions (PNFs).  However, putting VNFs on the same operational plane as PNFs doesn’t reduce opex, since the overall management and operations processes are (because the VNFs mimic the PNFs in management) the same.  The best you can hope for is to keep it the same.  To improve opex, you have to automate more of the service lifecycle than just the VNFs.

MLO is an add-on to NFV’s orchestration, an elevation of the principle of NFV MANO to the broader mission of lifecycle orchestration for everything.  A number of operators, notably AT&T with ECOMP, have accepted the idea that higher-layer operations orchestration is necessary.  Some vendors have created an orchestration model that works both for VNFs and PNFs, and others have continued to offer only limited-scope orchestration, relying on MLO features from somewhere else.  Some OSS/BSS vendors have some MLO capability too.

NFV plus MLO can make a business case.  MLO, even without NFV, could also make a business case and in fact deliver a considerably better ROI in the first three or four years.  Add that to the fact that there is no standard set of MLO capabilities and no common mechanism to coordinate between MLO and NFV MANO, and you have what’s obviously fertile ground for confusion and misinformation.  You also have a classic “tail-or-dog” dilemma.

NFV drove our current conception of orchestration and lifecycle management, but it didn’t drive it far enough, which is why we need MLO.  It’s also true that NFV is really a pathway to carrier cloud, not an end in itself, and that carrier cloud is likely to follow the path of public cloud services.  That path leads to event-driven systems, functional programming, serverless computing, and other stuff that’s totally outside the realm of NFV as we know it.  So, does NFV have to evolve under MLO and carrier cloud pressure, or do we need to rethink NFV in those terms?  Is virtual function deployment and lifecycle management simply a subset of MLO?  This may be something that the disorderly market process I opened with may end up deciding.

Perhaps it’s not a bad thing that we end up with an “NFV” that doesn’t relate all that much to the original concept.  Certainly, it would be better to have that than to have something that stuck to its technical roots and didn’t generate any market-moving utility.  I can’t help but think that it would still be better to have a formal process create the outcome we’re looking for, though.  I’m pretty sure it would be quicker, and less expensive.  Maybe we need to think about that for future tech revolutions.

Open source seems to be the driver of a lot of successes, and perhaps of a lot of the good things circulating around the various NFV definitions.  Might we, as an industry, want to think about what kind of formal process is needed to launch and guide open-source initiatives?  Just saying.

Dissecting the Details of the Carrier Cloud Opportunity

The “carrier cloud” should be the real focus of transformation.  For operators, it epitomizes the shift from connectivity technology to hosting technology, and for vendors a change from network equipment to servers and software.  Things like SDN and NFV are not goals; they are important only insofar as they can be linked to a migration to the carrier cloud.

I’ve been working hard at modeling the carrier cloud trend, using a combination of operator data and my own “decision model” software.  I’ve shared some early work in this area with you in prior blogs, but I have more refined data now and more insights on what’s happening, so it’s time to take a deeper look.  What I propose to do is to set the stage in a couple paragraphs, and then look at carrier cloud evolution over the next couple decades.

There are six credible drivers to carrier cloud.  NFV, largely in the form of virtual CPE (vCPE) is the one with the longest history and the most overall market interest.  Virtualization of advertising and video features is the second.  The third is the mobile broadband evolution to 5G, the fourth network operators’ own cloud computing services.  Driver five is “contextual” services to consumers and workers, and six is the Internet of Things.  These drivers aren’t terribly important insofar as where they take carrier cloud; their overall strength determines market timing, and their relative strength the things most likely to be successful in any period.

We have a notion, set by “cloud computing”, that the carrier cloud is cloud computing owned by carriers.  The truth is more complicated, and fluid.  There is a strong drive toward hosting not in centralized cloud data centers, but in distributed edge points.  Edge cloud, so to speak, doesn’t depend on economies of scale but on the preservation of a short control loop.  If edge beats out central cloud, then we might see Linux boards in devices providing the majority of the hosting horsepower, and that would favor network vendors (who are already at the edge) over server vendors.

So the net is that the “where” of carrier cloud is the shape of the actual technology, and there are two basic options—edge-centric and centralized.  Were we to find that something like operator cloud computing services was a near-term dominant driver, we could expect to see more centralized deployments, followed by a migration toward the edge.  If any other driver dominates, then the early impetus is for edge hosting, and that will tend to sustain an edge-centric structure over time.  Got it?  Let’s look, then, at “time” in intervals to see what my model says.

The period between now and 2019 is the period that’s most transforming in the sense of the architecture of deployment (edge, center) and the opportunity for a driver and a vendor to dominate.  In this period, we could have expected to see NFV emerge as the dominant driver because it had a major head start, but NFV has failed to generate any significant carrier cloud momentum.  What is actually creating the biggest opportunity is the consumer video space.

Over half of all carrier cloud opportunity through 2019 is video-content related.  A decent piece of the early opportunity in this period is for edge caching of content, associated with improving video QoE while at the same time conserving metro mobile backhaul bandwidth.  Over time, we’ll see customization of video delivery, in the form of shifting from looking for a specific content element to socializing content among friends or simply communities of interest.  The mobile broadband connection to this creates the second-most-powerful driver, which is the mobile evolution driven by 5G issues; not yet truly a 5G convergence but rather a 4G impact of 5G requirements.  Nothing else really matters in this critical period, including NFV.

Between 2020 and 2022 we see a dramatic shift.  Instead of having a single opportunity driver that represents half of all opportunity for carrier cloud, we have four opportunities that have over 15% contribution, and none have more than 24%.  Cloud services, contextual services, and IoT all roughly double their opportunity contribution.  Video-content and 5G evolution continue to be strong, but in this period 5G opportunity is focused increasingly on actual deployment and video content begins to merge into being a 5G application.

This is the period where the real foundation of carrier cloud is laid, because it’s where process hosting begins to separate from content hosting.  The vCPE activity drops off as a percentage of total opportunity, and at this stage there is still no broad systemic vision to drive NFV forward in other applications.  IoT and contextual services, the most direct drivers of process hosting, begin to gather their strength, and other drivers are nearing or at their high-water mark, excepting video content which is increasingly driven by personalization, socialization, and customization of both ads and QoE.

The next period is the beginning of the real carrier cloud age, the 2023 to 2025 period, and in this phase we are clearly in the period when process and application hosting dominate carrier cloud opportunity.  Video-content services, now largely focused on socialization and personalization rather than caching, join with contextual services and IoT to reach 20% opportunity contributions, while most other areas begin to lose ground.  But it is in this period that NFV concepts finally mature and take hold, and most NFV applications beyond vCPE are fulfilled in this period.  This marks the end of the period where NFV specifications and tools are significant drivers of opportunity.

The largest number of new carrier cloud data centers are added, according to my model, in 2025.  In part, this is because of the multiplicity of opportunity drivers and in part because the economy of scale in edge data centers allows for a reduction in service prices, which drives new applications into the carrier cloud.

The period to 2025 is the final period where my model retains any granular accuracy.  Beyond that, the model suggests that operator cloud services and IoT will emerge as opportunity driver of 30% or more of the data center growth.  In this period, point-of-activity empowerment for workers and IoT control processes that focus attention on the edge are the primary drivers of process hosting.  Operators, with ample real estate at the network edge, will continue to expand their hosting there.  From 2025 onward, it appears that process hosting applications become more generalized, less tied to a specific driver, and that this is the period when we can truly say that carrier cloud is the infrastructure strategy driving operator capex.

Carrier cloud, in a real sense, wins because it throws money at the places where a good return is earned.  That’s hosted experiences and content.  What my model doesn’t address is the question of whether fundamental transformation of the network—the stuff that creates end-to-end connectivity—will happen, and how that might impact things like SDN and NFV as substitutes for network equipment as we know it.  I’m not yet able to model this question because we have no organized approach to such a transformation, and my buyer-decision model needs things buyers can actually decide on.

Broadly, it appears that you could reduce connection network capex by about 20% through the optimum use of SDN and NFV.  Broadly, it appears that you could reduce connection network opex by about 30%, beyond what you’d already achieved through process opex service lifecycle automation.  However, getting to this state would demand a complete rethinking of how we use IP, the building of new layers, and the elimination of a lot of OSI-model-driven sacred cows.  All that could be done only if we assumed that we had a reason to displace existing technology before its lifecycle had ended, and that will be hard to find.  Particularly given that we’re going to ramp up on a 5G investment cycle before there’s much chance of promoting a new network model.

5G could transform access networking by creating RF tail connections on fiber runs to neighborhoods, and these same RF points could also support mobile 5G access.  A total convergence of mobile and fixed metro networks would likely follow, and since consumer video is the dominant traffic driver for both, we’d expect to see hosting (in the form of CDN) move outward.  We could see “the Internet” become more of a thin boundary between CDNs and access networks, which would of course open the door for total transformation of IP networks.

What that means in practical terms is that the best shot SDN and NFV have to transform how we build connection networks is the period up to about 2022, when 5G investment peaks.  If nothing happens to how we do metro networking, then transformation of the connection architecture of future networks will come about naturally because of the edge hosting of content and processes and the radical shift from traditional trunking to data center interconnect as the driver of bulk bandwidth deployment.  That means that there is only a very short period to put connection-network transformation into the picture as a driver of change, but even were it to be fully validated it lacks the opportunity value to significantly accelerate the carrier cloud deployment.  All it could do is make network vendors more important influences on the process, particularly Cisco who has both connection and server assets.

If you are not a network equipment vendor but instead a server provider, chip company, or software house, then the carrier cloud bet says forget the network and focus on hosted processes and the software and hardware dynamics associated with that.  For example, given that in the cloud the notion of functional computing (lambdas, microservices, etc.) and dynamic event-driven processes is already developing, these would seem likely to figure prominently in process hosting.  That’s a different cloud software architecture, with different hardware and even chip requirements.

IT players might also want to consider that there is zero chance that process hosting in the carrier cloud wouldn’t impact enterprise IT.  If edge-distributed computing power is good, the either enterprises have to accept cloud providers owning the edge, or they have to think about how to host processes close to the edge themselves (think Amazon’s Greengrass).  The latter would mean rethinking “server farms” as perhaps “server fog”, and that would impact the server design (more CPU and less I/O) and also the software—from operating system and middleware to the virtualization/cloud stacks.  And of course, every software house would have to recast their products in process-agile terms.

It’s also fair to say that discussions of SDN and NFV would be more relevant if they addressed the needs of carrier cloud and the future revenues it generates, rather than focusing on current services that have no clear revenue future.  NFV would also benefit from thinking about the way that functional programming, function hosting, and “serverless” computing could and should be exploited.  It makes little sense to look for NFV insight in the compute model that’s passing away, rather than the one that will shape the future.

Vendors have their own futures to look to.  Carrier cloud will create the largest single new application for servers, the largest repository for the hosting of the experiences that drive Internet growth.  It’s hard to see how that critical issue I opened with—central or edge—won’t determine the natural winners in the race to supply the new compute technology, and vendors need to be thinking ahead to what the carrier cloud issue will create in terms of opportunity (and risk) for them.  I think Cisco’s recent announcements of a shift in strategy show that they, at least, know that things are changing.  They (and others) have to prove now that they know where those changes will take the industry.

In all, I think carrier cloud reality proves a broader reality, which is that we tend to focus on easy and familiar pathways to success rather than on plausible pathways.  There is an enormous amount of money at stake here, and enormous power in the industry.  It will be interesting to see which (if any) vendors really step up to take control of the opportunity.