What’s Really Needed to “Simplify” NFV

Intel says it will simplify NFV by creating reference NFVIs.  Is there a need for simplification with NFV, and does Intel’s move actually address it in the optimum way?  It depends on what you think NFV is and what NFVI is, and sadly there’s not full accord on that point.  It also depends on where you think NFV is going—toward more service chaining or to 5G and IoT.  In the ETSI NFV E2E model, “NFVI” or NFV Infrastructure, is something that hosts virtual functions.  It lives underneath a whole set of components and it’s really those components, and in particular the Virtual Infrastructure Manager (VIM) that frames the relationship between hardware and NFV.  We can’t start with NFVI, we have to look at things from the top.

If there’s NFV confusion, it doesn’t start with NFVI because logically the infrastructure itself should be invisible to NFV.  Why that’s true is based on the total structure.  It’s difficult to pull a top-down vision from the ETSI material, but in my view the approach is fairly straightforward.  Virtual Network Functions (VNFs) are the hosted analog of Physical Network Functions (PNFs), which are the devices we already use.  The goal of NFV is to deploy these VNFs and connect the result to current management and operations systems and practices.  Referencing a prior blog of mine, there is a hosting layer and a network layer, and the goal of NFV is to elevate a function from the hosting layer into the network layer, as a 1:1 equivalent of a real device that might otherwise be there.

If that is the mission of NFV, then the role of Management and Orchestration (MANO) is to take what might be a device that consists of multiple functions and is defined as a “service chain” and convert that into successive hosting interactions.  With what?  The VIM.  MANO doesn’t do generalized orchestration.  The VNFM’s role is to harmonize VNF management with management of PNFs.  Service lifecycles are out of scope, as is the control of the PNFs that remain.  In this model, a VIM is the cloud/virtualization software stack that deploys an application on infrastructure.  Many in the NFV ISG say that the VIM is OpenStack.

Logically, an operator would want to be able to deploy VNFs on whatever hosting resources it found optimal, including resources from open architectures like the Open Compute Project or Telecom Infrastructure Project.  Logically, they’d want to be able to embrace servers based on different chips (we just saw an announcement of a server partnership between Nvidia and the biggest server players, focusing on AI), and many think operators should be able to use something other than OpenStack as a VIM (see below).  We should be able to have portions of our resource pool focused on accelerated data plane, and portions focused on high compute power.  We should have what’s needed, in short.

Logically, we should be thinking of a VIM as being (in modern terms) an “intent model”, exposing standard SLAs and interfaces to MANO and implementing them via whatever software and hardware resources were desired.  If this is the goal, then it seems like a lot of the NFVI confusion is really a symptom of an inadequate VIM model.  If the goal of a VIM is to totally abstract the implementation of the deployment platform and underlying hardware, then there can’t be “confusion” in the NFVI because it’s invisible.  If there is confusion then it should be resolved by the implementation of the VIM.

This happy situation is compromised if you assume that there has to be one and only one VIM.  If the NFVI isn’t homogeneous, or if the operator elects to use virtualization software other than OpenStack, you end up with the problem of having all the options supported inside one VIM, which means that somehow a VIM would have to be vendor-neutral.  What vendor will provide that?  Is it a mandate for an open-source implementation of NFV?  Not yet.

I think that it’s particularly critical to be able to use VMware’s solutions rather than depend exclusively on OpenStack.  VMware is widely used, favored by many operators, and solid competition in the virtualization/cloud-stack space would be very helpful to network operators.  A VMware-modeled VIM would be helpful to NFV overall.

Whatever the motive behind VIM multiplicity, having more than one VIM means having some means of selecting which VIM you use.  If there is indeed diverse NFVI, then you need to have a mechanism to decide what part of the diverse pool you plan to use.  This isn’t complicated to do in theory if you have the right implementation, but a requirement to do it would have complicated the work of the NFV ISG and they elected not to address that issue.  I’ve called this a “resource domain” problem, a requirement that an intent model representing hosting/deployment represent a sub-model structure that can then link to the right deployment stack.

The VIM issues are bad enough when you consider the problem of deploying a VNF in the right place and right way, but they’re a lot worse when you consider redeployment.  Suppose I need a server resource with a widget to properly host a given VNF.  It’s certainly a problem if the only way I can make that available is to put it on all servers, because my VIM can’t select a server with it from a diverse pool.  But imagine now that my server breaks and there’s no other widget-equipped server in the same location.  I now have to reconfigure the service to route the connection to a different data center.  This is almost surely going to require configuring not only virtual switch ports for OpenStack, but configuring WAN paths.  Remember, I don’t have that capability because my MANO is focused on deploying the virtual elements and not setting up the WAN.

Intel could address this problem, but not with reference NFVIs.  What they needed to do was to hand it over to Wind River (part of Intel) and ask them to frame an open architecture for a VIM that had that internal, nested-models, capability needed to control diverse infrastructure using diverse virtualization software.  That would be a huge step forward, not only for Intel but for NFV overall.  It would also, of course, tend to open up the NFVI market, which may not be in Intel’s interest.

The need to have a very agile approach to managing virtual infrastructure goes beyond just different implementations of cloud hosting or CPUs.  Nokia has recently announced AirGile, what it calls a “cloud-native” model for NFV hosting that incorporates many of the attributes of functional/lambda/microservice programming that I’ve been talking about.  If you want to be truly agile with stateless elements for 5G and IoT (which is what Nokia is aiming at) then you need to be a lot more efficient in deployment, scaling, and redeployment.  Taking advantage of the AirGile model means having statelessly designed VNFs.  If we’re going to do that, we should rethink some of the use cases for NFV as well as the management model.

Vendors, including Intel and Nokia, clearly have their own visions of how NFV should work.  Add these to the multiplicity of open-source solutions, many with strong operator support and it shows that things aren’t going to converge on a single NFV model any time soon.  That means we have to be able to assess the relative merits of each approach, and the only way to fully understand or assess NFV is to take it from the top, in the most general case, and explore the implications.  I think that the biggest problem NFV had was starting from the bottom.  The second-biggest problem was excessive fixation on a single use case, virtual CPE.  Top-down, all-use-cases, is the way to go.

NFT’s problems can be solved, and in fact there are proposals in various forms and venues to do that.  One candidate is ONAP, and the first of several pieces explaining why can be found HERE.  Certainly ONAP needs to be tested, in particular in terms of how its use of TOSCA can address the modeling needs.  Is it best?  What’s needed is to explore the capabilities of these solutions in that general case I noted, testing them against the variety of service configurations and mixtures of VNF and PNF, and over a range of deployment/redeployment scenarios.  If we do that, we can ensure that all the pieces of NFV fit the mission, and that we simplify the process of onboarding VNFs, infrastructure, and everything else.  The ETSI ISG probably won’t be the forum for that to happen, and the use case focus that has biased the ISG is also biasing other activities.  We may have to wait for broader NFV applications (like 5G, as Nokia suggests, and IoT) to emerge and force a more general approach to the problem.

Is Fog/Edge Computing Coming Into It’s Own?

We seem to be entering the fog again, at least in computing, and in a PR sense.  Just this week we’ve had a half-dozen media stories on fog and edge computing, but none of them really look at the issue fully.  There are, for certain, a lot of drivers for moving compute functionality to the edge, but is this nothing more than cloud computing with metro data centers?  Applications, and the benefits they generate, are the real drivers of technology change.  What specific drivers will control the fog/edge, and what do they demand?

Putting compute resources close to the points of user interaction, data collection, and process control has the benefit of improving response time or shortening the control loop from condition to action.  Almost everyone in the cloud space knows that.  Where we are still seeing a debate, at least in the implicit media sense, is over the specific application model.

Cloud applications for enterprises today tend to use the cloud as an elastic front-end, absorbing the web and mobile interface issues and prepping transactions from many different sources (often in different forms) for funneling into traditional online transaction processing (OLTP) applications.  The problem with this application model as a driver to fog/edge computing is that “response time” for the user will still depend on the back-end OLTP applications.  Sure, you can perhaps paint a form a bit faster with front-end components close to the user, but the dialog the user is looking for isn’t accelerated overall.

The fact that most OLTP applications won’t benefit from fog/edge means that most applications that involve direct human-system interactions won’t directly benefit either.  We have to look at M2M or IoT if we want to create fog/edge justification, but even there, it’s not necessarily easy.

Suppose I have an M2M application that accepts a sensor stimulus, sends a message to a system, and activates a process.  Is that just an OLTP application with the human removed?  The answer would depend on just what process I was activating, and that raises the big question with the vision of fog/edge computing.  For OLTP to benefit from fog/edge, we have to move the application itself to the edge.  Most businesses are years, even decades, from transferring all their core business applications into the cloud.  It may happen, but certainly not in the career lifetimes of people who bet their promotions on fog/edge success in the near term.

We know from industrial automation history that the average process control application isn’t a core business application in the traditional sense.  We already have M2M in factories, and we already depend on local processes to convert sensor data to action.  Is factory automation then the killer app for the fog/edge proponents?  No, because the events and the things being controlled are in a contained space and, in any event, are already being process-connected today.

Home automation doesn’t offer much more.  Home control isn’t a sub-second response problem, and increasingly people are putting more smarts in their homes, in the form of local controllers, to handle things like scheduling actions.  The home market is also very price-sensitive, which means that it would be easier to exploit fog/edge deployment with it than to drive that deployment in the first place.

If you believe in the media model for IoT, then in my view you are saying that fog/edge will never happen.  Self-drive cars, sensing a light turn yellow, come to a stop.  Yes, but do they do that because the car has polled the intersection sensor for light state and received a “Yellow” response, all of which would go through the network?  Every intersection would get hundreds of polls, and somebody would run a DDoS attack on the sensor and nobody would be able to move a car anywhere on the road.  We need to accept that IoT generates a whole new world of event-driven processes, and it’s those processes that would justify the fog/edge deployment.  Those processes and the events they handle also have very unique demands, and if we can’t meet the demands we can’t make fog/edge work.

Some events, generated by closed-loop process control systems and home automation, are much like transactions.  They are significant to a single process source, you emit them, process them, and do something as a result.  Most of the new-model events are not transactions.  They are significant to a wide and variable scope of processes, they may be directly emitted or emitted by a process as the result of another event, and you may do nothing other than pass them along.  We can’t frame event processing in the traditional way, and if you don’t believe that think of the cloud providers like Amazon, Google, and Microsoft who developed a new cloud model and even programming languages to accommodate event-driven systems.

Event-driven processes are the key to fog/edge, and to really address event-driven requirements we need to think of two “spaces”, an “event-space” and a “process-space”.  The event-space is a series of collectors and duplicators that distribute events.  It would have to be low-latency and so likely based at least in part on edge computing.  Events move in the event-space.  The process-space is a series of function points, places where you can host simple functions that intercept events (from the event-space) in specific ways.

The event-space isn’t just a pipeline or fabric for events, though.  The “collectors” and “duplicators” are all standard processes that are hosted there for the purpose of generating what would essentially be a series of event trees.  That’s because IoT events are events and not discrete polling responses.  A traffic light changes because it changes, not because all the cars ask for the current state.  In the real world, the sensor would feed a process, probably one of a number designated to be the source of traffic information for a given number of square blocks.  This process would then distribute secondary events.  For example, the process might poll the light to see what state it was in, and then generate “light-is-changing” events.  Those secondary events would also flow into the event-space.

Events that had to feed multiple chains of activity might then be fed into a duplicator, which you could think of as a kind of high-performance publish-and-subscribe or a multicast process.  This new stream of events might intersect with other duplicators or with processes, or both.  The meshing of events with processes is a complex ballet, not only because it will demand a kind of “function routing” that aims an event at the destination process, but because that process might not yet live anywhere in particular, so the process-space will have to instantiate it in what’s effectively the path of an event.  Which, of course, could be influenced by where the process could be put.

Event processing is serverless not because it simplifies cloud billing, but because you can’t have persistent processes managing transitory things.  You can’t have traditional middleware, or traditional programming languages, or much of anything else either.  The challenge of the cloud, and of fog/edge computing in particular, is coping with events, and that’s going to be the priority of the IT industry for many years to come.  The good news is that we’re starting to recognize just how different things have to be, and working to support the differences.  The bad news is that there’s still a lot to do.  When it’s done, we may find many applications beyond IoT changed forever.

Considering the Layers of Service Modeling and Automation

A lot of important relationships in networking aren’t obvious, and that is the case with the relationship between management system boundaries, models, and elements of infrastructure.  Those relationships can be critical in service lifecycle management, which in turn is critical to sustaining operator profit-per-bit and driving or supporting “innovations” like SDN or NFV.  In my last blog I talked about the general trend in modeling services, and here I want to talk about the relationship between legacy management concepts and evolving notions of infrastructure and service lifecycle automation.

Networks have always been made up of discrete elements, “trunks” and “nodes”.  Trunks are the physical media that carries traffic and nodes are the switching/routing points that create end-to-end connectivity.  Even in the day of SDN and NFV, you still end up having trunks and nodes, though the nodes may be hosted or virtual rather than real devices.

In the old days, you used to have to control nodes directly and individually, using a command-line interface (CLI).  Over time, the complexity of that task became clear, and management systems evolved to simplify the process.  Since the purpose of nodes was to create end-to-end connectivity, one of the management innovations was to provide tools to create the kind of cooperative node behavior that was essential in building routes through a network.  In the OSI world, this built a management hierarchy—element management, then network management, and finally service management.

Three layers of management may simplify management tasks by standardizing them, but the layers themselves can be a bit complicated.  To make matters worse, packet networks have almost always been built to offer adaptive behavior, learning topology to find addresses and also learning about conditions so that failures can be accommodated.  Adaptive behavior generally takes place inside areas of the network which in routing are known as autonomous systems (ASs).  So we have elements, networks, and ASs that all have to be factored in.

SDN and NFV add in their own stuff because SDN has a controller that handles route management on the devices under its jurisdiction, and NFV has to deploy and manage the virtual resources that host network functions.  There is, in both SDN and NFV, a world “inside” a virtual box that has to be managed, so that creates another element.

We can’t keep listing all these things, so let’s adopt some terms here that are generally congruent with industry practices.  Let’s say that a group of resources that work cooperatively are a domain.  A network consists of one or more domains, and a domain has external behaviors that end up being something like what our old friends, nodes and trunks, would create.  Domains are ingrained in current practices, and so it’s helpful to retain the concept as long as it doesn’t run at odds with progress.

The structure that we’ve built here is multi-layered by both nature and design.  We have domains that interconnect with each other, and each of these are “networks”.  Underneath the domains we have perhaps another domain layer (the NFV resource pool) that has to be managed to push up virtual functions into the domain layer.  Above this, we have services that coordinate behavior across the domains.

This structure seems to naturally break down into an intent-model hierarchy.  A network might be several domains, which means that a network “model” could be said to decompose into multiple domain models.  Those in turn might decompose into models that defined that process of “elevation” that made nodes created by hosting something on a resource pool visible and functional.

One obvious question here is whether a given intent/domain model models the domain and all its services, or whether it models a specific service on a domain level.  Do we have an “IP VPN” service as well as perhaps a half-dozen others, exposed by an IP domain, or do we have separate domain models for each of the services, so that every domain that offers IP VPNs has a separate intent/domain model for it? There is no reason why you couldn’t do either of these two things, but the way that the higher-layer processes worked would obviously be different, one manipulating the same intent/domain models for all services, and the other manipulating service-specific models.

Generally, network services (those that are made up of real or virtual nodes and trunks and offer connectivity) will manage cooperative behavior via a management API.  An intent/domain model would envelope this API and use it to coerce the properties it needs from the underlying resources.  We can say that network domains are represented by management APIs, and conversely we could say that a domain is created by every set of resources that have their own management API.

Resource domains, meaning hosting domains, are different because there is no inherent functionality presented by one.  You have to deploy some virtual network function onto such a domain, and when you do you essentially create a node.  You also have to connect network functions to each other, and the outside world, so we could say that a hosting domain builds virtual boxes that become nodes.  When this happens, the virtual boxes should be managed at the network level the way that the real boxes would be.

This, I think, is a useful model exercise.  What we’ve done is to say that the “network” is always nodes and trunks, and that the management of the network should look like node/trunk management with real boxes as the nodes.  The hosting domain is responsible for node-building, and that process will always create a virtual box that looks like and is managed like some real box, and also will create ports on that node that link to real network trunks.

This can be loosely aligned with the ETSI model, in that they tend to cede management of the VNFs at the “functional” level to traditional management elements, but it’s less clear what the mapping between these virtual nodes and the implementation down in the hosting domain.  This is where the principles of intent modeling are helpful; you can assume that the virtual nodes are intent models that expose the management properties of the physical network functions they replace, and that the vertical implementation of the VNFs in the hosting domain harmonize to these PNF management properties.

The problem with this approach is the agility of virtual functions.  Real nodes are somewhere specific, and so the topology of a network is firmly established by node locations; there’s nowhere trunks can go and still be connected to something.  With virtual functions, any hosting point can support any type of node.  When you decide where to host something you create the equivalent of the PNF in the networking domain.  But you have to decide, and when you do you have to trunk to that point to connect.  This makes VNF-for-PNF static linkage difficult because the trunk connections to a PNF would be in place, period.  For a VNF you have to build them.

It would seem that this argues for at least one other layer of abstraction, at least where you have to mingle VNFs and PNFs.  A better approach is top-down, which is to say that you compose a service, map it to the fixed and hosting topologies, and then push down the VNF pieces to the hosting layer.  This might suggest that there are three broad layers—a service domain, a network domain, and a hosting domain.  At least, in a functional sense.

The purpose of a service domain would be to break out the service into functional pieces, like “Access” and “VPN”.  The network domain might then map those pieces to a service topology that identifies the management domains that have to be involved, and the hosting domain then hosts the VNFs as needed, and makes the connections—the “trunks” and anything internal to the VNF hosting process.

This latter point raises its own issues.  You can’t host functions without having connectivity, not only among the functional elements but also with the management framework itself.  The industry standard approach for this from the cloud computing side is the subnet model, where an application (in the NFV case, a virtual network function) is deployed in a specific IP subnet, which in most cases is based on the RFC 1918 private IP address space.  That means that the elements can talk with each other but not with the outside world.  To make something visible (to make it into a trunk connection), you’d associate its private address to a public address.  Docker, the most popular container architecture, works explicitly this way, and presumptively so do VM architectures like OpenStack.

Putting something in a subnet is one thing, but creating a service chain could be something else again.  A service chain is an ordered sequence of functions, which means that the chains are created by tunnels that link the elements.  Traditional container and cloud software doesn’t normally support that kind of relationship.  You could surely set up a tunnel from the individual functions, but in the first place they probably don’t know the order of functions in the chain, and second they probably don’t set up tunnels at all today; they expect to have their stuff connected from the outside.  You could create tunnels at Level 2 with a virtual switch and something like OpenStack, but does that mean that we need to host service chains in a virtual LAN at Level 2?  OpenStack could also be used at Level 3 to create tunnels, of course, providing you had some white-box switches that knew Level 3.

In address terms, this isn’t the end of it.  You generally won’t want to make the composed elements of a service visible and accessible except by the management processes.  The user of a service has an address space that the high-level service fits into.  The service provider will have a subnet address space for the virtual functions.  They may also have an address space for the virtual trunks.  Finally, the management structure, particularly for the hosting domain, will need an address space to connect to the lifecycle management processes.  One of the things that’s essential in the management and modeling of services is accounting for all these address spaces.  For example, we have to deploy VNFs into an address space.  We also have to connect the virtual devices that VNFs create using an address space.  Thus, we have to be able to manage these spaces with the modeling steps.

Finally, hosting adds complexity to both address space management and optimization.  Remember that you can host a function wherever you have resources.  What’s the best place, then?  How do you factor in things like the location of other non-portable elements of the service?  Not to mention questions of not creating single points of failure or violating regulatory policies by putting a service in or through a location where there might be governing regulations to worry about.

You can’t push these kinds of decisions down into OpenStack, because the issues are at a higher level.  In the real world, with limitations on the capacity of a single OpenStack domain, you have to at least divide hosting by domains.  You have to connect across data centers, where a single domain won’t have both ends or all the components.  We risk devaluing both SDN and NFV if we don’t think about the bigger picture here.

Of Networks, Management Scope, Modeling, and Automation

Service lifecycle automation is absolutely critical to operator transformation plant, but frankly it’s in a bit of a disorderly state.  Early on, we presumed that services were built by sending an order to a monolithic system that processed the order and deployed the necessary assets.  This sort-of-worked for deployment, but it didn’t handle service lifecycles at all, much less automate them.  A better way is needed.

One trend I see emerging in software lifecycle automation is what I’ll call representational intent.  The concept, as applied to networks, dates back more than a decade to the IPsphere Forum, where “services” were divided into “elements”, and the implementation of elements was based on defining a software agent that represented the element.  Manipulate the agent and you manipulated the element.  The importance of the concept can be seen in part through its history, and in part by hypothesizing its future.

The “representative” notion here is important because services and service elements are created through coerced cooperative behaviors, and while the elements are individually aware of what they’re cooperating to do, the awareness typically stops at an element boundary.  An access network knows access, but it doesn’t know it’s a part of a VPN.  There has to be some mechanism to introduce service-wide awareness into the picture, and if the elements themselves aren’t aware of services then we need to create representations of them that can be made aware.

This all leads to the second term, the “intent”.  An intent model is inherently representational, it abstracts a variety of implementations that are functionally equivalent into a single opaque construct.  Manipulating the intent model, like manipulating any representational model, manipulates whatever is underneath.  Structuring intent models structures the elements the models represent and adds that service-awareness that’s essential for effective lifecycle automation.

The seminal IPsphere was absorbed by the TMF, and whether there was a connection between the absorption and the notion of Service Delivery Frameworks (SDF) and NGOSS Contract or not, these two concepts picked up the torch within the TMF.

SDF was explicitly a representational intent approach.  A service was composed from elements represented by an SDF model, and the model enveloped not only the elements but the management strategy.  An SDF element might be totally self-managed, committing to an SLA and operating within itself to meet it.  That’s what most would say is the current thinking of an intent-modeled domain; it fixes what it can and passes everything else off to the higher level.  Where SDF broke from strict “intent-ness” is that other models of management, where the “intent model” exposed a management interface to be used from the outside rather than self-remediating, were also exposed.

There were a lot of concerns about the SDF approach from the operator side.  I recall an email I got from a big Tier One, saying “Tom, you have to understand that we’re 100% behind implementable standards, but we’re not sure this is one.”  I undertook, with operator backing, the original ExperiaSphere project, which was a Java implementation of a representational intent approach.

In this first-generation ExperiaSphere, the intent model representations were created by a software implementation, meaning that the service context was explicitly programmed into a “service factory” that emitted an order template.  That template, when resubmitted to the factory, filled the order and sustained the service.  I presented the results to the TMF team.

Meanwhile the NGOSS (Next-Generation OSS) project was approaching things from a slightly different angle.  SDF and (first-generation) ExperiaSphere both represented elements of a service with parallel-plane software processes.  The NGOSS Contract approach represented them instead as a data model.  Lifecycle processes are of course tickled along their path by events, and what NGOSS contract proposed was to define, in the service contract, the linkages between a state/event representation of the behavior of a service/element and the processes that handle the events.

In parallel with the TMF work, the enterprise space was focusing on automating application deployment, and the initial thrust was to have the development team create a set of deployment instructions that could be handed off to operations, hence the term “DevOps”.  DevOps emerged with two broad models—the “imperative” model that enhanced the notion of “scripting” to enter commands into systems management, and the “declarative” that defined an end-state model that was then used to drive deployment to match that state.  DevOps originally deployed, but has been enhanced to add the notion of “events”, which could be used to add at least some lifecycle management.  I won’t go into DevOps here beyond saying that the declarative approach is evolving much like intent models, and that any lifecycle manager could invoke DevOps at the bottom to actually control resources.

The TMF model also spawned two activities, the CloudNFV initiative in the ETSI NFV ISG, and my second-generation ExperiaSphere project.  Second-generation ExperiaSphere expanded on the NGOSS Contract notion, framing an explicit data-model architecture to define services and manage the service lifecycle.  CloudNFV took a different path because the core logic was provided using software developed by EnterpriseWeb, and that brought the capability of using very dynamic object modeling, not of a deployment but of the service itself. From the service and dependencies of the elements, a dynamic model is created.

The CloudNFV model, which has been greatly enhanced in service lifecycle support, has been the focus of a number of TMF Catalysts, where it won awards.  The ten-thousand-foot summary of the idea is that service elements are represented by objects, each of which is onboarded and described in terms of its dependencies—interfaces of some sort, for example.  Services are created by collecting the objects in the structure of the service itself, not in the structure of a parallel model.  The objects are their own model, and in parallel they’re a kind of programming language that gets executed when a lifecycle event occurs.

You can see, looking over all of this, that we’re evolving our notion of representing services.  There are four pathways out there.  First, you can have parameters that describe things, processed as a transaction.  This is the traditional OSS/BSS model, and there’s a good deal of that flavor still in SDN and NFV specifications.  Second, you can have a set of intent models that are “authored” into a service through programming or through a GUI.  Many of the current implementations of service lifecycle management fall into this category.  Third, you can have a data modeling architecture that defines a service deployment as a set of hierarchical intent models, and finally you can have a service that’s defined by its natural “real-world” structure, without external modeling or assembly.

With the exception of the first of these approaches, everything is a form of representational intent, which is why I opened with the concept.  The debate in the market, in implicit form only at this point since we don’t even have a broad understanding that there are options available, is how to represent the representational intent.  We can program or author it, we can data-model it, or we can let the natural assembly of service elements define their own model.  Which is best?

The only long-term answer to that question is going to be given by experience, which we’ve had precious little of so far.  To get it, we need to follow a service from a gleam in some marketing type’s eye to the financial realization of the opportunity by widespread deployment and use.  There are some general points that seem valid, though.  The lower-numbered approaches are adequate for services that have few elements and are largely static in their composition, particularly when the element domain management (see below) is self-contained.  The more dynamism and scale you introduce, the more you need to think about moving toward the higher-level models.

Element domain management is important to the service management approach because many service lifecycle systems are really manager-of-manager technologies for the good reason that there are already tools to manage router implementations of VPNs or deployment of virtual functions.  DevOps tools offer at least some level of element domain management for deployed components, providing that they support event-handling (which the major ones do) and that they’re used carefully.  The real role of declarative DevOps as a broad cross-application-and-service strategy, is another question mark in this space, but I think it will evolve into one of the four approaches I’ve outlined, most likely the third.  DevOps models not a service or application but a process of deployment, which makes its approach similar to a hierarchical model.

Given the importance of a specific approach to lifecycle automation, and the fact that there are clear impacts of any approach on the kinds of services that could be deployed, more clarity is needed here.  I think that every vendor who purports to have orchestration/automation should be asked to frame their solution in one of the four solution ranges I presented here.  I’m going to do that for any that I blog about, so if you want to brief me on your solution expect to have that question asked.  It’s time we started getting specific (and more useful) in how we look at these products, because understanding their fundamental approach is the only pathway to understanding their fundamental value.

Would Savings from NFV or Lifecycle Automation Fund Innovation?

SDxCentral raised an interesting point in an article on how Nokia thinks operators would use savings created by virtualization and automation.  The point is that operators, having saved on both opex and capex with these strategies, would then spend more on innovation.  I believe that the potential for this shift exists, but I also think there are some barriers that would have to fall to realize it.  The biggest, perhaps, is facing exactly what NFV really is.

One of the problems with the save-here-to-spend-there approach is that, according to the operators’ own CFOs, the savings that have been proposed for virtualization and automation don’t stand up to close examination.  In one of my sweeps of CFO attitude, I found that none had seen a credible demonstration of net savings.  Strategies aimed at capex reduction didn’t consider the fact that the alternative infrastructure almost certainly created additional operations expense.  Strategies aimed at opex reduction didn’t correctly estimate even the current opex costs, much less what could be saved.

Part of this problem is the effect of the media on claims and research, which I’ve irreverently described as a “Bull**** bidding war.”  One vendor says “I can demonstrate a savings of 15%!”  The reporter goes to a competitor and says “Charlie over there says he can save 15%, what can you save?”  Now this competitor knows darn well that either they beat Charlie’s number, or the story is going to be about Charlie.  What do you suppose happens?

The bigger factor, though, is the fact that you cannot even attempt a credible estimate of the cost of a network unless you understand in detail how that network is built.  We say “adopt SDN” or “adopt NFV”, but does that mean you to everything with those technologies?  We know that SDN and NFV will have a limited impact on fiber or access technology, but how limited?  Is the impact in other areas limited too?  We can’t know unless we understand just what areas of the network would really be influenced.

On the opex side, I’ve never seen a use case or report that cited how operations costs are actually distributed, or even what they are.  One common problem is to take the entire “operations” portion of a carrier and assume it’s network equipment.  Hey, guys, they have a bunch of expenses like real estate, vehicles, and do forth that don’t even represent direct network costs.  OAM&P costs run to about 64 cents on every revenue dollar, but most of that doesn’t have any connection with network operations and can’t be addressed by automation.

The good news is that while most of the numbers are just smoke, the reality is that there is considerable opportunity to create savings.  My own estimates put the achievable goal at about 12% of capex and between a third and a half of opex, and the result of that combination would exceed the total network capex budget of some operators.  You could buy a lot of innovation with that.

That raises the second point, though.  What exactly does spending on innovation mean?  Is innovation even monetizable?  If you get a windfall savings of perhaps 15 to 18 cents on every revenue dollar, do you run out into the street and scatter money?  You’d invest in something that offered a good return, and “good return” to a CFO means noticeably above the return of legacy infrastructure and services.  What that “something” might be isn’t easy to determine.

A massive investment in innovation would mean a massive shift in infrastructure architecture, say from spending on boxes that create connection services to servers and software that create experiences.  Historically, operators see this kind of shift as being guided by some standards initiative, aimed at defining the architecture and elements in an open way.  Like, one might say, 5G.

5G is a poster-child for the issues of network innovation.  Intel calls it “the next catalyst.”  We are years along in the effort.  We’ve defined all of the architectural goals.  We are just now starting to see people talking about the business case for pieces like network slicing.  How do you get to this point without knowing what the benefits were going to be?  Innovation has to mean more than “doing something different.”

It’s easy to slip from “benefit” to “feature”.  There are a lot of things that next-gen infrastructure could do, but it’s far from clear that all of them (or even any of them) offer a high-enough ROI to meet CFO requirements.  In the case of 5G, we know that higher customer speeds and cell capacity, FTTN/5G combinations to enhance wireline service delivery, unification of wireline/wireless metro architecture to eliminate separate evolved packet core (EPC), and some aspects of network slicing have at least credible benefits.

Credible, but are they compelling?  Most people would agree that an “innovation transformation” would shift much more focus to hosting and data centers.  My work on carrier cloud, drawing on the input of about 70 operators, shows that all of 5G would drive only about 12% of potential carrier cloud data center deployment.  The biggest factors in carrier cloud deployment are IoT, personalization of advertising and video, and advanced cloud computing services.  We should then look to architectures for each of these.

We actually have them.  The big OTT players like Amazon, Google, Microsoft, Twitter, and Uber have all framed architectures to deal with the kind of thing all of these true carrier cloud drivers will require.  All we need to do is to frame them in the context of carrier cloud, which should actually not be that difficult.  The thing that I think has made it challenging is that it’s software-driven.  In fact, it would be accurate to say that all “innovation” in the network operator space is really about transformation to a software-driven vision of technology.

Software-centric planning is hard for operators, and you don’t have to look further than SDN, NFV, or 5G to see that.  None of these initiatives were done the way software architects would have done them; we fell back on hardware-standards thinking.  The problem with the drivers of the carrier cloud is that there’s no real hardware-centricity to fall back on.  How do you approach these drivers if you don’t have a software vision?

Traditional NFV plays, including the open-source solutions, have a problem of NFV-centricity in general, and in particular a too-literal adherence to the ETSI ISG’s end-to-end model.  Most are incomplete, even for the specific issues of NFV, and can’t drive enough change to really make a business case on their own.  There are players now emerging that are doing better, but the problem we have now is that all “orchestration” or “NFV” or “intent modeling” represents is a claim.  Like, I might say, “innovation”.  Perhaps what we need to do first is catalog the space, look at how software automation of services has evolved and how the solutions have divided themselves.  From there, we can see what, and who, is actually doing something constructive.  I’ll work on that blog for next week.

Exploiting the New Attention NFV is Getting

You might be wondering whether perhaps NFV is getting a second wind.  The fact that Verizon is looking at adopting ONAP, whose key piece is rival AT&T’s ECOMP, is a data point.  Amdocs’ ONAP-based NFV strategy is another.  Certainly there is still interest among many operators in making NFV work, but we still have two important questions to answer.  First, what is NFV going to be, and do?  Second, what does “work” mean here.

NFV does work, in a strict functional sense.  We have virtual CPE (vCPE) deployed in real customer services.  We have some NFV applications in the mobile infrastructure space.  What we don’t have is enough NFV to make any noticeable difference in operator spending or profit.  We don’t have clear differentiation between NFV and cloud computing, and we don’t have a solid reason why that differentiation should even exist.  We won’t get those things till we frame a solid value proposition for “NFV” even if it means that we have to admit that NFV is really only the cloud.

Which it is.  At the heart of NFV’s problems and opportunities is the point that its goal is to host some network features in the cloud.  That by rights should be 99% defining feature hosting as a cloud application and 1% doing what special things arise that might demand more than public cloud tools would provide.  What are the differences?  These are the things that have to justify incremental NFV effort, or justify cloud effort to expand the current thinking on cloud computing to embrace more of NFV’s specific mission.

The biggest difference between a cloud application and an NFV application is that cloud applications don’t sit in a high-volume data plane.  The cloud hosts business applications and event processing, meaning what would look more like control-plane stuff in data networking terms.  NFV’s primary applications sit on the data plane.  They carry traffic, not process transactions.

Traffic handling is a different breed of application.  You cannot, in a traffic application, say that you can scale under load, because adding a parallel pathway for data to follow invites things like out-of-order arrivals.  Doesn’t TCP reorder?  Sure, but not all traffic is TCP.  You have to think a lot more about security, because traffic between two points can be intercepted and you could introduce something into the flow.  Authenticating traffic on a per-packet basis is simply not practical.

NFV applications probably require different management practices, in part because of the traffic mission we just noted, and in part because there are specific guarantees (SLAs) that have to be met.  Many network services today have fairly stringent SLAs, far more stringent than you’d find in the cloud.  You can’t support hosting network functions successful if you can’t honor SLAs.

So, we have proved that you do need something—call it “NFV”—to do what the cloud doesn’t do, right?  I think so, but I also think that the great majority of NFV is nothing more than cloud computing, and that the right course would be to start with that and then deal with the small percentage that’s different.  We’ve not done that; much of NFV is really about specifying things that the cloud already takes care of.  Further, at least some of those “NFV things” really should be reflected in cloud improvements overall.  Let’s look at some of the issues, including some that are really cloud enhancements and some that are not, to see what our second-wind NFV would really have to be able to address if it’s real.

Cloud deployment today is hardly extemporaneous.  Even to deploy a single virtual function via cloud technology would take seconds, and an outage on a traffic-handling connection that’s seconds long would likely create a fault that would be kicked up to the application level.  There are emerging cloud applications that have similar needs.  Event processing supposes that the control loop from sensor back to controller is fairly short, probably in the milliseconds and not seconds.  So how do we deploy a serverless event function in the right place to handle the event, given that we can’t deploy an app without spending ten times or more the acceptable time?

Networks are run by events, even if traffic-handling is the product.  Clouds are increasingly aimed at event processing.  What makes “serverless” computing in the cloud revolutionary isn’t the pricing mechanism, it’s the fact that we can run something on demand where needed.  “On demand” doesn’t mean seconds after demand, either.  We need a lot better event-handling to make event-based applications and the hosting of network functions workable.

Then there’s the problem of orchestrating a service.  NFV today has all manner of problems with the task of onboarding VNFs.  We have identified at this point perhaps a hundred discrete types of VNF.  We have anywhere from one to as many as about a hundred implementations for a given type.  None of the implementations have the same control-and-management attributes.  None of the different types of VNF have any common attributes.  Every service is an exercise in software integration.

But what about the evolving cloud?  Today we have applications that stitch components together via static workflows.  The structure is fixed, so we don’t have to worry excessively about replacing one component.  Yet we already have issues with version control in multi-component applications.  Evolve to the event-chain model, where an event is shot into the network to meet with an appropriate process set, and you can see how the chances of those appropriate processes actually being interoperable reduces to zero.  The same problem as with NFV.

Then we have lifecycle management.  Cooperative behavior of service and application elements is essential in both the cloud and NFV, and so we have to be able to remediate if something breaks or overloads.  We have broad principles like “policy management” or “intent modeling” that are touted as the solution, but all policy management and all intent modeling are at this point are “broad principles”.  What specific things have to be present in an implementation to meet the requirements?

Part of our challenge in this area is back to those pesky events.  Delay a couple of seconds in processing an event, and the process of coordinating a response to a fault in a three-or-four-layer intent model starts climbing toward the length of an average TV commercial.  Nobody likes to wait through one of those, do they?  But I could show you how just that kind of delay would arise even in a policy- or intent-managed service or application.

There is progress being made in NFV.  We have an increased acceptance of the notion that some sort of modeling is mandatory, for example.  We have increased acceptance of the notion that a service model has to somehow guide events to the right processes based on state.  We even have acceptance of an intent-modeled, implementation-agile, approach.  We still need to refine these notions to ensure that they’ll work at scale, handling the number of events that could come along.  We also need to recognize that events aren’t limited to NFV, and that we have cloud applications evolving that will be more demanding than NFV.

My net point here is that NFV is, and always was, a cloud application.  The solutions to NFV problems are ultimately solutions to broader cloud problems.  That’s how we need to be thinking, or we risk having a lot of problems down the line.

Comcast is Signaling a Sea Change in the SD-WAN Space

Comcast has started to push in earnest at business services with SD-WAN, and they’re far from the only play in the space.  In fact, one question that’s now being raised in the space is whether the future of SD-WAN will be tied more to service providers than to CPE products bought directly by enterprises, or by managed service providers.  That question is also extending to the broader area of vCPE, which ties in then with NFV.  Service-provider SD-WAN is also a means of linking SDN services to the user, and even linking enterprise management systems with WAN services.

There are a lot of ways of offering business services, and the one that’s dominated for decades is the “virtual private network” (VPN) at Level 3 or the “virtual LAN” or VLAN at Level 2.  Both these service types have been deployed largely by adding features to native routers and switches (respectively) that allow network segmentation.  These “device-plus” features provide low overhead, but they also impact the native behavior of the protocol layer they work at, and that can create cost, compatibility, and management issues.

SD-WAN is an overlay technology, meaning that it’s created on top of L2/L3 (usually the latter) network services.  The nodes of the service provider’s networks see SD-WAN as traffic, just like all other traffic, and that’s true even where SD-WAN is overlaid on VPN/VLAN services.  Many SD-WAN services extend traditional VPN/VLAN services by spreading a new network layer on top of both VPN/VLAN and Internet services.

Service providers like the telcos have had mixed views of SD-WAN from the first.  Yes, it could offer an opportunity to create business services at any scale, to leverage Internet availability and pricing, and to unify connectivity between large sites and small sites, even portable sites.  The problem is that SD-WAN services can be deployed by MSPs and the users themselves, over telco Internet services, and so cannibalize to at least a degree the traditional “virtual-private” LAN and network/WAN business.  Comcast isn’t an incumbent in VPN/VLAN services so they have no reason to hold back.  In fact, they could in theory offer SD-WANs that span the globe by riding on competitive Internet services.

Once you have a bunch of telcos who face SD-WAN cannibalization from competitors like Comcast, from MSPs, and even from enterprises rolling their own VPNs, you pose the question of whether it’s better, if you’re going to lose VPN/VLAN business, to lose it to your own SD-WAN or to someone else’s.  Obviously it is, at least once it’s clear that the market is aware of the SD-WAN alternative.  That could mean that all the network operators will get into the SD-WAN space for competitive reasons alone.

If network operators decide, as Comcast has, to compete in the SD-WAN space, it makes little sense for them to squabble about in the dirt on pricing alone.  They would want to differentiate, and one good way to do that (again, a way Comcast has used) is by linking their SD-WAN service to underlying network features, which most often will mean QoS control, but also likely includes management capability.  That promotes a cooperative model of SD-WAN to replace the overlay model.  To understand how that works, you’d have to look at the SD-WAN service from the outside.

A service like SD-WAN has the natural capacity to abstract, meaning that it separates service-level behavior from the resource commitments that actually provide connectivity.  An SD-WAN service looks like an IP VPN, without any of the stuff like MPLS that makes VPNs complicated, and regardless of whether IP/MPLS or Internet (or any other) transport is used.  You can provide service-level management features, you can do traffic prioritization and application acceleration, and it’s all part of the same “service”, and it’s the same whatever site you happen to be talking about.  This uniformity is a lot more valuable than you might think at a time when businesses spend on the average about 2.7 times as much on network support as they do on network equipment.

The general trend in SD-WAN has been to add on features like application acceleration and prioritization, and those additions beg a connection to network services that would offer variable QoS.  An SD-WAN service with that traffic-expediting combination is a natural partner to operator features.   The management benefits of SD-WAN can also be tied to management of the underlying WAN services, which is a benefit both in user-managed and managed service provider applications.

SD-WAN prioritization features are also a camel’s nose for NFV’s virtual CPE (vCPE) model.  A unified service vision at the management level means it’s easier to integrate other features without adding undue complexity, and so it encourages buyers to think in modular feature terms, playing into the vCPE marketing proposition.  If operators could promote an SD-WAN model that relied on elastic cloud-hosted features for vCPE rather than a general-purpose premises box as is the rule today, they could end up with a service model that neither MSPs nor direct buyers of SD-WAN could easily replicate.  Since linking their SD-WAN service to network prioritization features is also something that third parties can’t do easily, this can create a truly unique offering.  Differentiation at last!

Of course, everyone jumps on differentiation, so all this adds up to the possibility, or probability, that SD-WAN will be increasingly dominated by network operators who exploit network features under the covers to differentiate themselves.  That’s been clear for some time, and it’s why the players in the crowded SD-WAN startup market are trying so hard to elevate themselves out of the pack.  There will be perhaps four or five that will be bought, and four or five times that number exist already.

There is little or no growth opportunity for business VPNs that require carrier Ethernet access and MPLS.  Big sites of big companies are about it, and in any business total addressable market (TAM) is everything.  Add that truth to the two differentiating paths of SD-WAN for network operators (linkage to network services including SDN and linkage to NFV hosting of features) and you have the story that will dominate the future of SD-WAN.  Which means that every SD-WAN startup had better understand how to tell that story or they’ll have no exit.

In the second half of 2018 we’ll probably start to see the signs of this in the SD-WAN space, with fire-sale M&A followed by outright “lost-funding” exits.  There are way too many players in the space to sustain when the market is going to focus on selling to network operators, and startups have only a limited opportunity to prepare for that kind of SD-WAN business.  There’s only one hope for them to avoid this musical-chairs game, and it’s government.

No, not government market, though that does present an opportunity.  Regulators, if they were to allow for settlement and paid prioritization on the Internet, would create an SD-WAN underlayment that anyone could exploit.  That would keep SD-WAN an open opportunity and prevent the constriction in opportunity to network operators that will drive consolidation.  The question is whether it could happen fast enough.  Even in the US, where regulatory changes have been in the wind since January, it will almost surely take more than six months to get something new in place.  Elsewhere it could be even longer, and operators like Comcast aren’t waiting.  If the big operators get control of SD-WAN before regulatory changes gel, it will be too late for most of the SD-WAN players.  So, if you are one, you might want to start prepping for an operator-dominated future right now, or you may run out of runway.

Some Further Thoughts on Service Lifecycle Automation

Everyone wants service lifecycle automation, which some describe as a “closed-loop” of event-to-action triggering, versus an open loop where humans have to link conditions to action.  At one level, the desire for lifecycle automation is based on the combined problem of reducing opex and improving service agility.  At another level, it’s based on the exploding complexity of networks and services, complexity that would overwhelm manual processes.  Whatever its basis, it’s hardly new in concept, but it may have to be new in implementation.

Every network management system deployed in the last fifty years has had at least some capability to trigger actions based on events.  Often these actions were in the form of a script, a list of commands that resemble the imperative form of DevOps.  Two problems plagued the early systems from the start, one being the fact that events could be generated in a huge flood that overwhelmed the management system, and the other being that the best response to an event usually required considerable knowledge of network conditions, making the framing of a simple “action” script very difficult.

One mechanism proposed to address the problems of implementing closed-loop systems is that of adaptive behavior.  IP networks were designed to dynamically learn about topology, for example, and so to route around problems without specific operations center action.  Adaptive behavior works well for major issues like broken boxes or cable-seeking backhoes, but not so well for subtle issues of traffic engineering for QoS or efficient use of resources.  Much of the SDN movement has been grounded in the desire to gain explicit control of routes and traffic.

Adaptive behavior is logically a subset of autonomous or self-organizing networks.  Network architecture evolution, including the SDN and NFV initiatives have given rise to two other approaches.  One is policy-based networking where policies defined centrally and then distributed to various points in the network enforce the goals of the network owner.  The other is intent-modeled service structures, which divide a service into a series of domains, each represented by a model that defines the features it presents to the outside and the SLA it’s prepared to offer.  There are similarities and differences in these approaches, and the jury isn’t out yet on what might be best overall.

Policy-based networks presume that there are places in the network where a policy on routing can be applied, and that by coordinating the policies enforced at those places it’s possible to enforce a network-wide policy set.  Changes in policy have to be propagated downward to the enforcement points as needed, and each enforcement point is largely focused on its own local conditions and its own local set of possible actions.  It’s up to higher-level enforcement points to see a bigger picture.

Policy enforcement is at the bottom of policy distribution, and one of the major questions the approach has to address is how you balance the need for “deep manipulation” of infrastructure to bring about change, with the fact that the deeper you go the narrower your scope has to be.  Everybody balances these factors differently, and so there is really no standard approach to policy-managed infrastructure; it depends on the equipment and the vendor, not to mention the mission/service.

Intent-modeled services say that both infrastructure and services created over it can be divided into domains that represent a set of cooperative elements doing something (the “intent”).  These elements, because they represent their capabilities and the SLA they can offer, have the potential to self-manage according to the model behavior.  “Am I working?”  “Yes, if I’m meeting my SLA!”  “If I’m not, take unspecified internal action to meet it.”  I say “unspecified” here because in this kind of system, the remediation procedures, like the implementation, are hidden inside a black box.  If the problem isn’t fixed internally, a fault occurs that breaks the SLA and creates a problem in the higher-level model that incorporates the first model.  There the local remediation continues.

You can see that there’s a loose structural correspondence between these two approaches.  Both require a kind of hierarchy—policies in one case and intent models in another.  Both presume that “local” problem resolution is tried first, and if it fails the problem is kicked to a successively higher level (of policy, or of intent model).  In both cases, therefore, the success of the approach will likely depend on how effectively this hierarchy of remediation is implemented.  You want any given policy or model domain to encompass the full range of things that could be locally manipulated to fix something, or you end up kicking too many problems upstairs.  But if you have a local domain that’s too big, it has too much to handle and ends up looking like one of those old-fashioned monolithic management systems.

I’m personally not fond of a total-policy-based approach.  Policy management may be very difficult to manipulate on a per-application, per-user, per-service basis.  Most solutions simply don’t have the granularity, and those that do present very complex policy authoring processes to treat complicated service mixes.  There is also, according to operators, a problem when you try to apply policy control to heterogeneous infrastructure, and in particular to hosted elements of the sort NFV mandates.  Finally, most policy systems don’t have explicit events and triggers from level to level, which makes it harder to coordinate the passing of a locally recognized problem to a higher-level structure.

With intent-based systems, it’s all in the implementation, both at the level of the modeling language/approach and the way that it’s applied to a specific service/infrastructure combination.  There’s an art to getting things right, and if it’s not applied then you end up with something that won’t work.  It’s also critical that an intent system define a kind of “class” structure for the modeling, so that five different implementations of a function appear as differences inside a given intent model, not as five different models.  There’s no formalism to insure this happens today.

You can combine the two approaches, and in fact an intent-model system could envelope a policy system, or a policy system could drive an intent-modeled system.  This combination seems more likely to succeed where infrastructure is made up of a number of different technologies, vendors, and administrative domains.  Combining the approaches is often facilitated by the fact that inside an intent model there almost has to be an implicit or explicit policy.

We’re still some distance from having a totally accepted strategy here.  Variability in application and implementation of either approach will dilute effectiveness, forcing operators to change higher-level management definitions and practices because the lower-level stuff doesn’t work the same way across all vendors and technology choices.  I mentioned in an earlier blog that the first thing that should have been done in NFV in defining VNFs was to create a software-development-like class-and-inheritance structure; “VNF” as a superclass is subsetted into “Subnetwork-VNF” and “Chain-VNF”, and the latter perhaps into “Firewall”, Accelerator” and so forth.  This would maximize the chances of logical and consistent structuring of intent models, and thus of interoperability.

The biggest question for the moment is whether all the orderly stuff that needs to be done will come out of something like NFV or SDN, where intent modeling is almost explicit but where applications are limited, or from broader service lifecycle automation, where there’s a lot of applications to work with but no explicit initiatives.  If we’re going to get service lifecycle automation, it will have to come from somewhere.

What’s the Real Relationship Between 5G and Edge Computing?

According to AT&T, 5G will promote low-latency edge computing.  Is this another of the 5G exaggerations we’ve seen for the last couple of years?  Perhaps there is a relationship that’s not direct and obvious.  We’ll see.  This is a two-part issue, with the first part being whether low latency really matters that much, and the second being whether edge computing and 5G could reduce it.

Latency in computing is the length of the closed-feedback control loop that characterizes almost every application.  In transaction processing, we call it “response time”, and IBM for decades promoted the notion that “sub-second” response time was critical for worker productivity improvement.  For things like IoT, where we may have a link from sensor to controller in an M2M application, low latency could mean a heck of a lot, but perhaps not quite as much as we’d think.  I’ll stick with the self-drive application for clarity here.

It’s easy to seem to justify low latency with stuff like self-driving cars.  Everyone can visualize the issue where the light changes to red and the car keeps going for another 50 feet or so before it stops, which is hardly the way to make intersections safe.  However, anyone who builds a self-drive car that depends on the response of an external system to an immediate event is crazy.  IoT and events have a hierarchy in processing, and the purpose of that hierarchy is to deal with latency issues.

The rational way to handle self-drive events is to classify them according to the needed response.  Something appearing in front of the vehicle (a high closing speed) or a traffic light changing are examples of short-control-loop applications.  These should be handled entirely on-vehicle, so edge computing and 5G play no part at all.  In fact, we could address these events with no network connection or cloud resources at all, which is good or we’d kill a lot of drivers and pedestrians with every cloud outage.

The longer-loop events arise more from collective behaviors, such as the rate at which vehicles move again when a light changes.  This influences the traffic following the light and whether it would be safe to pull out or not.  It’s not unreasonable to suggest that a high-level “traffic vector” could be constructed from a set of sensors and then communicated to vehicles along a route.  You wouldn’t make a decision to turn at a stop sign based on that alone, but what it might do is set what I’ll call “sensitivity”.  If traffic vector data shows there’s a lot of stuff moving, then the sensitivity of motion-sensing associated with entering the road would be correspondingly high.  For this, you need to get the sensor data in, digested, and distributed within a couple seconds.

This is where edge computing comes in.  We have sensors that provide the traffic data, and we have two options.  First, we could let every vehicle tickle the sensors for status and interpret the result.  Leaving the latter stage aside, the former is totally impractical.  First, a sensor that a vehicle could access directly would be swamped by requests unless it had the processing power of a high-end server.  Second, somebody would attack it via DDoS and nobody would get a response at all.  A better approach is to have an edge process collect sensor data in real time and develop those traffic vectors for distribution.  This reduces sensor load (one controller accesses the sensor) and improves security.  If we host the control process near the edge, the control loop length is reasonable.  Thus, edge computing.

The connection between this and 5G is IMHO a lot more problematic.  Classical wisdom (you all know how wise I think that is!) says that you need 5G for IoT.  How likely that is to be true depends on just where you think the sensors will be relative to other technology elements, like stoplights.  If you can wire a sensor to a subnet that the control process can access, you reduce cost and improve security.  If you can’t, there are other approaches that could offer lower wireless cost.  I think operators and vendors have fallen in love with the notion that IoT is a divine mandate, and that if you link it with 5G cellular service you get a windfall in monthly charges and buy a boatload of new gear.  Well, you can decide that one for yourself.

However, 5G might play a role, less for its mobile connection than for the last-mile FTTN application so many operators are interested in.  If you presume that the country is populated with fiber nodes and 5G cells to extend access to homes and offices, then linking in sensors is a reasonable add-on mission.  In short, it’s reasonable to assume that IoT and short-loop applications could exploit 5G (particularly in FTTN applications) but not likely reasonable to expect them to drive 5G.

In my view, this raises a very important question about 5G, which is the relationship between the FTTN/5G combo for home and business services, and other applications, including mobile.  The nodes here are under operator control, and are in effect small cells serving a neighborhood.  They could also support local-government applications like traffic telemetry, and could even be made available for things like meter reading.  These related missions pose a risk for operators because the natural response of a telco exec would be to try to push these applications into higher-cost 5G mobile services.

The possibility that these neighborhood 5G nodes could serve as small-cell sites for mobile services could also be a revolution.  First, imagine that 5G from the node could support devices in the neighborhood in the same way as home WiFi does.  No fees, high data rate, coverage anywhere in the neighborhood without the security risks of letting friends in your WiFi network.  Second, imagine that these cells could be used, at a fee, to support others in the neighborhood too.  It has to be cheaper to support small-cell this way than to run fiber to new antenna locations.

There’s a lot of stuff that could be done to help both the IoT and small-cell initiatives along.  For IoT what we need more than anything is a model of an IoT environment.  For example, we could start with the notion of a sensorfield, which is one or more sensors with common control.  We could then define a controlprocess that controls a sensorfield and is responsible for distributing sensor data (real-time or near-term-historical) to a series of functionprocesses that do things like create our traffic vectors.  These could then feed a publishprocess that provided publish-and-subscribe capabilities, manual or automatic, to things like our self-drive vehicles.

I think too much attention is being paid to IoT sensor linkage, a problem which has been solved for literally billions of sensors already.  Yes, there are things that could make sensor attachment better, such as the FTTN/5G marriage I noted above.  The problem isn’t there, though, it’s with the fact that we have no practical notion of what to do with the data.  Edge computing will be driven not by the potential it has, but by real, monetized, applications that justify deployment.

Can the Drivers of Carrier Cloud Converge on a Common Cloud Vision?

One of the issues that should be driving the fall operator planning cycle, carrier cloud, isn’t making a really strong appearance so far.  Those of you who’ve read my blog on what is likely to be a big planning focus no doubt saw that carrier cloud wasn’t on the list.  Many would find this a surprise, considering that by 2030 we’ll likely add over one hundred thousand data centers to support it, most at the edge.  I was surprised too, enough to ask for a bit more information.  Here’s what’s going on.

The key point holding back carrier cloud is a clear, achievable, driving application.  Operators have become very antsy about the field-of-dreams approach to new services.  Before they build out infrastructure like carrier cloud, they want to understand exactly what they can expect to drive ROI, and get at least a good idea of what they’ll earn on their investment.  There are six drivers of carrier cloud, as I’ve noted before, and while operators generally understand what they are, they’re not yet able to size up the opportunity for any of them.

Two of the six drivers for carrier cloud were on the hot-button list for the fall.  One was NFV and the other was 5G, but these account for only 5% and 16% of carrier cloud incentive, respectively, through 2020.  The majority, more than three-quarters, falls to applications not on the advance planning radar.  The biggest driver available in that timeframe is the virtualization of video and advertising features for ad and video personalization and delivery.  It accounts for about half the total opportunity for the near term.  Why is this driver not being considered, even prioritized, in the fall cycle?  Several reasons, and none easy to justify or fix.

First, operators have been slow to get into advertising.  Most who have moved have done so by purchasing somebody who had an ad platform (Verizon and both AOL and Yahoo, for example).  As a result, there’s been less focus on just how the ad space should be handled in infrastructure (meaning carrier cloud) terms.  Operators who have tried to move their own approach here (mostly outside the US) have found it very difficult to get the right people onboard and to drive their needs through a connection-services-biased management team.

The second factor is that operators have tended to see video delivery in terms of using caching to conserve network capacity.  These systems have been driven from the network side rather than from the opportunity side, and they’ve ignored issues like socialization and personalization.  Operators see the latter as being linked more to video portals, which (so far) they’ve either pushed into the web-and-advertising space just noted here, covered in their IPTV OTT service plans, or have not been particularly hot about at all.

What these points add up to is a dispersal of responsibility for key aspects of our demand driver, and a lack of a cohesive set of requirements and opportunities links to infrastructure behavior.  In short, people aren’t working together on the issues and can’t align what’s needed with a specific plan to provide it.

This explains, at least at a high level, why the main carrier cloud driver isn’t in the picture for the fall cycle.  What about the two that are?  Could they take up the slack?

The SDN-and-NFV-drives-change approach, as I’ve already noted in the referenced blog, hasn’t really delivered much in the way of commitment from senior management.  The biggest problem is that neither technology has been linked to an opportunity with a credible scope of opportunity and infrastructure.  SDN today is primarily either a data center evolution toward white-box devices, or a policy management enhancement to legacy switches and (mostly) routers.  NFV today is primarily virtual CPE hosted not in the cloud but on a small edge box at the service demarcation point.  It’s hard to see how these could change the world.

What SDN and NFV prove is the difficulty in balancing risk and reward for technology shifts.  For these technologies to bring about massive change in the bottom line, they have to create massive impact, which means touching a lot of services and infrastructure.  That creates massive risk, which leads operators to dabble their toes in the space with very limited trials and tests.  Those, because they are very limited, don’t prove out the opportunity or technology at scale.  By the time we get a convincing model of SDN and NFV that has the scope to do something, carrier cloud deployment will have been carried forward by something else, and SDN/NFV will just ride along.

5G is even more complicated.  Here we have the classic “There’s a lot you can do with a widget” problem; that may be true but it doesn’t address the question of how likely it is you’d want to do one of those things, or how profitable it would be to do it.  In many ways, 5G is just a separate infrastructure justification problem on top of carrier cloud.  We have to figure out just how it’s going to be useful, then deploy it, and only then see what parts of its utility bear on a justification for carrier cloud.

Nobody doubts that 5G spectrum and the RAN are useful, but it’s far from clear that either would have any impact on carrier cloud.  In fact, it’s far from clear what positive benefits would come from the additional 5G elements, including network slicing.  Remember, there’s a difference between “utility” (I can find something to do with it) and “justification” (I gain enough to proactively adopt it).

An Ixia survey cited by SDxCentral says that carrier cloud plans are rooted in NFV.  Actually the data cited doesn’t seem to show that to me.  The cited drivers for 5G are quite vague, as disconnected from explicit benefits as any I’ve heard.  “Flexible and scalable network” is the top one.  What does that mean and how much return does it generate?  My point is that the survey doesn’t show clear drivers, only abstract stuff that’s hard to quantify.

That’s consistent with what operators have told me.  In fact, 5G planning focus is more on trying to nail down actual, quantifiable, actionable, benefits than about driving anything in particular forward.  Don’t get me wrong; everything in 5G has potential value, has circumstances that would make a critical contribution to operators.  How much of that potential is real is very difficult to say, which is what’s giving operators so much grief.

What exactly is all that rooting of 5G in NFV?  The article quotes an Ixia executive as follows: “You are going to deploy the 5G core in an NFV-type model. There’s no doubt it will all be virtualized.”  Well, gosh, everything virtualized isn’t necessarily NFV.  So, is this an attempt by a vendor (who has an NFV strategy) to get into the 5G story?  You decide.

It all circles back to the notion of that field of dreams.  In order for the drivers of carrier cloud to operate effectively, they all have to drive things to the same place.  We need to have a common cloud model for the network of the future, because we’re not going to build one for every possible demand source.  The difficulties for the moment lie less in making the drivers real in an ROI sense, but in making the drivers credible enough to motivate planners to look for a single cloud solution that can address them all.  That’s what’s missing today, and sadly I don’t see the fall planning cycle providing it.