Why a Model-Driven NFV Architecture is Critical (and How Ciena’s Looks)

In the fall of 2013 I had a meeting with five Tier One operators to discuss what was needed to make NFV work.  At the meeting, one of the key figures in the NFV ISG and a representative of a big operator made two comments I think are critical and have stood the test of time.  The first, referencing the fact that capex savings for NFV were now expected to be at most 20%, was “If I want a 20% reduction in capex I’ll just beat Huawei up on price!”  That meant that savings for NFV had to come from opex, and to demonstrate opex, my contact said, “You have to present an NFV architecture in the context of a service lifecycle.”

A service lifecycle, to an operator, has four distinct phases:

  1. Architecting, which is the process of defining a service as a structure made up of feature components that eventually decompose downward into resource commitments.
  2. Ordering, meaning the connection between an orderable service model and a customer, made through a self-service portal or a customer service rep.
  3. Deployment, the conversion of a service model that’s been ordered into a set of resource commitments that fulfill the service description and SLA.
  4. Management, the support of the service in its operational state, responding to conditions that would break the SLA, create inefficiency risks, or to customer-initiated changes or termination.

I took this to heart in my ExperiaSphere project, which divides the tutorial slides into this same four phases, and I’ve used these phases to assess the various vendor “NFV solutions”.  Most of the solutions were incomplete, as you can probably see just from the names of the phases.  Even where there were complete solutions, the specifics available from a vendor in online or distributed documentation was rarely enough to allow me to present the solution in detail.

Ciena’s is one of the six vendors I believe have a complete service-lifecycle story.  Their just-announced Blue Planet DevOps Toolkit also provides the requisite detail in their analyst preso, which I’m referencing here, and so I want to talk about it to illustrate why that first architecting phase is the key to the whole NFV story.

The Architect phase of a service lifecycle is really a series of iterations that are intended to build a decomposable model of a service that can then be offered as a retail product.  It’s always been my view that the model should be hierarchical and should describe a service as a feature composition, each feature of which eventually linked to some resource management/deployment task.  This corresponds to the “declarative” model of DevOps, for those familiar with the software development world.

There are two pieces to the Architect phase, one to model the services and the other to control the resources.  This corresponds to Ciena’s notion of a Service Template and Resource Adapters, with the latter being roughly equivalent to the ETSI NFV ISG’s Virtual Infrastructure Managers.  Ciena uses the OASIS TOSCA (Topology and Orchestration Specification for Cloud Applications) language, which I’ve said for several years now is the most logical way to describe what are effectively cloud-deployed services.  The real intent of the Ciena Blue Planet DevOps Toolkit is to build these Templates and RAs and to then provide a framework in which they can be maintained, almost like the elements of an application, in a lifecycle management process that’s somewhat independent from the lifecycle management of the deployed services.

The Template and RA separation corresponds to what I’ve called “service domain” and “resource domain” activities in the Architect phase of a service.  The service domain is building services from feature elements that have been previously defined.  These can be augmented as features evolve, and revised as needed for market reasons, and it’s this add-and-revise process that’s analogous to software evolution for a running application.  The RAs associate in most cases with management systems or resource controllers that can configure, deploy feature elements, and change resource state.  Service Templates, at some point, reference RAs.

Service providers can use the Toolkit to build all this themselves, conforming to their own requirements and their own network infrastructure.  Vendors or third parties can also build them and submit them for use, and it would be logical IMHO to assume that many vendors would eventually realize that building an RA or Infrastructure Manager is essential if their stuff is to be used to host virtual functions, connect them, etc.

There is nothing as important to NFV’s success as the notion of service and resource models, yet little or nothing is said in the NFV specifications about how these things would be created and used.  The Ciena approach uses TOSCA to describe service characteristics and parameters (the NFV ISG is now looking at a liaison with OASIS/TOSCA to describe how the parameters they’re suggesting would be stored in a model).  These fall into three categories—parameter values used to guide decomposition of model elements, parameters used to describe service conditions upward toward the user, and parameters that are sent directly or indirectly to RAs for deployment and management.

Ciena’s focus on model-building here is critical because it would facilitate just what their preso says, which is vendor independence and facile integration of resources.  The only thing needed to make this story complete is a way of authoring the lifecycle processes too.

Lifecycle processes could be defined in this kind of model, for each model element.  That’s critical in making the models reusable, since each model element is a kind of self-contained and self-described service unit that knows how to deploy and sustain itself.  Since ordering a service is logically the first lifecycle stage, the entire operations process can be defined and composed this way, at every level.

The implicit Ciena approach is that a service template, or model, has built into each element the rules associated with that element’s life, both as an order template and when instantiated, as a piece of a service.  No matter where you put this element, geographically or in a service context, its model will determine how it is sustained, and nothing else is needed.  Give a model to a partner or subsidiary, or to a customer, and if they have the right RAs they can deploy it.  That will be true if you can define lifecycle processes into the service templates, and if you can’t then there’s a hole that will invalidate at least some of the benefits of the Ciena approach.  That means the question of whether the Toolkit supports lifecycle definitions, and how that works, is critical.

Most of you know that I don’t accept verbal inputs from vendors on important points because they’re not public disclosures.  I’d invite Ciena to respond by commenting on LinkedIn to my question here.  Alternatively, I’d like to see a document that describes the approach to lifecycle definition, without an NDA so I can reference it.

The good news is that I have been assured by two TOSCA experts that there would be no problem defining lifecycle processes in state/event form within a TOSCA template.  It’s only a matter of providing the mechanism to build it through something like the Toolkit, then steering events through it.  I’d love to see someone describe this in public, and if your company has it, I’d like to hear from you.

I’ve also realized with this announcement that I need a way of getting information on purported NFV implementations in depth and with confidence that the vendor really has what they say.  That’s the kind of information I want to base my assessments on, and pass on to those who read this blog.  I’m going to be working out an approach, and I’ll publish it as an open document for vendors when I’m done.

Where Are We in SDN/NFV Evolution?

Some examples of the operator architecture for SDN and NFV networks have been out for a couple weeks or more.  We’re now starting to hear more about vendor changes and even a bit more about the conditions in the marketplace.  I’ve certainly been hearing plenty, and so I want to provide a status report.

It has been clear for some time that there is no compelling movement to totally transform infrastructure using SDN and NFV in combination.  Optimally, SDN and NFV could represent almost half of operator capex by 2022 but that figure is now unlikely to be reached because it would require a holistic commitment to infrastructure change.  The progress toward next-gen infrastructure is now most likely to be service-specific, and the primary area of opportunity lies in mobile networking today and IoT down the line.

What separates mobile/IoT opportunity from other services (enterprise vCPE, for example) is scope.  Mobile/IoT is explicitly cloud-hosted so anything you build becomes reusable and leverageable, and there’s enough of either to build up a fairly significant repertoire of data centers.  This lowers both cost and risk for add-on services, providing that the architecture under which you deploy is suitable for extension.

One dimension of suitability is the breadth of the business case.  The problem with service-specific business cases is their specificity.  You could easily frame an opportunity so specialized in benefits that it would justify deployment where other opportunities would not.  Operator say that even mobile/IoT service opportunities are hampered by a lack of vertical integration of the technology elements involved in SDN and NFV.  Operators today believe that neither SDN nor NFV has a mature management model, which makes it nearly impossible to assess operations costs.  Since opex represents more of each revenue dollar than capex, the lack of clarity there inhibits anything but very specialized and localized investment.

The good news is that one early problem with both SDN and NFV, which is the dependence of success on coexistence with legacy elements, is now seen as solved.  Most vendors now promise such integration; Ciena has just announced a kind of next-gen-network-SDK that facilitates model-building for both resources and services, a key step to solving the legacy problem.  In theory, modeling also supplies the first step toward operational integration, but the rest of the steps here are taking longer, probably because of the continued debate over the best management approach.

We are making progress here, too.  At the beginning of this year there were no credible models for a complete future-network architecture, and we now have public models from both AT&T and Verizon and some vendors (Netcracker, most recently) have published their own top-to-bottom vision.  While even many within the operators’ organizations don’t think all the details are ironed out yet, there’s general acceptance of eventual success.  The median year for availability of a complete solution from credible sources is 2018 and a very few operators think that by this time next year there will be at least one proven model that addresses the full range of issues.

Leveraging vCPE opportunity may be critical for NFV because nearly every large operator has at least some of it, and most will see it either as a lead-off opportunity or a quick follow-on.  The challenge is that credible vCPE deployments so far are more likely based on premises hosting of functions in generalized appliances than on resource pools.  The applications are justified by a belief that agile function replacement and augmentation presents a revenue opportunity, a view that is somewhat credible in the MSP space but not as clearly so in the broad market.  And MSPs don’t usually deploy much infrastructure; they rely on wholesale/overlay relationships.

Many operators believe that consumer vCPE could be the escape from the premises-hosting trap, but the challenge has been that the cost of operationalizing a large population of cloud-fulfilled consumer edge functions isn’t known, and the savings available are limited by the low cost of devices and the need to have something on premises to terminate the service and offer WiFi connectivity to users.

SDN at present is primarily a data center play, which means that it would be likely to deploy primarily where larger-scale NFV deploys—mobile infrastructure.  According to operators, the big barrier to SDN deployment has been the lack of federated controllers to extend scope and the ability to efficiently support a mixture of SDN and legacy.  As noted above, these problems are now considered to be solved at the product level, and so it’s likely that trials will develop quickly and move to deployments.

Getting SDN out of the data center is the real challenge, according to operators.  There is a hope that SDN could build up from the fiber layer to become a kind of universal underlay network, and some operators point to the Open Optical Packet Transport project within the Telecom Infra Project (which both ADVA and Juniper just joined) as a possible precursor to an agile electro-optical underlayment.  Since that project is purely optical at this point, such an evolution would take time.

ADVA and Ciena both have strong orchestration offerings, acquired from Overture and Cyan, respectively, and at least some operators hope they’ll present something in the form of an integrated optical/SDN-tunnel picture.  Brocade might also play in this to exploit their Vyatta routing instances to build virtual L3 networks.  Nobody is moving very fast here, say operators, and so it probably can’t happen until 2018.

Vendors are facing their own challenges with SDN and NFV, as both comments I’ve received from their salesforce and a recent article in the New IP shows.  The primary problem, say vendors, is that the buyer wants a broad business case that most product offerings can’t deliver because of scope issues.  What’s needed is an integration project to tie deployment to operations efficiently, and while there are products that could build this vertical value chain (and all of it could be offered by half-a-dozen vendors), the sales cycle for something like this is very long, and vendors have been pushing “simplified” value propositions, getting them back to those service-specific deployments that cannot win on a large scale.

IoT could be an enormous boost for both SDN and NFV if the concept is recognized for what it is, which is a cloud-analytics and big-data application first and foremost.  IoT could be an even larger driver for edge data center deployments, and so could revolutionize both SDN and NFV—the former for DCI and the latter for more feature hosting than any other possible service.  HPE has made a few announcements that could be considered as positioning IoT this way, but they’ve not taken that position aggressively and the industry still sees this as a wireless problem.

Who’s ahead?  It’s hard to say, but operators think the vendors considered most likely to “win” at large-scale SDN/NFV modernization are Huawei and Nokia, with Ericsson seen as an integrator winner.  Both are seen as having a strong position in mobile infrastructure, which is where the first “big” success of both SDN and NFV is expected, and both have credible SDN/NFV stories that link to management and operations.  Brocade, who just demonstrated a hosted-EPC deployment in concert with Telefonica, and Metaswitch who has had a cloud-IMS option for some time, are also seen as credible players in integrator-driven projects.

The integrator angle may prove interesting.  Ericsson has been the classic telco integrator, though both Nokia and Huawei have expanded their professional-services credibility.  Ericsson also has strong OSS/BSS position, but they’re not seen by operators as leading the OSS/BSS evolutionary charge.  Operators are also wondering whether the Cisco relationship will end up tainting them as an objective integrator, but on the other hand it gives Ericsson a bit more skin in the game in terms of upside if a major SDN/NFV deployment is actually sold.

All of the IT vendors (Dell, HPE, IBM, Intel/Wind River, Microsoft, Oracle, and Red Hat) are increasingly seen as being hosting or platform players rather than as drivers of a complete (and benefit-driven) deployment.  That’s not an unreasonable role, but it lacks the ability to drive the decision process because that depends so heavily on new service revenues or operational efficiencies.  HPE and Oracle do have broad orchestration and potentially operations integration capability, but these capabilities are (according to operators and to their own salespeople) not being presented much in sales opportunities for fear of lengthening the selling cycle.

Vendors with fairly limited offerings can expect to benefit from increased support for from-the-top orchestration, both from vendors like Amdocs and Netcracker and in operator-developed architecture models.  However, the OSS/BSS-framed orchestration options have to be presented to the CIO, who has not been the driver for SDN/NFV testing and trials.  In fact, most operator CIOs say they are only now becoming engaged and that proving the operations benefits will take time.

Operators and vendors still see more issues than solutions, but the fact is that solutions are emerging.  I think that had a respectable top-down view of SDN/NFV been taken from the first, we’d have some real progress at this point.  That view is being taken now, though, but it seems likely it will rely more on professional services for custom integration than on any formal standard or even open-source project.  It’s going to take time, and every delay will lose some opportunity.  That’s why it’s important to confront reality right now.

Is There a Value in a “Software-Defined Internet?”

How personal should a network be?  The vast majority of things I could find on the Internet, I never want to see.  The vast majority of people who could reach me, or who could reach, are those I never want to talk with.  Enterprises tell me that the great majority of the possible user-to-application or worker-to-data relationships their networks make possible are barred for security/compliance reasons.  Spam is defeating the utility of email for many, and search advertising is making finding useful stuff almost impossible.  Are we doing the right thing here?  Is there an alternative?

How flexible should a network be?  We surely have applications today that are fine with best-efforts services.  We surely have applications that demand some fairly rigorous SLA.  Can we build an efficient infrastructure to satisfy both these goals?  Is the extremely low cost of Internet bandwidth creating a kind of destructive competition for better-grade services, and preventing them from developing?

I’ve been looking over enterprise responses to questions on the Internet, email and messaging, virtual and private networks, and it’s interesting to see what the decision-makers think.  It’s also interesting that they respond differently to issues and questions, depending on whether they are talking as representatives of their business or as consumers.  The differences themselves may tell us a lot about the future.

As consumers, decision-makers are concerned about loss of privacy and what they see as the distortion created by ad sponsorship.  Every decision-maker thinks that too much of their personal information has been captured.  Do not track doesn’t work, they say.  Most cite examples that I can identify with; I do a search on a camera that I happen to own and next time I go to a news website I’ll see an ad for the camera, do-not-track notwithstanding.

As decision-makers, their big problem is bias in information.  While nearly everyone agrees that there is more information on the Internet than they’d ever have had access to before, most also believe that the information can be trusted less.  Back in 1991 when I started surveying what influenced buyers of technology, there were at least two technical publications that were in virtually everyone’s top five.  There are none there today.  People believe that online opinions, even consumer reviews, are bought and paid for.

Of course, the same people who worry as consumers about privacy are eager to exploit online advertising on behalf of their own companies, and most defend paying in some way or another for editorial mentions or analyst opinions.  They also say today that it’s smarter to spend to promote what you have than to pay to figure out, then do, the right thing in the market.

It’s obvious that you could make a sociological thesis about this topic, but that’s probably not helpful to technologists who read my blog.  Two tech questions suggest themselves; has consumerism and the Internet contaminated our whole model of communication and information dissemination to the point where it has to be fixed, and what might a fix look like?

Skype has an interesting approach to communication that might offer a starting point.  While you can set up your Skype account to permit calls or messages from anyone, you can also say that you’ll accept only contacts from someone in your contact list.  That forces others to request to join your list as a condition of communication.  LinkedIn lets people you’ve connected with send you messages but limits what others can send you.  Explicit communications, based on what is in effect an invite or closed user group, has been around for a long time.

One fair question to ask is whether systems like this should be used for email, or at least be made available.  Yes, you can block email except from a safe senders list, but how does somebody get added to that if they can’t contact you?  It’s obviously possible to do better at controlling email access, and were that done it’s possible that email would be less of a risk and an intrusion than it is now.

On the network side, there are both subtle and obvious questions.  In the latter category we have the question of whether virtual networks should be composable on a personal level.  Could I, for example, build a virtual-Internet-of-one that contains only the sites I admit?  Could I then, based on a search, find other sites and elect to admit them?

The subtle question, which also relates to virtual networking, is whether the fact that the Internet is a low-cost and ubiquitous underlayment for virtual-network services is effectively limiting the virtual-network space by creating what amounts to a polarized option set.  You can pay little and get an Internet overlay, or you can pay a whole lot to get a true private network.  In the former case you get best-efforts services, you still have DDoS issues, etc.  In the latter you can have a real SLA and more security.  Wouldn’t it be nice to have a more graduated set of options, opening more-than-best-efforts to a larger community?

There’s obviously no technical barrier to offering SLAs on the Internet, since we can do SLAs on private IP.  The problem is one of public policy, which relates to my opening question of whether our consumer vision for the Internet is impacting our overall vision of networking.  Settlement and “paid prioritization” are seen as being anti-consumer, but they’re mandatory if the Internet as a ubiquitous data dialtone is going to be meaningful.

Operators tell me that the biggest problem in profit compression is the Internet.  Internet bandwidth is low-margin to begin with, and it’s also broadly available as the foundation for virtual network services and SD-WAN.  This means that it becomes more difficult to develop an independent QoS-capable network with the Internet’s magnetic low costs.  It’s also difficult to personalize the Internet because that would, to many, smack of censorship even if the users themselves implemented the subsetting.  If we presume that the technical pathway to a true IP dialtone lies in the expansion of Internet infrastructure to be IP-dialtone infrastructure, the barriers are probably insurmountable.

Should we be allowed to “subset” the Internet both in terms of virtual subnetworking and in terms of QoS?  Should the fabric of the Internet support any valid business mission, and the application of that fabric then conform to public-policy goals?  The only way to make everything work is to make the Internet into a virtual network too, into a “software-defined Internet” or SDI.

An SD-WAN is an overlay network with virtual endpoints set as needed.  SDI would be the same thing, and the underlayment could then be either a global IP network or the MEF’s Third Network…or any combination of underlayment that offers you the scope and QoS that you want.  Since the Internet is defined by who’s on it and how it’s addressed rather than by the technology used, this would let it continue to conform for consumer-driven regulatory policy and even offer only best-efforts services.  But this approach would also let you personalize your view of the Internet, and other virtual-network services for business could coexist on the underlayment.

There has actually been a project to address this vision, started by Huawei almost a decade ago and codified by the IEEE in p1903, Next-Generation Service Overlay Network or NGSON.  The architecture for NGSON is described here, and the project is still active, though I’ve not seen much publicity on the concept.  What NGSON seeks to do, technically, is to create an overlay that can bind applications, underlayment features, and user/provider policies into a single element that can then serve as an exchange point for all of these components and stakeholders.

NGSON joins the MEF’s Third Network as a kind of generalized overlay model, and there are a half-dozen IETF proposals that introduce virtualization concepts to bind an overlay and IP underlayment, obviously to the benefit of both the IP router vendors and those with large investments in routers.  I think that in theory any of these could be used to build a SDI, but the mechanism for market adoption would be difficult.

Regulatory policy on consumer networks has shifted to a more consumeristic bias over the last five years in both the US and Europe.  The current picture would appear to put operators in a difficult position were they to adopt an overlay/underlay model that explicitly allowed for parallelism of consumer services and “the Internet”.  That’s certainly true in the US, for example.  In addition, a transformation to an SDI presents some major issues in terms of sunk costs and evolution.

I think it’s clear that the Internet isn’t going to serve all of our network needs, and that the Internet as currently structured forces unfavorable privacy trade-offs and also limits service quality.  However, transforming it directly would demand a major shift in policy, something that’s not likely to gain support in a polarized political environment.  What might have to happen is for SDN to transform networks from the bottom, and implementation of an overlay model could then evolve within that transformation.

Making “Digital Transformation” Real

Brocade did an interesting paper on the topic of digital transformation, something you’ll recall was also a fixture of the Netcracker launch of its Agile Virtualization Platform.  Reading it, it occurred to me that the concept of “digital transformation” is widely accepted and not usefully defined.  That was one of my conclusions on the Netcracker report that also used the term, you’ll recall, and it’s clearly demonstrated in the Brocade survey.  It also occurred to me that some statistical market research I did several years ago might point to a definition, and perhaps even a path forward to achieving the goal.

The Brocade paper documents a global survey that, to me, shows IT organizations groping with the critical question of relevance.  They’re facing budget constraints, “shadow IT” and other things that demonstrate that line departments and senior management want something from IT, something that IT can’t deliver.  Is that something “digital transformation?”  Perhaps it is, if we define the concept appropriately.

Facts and industry history might be a good place to start our quest.  What I found in my research is that the rate of change in IT spending relative to changes in GDP over the last 50 years, when graphed, form a sine wave.  If you then lay in key IT innovations along the timeline, you see that the peaks of the wave (when IT spending grows faster than GDP) came when a paradigm shift in IT opened new benefit sources for exploitation.  The curve then rose for four to eight years as businesses absorbed the benefits of the new paradigm, then dipped down as things fell into cost consolidation when all the gains had been realized.  Then they picked back up again in the next cycle.  There were three major IT cycles prior to about the year 2000.  There have been none since.

I thought when I uncovered this relationship that we were now awaiting the next cycle, a cycle that would reignite benefit-driven IT.  I realize now that’s not the answer.  What we’re looking for is an agile IT model that doesn’t demand big technical paradigm shifts at all.  Every year, we get more invested in technology, and cyclical transformations that involve new computing models (minicomputers, PCs) are simply not practical.  What’s needed from “digital transformation” is a restructuring of IT to allow it to couple continuously with business operations through all the twists and turns.  Four or eight-year realization?  Forget it.  Digital transformation has to facilitate almost instant adaptation, and then re-adaptation.  Nothing singular and simple is going to generate it.  It’s a whole new IT and networking model.

That starts at the top.  If you want “digital transformation” you have to transform the relationship between applications and the business, which starts by redefining how you conceptualize the role of applications in supporting workers.  In the past, we’ve used software to create a framework for production, and built human processes to fit the software model.  That’s why the advent of minicomputers and personal computers and online services have been transformative—they let us build the IT model “closer” to the worker.

The line departments think this “framework” has become a cage, though.  Because software processes are highly integrated with each other and increasingly just touch the worker through a GUI, they aren’t particularly responsive to changes or opportunities that demand a more fundamental shift.  We can see this in the way businesses struggle to support mobile work.  Mobile-worker productivity support is effectively contextual, personalized, IT.  That eliminates an application-driven flow in favor of linking IT support to worker-recognized events.

It’s not happening today.  In my most recent exchange with enterprises, I found that almost three out of four said their mobile strategy was to make their applications work on a mobile device.  It should have been focused on defining the right information relationship to support a mobile worker.  But how?  By making applications completely event-driven.

Making something event-driven means taking the workflow structure out of applications, and instead looking at any activity as a series of events or “happenings” that suggest their own process support tasks at the application level.  This, you’ll undoubtedly recognize, is an enterprise version of the event-driven OSS/BSS trend in the operator space.  The realization of digital transformation must be to first create an event-driven software structure, then create an agile platform to host it, and finally to create a set of network services that properly connect it.

Microservices, meaning small functional atoms that use web-like (“RESTful”) interfaces to obtain the data they operate on, are generally considered to be the key to making OSS/BSS event-driven and that’s also true for enterprise software.  However, it’s not a simple matter to change software to create microservices.  Many software components are “stateful” meaning that they store data across multiple interactions, and that forces them to be used in a flow-driven rather than event-driven way.  However, it’s almost certain that microservices would be the first step in supporting a realistic model for digital transformation.

Underneath this first step there are a few things that could be done at the developer level.  One is the embodied, in Java, as the Open Services Gateway Initiative, or OSGi.  OSGi lets components be developed and deployed either as locally connected pieces of something or as distributed elements.  If you were to develop all software components using OSGi principles (in whatever language you like), you could host stuff where it makes sense, meaning either as network-connected or locally bound processes.  This is critical because you can’t build efficient software by putting every little function on the network as a separate microservice, but what can be made network-hosted and what for efficiency reasons has to be local would vary depending on the specifics of the application.

Another essential transformation in software development is functional programming.  Functional programming basically means that software elements (“functions” in development terms) take all their arguments from the thing that invoked them and deliver all their results back the same way.  Everything is “passed” explicitly, which means you don’t collect stuff internally and can’t accidentally become stateful.  Microsoft and Oracle (with C# and Java 8, respectively) support this today.  Functional programs could easily be shifted from internally connected to network-connected, and they’re also much more easily adapted to changing business conditions.

Hosting this framework is what justifies containers versus virtual machines.  The more you componentize and microservicize something, the more inefficient it is to host a dedicated version of OS and middleware for every software instance you deploy.  In fact, in this framework it’s really important to understand the transit delay budget for a given worker’s support and then distribute the components to insure that you can meet it.  Data center design and connection is paramount, and data center interconnect is a close second.  Networking has to create and preserve a distributed and yet responsive virtual platform to run our agile microservices.

All of these technical shifts are recognized today.  Most developers know about OSGi and functional programming.  Everyone knows about containers and microservices.  What seems to be lacking is an architecture, or even a philosophy, that connects them all.  Could it be that the concept of digital transformation is shrouded in definitional shadows at the high level, and obscured by technical specialization below?  The pieces of digital transformation are all laid out before us, but “us” is organizationally divided and unaware of the high-level goal.  That’s why we can’t even define what “digital transformation” means.

Knowledge is power, they say, but only if it’s a precursor to an informed decision.  Surveys can tell us there isn’t a consensus on digital transformation today, but since there isn’t one to know, we can’t survey to find it.  I think event-driven, agile, IT that supports rather than defines work is what digital transformation has to be.

Presuming my definition here is the right one, it’s a compelling goal but it has to be justified for each company by a business case and not just by consensus, and that business case has to be decomposed into specific steps that can approved and adopted.  Surveys show commitment and confusion, but confusion has to be resolved if investment is to come.  Now that we know the buyers want something, we have to define and then take the appropriate steps.  If Netcracker or Brocade (or Cisco, Dell, IBM, HPE, Microsoft, Oracle, Red Hat…) can create that chain from concept to benefits to technology steps, then they’ve started the march along the path to digital transformation, to their benefit and ultimately to the benefit of the industry.

Does Google’s New Personal/Home Assistant Change the OTT Game?

Google opened a lot of interesting topics with its developer conference this week, and I think there’s a common theme here that aligns with other industry moves and foretells something even more important.  We are moving closer to the concept of the digital assistant as our window on the world, and that could open a big pathway to a much greater use of cloud computing.  In fact, it could become the largest cloud application.

The mobile apps phenomenon was built, perhaps primarily, on the basic truth that a mobile user does things differently than a desktop user.  Not only are they restricted in terms of interaction real estate, they are more tactical in their information needs.  They want answers and not information, and while that goal may have been formulated in mobility, it’s maturing even in the home.  Amazon’s Echo and Alexa seem the inspiration for Google’s assistant and Home products.

Home control based on buttons and dials is self-limiting simply because there’s a limit to the number of these direct controls a user can keep track of.  Voice control is a much more intuitive way of handling something—tell your digital assistant to turn off a light or ask it what conditions are outside.  The number of things you can control may expand, but your pathway to exercising control is always that singular assistant.

Google seems to have drawn inspiration from another source—Facebook—for the second big move, which is the Allo app.  Allo is a chat-driven agent, similar to Facebook’s enhanced Messenger.  What seems important in Allo’s positioning is the enhanced notion of context, which you can see in the quote on Google’s blog, “Because the assistant understands your world….”

Context is critical in exploiting the interest in wearable technology, the support for which is also being increased according to Google.  Wearables offer a closer-to-life interface than a phone or tablet, which means that they naturally expect a more contextual level of support.  The notion of suggested responses to texts, for example, demonstrates contextual value and at the same time makes a watch accessory more useful.

Google’s emphasis on these points isn’t new, either to Google or to the market, but I think it makes it clear that a contextual personal agent war is going to happen, involving Amazon, Apple, Google, and Microsoft.  That’s going to accentuate both context and personal agency, and that’s what could be a game-changer, in terms of network and cloud infrastructure and even in terms of the OTT revenue model.

Logically speaking, both contextual input collection and personal agent analytics would be more effective if they were hosted locally to the user.  The most demanding contextual analysis is surely based on geography, both of the specific user and the user’s social and retail frame of reference.  IoT is an example of a contextual resource, and if you’re going to analyze the conditions in New York it makes sense to do that in New York because otherwise you’re hauling telemetry a very long way.  Similarly, if you have a New York user asking questions they’re probably relating to that user’s home, work, or immediate environment.

All of this argues for a wider distribution of cloud resources, and I think this is magnified by any context-and-agent wars among vendors.  Google probably has greater geographic scope than the others, so wouldn’t they want to play on that benefit?  And if there’s wider distribution of cloud resources then there’s more resources local to any user, any mission, which could encourage competition among cloud users for “local” facilities whose propagation delays are smaller.

The hosting of agent and contextual processes is clearly a cloud opportunity, but it also has implications in networking.  If we assumed that every user went to an agent approach, then search-related and even casual web access might all migrate there, which would mean that non-content traffic could be short-circuited into a cloud agent rather than sent to a user/device.  While content traffic is the great majority of traffic overall and the largest source of traffic growth, most content traffic really doesn’t transit the Internet, but lives inside content delivery networks.  Something that frames web flows differently might have a very large impact on how “the Internet” really gets connected.

One of the major challenges that this happy transformative trend has to face is that it potentially undermines the whole OTT model.  An agent doesn’t deliver search results, at least not a good one.  Obviously a verbal request demands a terse verbal response.  That means that the market model shifts away from search, which means that Google in particular has to be thinking about how to become a leader in delivering “contextual advertising”.  This is yet another incentive for context and agency to expand greatly in importance.

I’ve said before that neither contextual communications nor personal agency can work well if you assume all the data is collected and correlated inside a mobile device.  What has to happen is a combination of two things.  One, a personal agent has to evolve to become a personal cloud process that’s doing the analyzing and responding, through either a mobile device or home-control portal.  Second, contextual information has to be collected and subject to ad hoc analytics to generate useful insights into the meaning of a request, based on social, environmental, and retail conditions.  These notions can be expanded to provide ad support.

Contextual services of any sort could be presented through an API for use, and access could be either paid or based on ad sponsorship.  Environmental and geographic context readily adapt to this approach; you can imagine merchants promoting their sales as part of an application that guides someone to a location or along a path, particularly if they’re walking.  Even the personal agent processes could be sponsored or paid on a subscription basis.

Google may be at a disadvantage versus the other players (Amazon, Apple, and Microsoft) who might address personal agents and contextual programming, simply because the shift we’re discussing would directly impact their primary revenue stream.  The other players have a more retail-direct model that relies not on sponsorship but on user payment for the services, directly or through purchase of a device.  However, because these trends are largely promoted by the increased dependence on mobile devices, it’s hard to see how Google could simply refuse to go along.

Facebook might be the determinant here.  Context is social as well as geographic, and Facebook has a direct link to the social context of most mobile users.  They know who’s connected to whom, what’s happening, and increasingly they can tell who’s communicating directly.  Adding geographic context to this is probably easier than adding social context to geography, at least in the simple sense of locating the user(s) in a social relationship.  IoT would be a difficult addition for any player.  Could Facebook launch an IoT approach?  It seems outlandish, but who would have thought they’d have launched Open Compute?  And Facebook’s core application isn’t subject to disintermediation like search is.

All face-offs don’t end up in active conflict, so this contextual-agency thing may or may not mature to the point where it could really impact the market.  If it does, though, it’s going to bring about an almost-tectonic shift in the business model of OTT and a truly tectonic shift in the organization and connection of online information.

Can Second-Tier Vendors Win in a DCI-Centric Model of Infrastructure Evolution?

Juniper had a Wall Street event earlier this week and analysts used terms like “constructive” and “realistic” to describe what the company said.  The central focus in a technical sense was SDN and the cloud, not separately as much as in combination.  Juniper’s estimates for growth through 2019 were slightly ahead of Street consensus, so the question is whether the characterizations of Street analysts are justified, not only for Juniper but for other wannabe network vendors still chasing the market leaders.

Juniper isn’t alone in saying that the cloud, and the SDN changes likely to accompany it, are going to drive more revenue.  The view of most network vendors is that sure, “old” model spending is under pressure, but “new” model spending will make up for it.  There are of course variations in what constitutes the old and new, but I think it’s fair to say that most vendors think the cloud and SDN will be a growth engine.  How much of an engine it will be, IMHO, depends on how effectively vendors address the drivers that would have to underpin the change.

Let’s start with the optimistic view.  If my model is correct, carrier cloud (driven by SDN and NFV) would add over 100,000 new data centers globally.  All of these data centers would be highly connected via fiber, and obviously they’d be distributed in areas of heavy population and traffic.  If we were to see these data centers as deployed purely for cloud computing, that in itself would generate a decent opportunity for data center interconnect (DCI).

If we presumed these data centers were driven more by SDN and NFV it could get even better.  For example, if all wireline broadband were based on cloud-deployed vCPE, then all wireline access traffic would be aggregated into a data center, which means that it would be logical to assume that almost everything in wireline metro aggregation would become a DCI application.  And given that mobile infrastructure, meaning IMS and EPC, would also be SDN/NFV-based, the same would be true on the mobile side.  All of that would combine to create the granddaddy of all DCI opportunities, to the point where most other transport fiber missions except access would be unimportant.

If I were a vendor like Juniper with a commitment to SDN and the cloud, this is the opportunity I’d salivate over, and shortly after having cleaned myself up, I’d be looking at ways to promote this outcome in infrastructure/market evolution terms.  It’s here that problems arise, for Juniper any anyone else who wants to see a DCI-centric future.

The media loves the cloud, but the fact is that even Amazon’s cloud growth hasn’t been able to get cloud computing much above the level of statistical significance in terms of total global IT spending.  We still have a lot of headroom for growth, but if we assume that enterprises’ own estimates of cloud penetration are accurate, we would probably not see cloud computing generating even a fifth of that hundred thousand data centers.  Most significantly, cloud computing doesn’t drive the edge-focused deployment of data centers that SDN and NFV do, and thus doesn’t compel the same level of interconnection.  You have fewer bigger data centers instead.

There is nothing a network equipment vendor can do to promote “traditional” enterprise cloud computing either.  This arises from the transfer of applications that fit the cloud profile out from the data center, and how a network vendor could influence that is unclear to say the least.  For network vendors, in fact, the best way to promote cloud computing growth would be to get behind a cloud-centric (versus mobile-connection-centric) vision of IoT.  Network vendors don’t seem psychologically capable of doing that, so I think we’d have to put encouraging cloud computing as the driver for our DCI explosion off the table.

SDN as a driver, then?  Surely SDN and the cloud seem to go together, but the connection isn’t causal, as they say, in both directions.  If you have cloud you will absolutely have SDN, but you can’t promote cloud computing just by deploying SDN unless you use SDN to frame a more agile virtual-network model.

This is a place where Juniper could do something.  Nokia’s Nuage SDN architecture is in my view the best in the industry as a unified SDN-connection-layer model but Juniper’s Contrail could be the second-best.  Juniper even has controller federation capability to allow for domain interconnection.  The problem for both vendors seems to be that SDN used this way would transform networks away from traditional switching/routing, and so it could hasten the demise of legacy network revenues.  Would SDN make it up?  Perhaps it would change market leaders, but it’s hard to say why operators would adopt SDN on a large scale as a replacement for traditional L2/L2 if it were more expensive.

Which gets us to NFV.  NFV as a means of creating agile mobile infrastructure is the most credible of the evolutionary-NFV applications.  The challenge is whether a vendor who isn’t a mobile-infrastructure player can drive the deployment, especially given that Ericsson, Huawei, and Nokia all have NFV stories to tell.  Obviously, any major mobile-infrastructure NFV could create an explosion in the number of cloud data centers and drive DCI, but fortune would likely favor vendors who were actually driving the deployment.

The big thing about NFV data centers is the potential they’d be widely distributed and that they’d be natural focus points for service traffic.  That, as I said up front, is what would make them revolutionary in DCI terms.  The obvious question is whether the mobile-infrastructure players who could drive the change would benefit enough from it—data centers would house servers after all, and DCI replacement of traditional metro infrastructure would impact most of the big vendors by cutting switching/routing spending even faster (and further).

Ericsson and Cisco would seem to have an edge here because they have a server and data center strategy that would give them an opportunity to gain revenue from a shift to hosted, DCI-centric, metro infrastructure.  Ericsson has also been a strong player in professional services, and Cisco’s quarterly call this week showed they had a significant gain in professional services and that they are stressing data center (UCS and the cloud) infrastructure in their profit planning.  In fact, Cisco is making a point of saying they are shifting to a software and subscription revenue model even for security.

Conceptually, smaller players in an industry should have first-mover advantages, but in networking in general (and with Juniper in particular) the smaller players have been at least as resistant to change as the giants.  Juniper actually launched a software-centric strategy at a time when Cisco was in denial with respect to just about every network change—the recognized transformation and the cloud at least two years earlier than the industry at large and they had some product features (like separation of control and data plane) that could have given them an edge.  They just didn’t have the market mass or insight to make good on their own thought leadership.

That’s what will make the DCI opportunity difficult for any second-tier vendor.  The drivers of the opportunity are massive market shifts, shifts that will take positioning skill, product planning, and just plain guts to address.  Especially now, because the giants in the space have awoken to the same opportunity.

Netcracker’s AVP: Is This the Right Approach to SDN and NFV?

I had an opportunity this week to look over some material from Netcracker on their notion of a “digital service provider”, part of the documentation that relates to their Agile Virtualization Platform concept.  I also reviewed what was available on the technology and architecture of AVP.  I find the technology fascinating and the research and even terminology a little confusing.

Netcracker is an OSS/BSS player, of course, and as such they represent an interesting perspective on the transformation game.  My surveys say that the OSS/BSS vendors are obviously more engaged with the CIO, but they are also better-connected with the CFO and CEO and less connected with the COO and CTO.  That makes them almost the opposite of the network equipment vendors, and that alone means that their vision could be helpful in understanding what’s going on.  It’s also a chance to compare their views with what I’ve learned in my own contacts with operators, so let’s start at the top.

What exactly is a “Digital Service Provider” or “digital transformation” to use the TMF term?  Obviously not just a provider of digital services because “digital” in a strict sense is all there is these days.  I think what both concepts are getting at is that operators need to be able to create a wider variety of services more efficiently and quicker, which means that software and IT have to play a larger role—perhaps even the dominant role.  So the notion of AVP is to facilitate that.

What drives operators to want a digital transformation, says the material, is almost completely reactive.  Customers demand it, revenue gains depend on it, competition is doing it…these are indications of a market driven by outside forces rather than one trying to get ahead of the curve.  It’s not that operators are being dragged kicking and screaming into the transformation, perhaps, but they are surely not romping off on their own accord.

The barriers to achieving the transformation are equally interesting, or at least one point is.  Operators told Netcracker that technical factors like operations and integration was the most important inhibitor only about a quarter of the time.  Factors like staffing and skills and culture were far more important in the survey, and perhaps most interesting of all was the fact that only about 15% of operators seemed to be groping for solutions—the rest said they either had transformed or were well on their way.

I have to confess I have a bit of a problem with these points, for two reasons.  First, it would seem the survey shows that AVP is too late and doesn’t address the main issue set, which is skills and culture and not technology.  Second, it’s hard to see how Netcracker or anyone else would have much of a shot at solving market problems if 85% of the buyers don’t need a new approach.

My own surveys have yielded different responses.  The overwhelming majority of operators tell me that their driver for change is profit compression for connection-oriented services.  Only a small percentage (and almost all of them Tier Ones or MSPs) have an approach lined up, and an even smaller percentage says they’ve made substantial progress implementing one.  Thus, my own data seems to make Netcracker’s case for opportunity more strongly.

Interestingly, a different Netcracker document, the AVP brochure, frames it differently.  There the big problem is network resource and configuration, staff and culture second and third, with cost and operations processes and systems trailing.  This brochure also lays out three reasons for the “slow process” (recall that the other one says only 15% are lagging).  These are commercialization uncertainty, operational complexity, and organizational misalignment.  The last of these corresponds to the staff/culture point and I’d say that the other two are different perspectives on the “resources and configuration”, “cost”, and “operations processes and systems”.  I don’t think the inconsistencies here are fatal issues, but they do create a bit of confusion.

My surveys say that operators are generally committed to a two-prong approach.  In the near term, they believe that they have to make operations processes more efficient and agile, and they believe this has to be done by introducing a lot of software-driven automation.  In the longer term, they believe that they need to find revenue beyond connection-based services.

AVP is interesting from a technology perspective, perhaps even compelling.  Netcracker says it’s made up of composable microservices, and that sounds like the approach that I think is essential to making OSS/BSS “event-driven”.  Unfortunately, there aren’t enough details provided in any of the material for me to assess the approach or speculate on how complete it might be.  For the record, I requested a complete slide deck from them and I’ve not received one.

AVP is a Netcracker roadmap that has many of the characteristics (and probably all of the goals) that operators’ own architectures (AT&T and Verizon’s recent announcements for example) embody.  Their chart seems to show four primary elements—a professional services and knowledge-exchange practice, enhanced event-driven operations processes, a cloud deployment framework that would host both operations/management software and service elements, and the more-or-less expected SDN/NFV operations processes.  Netcracker does have its own E2E orchestration product, but the details on the modeling it uses and how it links to the rest of the operations/management framework aren’t in the online material.

If operators’ visions of a next-gen architecture are valid (and after all the operators should be the benchmark for validity) then the Netcracker model is likewise, but it does have some challenges when it’s presented by a vendor and without specific reference to support for operator models.  My surveys say that the big problems are the state of SDN/NFV and the political gap that’s inherent in the model itself.

Remember who OSS/BSS vendors call on?  The CIO is surely a critical player in the network of the future, and might even be the critical player in both the operations-efficiency and revenue-agility goals.  However, they aren’t the ones that have been pushing SDN and NFV—that’s been primarily the CTO gang.  Operators are generally of the view that if there is any such thing as a “digital transformation” of infrastructure, SDN and NFV are the roots of it.  Interestingly they are also of the view that the standards for SDN and NFV don’t cover the space needed to make a business case—meaning fulfill either the cost-control or revenue goal I’ve already cited.  So we have CIOs who have the potential to be the play-makers in both benefits, the OSS/BSS vendors (including Netcracker) who could engage them…and then across a gap the CTOs who are driving infrastructure change.

Properly framed, the Netcracker model could not only link the layers of humans and technology that have to be linked to produce a unified vision of the network of the future.  Properly framed, it could even harmonize SDN and NFV management from within, and then with operations management.  It’s easier for me to see this being done from the top, from the OSS/BSS side, than from the bottom.  But it’s not going to happen by itself.  Vendors, operators, and even bodies like the TMF at whose event Netcracker made its presentation, need to take the process a little more seriously.  Absent a unified, credible, approach from benefits to networks, operator budgets will just continue their slow decline under profit pressure.

I think OSS/BSS vendors have a great opportunity.  My research and modeling shows that an operations-centric evolution of network services could produce most of the gains in efficiency and agility that have been claimed for SDN and NFV.  Without, of course, the fork-lift change in infrastructure.  That story should be very appealing to operators and of course to the OSS/BSS types, but what seems to be happening is a kind of stratification in messaging and in project management.  Operations vendors sing to CIOs, network equipment vendors to CTOs, and nobody coordinates.  Maybe they all need to take an orchestration lesson themselves.

Service Assurance in the Network of the Future

One of the persistent questions with both SDN and NFV is how the service management or lifecycle management processes would work.  Any time that a network service requires cooperative behavior among functional elements, the presumption is that all the elements have to be functioning.  Even with standard services, meaning services over legacy networks, that can be a challenge.  It’s even more complicated with SDN and NFV.

Today’s networks are multi-tenant in nature, meaning that they share transmission/connection facilities to at least some degree.  Further, today’s networks are based on protocols that discover state and topology through adaptive exchanges, so routing is dynamic and it’s often not possible to know just where a particular user’s flows are going.  In most cases these days, the functional state of the network is determined by the adaptive processes—users “see” in some way the results of the status/topology exchanges and can determine if a connection has been lost.  Or they simply don’t see connectivity.

QoS is particularly fuzzy.  Unless you have a mechanism for measuring it end-to-end, there’s little chance that you can determine exactly what’s happening with respect to delay or packet loss.  Most operator guarantees of QoS are based on performance management through traffic engineering, and on capacity planning.  You design a network to offer users a given QoS, and you assume that if nothing is reporting a fault the users are getting it.

It’s tempting to look at this process as being incredibly disorderly, particularly when you contrast it with TDM services that because they were dedicating resources to the user could define the state and QoS with great precision at any point.  However, it’s not fair to SDN or NFV to expect that they will do better than the current state of management, particularly if users expect lower prices down the line, and operators lower opex.

The basic challenge posed by SDN in at least replicating current management knowledge is the fact that by design you’re saying that adaptive exchanges don’t determine routes, and in fact don’t happen.  If that’s the case, then there is no way of knowing what the state of the devices is unless the central controller or some other central management element knows the state.  Which, of course, means that the devices have to provide that state.  An SDN controller has to know network topology and has to know the state of the nodes and trunks under its control.  If this is true, then the controller can construct the same knowledge of overall network conditions that the network acquired through adaptive exchanges, and you could replicate management data and SLAs.

NFV creates a different set of problems.  With NFV the service depends in part on functions hosted on resource pools, and these are expected to offer at least some level of “automatic recovery” from faults, whether that happens by instantiating a replacement copy, moving something, reconnecting something, or scaling something under load.  This self-repair means that a fault might exist at the atomic function level but you don’t want to recover from it at the service level till whatever’s happening internally has been completed.

The self-remediation model of NFV has, in the NFV ISG and IMHO, led to a presumption that lifecycle management is the responsibility of the individual virtual network functions.  The functions contain a local instance of a VNF management process and this would presumably act as a bridge between the state of resources and their management and the state of the VNFs.  The problem of course is that the service consists of stuff other than that single VNF, and the state of the service still has to be composited.

The operators’ architectures for NFV and SDN deployment, now emerging in some detail, illustrate that operators are presuming that there is in the network (or at least in every domain) a centralized service assurance function.  This function collects management information from the real stuff, and also provides a means of correlating the data with service state and generating (in some way) the notifications of faults to the service processes.  It seems that this approach is going to dominate real SDN and NFV deployment, but the exact structure and features of service assurance aren’t fully described yet.

What seems to have emerged is that service assurance is a combination of three functional elements, aggregation of resource status, service correlation, and event generation.  In the first of these, management data is collected from the things that directly generate it, and in some cases at least the data is stored/cached.  An analytics process operates on this data to drive what are essentially two parallel processes—resource management and service management.  The resource management process is aimed at remedying the problems with physical elements like devices, servers, and trunks.  The service management process is designed to address SLA faults, and so it could just as easily replace a resource in a service as require it be fixed—in fact, that would be the normal course.

Service management in both SDN and NFV is analogous to end-to-end adaptive recovery as found in legacy networks.  You are going to “fix” a problem by reconfiguration of the service topology and not by actually repairing something.  If something is broken, that becomes a job for the resource management processes.

Resource management doesn’t appear to present any unusual challenges.  You have device state for everything, and so if something breaks you can fix it.  It’s service management that poses a problem because you have to know what to reconfigure and how to reconfigure it.

The easiest way to determine whether a service has faulted is to presume that something internal to the service is doing it, or that the service users are reporting it.  Again this may seem primitive but it’s not really a major shift from what happens now.  If this approach is taken, then the only requirement is that there be a problem analysis process to establish not what specifically has happened but what can be done to remedy the fault by reconfiguration.  The alternative is to assume that the service assurance function can identify the thing that’s broken and the services that are impacted.

Both these options seem to end up in the same place.  We need to have some way of knowing when a virtual function or SDN route has failed.  We need to have a recovery process that’s aimed at the replacement of that which has broken (and perhaps a dispatch task to send a tech to fix a real problem).  We need a notification process that gives the user a signal of conditions comparable to what they’d get in a legacy network service.  That frames the reality of service assurance.

I think that the failing of both SDN and NFV management to date lies in this requirement set.  How, if internal network behavior is not determined by adaptive exchange, does the service user find out about reachability and state?  If SDN replaces a switch/router network, who generates the management data that each device would normally exchange?  In NFV how do we reflect a virtual function failure when the user may not be directly connected to the function, but somewhere earlier/later in the service chain?

The big question, though, is one of service configuration and reconfiguration.  We cannot assume that every failed white box or server hosting a VNF can be recovered locally.  What happens when we have to change the configuration of the service enough that the elements outside the failing domain have to be changed to reflect the reconfiguration?  If we move a VNF to another data center, don’t we have to reconnect the WAN paths?  Same with SDN domains.  This is why the issue of recovery is more than one of event generation or standardization.  You have to be able to interpret faults, yes, but you also have to be able to direct the event to a point where knowledge of the service topology exists, so that automated processes can reconnect everything.  Where is that point?

In the service model, or it’s not anywhere.  Lifecycle management is really a form of DevOps, and in particular of the declarative model where the end-state of a service is maintained and compared with the current state.  This is why we need to focus quickly on how a service is modeled end-to-end and integrate that model with service assurance, for both legacy and “new” technologies.

Overlay/Underlay Networking and the Future of Services

Overlay networks have been a topic for this blog fairly often recently, but given that more operators (including, recently, Comcast) have come out in favor of them, I think it’s time to look at how overlay technology might impact network investment overall.  After all, if overlay networking becomes mainstream, something of that magnitude would have to impact what these networks get overlaid onto.

Overlay networks are virtual networks built by adding what’s essentially another connection layer on top of prevailing L2/L3 technology.  Unlike traditional “virtual networks” the overlay networks are invisible to the lower layers; devices down there treat them as traffic.  That could radically simplify the creation of virtual networks by eliminating the need to manage the connectivity in a “real” network device, but there are other impacts that could be even more important.  To understand them we should start at the top.

There are two basic models of overlay network—the nodal model and the mesh model.  In the nodal model, the overlay includes interior elements that perform the same functions that network nodes normally perform in real networks—switching/routing.  In the mesh model, there are no interior nodes to act as concentrators/distributors of traffic.  Instead each edge element is connected to all the others via some sort of tunnel or lower-level service.

The determinant in the “best” model will in most cases be simply the number of endpoints.  Both endpoints and nodes have “routing tables”, and as is the case with traditional routing, the tables don’t have to include every distinct endpoint address, but rather only the portion of an address needed to make a forwarding decision.  However, if the endpoints are meshed then the forwarding decision has to be made for each, which means the endpoint routing tables get large and expensive to process.

Interior node points can simplify the routing tables, particularly since the address space used in an overlay network need not in any way relate to the underlying network address space.  A geographic/hierarchical addressing scheme could be used to divide a network into areas, each of which might have a collecting/distributing node.  Node points can also be used to force traffic along certain paths by putting a node there, and that would be helpful for traffic management.

The notion of an overlay-based virtual network service clearly empowers endpoints, and if the optimization of nodal locations is based on sophisticated traffic and geography factors, it would also favor virtual-node deployments in the network interior.  Thus, overlay networks could directly promote (or be promoted by) NFV.  One of the two “revolutionary elements” of future networking is this a player here.

So is the other.  If tunnels are the goal, then SDN is a logical way to fulfill that goal.  The advantage SDN offers is that the forwarding chain created through OpenFlow by central command could pass wherever it’s best assigned, and each flow supported by such a chain is truly a ship in the night relative to others in terms of addressability.  If central management can provide proper traffic planning and thus QoS, then all the SDN flows are pretty darn independent.

The big question for SDN has always been domain federation.  We know that SDN controllers work, but we can be pretty sure that a single enormous controller could never hope to control a global network.  Instead we have to be able to meld SDN domains, to provide a means for those forwarded flows to cross a domain boundary without being elevated and reconstituted.  If that capability existed, it would make SDN a better platform for overlay networks than even Ethernet with all its enhancements.

The nature of the overlay process and the nature of the underlayment combine to create a whole series of potential service models.  SD-WAN, for example, is an edge-steered tunnel process that often provides multiple parallel connection options for some or even all of the service points.  Virtual switching (vSwitch) provides what’s normally an Ethernet-like overlay on top of an Ethernet underlayment, but still separates the connection plane from the transport process, which is why it’s a good multi-tenant approach for the cloud.  It’s fair to say that there is neither a need to standardize on a single overlay protocol or architecture, nor even a value to doing so.  If service-specific overlay competition arises and enriches the market, so much the better.

Where there obviously is a need for some logic and order is in the underlayment.  Here, we can define some basic truths that would have a major impact on the efficiency of traffic management and operations.

The first point is that the more overlays you have the more important it is to control traffic and availability below the overlay.  You don’t want to recover from a million service faults when one common trunk/tunnel has failed.  This is why the notion of virtual wires is so important, though I want to stress that any of the three major connection models (LINE, LAN, TREE) would be fine as a tunnel model.  The point is that you want all possible management directed here.  This is where agile optics, SDN pipes, and so forth, would live, and where augmentation of current network infrastructure to be more overlay-efficient could be very helpful.

The second point, which I hinted at above, is that you need to define domain gateways that can carry the overlays among domains without forcing you to terminate and reestablish the overlays, meaning host a bunch of nodes at the boundary.  Ideally, the same overlay connection models should be valid for all the interconnected domains so a single process could define all the underlayment pathways.  As I noted earlier, this means domain federation has to be provided no matter what technology you use for the underlayment.

The third point is that the underlay network has to expose QoS or class of service capabilities as options to the overlay.  You can’t create QoS or manage traffic in an overlay, so you have to be able to communicate between the overlay and underlay with respect to the SLA you need, and you have to then enforce it below.

The final point is universality and evolution.  The overlay/underlay relationship should never depend on technology/implementation of either layer.  The old OSI model was right; the layers have to see each other only as a set of exposed services.  In modern terms, that means that both layers are intent models with regard to the other, and the overlay is an intent model to its user.  The evolution point means that it’s important to map network capabilities in the overlay to legacy underlayment implementations, because otherwise you probably won’t get the scope of implementation you need.

You might wonder at this point why, if overlay networking is so powerful a concept, operators haven’t fallen over themselves to implement it.  One reason, I think, is that the concept of overlay networks is explicitly an OTT concept.  It establishes the notion of a network service in a different way, a way that could admit new competitors.  If this is the primary reason, though, it may be losing steam because SD-WAN technology is already creating OTT competition without any formal overlay/underlay structures.  The fact that anyone can do an overlay means nobody can really suppress the concept.  If it’s good, powerful, then it will catch on.

Can We Apply the Lessons of NFV to the Emerging IoT Opportunity?

I blogged yesterday about the OPNFV project for Event Streams and the need to take a broad view of event-driven software as a precursor to exploring the best way to standardize event coding and exchange.  It occurred to me that we’re facing the same sort of problem with IoT, focusing on things that would matter more if we had a broader conception of the top-down requirements of the space.  Let me use the same method to examine IoT as I used for the Event Streams announcement—examples.

Let’s suppose that I have a city that’s been equipped with those nice new IoT sensors, directly on the Internet using some sort of cellular or microcellular technology.  It’s 4:40 PM and I left work early to get a jump on traffic.  So did a half-million others.  I decide that I’m going to use my nice IoT app to find me a path to home that’s off the beaten path, so to speak.  I activate my app, and what happens?

What I’m hoping for, remember, is a route to my destination that’s not already crowded with others, or will shortly become crowded.  That information, the IoT advocates would say, is exactly what IoT can provide me.  But how, exactly?  If the sensors count cars, I could assume that car counts would be a measure of traffic, but a car counter would count not cars but cars passing it.  If the traffic is at a standstill, how many cars are passing?  Zero, so I have a bad route choice.

However, it may not be that bad because I may never see the data in the first place.  Remember, I have a half-million sharing the road with me, and most of them probably want to get home early too.  So what are they doing?  Answer; hitting their IoT app to find a route.  If that app is querying those sensors, then I’ve got a half-million apps vying for access to a device that might be the size of a paperback book.  We have websites taken down by DDoS attacks of that size or smaller, and those sites are supported by big pipes and powerful servers.  My little sensor is going to weather the storm?  Not likely.

But even if I got through, would I understand the data?  I could presume that the sensors would be based on a basic HTTP exchange like the one that would fetch a web page.  Certainly I could get something like an XML or JSON payload delivered that way, but what’s the format?  Does the car sensor give me the number of cars in the last second, or minute, or hour, or what?  Interpreting the data starts with understanding what data is actually being presented after all.

But suppose somehow all this worked out.  I’ve made the first turn on my route, and so have my half-million road-sharing companions.  Every second, conditions change.  How do I know about the changes?  Does my app have to keep running the same decision process over and over, or does the sensor somehow signal me?  If the latter is the case, how would a specific sensor know 1) who I was, 2) what I wanted and 3) what I had to know about to get what I wanted?

OK, you say, this is stupid.  What I’d really do is go to my “iIoT service” on my iPhone, where Apple would cheerfully absorb all this sensor data and give me answers without all these risks.  Well, OK, but that raises the question of why a city-full of those IoT sensors got deployed when they’re nothing but a resource for Apple to exploit.  Did Apple pay for them?  Ask that to Tim Cook next shareholder call.  If Apple is just accessing them on “the Internet” because after all this is IoT, then is Apple and others expecting to pay for the access?  If not, why did they ever get deployed.  If so, how does that cheap little sensor know who Apple is versus some shameless exploiter of their data?

Maybe, you speculate, we solve some of our problems with the device that started them, our phone.  Instead of counting cars, we sense the phones that are nearby.  Now we know the difference between an empty street and gridlock.  Great.  But now we have thousands of low-lifes tracking women and children.  Prevent that with access controls and policies, you say?  OK, but remember this is a cheap little sensor that you’ve already had to give the horsepower of a superserver to.  Now we have to analyze policies and detect bad intent?

Or how about this.  A bunch of Black Hats says, gee we could have fun by deploying a couple hundred “sensors” of our own, giving false data, and getting a bad traffic situation to become gridlocked enough that even emergency fire and rescue can’t get through.  Or we’re a gang of jewel thieves with an IoT getaway strategy.  How do these false-flag sensors get detected?

Sometimes insight comes in small steps.  For example, the Event Stream project talks about Agents that get events, Collectors that store them in a database.  This kind of structure is logical to keep primary event generators from being swamped by all the processes that need to know the state of resources.  Isn’t it logical to assume that this same sort of decoupling would be done in IoT?  The project seeks to harmonize the structure of event records; isn’t it logical to assume that sensor outputs would similarly have to be harmonized?  Resource information in NFV and sensor data in IoT both require what are essentially highly variable and disorderly sources to be loosely coupled with highly variable and disorderly process sets that interpret the stuff.  The issues raised by each would then be comparable.

Once we presume that we need to have common coding for event analysis and some sort of database buffering to decouple the sensors in IoT from the processes, we can resolve most of these other questions because we don’t have a sensor network problem anymore, we have a database problem, and we know how to address all the concerns raised above if we presume that context.  But just as Event Streams have to trigger an awareness of the need for contextual event processing, so the existence of a database where sensor data is collected and from which it’s distributed begs the question of what apps do and how public policy goals are maintained.

We’re not there yet with IoT.  Even IT vendors who make IoT announcements are still kissing low-power protocols and transmitters and not worrying about any of the real issues.  And these are the vendors who already sell the databases and analytics products and clouds of servers, who have the technology to present a realistic model.

Way back in 2013, in CloudNFV, I outlined the set of issues that NFV would have to address, and everyone who was involved in that process knows how hard I tried to convince both vendors and operators that key issues were being ignored.  It’s now 2016, and we’re now just starting to address them.  Could we have today a complete NFV implementation to deploy if we’d accepted those issues in 2013 when they were first raised?  My point isn’t to aggrandize my own judgment; plenty of others said much the same thing.  Many are saying it now about IoT.  Will we insist on following the same myopic path, overlooking the same kind of issues, for that technology?  If so, we’re throwing out a lot of opportunity.