Market Evolution: The “Stages” of SD-WAN

You probably know all about the stages of grief, or the stages of project post-mortem examinations.  How about the stages of SD-WAN?  The fact is that SD-WAN is undergoing a series of parallel revolutions, and the way they interact will shape both the opportunity and the competitive landscape, at both the product and service levels.

SD-WAN is an emerging but still very minor piece of the network services market.  Most of the market estimates place total revenues well below $200 million annually, and it’s divided up among a large number of players.  Recent market data reported here lists 13 vendors and an “others” category that seems to have about that same number in it.  Given that my own model says the SD-WAN opportunity could reach $40 billion by 2022, a two-hundredfold growth rate, it’s no surprise that there are plenty of interested startups and even significant early M&A (Cisco and VMware both bought SD-WAN firms).

The question, of course, is what’s driving the growth.  My view is that the drivers in the SD-WAN space have been somewhat diverse from the start, and that a new and very significant set of new feature requirements are being introduced now.  It’s these new requirements that will generate the incremental opportunity for SD-WAN, not simply the growth of opportunity based on the factors that dominate the market today.  The best way to explore this is to consider the “phases of SD-WAN.”

The first phase of SD-WAN emerged as a requirement set at about the same time as NFV did, which was 2012.  The problem that drove the consideration was similar in a sense to the problem that was emerging at the same time in cloud data centers—tenant networking.  Applications could be considered “tenants” in the cloud, and the need for uniform application access by workers located anywhere added quickly to the first-phase requirements.  The original model was called “hybrid SDN” or “SDN-WA” to distinguish between the data-center-centric SDN model and this newer one.

There were thus two interdependent requirements here.  One was the desire to create a Nicira-like independent overlay network that offered greater control of applications and users, and the other was the desire to create an overlay network that could, because it was an overlay network, a new service layer that would span different network connection options.  Many sites didn’t have Ethernet access and MPLS VPNs, but they all had Internet access.

Some players in the market, notably Nicira (bought by VMware and made into NSX) and Nuage (bought by Nokia) evolved their capabilities more in line with the notion of “hybrid SDN” and “SDN-WA”.  These provided tight integration between their network solution and data center hosting, and in some cases even vertical integration between the service overlay and the underlying services.  Others saw an opportunity for a simple premises device (or process) that exploited the specific need for a hybrid between MPLS VPNs and the Internet.  This second group makes up what we call “SD-WAN” today, and its emergence marks the mainstream of the first phase of SD-WAN.

Almost as this early SD-WAN vision was maturing, another issue set emerged.  Public cloud computing, particularly when it’s used for failover, scaling, and multi-cloud, generates a need to put an application component onto a VPN.  Obviously that can’t be done (economically, or at all) in the usual way—putting an MPLS VPN demarcation in the cloud.  On the other hand, a software SD-WAN agent could easily be bound into the cloud component, making the connection over the Internet and making each cloud component appear on the corporate VPN.

Having a cloud-ready client has quickly become a requirement for SD-WAN vendors, and it’s this development that characterizes the second phase of SD-WAN maturity.

Before we move on to the next phase, let’s take a moment to explore the market side of the picture.  SDN has generally been sold to cloud providers and network operators, and that’s been true for the SDN-centric products from the first.  SD-WAN was first sold to MSPs and enterprises, with the balance shifting toward the latter over time.  This situation prevailed through the first phase of SD-WAN, largely because network operators saw SD-WAN as cannibalizing their MPLS VPN business.

With the advent of our second phase, network operators started to get into the act, for three reasons.  First, many realized that losing their VPN customers to one of their other services was preferable to losing it to another competitor.  Second, they realized that SD-WAN didn’t often actually cannibalize MPLS business, it simply extended corporate connectivity.  Third, a growing minority saw that SD-WAN could actually help them create a stable model of VPN in what would likely be a changing set of infrastructure options.  Operators are the fastest growing channel for SD-WAN sales today, and will likely represent the majority of new sales by the end of the year.

The operator interest in SD-WAN has had the perhaps-undesirable effect of nailing part of the SD-WAN phase evolution, the feature part, to the ground.  Most of the operator-driven SD-WAN is focused purely on extending connectivity.  The other features of SD-WAN have been generally ignored, and that could be an issue because of the onrush of phase three of SD-WAN.

Phase two, you’ll recall, was characterized by the goal of establishing cloud-friendly VPN connectivity.  Perhaps the most important thing about this new requirement is that it’s only the first step in the real mission of SD-WAN, which is providing a bridge between the churning world of mobile workers and the continuously variable relationship between applications, components, and hosting points that virtualization has created.  This is what I’ve been calling logical networking, because it focuses on connectivity as the virtual world demands it, rather than on the fixed-network-service-access-point model of IP overall.

Phase three is the critical stage of SD-WAN evolution for two reasons.  First, it defines the set of virtualization-linked feature requirements that will drive broad adoption of SD-WAN.  We could see, if this phase progresses, SD-WAN becoming a near-universal approach to building business networks, and even as the foundation of things like 5G and CDNs.  It could become the fabric of the carrier cloud.  Second, it divides the two sales channels for SD-WAN services, and if that division persists, SD-WAN might take much longer to adopt.

Recall that SD-WAN is now being offered by network operators and that this is the fastest-growing channel.  It’s also offered directly to enterprises, and through MSPs and other virtual operator types.  Right now, I’m seeing interest in the third SD-WAN phase confined to the latter market channel.  If that remains true, then as the operators gain ascendancy in number of new SD-WAN customers, they cement the second-phase model into place and make it harder for SD-WAN to achieve its full potential.

In the long term, I don’t think operators will remain stuck in phase two of SD-WAN.  The problem is competition among operators, which is a sure thing (and according to some operators and enterprise buyers, already happening).  Feature differentiation in connection services is difficult, particularly if you have to rely on a best-efforts network (the Internet) for some of your connectivity.  A feature race necessarily turns into a climb of the feature layers toward more logical-modeled services.

In the middle term, MSPs and other virtual-network providers could be expected to use logical-modeled features to differentiate themselves from stogy operator positioning of pure connectivity-linked SD-WAN services.  If these SD-WAN providers start to climb the feature ladder to logical networking, they force the operators to do that, and thus contribute to eventual operator competition on higher-level SD-WAN features.

In the short term, it’s going to be up to vendors who provide the SD-WAN service components to promote the third phase.  We’re already seeing a lot of competition in the SD-WAN space, even though everyone knows that the market will quickly settle on a number of winners you could count on the fingers of one hand.  Today, you could count the number of vendors who have a clear logical-network positioning on a lot fewer fingers—one, in fact (128 Technology, who I’ve noted in past SD-WAN blogs).

Working against vendor-driven progress to phase three of SD-WAN is the fact that the vendors are of course interested in short-term sales traction.  With network operators the fastest-growing opportunity, and with them showing little interest so far in phase three features, there is a chance that everyone will chase the sales (low) apples and leave the best fruit unpicked—at least for a while.

The final force to advance SD-WAN may come from that original group of “SDN-WA” players, meaning ones like VMware/NSX and Nokia/Nuage.  Both these companies have an SDN orientation but a broad view of extending data center networking to the virtual world and providing branch/remote networking as well.  Both also have SD-WAN offerings, but neither of these currently supports my third phase of SD-WAN, logical networking.  If that changes, and I think it’s most likely to change with VMware’s positioning of its Virtual Cloud Network, it puts perhaps the most powerful vendor in the space convincingly behind the third phase of SD-WAN.

That doesn’t mean that VMware will sweep the board here, though.  First, they have not yet bought into the third-phase value proposition; VeloCloud doesn’t offer full logical networking.  Second, their focus with their Virtual Cloud Network is positioning for service providers, and as I’ve noted that space is still mired in simple connection-driven SD-WAN.  There’s still plenty of time for other SD-WAN (and SDN) players to embrace the third phase and gain a leading position.  Time, but not much time, in my view.  I think this space is going to reach maturity within a year and a half, and those who don’t have a seat at the value table will be lost.

Why ETSI’s New Zero-Touch Automation is in Trouble

It seems like zero-touch automation (ZTA) is everyone’s goal but also that it’s mired in the usual standards-group molasses.  The recent announcement that the first use case would be 5G network slicing, and the bias of the process toward earlier work on NFV, seem to combine to set the wrong goal and the wrong approach.  Are we launching a bold, new, essential initiative or just making the same mistakes again under a different name?

Problem number one for ZTA is the timing of the need.  NFV and SDN promised significant improvements in overall operator capex, but both have been going on for five years now and there have been no significant improvements.  Operators have pressed vendors on pricing, gone more to price leaders, and are now exploring open models of legacy devices (P4 switch/routers) instead of deploying virtual functions in the cloud or using OpenFlow forwarding.  The earnings-call comments from both operators and vendors show pressure on capital spending, and the only near-term way to reduce it is to improve opex.  That’s what ZTA is supposed to do.

The target use case of 5G network slicing doesn’t seem to fit the timing of the need, or even the right target.  We cannot reduce opex on something not yet deployed.  Why not focus instead on doing something with the current network infrastructure, the current services?  The profit-per-bit squeeze operators are under today doesn’t involve 5G network slicing, or even 5G overall.  There is currently no way of knowing just how much network slicing will even deploy, or when it would start.

Having a proof-of-concept test to support a use case, at this point, seems premature to me as well.  What “concept” do we have to prove?  I really like PoCs, but there has to be an architecture to test against use cases if we’re to gain anything from them.  Since the ZTA group hasn’t defined that architecture yet, there’s a risk of having PoCs simply run though trial implementations.  That happened with NFV too.

The NFV linkage poses problem number two.  NFV isn’t responsible for any significant amount of either capex or opex today.  The problem that ZTA has to solve isn’t NFV automation, it’s service lifecycle automation regardless of what technology resides in network infrastructure.  Today, and for the foreseeable future, that will be legacy technology.  So why not focus ZTA on automating the operation of today’s services on today’s infrastructure?

Another NFV-related problem is what I’ll call an “application vision” of the requisite software components to be used for management and orchestration.  Functionally, the NFV model defines the logical elements, but the implementations of the model have tended to follow the functional alignment, making NFV MANO and VNFM look like monolithic applications, or applications with a few modular components.  We have a better, optimum, model in the notion of an intent-data-model-driven approach.

Problem number three is the inertia of standards processes overall.  John Cupit, a guy whose skills and insights I respect, commented on LinkedIn on one of my recent blog posts, saying:

I think that too much standards work is focused on justifying the existence of the standards bodies.  I also think that Time to Market concerns of the participating vendors hijack the work that is performed.  The MEF work and the ETSI NFV standard are poster children in terms of representing technical efforts which ignored how the Carrier industry had to totally transform itself to remain viable in providing connectivity and cloud-based services…after four or five years of work, we have a technical approach which is still below the line from a financial analysis standpoint and which does not have an answer for long-term service management or ZTA.

Amen!  If we want to find a reason carrier transformation lags behind OTT innovation, you could start with the tendency of operators to launch a five-year standards initiative when OTTs would simply put a couple architects to work, launch development based on the result, and release something in six months.  The white paper first released on ZTA looks a lot like the one released for NFV in 2012, and you may recall that the whole ZTA issue was raised there quickly (spring of 2013) and declared out of scope.  News flash: You can’t do zero-touch automation if anything that has to be operated is out of scope.

If we are ever going to get anywhere with ZTA we have to start with the presumption that we are building a universal model and focus on how the model gets to be universal.  We already have a prototype concept with “intent modeling” and the notion of hierarchical decomposition of a service data model.

Intent modeling says that any service can be subdivided into functional elements that have explicit missions, represented by interfaces, parameters, and SLAs.  Within an intent-model element, the means whereby the mission/intent is realized is opaque.  Internal logic (whatever it is) works to manage behavior against the element’s own SLA.  If the element can’t meet that, then it asserts a status that reflects the failure.

That’s where hierarchy comes in.  An intent-modeled element is a piece of a service hierarchy, so it is “created” by the decomposition of something above.  A “VPN” functional component could be realized in a number of ways, like “MPLS-VPN” or “2547-VPN” or “Overlay-VPN”.  If one of those second-level components is selected and faults, it’s up to the VPN functional component to decide what happens.  Perhaps you re-provision the same sub-component, perhaps you launch another one, or perhaps you report a failure yourself, to your own superior element.

In a structure like this, there is no application component called “MANO” or “VNFM”, only a set of functions that add up to those original functional blocks.  It goes back to the old TMF model of the NGOSS contract, where events (conditions in a service lifecycle) are steered to processes (the logic that combines to create the features of things like MANO and VNFM) via a contract data model.  It’s the model structure and the specific modeling of the elements that matter most.

Models are also the right path to NFV interoperability.  If there is a “standard model” for a given function, whether it’s a virtual one (VNF) or part of the infrastructure (NFVI), then anything that fits into that model is equivalent in all respects.  Without a standard model for something, the implementation of features and functions can vary in terms of their intent, interfaces, SLAs, etc.  That makes interoperability at the service composition level nearly impossible to achieve.

There are some examples of intent-modeled networking already out there.  Sebastian Grabski has an insightful blog post on what he calls “declarative networking” that’s based on model abstractions that could easily be expressed as a form of intent modeling.  It shows how modeling can create multi-vendor, interoperable, implementations of the same thing, or logical feature, based on some Cloudify tools.  That, in turn, shows (again) that the really useful work in this space isn’t being done in network-biased bodies at all, but in cloud projects.  The cloud seems more willing to accept the basic need for declarative modeling than the network community.

Modeling isn’t just drawing pretty pictures.  What are the functional elements of a service?  What are the common properties of all intent-modeled elements?  How do we define events that have to link model elements to coordinate systemic reactions?  How do events get sent to service processes?  This is the stuff that ZTA needs to define, before use cases, before tests and trials, because without these things there’s no way to create a ZTA approach that covers the whole of the service, extends to all possible infrastructure, and incorporates all current network and service management tasks.  Where is it?

ZTA should have started with service modeling.  The early PoC efforts should have focused on defining the prototype service models for the largest possible variety of services.  On ensuring that everything about a service, from top to bottom, legacy to modern, self-healing to manual processes, was representable in the model.  On ensuring that every class of feature had a “class model” that would be implemented by everything that purported to support that feature.  When that’s been done, you can start doing implementation trials on the modeling, because you know it meets all the goals.

There is, in my view, an obvious problem with the way we try to approach technology standards.  If you try something, based on a given methodology, and it fails, there are many reasons why that might have happened.  If you use the same methodology a second and third time, and they all fail in the same way, you have to suspect that the methodology is somehow keeping you from the right answer.  Standards, and the “standards” approach to innovation in infrastructure, may be the core problem with transformation.  ZTA is in trouble for that reason.

Why We Should Be Augmenting Augmented Reality

Everyone has heard of augmented reality, and it’s often mentioned as a driver for everything from 5G to new phone technology.  There’s no question that it could create a whole new relationship between us (as workers and consumers) and our technology, but the space is a lot more complicated than just putting on some kind of special goggles.  In fact, augmented reality ends up linking almost every hot technology topic (including AI), and to see why we need to examine the three requirements for useful augmented reality.

To start off, we should reflect on the Wikipedia definition for augmented reality: “Augmented reality (AR) is an interactive experience of a real-world environment whose elements are “augmented” by computer-generated perceptual information, sometimes across multiple sensory modalities, including visual, auditory, haptic, somatosensory, and olfactory. The overlaid sensory information can be constructive (i.e. additive to the natural environment) or destructive (i.e. masking of the natural environment) and is seamlessly interwoven with the physical world such that it is perceived as an immersive aspect of the real environment.  In this way, augmented reality alters one’s ongoing perception of a real-world environment, whereas virtual reality completely replaces the user’s real-world environment with a simulated one.”

The missions for augmented reality are broad.  We already see some use in improving visual perception for those with limitations in that area.  There is broad interest in having augmented reality play a role in self-drive vehicles, and even in everyday shopping, sightseeing, and most of all, working.  A number of companies are experimenting with the use of augmented reality as a means of offering worker or even customer support for specialized tasks.

We see a lot of virtual reality today in gaming, and there are some examples of augmented reality too, but they’re more limited in both numbers, users, and scope.  The reason is that augmented reality is a lot harder to do, and the three “R’s” that set the requirements framework are the reason.

The first requirement is that it has to be responsive.  Reality is real-time, and so any augmentation of reality has to be as well.  That means that the artificial part of the visual field has to track the real part.  If the system employs a display for both real-world surroundings and augmentation, then that system has to track the real world without perceptible delays.

Responsiveness starts with the ability to present the “real-world” visual field in real time.  That could be automatic if the augmented reality device lets a user “see though” the display, but most of today’s products create the entire augmented reality view from a camera and image insertion.  There is a potential for delay in both models, but most in the latter because of the challenge of redisplaying the real-world image to follow a user as they move their head.  Remember that the system would have to offer a very wide field of view, perhaps not as wide as the human eye but surely well over 120 degrees, or there’s a risk of literal “tunnel vision” that could be dangerous for the user (and those around the user).

The responsiveness problem is exacerbated by the need to visually link augmentation with the reality piece.  A label on a building can’t float in space as the user’s head turns, then race to catch up with the object it’s associated with.  However, figuring out what is in the field of view and where exactly it is located within it is far from trivial.  In pure virtual reality the entire view is constructed so there’s no need to synchronize with the real world.

The second requirement is that it has to be relevant.  This is nearly as important a criterion as responsiveness, because augmented reality seems to imply positive or valuable additions.  The visual sense is our most powerful, the one that (for those without sight impairments) establishes our place in the world.  Imagine the effect of a bunch of image clutter that has little or nothing to do with what we see and what we’re trying to do.

What makes this the most difficult of the criteria to meet is its total subjectiveness.  Relevance is (no pun intended) in the eye of the beholder.  The key to achieving it is placing the augmented reality into context, meaning an understanding not only of the current real-world visual field, but also the mission of the user/wearer.  At the minimum, a good system would have to be able to accept explicit missions or mindsets as a guide to what to display.

If you’re shopping, you look for stores.  If you’re sightseeing, you look for landmarks, and if you’re hungry you look for restaurants.  To clutter a view with all of this at one time would be to either render the real world invisible behind the mask of created imagery, or to render the imagery virtually useless because of the size limits the sheer mass of information would impose.  This kind of mission-awareness is the first and most critical step to contextualizing augmented reality for relevance, but it’s not the end of the story.

Obviously, looking for restaurants isn’t helpful if I can’t find any, which means that I’d also need to understand the nature of the businesses that sit within my visual field.  This understanding could come about only through accurate geolocation of my position and a knowledge of the locations of other points of relevance, like restaurants, stores, and landmarks.  You could get that from a geo-database something like what Google provides, or you could get it from an “information field”.

Those of you who have read my own IoT views know that “information fields” are contextualized repositories of information that intercept a user’s own “information field” as the user moves around or changes goals.  Unlike geo-databases that would require someone to host them (and would likely be available only as a service), information fields would be asserted by companies or individuals, and would likely represent a hosted event-driven process.  An augmented reality user would assert a field, and so would anything that wanted explicit representation in user augmented reality displays.

The problem with my model is that you’d need a standard framework for the interaction of information fields with events and with each other.  Frankly, I think that neither IoT nor augmented reality can be fully successful without one, but it’s hard for me to see how a body could take up something like that and get the job done in time to influence the market.  Might a vendor, a cloud provider, offer an open strategy?  They could, and both Amazon and Google would have a lot to gain in the process.  Of the two, I think Amazon is the more likely mover, since Google already has (to support its Maps service) the geo-database that would likely be the foundation of the alternative to information fields.  But more on this later.

Requirement three may seem almost contradictory; an augmented reality system has to be restrictive.  It has to get out of the way under special circumstances, where the augmentation might put the user at risk in the real world.  That it’s critical that a user walking down a New York sidewalk not end up falling into an open subway grate or walking into traffic is obvious.  That a commercially exploitive model of augmented reality might realistically be expected to pepper the visual field with ads is also obvious.

I’ve talked with some augmented reality researchers who say that experience shows that the difference between “clutter” and “augmentation” is subtle.  This is one place where AI could come into play, because it would be very valuable for an augmented reality system to learn from user management of the density of the augmented part of the display, and enforce similar densities under similar conditions.

In this area, and in others as well, a big part of the benefit of augmented reality depends on AI.  The problem is that too many things are going on in the real world to allow the user to frame missions to filter and label them.  You need at the least machine learning, and better yet you need a mechanism to predict what will be valuable.  Cull through recent web searches, texts, emails, and so forth, and you could (as an AI agent process) take a good guess at what a user might be doing at any moment, and by knowing that, provide better contextual support and augmented reality relevance.

Let me go back now to information fields.  This sounds like a totally off-the-wall concept, but it’s really the model of IoT that fits the Internet the best.  An information field can be analogous to a website, a place where information is stored.  The user, via another information field, can be visualized as a browser doing web-surfing.  Not by explicit clicking, but by moving, looking, living.  Every movement and action, glance and request, surfs the “information field web”.  Much of this approach could be implemented via a combination of event processing and the current Internet web-server model.

When the Worldwide Web came along, it was a technology-driven revolution in online thinking.  That was possible because a self-contained development opened a lot of doors.  Our challenge today is that most of those easy wins have already been won.  Today’s new technology, including augmented reality, is part of a complex implicit ecosystem whose individual parts have to be pulled through somehow to create a lot of glorious wholes.  Should we be thinking in those terms?  Surely.

Do We Finally Have an Open and Realistic IoT Model?

Most of you know I think IoT is overhyped, with the popular vision being a whole bunch of new sensors put on the Internet for people to exploit, misuse, or hack, depending on their predisposition.  The real IoT opportunity has to face two realities.  First, most sensors will never be on the Internet directly; they’ll have a gateway that offers them limited exposure.  Second, the IoT opportunity for providers will be in offering digested contextual information derived from sensor gateways.  You don’t hear much of this stuff, and so I was happy to talk with the Eclipse Foundation’s IoT people and hear a more realistic vision.

It’s always interesting to see how realists view a market, in contrast to how the market is portrayed.  I explained my own vision of the IoT space, the notion of contextual “information fields” projected from event-driven services.  Even that was a bit too futuristic for the Eclipse team; they were focused more on “industrial” and “facility” IoT.  The good news is that their model of an open IoT ecosystem is perfectly compatible with their own industrial vision, my information/contextual vision, and even the pie-in-the-sky world-of-open-sensors vision.

In a nutshell, the Eclipse Foundation defines three broad software stacks, with the goal of creating three open and symbiotic ecosystems.  One is focused on limited, contained, devices like the pie-in-the-sky sensors you hear about all the time.  The second is focused on the critical gateways that connect realistic, locally specialized, sensors to broader applications, and also provides local processing for closed-loop event handling.  The third is focused on cloud-hosted applications that exercise control and provide analytics and other event digestion and distribution.  My information fields are created by the last of these three.

The overriding mission of the Eclipse Foundation in IoT is to create an ecosystem within and among these three component stacks.  They want the individual stacks to be open within the domain they represent, and they want them to present APIs that can link them together in whatever deployment context is mandated by previous services or commitments, and then with whatever new stuff is needed to address the evolution of the opportunity.

The slide deck I reviewed shows the device stack hosted on an embedded OS (OS/RTOS for “Real Time OS” in the slide), which I think is almost a given.  The second of the stacks for the gateway is similarly labeled, but here I think the OS might be a version of Linux.  The final stack for applications is a PaaS platform, meaning they hope to define middleware tools that would create a standard IoT-centric execution environment.  That, in fact, is really a goal with all the stacks.

Each of the stacks define a set of tools/protocols/APIs that provide the critical points of exchange and integration.  By defining standards here, Eclipse creates the basis for openness and extension, which is more critical for IoT than perhaps anything we talk about these days.  Information without application is just noise, and we need all the innovation we can get to turn the IoT data into useful stuff.

The middle, gateway, stack is defined in two broad examples, one for industrial applications and one for residential.  The structure of the stack is the same, but there are refinements in the interface to accommodate the different price points and functional requirements for these two market spaces.  I think that other versions of the gateway may come out as things like driverless cars and even retail IoT come along.

Up where I think the important stuff is, which is that third stack in the presentation, Eclipse sees a central role for a tool called ditto, which supports what’s called a “digital twin”.  Digital twins are representational agent processes that can be manipulated and read so as to simulate direct access to a device.  I think this is a great approach since it “objectifies” the IoT elements, but I hope they intend that the modeling/twinning is hierarchical, so that you can create successive virtual devices or elements that are relationships among a set of lower-level things.

The big barrier to the Eclipse Foundation model is incumbency.  We already have installed industrial and home systems, and in the residential market there is little tolerance for complexity and perhaps less for paying for integration.  The key would be to get manufacturers to embrace an open model.  That may be easier in Europe where there are fewer installed proprietary residential system, or in the industrial space where users are likely to pressure vendors for openness.

In the long term, history may favor the Eclipse Foundation.  Up to the 1980s, we had a bunch of proprietary computer operating systems because most enterprises did their own software.  As packaged software exploded, it became clear that vendors who didn’t have a large installed base couldn’t interest the software providers.  This created a move first to UNIX and later to Linux.  Could that happen here too?  I think it could, providing there’s a real value for “community IoT”.

Which brings me to the promise I made to summarize my own views on the subject.  There is no question that an IoT community, a collection of symbiotic applications serving many different missions valuable or even critical to consumers and businesses, demands an open model.  The mistake the IoT community has made was in presuming that model would be created by directly opening the sensors and controllers.  I don’t think there is any way to satisfy mandates for safety and privacy in such a scenario.  What is possible is what could be called “controlled, digested, exposure”.

The future of IoT lies in the creation of “information fields” derived from the collection, correlation, and analysis of sensor data.  These information fields would be created by and projected from the kind of stuff that the Eclipse Foundation is defining in its second and third software stacks—gateways and application platforms.  Applications, which could be run in or on behalf of mobile users, connected cars, autonomous vehicles, and of course cities, governments, retailers, enterprises, and so forth, would intercept some of these fields and utilize them.

For what?  Mostly about contextual applications.  We want our devices to serve us.  The biggest barrier in their doing that is understanding what we need/want, and the biggest barrier to that is understanding our context.  I’ve said in a number of blogs that I was told that the most-asked question of the early Siri was “What’s that?”  as if Siri could know.  But Siri could know with information-field-centric IoT.  We don’t have to make agent processes actually see to make them understand what we’d likely see from a given point.  We could use information fields to find places and people, to link goals with directions, to guide remote workers, to control self-driving cars.  The overall architecture is what’s needed here, not a bunch of proposed one-off solutions that couldn’t justify the enormous pre-deployment of assets needed to create critical information mass.

For all the good intentions, and good technology, in the Eclipse Foundation work, they still face the same risks as the MEF did (and as I discussed in my blog yesterday).  An ecosystem created from the bottom up may encourage exploitation but it doesn’t guarantee it.  Since value to the market is created where the money changes hands, at the user level, the value of the Eclipse Foundation ecosystem will have to be built with and on top of their work.  They’re still a very small force in the market overall, and we’ll have to wait to see if they can build up a momentum in the IoT space.  I’d like to see that, because an open and realistic approach is always critical, but especially so in IoT.

The MEF 3.0 Framework: Good at the Wrong Level?

The MEF has been doing a kind of reboot, transitioning to something other than pure carrier Ethernet.  One effort, the “Third Network” apparently confused a lot of people (what, they wondered, were the first two), and so the current efforts are titled more conventionally, as MEF 3.0.  It’s more explicit about its vision, and in particular more explicit about support for SD-WAN.  Is there enough substance here, or is this just another industry body seeking a sustaining vision?  To find out, we’ll need to look at the problems the MEF is trying to solve.

The original notion of the Third Network was based on the presumption that a service network could be built over a combination of networks, some (obviously) based on Ethernet.  Rather than assuming everything had to be over IP, the MEF vision presumed any combination of technologies could be used.  It also supported pan-provider federation.  Finally, it introduced the notion of Lifecycle Service Orchestration (LSO), which has become a specific centerpiece of the MEF 3.0 vision.  The Third Network was, as I’ve noted, implicitly linked to an end-to-end service model but it didn’t really describe the model in a full sense.  That’s an issue that I think is still a problem with MEF 3.0, though (as we’ll see below) it’s less one than it was before.

That defines the MEF side, what about the high-level vision?  Services are what buyers purchase, so that’s what frames the goal of MEF 3.0.  The idea is to create a service network that can ride on arbitrary operator networks.  That service network will almost surely involve CPE, and it may also involve gateway elements that join the operator networks to create universal connectivity.  Service networks that are independent of underlying transport infrastructure really have to be based on an overlay technology.

Overlay networks build connectivity on top of other connectivity, using a higher-layer header representing the higher-layer service.  User traffic is injected into the service network, usually via CPE but perhaps also through virtual CPE (vCPE) and cloud-hosted software elements.  Based on the service-network addressing of the packet, it’s assigned to both an exit interface (a network trunk connection) and an address on that network.  If there are multiple networks utilized (SD-WAN often supports both MPLS VPNs and the Internet to reach all the users’ sites), then at least some sites have to have connectivity to all the transport networks used, or there has to be a network gateway available to pass between the available networks.  We can call this the edge-connected transport or the gateway transport options.

The mapping between service network and transport network(s) can take a number of different forms.  One is that the service network simply maintains its own addresses in an independent and perhaps proprietary way, and maps the addresses of all the service-level packets to the appropriate transport network destination.  The other is that there is a formal tunneling protocol used, something that defines a virtual LAN, virtual or pseudowires, or another form of connectivity.  The service-level data is injected into the tunnels based on address associations, which take it to the right destination.

I think a perhaps-more-scholarly-sounding presentation of these points would have benefitted the MEF from day one of the Third Network, and would still benefit them today.  This illustrates the important things that MEF 3.0 has to contend with.  First and foremost, it has to either define or identify the overall tunnel model(s) to be used.  Is there only one, would any serve, etc.?  That links to the second thing, which is the functionality of the gateway.  You could in theory have different tunneling on either side of a gateway.  Is that supported?  Does the over-arching service protocol have to be a tunnel that then gets translated, or is it something that lives in a tunnel, and thus is the business of the end elements?

Whatever the answers to these points might be, there are three distinct layers to be considered.  At the bottom, there are the actual operator networks, which realistically are probably either Ethernet or IP.  In the middle are optional tunnel networks that provide segmented connectivity and transport across those networks, and at the top there’s the service overlay network that actually frames what the user sees as the connectivity service, and potentially provides other service-related features.

The purpose of MEF 3.0 is to define a set of APIs (LSO) that allow operators to deploy and manage the physical-network underlayment for a service network overlay, and to provide some (mostly future) limited management of the overlay framework.  Control has to be exercised in three dimensions.  First, you need to control the service relationship to the tunnel network, the interface and the SLA that is provided.  This is probably a CPE/CLE function.  Second, you have to control the tunnel network relationship to the transport underlayment(s), third the NNI between the different underlay networks.  Both the latter are gateway functions.

You can contrast this with today’s basic SD-WAN or overlay SDN model, which I believe is the de facto standard for service overlay networking.  The presumption is that you have an IP underlayment and that whatever connectivity is needed between/across the various types of IP or network operator administrative domains is provided by one or more of the endpoints.  These have a connection to several networks/domains, and so can bridge traffic between them.  If there is any deployment or management needed for the underlayment, it’s independent of the SD-WAN/SDN service.  That’s a simple model of service that can be used by the operator, a third-party player like an MSP, or even directly by the user.

This, finally, brings us to an assessment of the MEF 3.0 model.   That model is clearly not particularly simple.  It can provide a level of service management and SLAs that isn’t available to traditional SD-WAN, and a means of harmonizing connectivity across multiple operators with a seamless service.  Are the benefits enough to justify the complexity?  In the short term, I’d have to say they are not.  In the longer term?  Perhaps.

The problem I see with the MEF approach is that it presumes a goal of pan-provider services using homogeneous infrastructure, and operators aren’t generally telling me that’s their goal.  Some operators see SD-WAN simply as a way to extend their VPNs to places where they don’t provide MPLS or where customers can’t justify the cost—small, remote, locations.  Other operators see SD-WAN as a means of providing service-layer isolation within their own footprint.  In Europe, there is some interest in pan-provider networking, but it seems to me that most of it is satisfied by MPLS interconnect.

The differentiating features of SD-WAN, again in my personal view, have to be directed at the future, virtualized-infrastructure-and-cloud model.  Pan-provider connectivity is more credible if you assume that cloud providers are part of the mix, but are cloud providers going to federate their network connectivity in a more traditional sense, offering connectivity in general and not just connectivity within and to/from their cloud hosting?  If they don’t then the cloud and virtualization drive a need for formalized logical networking, which the MEF 3.0 specs don’t target at all.

Services aren’t sold to customers based on APIs they never see, but based on features that improve the value of the network to its users.  I continue to believe that logical networking is the feature set that will drive overlay networks, and the success of any implementation of overlay networking will depend on the services created not on the way the infrastructure it tweaked.

I think all of this relates to the need for the MEF to have defined a high-level service model, including the primary missions for the service.  They could then dissect the mission into features and map the features to their pan-provider architecture.  SD-WAN is at least mentioned in MEF 3.0, and so you could presume it is a service mission, but the service-level features of SD-WAN seem to be considered out of scope, and without that it’s not clear whether the model of SD-WAN that the MEF will define will actually fit with the market’s requirements.

Why not?  Because like most of the “networking bodies”, including formal standards groups, industry groups, and even open-source groups, the MEF is focused on the “how” more than the “what”.  You cannot develop a useful standard for doing something you’ve not ensured is useful.  The features of overlay services including SD-WAN that will define utility to the buyer are higher up on the technology stack, and the MEF continues to look too low.  Yes, they define how operators could field an overlay solution, but not how the operator solution could be differentiable and valuable.

MEF 3.0 is very close to being useful, but it’s not there yet.  The problem is not that the work they’ve done doesn’t have technical value, but that there’s a significant risk it won’t have market relevance.  Given that network operator interest in SD-WAN and the general overlay service model is exploding, and given that we already have two players (VMware with NSX/VeloCloud and Nokia with Nuage) that offer a carrier-targeted overlay service architecture, the MEF needs to step up and get their story optimized, before it’s overtaken by market events.  There’s still time, because neither of the carrier-overlay-network leaders has a complete logical networking story, and that’s the feature set that will pull through the SD-WAN model that will define the future.

The Two Pieces to a Transformed Operator

I offered operators’ criticisms of transformation views, and a few of you wanted to know what my own views of transformation were.  Do I have something that operators wouldn’t criticize?  Probably not, but I do have a view based on my absorbing of their views and the running of my market model.  It’s actually not all that complicated either.

At the high level, transformation to me has two pieces.  The first is to create and sustain a minimally profitable business model for connection services.  Operators need to provide these services, and they can’t lose money on them.  The second piece is carrier cloud.  Whatever forms the growth engine for operators in the future, it won’t be connection services but hosted experiences.  Carrier cloud is what hosts them.  We’ll take these in order.

A minimally profitable connection service business model depends on two things.  First, you have to streamline operations through the introduction of automation to the service lifecycle.  This is hardly a revolutionary idea; CIMI published its first service management automation paper in 2002, and most of what was in it is still recognizable today.  Plenty of others did the same, but we’ve only now started to come to terms with reality.  Second, you have to rely much more on open hardware and software technology for the capital infrastructure you deploy.  With profit per bit in the toilet, you can’t pay 40% gross margins to a vendor.

Service lifecycle automation has to start with the customer experience, for the simple reason that customer-related displaceable opex represents 57% of opex overall.  Most operators realize now that service lifecycle automation, like web-enabled applications, has a front- and back-end process model.  The back-end stuff is where most of the current attention is focused because the perception is that by controlling what operations people do, you control cost.  The reality is that in order for service lifecycle automation to work, you have to start with a customer self-service and support portal.

The problem with ZTA as an isolated back-end activity is that it’s difficult to align it with the notion of services, which are after all what are being sold and consumed here.  A service isn’t a collection of technology, it’s a collection of experiences.  The customer does the experiencing, of course, and so the most critical thing we need is a vision of how a customer handles a service—from ordering to problem management.  Only a front-end portal-driven vision can offer that.

The scope of ZTA, and the means by which it’s applied, relate directly to the services from the customer side.  In particular they relate to the service-level agreement or SLA.  To give a simple example, look at two extremes, one where there is no real SLA at all—best-efforts services—and another were the SLA is very stringent in terms of number of and duration of failures.

In the best-efforts case, what the customer needs is confidence that the operator knows there’s a problem, has identified it at a high level, and has offered an estimated time to repair.  The customer doesn’t need to know that this or that trunk failed, what they really need to know is that the failure is either already being worked on (because someone else reported it or because it was identified through management tools) and that it should be fixed by a given time.

Last winter when my area was plagued by heavy, wet, snow, we had a number of protracted power outages.  My utility company offered a map (which of course I had to access via a cell site) showing where the reported outages were, and offering information on how many customers were impacted and what the current state of allocation of crews were.  If I reported an outage, I could be told either that I was part of another specific outage already reported, or my report would launch a new outage.  Yes, like anyone without power and visualizing all the foot in my freezer thawing while all the humans in my unheated home froze, I was frustrated.  Not as much as I’d have been had I been unable to determine whether anyone knew I had a problem, though.

The status-and-resolution-centric view of a service is appropriate where users don’t have specific guarantees.  It enables operators to manage the infrastructure as a shared resource, which is how utility companies work.  The limited SLA means that service lifecycle automation really isn’t much of a requirement, and that providing it wouldn’t necessarily do much for opex, as long as resource-and-capacity management tools were employed from planning into network operation.

With a stringent SLA, there are two differences that have to be addressed.  First, the service user has very specific contractual guarantees, which means that the data needed to support assertions that there is or isn’t a problem has to be provided to the customer, at least on request to resolve disputes.  Second, the fact that multiple services that share resources do so with different guarantees means that it’s important to manage SLAs at the service level and to remediate and escalate issues at that same level.  You can’t rely on resource management as much, or perhaps at all.  Thus, you need low-to-zero-touch service lifecycle management.

Even in this second case, the specifics of the SLA will have a lot to do with the level of service management automation required to address those customer-centric operations costs.  If you look back to the days of time-division multiplexing (TDM) services based on T1/E1, T3/E3, and SONET/SDH trunks, customers expected to get not only service-level data but facility-level data.  Trunks had things like “severely errored seconds” and “error-free seconds”, and remediation was expected at the trunk level.  Whenever we provide a “service” whose connectivity components are visible and individually guaranteed, we need to provide management visibility at least to the level of identifying facility-specific faults.  If the customer’s service includes some grooming/routing of trunks, we’d need to provide access to the tools that could do that.

Since we seem to be moving away from services that are made up of discrete, visible, elements in favor of virtual connectivity services, might we dispense with all this hype?  No, because of virtualization.  A corollary to our principles of matching service automation to service visibility is that stuff that’s under the virtualization abstract can never be made visible.  A customer doesn’t know a virtual device, hosted in a VM or container, run on a cluster of servers, is doing what they see as providing an edge firewall.  They know they have a firewall, and they expect to be able to see the status of that virtual device as though it were real.

The strongest and most enduring argument for service lifecycle automation, including the elusive zero-touch automation, is virtualization.  Users cannot manage structurally diverse resource pools associated with virtualization; it’s not possible for them to even know what is being done down there.  Even customer service people manage functions in abstract, because functions build services.  The translation of function to infrastructure, both at an individual function level and at a service-systemic level, has to be handled invisibly, and if that handling isn’t done efficiently then the inevitable complexity introduced by virtualization (a function on a VM on a server connected by pathways to other functions is more complex than a couple functions in an appliance) will kill the business case for virtualization.

This point is both the conclusion to the “make connection services profitable” track, and the launching point for the “carrier cloud” discussion.  Everything in carrier cloud, all of what it is made up of and what it’s expected to host, is a form of virtualization.  If a user is getting a TV show via wireline FTTN, wireless 4GLTE, 5G/FTTN, or 5G mobile, they are buying the experience and they’ll need to know when, if it’s not working, it will be fixed.  If anything, carrier cloud is going to carry virtualization to many deeper levels, virtualizing not devices, but devices within deployment models within service models, and so forth.  That risks further decoupling the experience being sold with the details of management.

“Carrier cloud” is a broad term that’s usually taken to include two areas—service features or functions that are hosted rather than embedded in appliances, and the resource pool needed for that hosting.  Like network infrastructure, carrier cloud is a kind of capability-in-waiting, assigned not to a specific task or service but available as an ad hoc resource to everything.  Like all such capabilities, the trick isn’t making those assignments from an established pool, but in getting the pool established in the first place.  The resources of carrier cloud are an enormous “first cost” that has to be managed, minimized.

We have a naïve assumption in the market that operators would simply capitalize carrier cloud and wait for opportunities to pour in, the “Field of Dreams” approach, named after a popular movie.  “Build it, and they will come” was the tagline of the film, and that might work for regulated monopolies, but not for public companies who have to mind their profit and loss statements.  Getting carrier cloud going requires both an assessment of potential revenue and potential cost.

I’ve blogged before on the carrier cloud demand drivers; NFV, virtualization of video and advertising features, IMS/EPC/5G, network operator cloud services, contextual services, and IoT.  All of these have timelines and extents of credible influence, and the summary impact of the group would create the largest potential opportunity driver between 2019 and 2023.  However, opportunity benefits have to be offset by opportunity costs to derive an ROI estimate, and that’s what’s needed to drive carrier cloud.

Hosting economies of scale are predictable using long-established mathematical formulas (Erlang).  James Martin wrote a series on this decades ago, so any operator could determine the efficiency of a given resource pool at the hardware/capital level.  They can’t determine the profitability of carrier cloud services because they can’t establish the operations costs, and from that and their target rate of return, the selling price, and from that the market penetration and total addressable market (TAM).  If all of the complexity of multi-level virtualization are exposed to “normal” operations practices, none of our carrier cloud services are likely to happen.

Whether we’re talking about automation to support efficient virtualization or to support efficient customer support, it’s likely that automation and intent modeling are linked.  At the lowest level of modeling, when an abstract element contains real resources, the management goal would be to satisfy the SLA internally through automated responses to conditions, then report a fault to the higher layer if resolution wasn’t possible.  The same is true if a model contains a hierarchy of subordinate models; there may be broader-scope resolutions possible across resources—replacing something that’s failing with something in a different area or even using a different technology.  That’s a cross-model problem at one level, but a unifying higher-level model would have to recognize both the fault and the possible pathway to remediation.

My view is that intent modeling hierarchies, and intent-based processing of lifecycle management steps through state/event analysis, is central to service lifecycle automation.  An intent hierarchy is critical in supporting an end-customer view (that should be based on the status variables exported by the highest levels of the model) and at the same time providing operations professionals with deeper information (by parsing downward through the structure).  It’s the technical piece that links the two transformation pathways.  If you model services based on intent modeling, you can then improve operations efficiency by automating customer support, and you can also ensure that carrier-cloud-hosted services or service elements aren’t marginalized by exploding operations costs.

We are still, as an industry, curiously quiet about model-driven service lifecycle automation, despite the fact that seminal work on the topic (largely by the TMF) is more than a decade old.  If we spent some time, even limited time, framing a good hierarchical intent-driven service model structure, we could leverage it to cover all the bases of transformation.  It might not be a sufficient condition for transformation (other things are also needed), but I think it’s clearly a necessary condition.