Why Logical Networking is Our Real Future

What is the most important development in networking, not just technology but in the broadest sense?  What’s going to change, shape the future?  It’s the emergence of “logical networking” as what’s essentially a new OSI layer.  That single force is going to reshape everything, decide the fate of the new technologies, and define the role of network vendors, standards bodies, open-source projects, and how we think of applications.  In particular, it’s going to redefine SD-WAN, and perhaps SDN too.

As I said in a prior blog, a logical network is a network that connects logical entities, not physical network service access points (NSAPs).  In networking today, we typically build a network of subnetworks, with each subnetwork defining either a user community or an application hosting domain.  Subnets live inside facilities, and they’re connected with a WAN which in theory might be a true private network (links and routers) or a virtual private network (virtual links combined with either real or virtual routers).  The obvious problem is that what the network is supposed to connect are users and application components, which exist in real networks only as addresses.  If something moves, the address process has to be altered explicitly.

A logical network is one where users and application components know each other by who/what they are, in effect by their logical name.  All connections are made through logical names, and the mapping between logical names and underlying NSAPs is automatic and implicit, so it doesn’t complicate the life of the user.  Logical networks, because they know who and what things are, can also apply policies directly based on identity, rather than having to work through how identities (which matter at the policy level) map to NSAPs (which are all the network knows about).

By definition, a logical network should be independent of the physical network and connectivity underlayment (or underlayments) used.  That means that it represents a service-layer overlay on traditional network technology.  Two technologies have emerged that can provide this, software-defined networking (SDN) and software-defined WAN (SD-WAN).  The embodiment of logical networking is almost certainly going to end up being one of these, and the force of logical networking will reshape both of them.

In application/user terms, the great majority of SDN use today is in the data center, to create application subnetworks or tenant networks.  Virtualization demand agile deployment without compromising component, application, and user connectivity, and the easiest way to do that is to frame applications in a unique subnetwork that lets internal connectivity be open and implicit, but requires explicit exposure for external connections.  It’s easy to do this with SDN in any form, but the original SDN subnet-control technology (Nicira, now VMware NSX) was an overlay technology.

SD-WAN is also a form of overlay technology.  The original SD-WAN mission was primarily to extend VPNs to sites that were too small to justify high-speed MPLS/BGP connectivity, so it combined VPNs with the Internet for those small, sparse, sites.  Some SD-WAN providers have also focused on connecting multiple MPLS VPNs (managed services) or on dealing with sites in areas where even Internet broadband was likely not all that “broad” and thus need low overhead.  Most now offer hosted (meaning software instances) versions of their technology, which lets it be deployed in the public cloud and facilitates the integration of cloud components.

Both these missions have some credibility, but they’re also fairly technologically basic, which means feature differentiation is complicated if you stay with the original goals.  Not surprisingly, vendors have sought better differentiators.  In the SDN space, Nokia/Nuage is focusing on an SDN model that lets users and applications into overlay subnetworks.  Juniper likes multi-cloud.  For SD-WAN, there’s a trend emerging to be “entity-aware” meaning to have the SD-WAN understand applications, services, and users.  That’s the beginning of the logical-network future that I’m predicting, the trend that changes everything.  What’s driving it?  Market maturity, competitiveness (greed, if you like), and the great overarching issue of virtualization.

There is nothing technically magical about combining an Internet overlay VPN and an MPLS VPN, or doing overlay application subnetworks, or linking two or more MPLS networks together.  This is particularly true when you consider that the network operators themselves can apply SDN/SD-WAN technology to do just that.  Some (like Verizon) have already started.  If SDN and SD-WAN are purely technical transport overlays, then the logical provider for them is the data center network or software vendor (for SDN) or the service provider (for SD-WAN).  Once these guys get serious, which is when they see the money, which is already starting, they define the market and make transport-level connectivity features table stakes.

It is totally illogical to think that network virtualization, network-as-a-service, would develop an overlay network whose goal was to do exactly what the underlay network did.  IP connectivity is already here, perhaps not perfectly but surely good enough for most users.  To get a larger total addressable market (TAM) service providers and vendors need to look at what network virtualization and NaaS would really demand above basic connectivity.  That’s the logical networking space.

In a report I published earlier this year, I outlined five things that a product would need in this new logical networking space.  They were session awareness, explicit logical name support, application/service awareness, as-a-service or software-instance deployment option, and self-federating with respect to policies, namespaces, etc.  A few vendors in the space support all of these now, most support some, and a few have nothing at all.  The “market”, meaning the media/analyst space, hasn’t yet awaked to all of these points as elements in “logical networking” and some of them aren’t recognized as a part of any particular tech trend.  That’s going to change.

There’s no question that virtualization is the prime driver of change from a technical perspective, and mobility second.  Anything that unties the traditional relationship between a network-connected entity and a network address is either going to force that tradition to change or limit the range of things that virtualization and mobility can accomplish.  Where either of these forces are acting, the push for logical networking becomes stronger.  Where both are acting, that push becomes relentless.  Nothing is going to stop this shift in emphasis.  It may stop some players, though.

None of the current reported top-ten in market share in this publicly available list (which I don’t agree with; none of the three vendors I see most often competing for current opportunities are even on it) is particularly strong in the logical network space, or even well-positioned to exploit it.  The list also proves that the space itself isn’t well-defined.  If it’s correct as a picture of the space in 2017, it combines with my recent experience to show a major shift is already underway.  If it’s correct, it shows that the “incumbents” may be too complacent to be credible even in today’s leading-edge opportunities.

Let’s suppose that this logical-network thing keeps developing as I’m predicting.  The impact would be to move user/application issues up into the new logical-NaaS layer, which would be created almost entirely at the network edge.  Transport features would become increasingly invisible, and the need to sustain current Ethernet/IP service models would erode because the models would be invisible.  SDN adoption in the WAN would be facilitated.  Edge-centric NFV (virtual CPE or vCPE) would get subducted into these new SD-WAN edges.  Security would be likewise.  In short, these new devices or software instances would become the logical network service access point (L-NSAP) if you like new acronyms.  Every current user, current application, future user, future application, even new services and things like mobility, would likely be changed eventually to fit this new model.  Revolution, without any major new infrastructure investments.

Then there’s management.  Service management is a key requirement for any business service, and SD-WAN could provide a universal edge point where the service as the user sees it joins the service as the provider offers it.  You can see into the transport network choices from there, inward (to a degree at least) into the user network, and all of the service features and bindings/routings and policies.  It’s the perfect management agent.  There are already a few SD-WAN players who specialize in management and SLAs.  Everyone will be there eventually.

Eventually, this space is probably going to open-source.  The opportunity is too large and the number of SD-WAN endpoints too high for operators to accept commercial solutions.  Remember, NFV got started in order to reduce capex from proprietary appliances.  Another big force is the evolution of P4-based standardized platform like DANOS and Stratum, which I think will be adding support for non-forwarding-related processes co-hosted in the P4 boxes.  That means that vendors will need to stake out positions here, and make exits, very quickly.  It’s not going to be a problem this year, next, or even in 2020, but by 2022 I expect to see the market dominated by open devices—and that means SD-WAN, vCPE, uCPE, and everything else at the network edge.

Four years is a long time, but open-source doesn’t move quickly, particularly in the network space, and most of all in the operator space.  There’s plenty of time for somebody to get smart, get positioned, define the way that SD-WAN will work, sell out for big bucks, and shape the market.  Who will do it, or will anyone?  That’s the question.  Inertia is a powerful force, startups are crippled by their VCs’ lack of vision and “laser focus”, and a lot of this is going to be the kind of promotion and education combination that nobody likes to face.  Lots of startups won’t face it, perhaps even all of them.  They’ll hope for a small, easy market that keeps them safe.

SDN vendors are in the same position.  They have a secure data center niche, and while there’s no reason they couldn’t adopt all the logical network features I’ve noted (some already have, at least with respect to some key features), the market and vendors seem content for the moment.  What might make a difference for SDN is that some of the key SDN players have been acquired by major firms and have deeper pockets and an interest in the long-term.  They might want to use logical networking to broaden the total addressable market and improve margins.  Since applications and virtualization are the primary driver, and since they’re data center elements, SDN could still emerge as a critical factor in logical networking.  It could also continue to fall short of its promise.

Let me be clear.  Nothing in the SDN or SD-WAN space will be safe from convulsions because of the logical network shift.  In the SD-WAN space, nothing the “leading” players have relied on will matter, beyond a simple tick on an RFP checklist, even a year from now.  The differentiators will change, the requirements will change, and the leaders will change.  Long before the critical 2022 point, we’ll know whether anyone goes for the brass ring, but the change will come no matter what choice is made.

Interfaces, APIs, Events, Boxes: The Right Way to Do Network Software

Want to know the biggest technical problem in networking today?  It’s the fact that we can’t seem to stop thinking “interfaces” when we need to be thinking “APIs”.  Every day we see examples of this, and every day the industry seems to ignore the significance.  We need to face it now, if we’re going to make orchestration, automation, and even IoT work right.

API stands for Application Programming Interface, but the term “interface” by itself is usually applied to a hardware interface, something like RS-232, or a wireless standard like 802.11ac.  By convention, “interfaces” are implemented in hardware, and “APIs” in software.  That’s a huge difference, because if you churn out a million boxes that have to be updated to a new “interface”, you might be looking at replacing them all.  If a million boxes need an API changed, it’s almost always just a software update.

This update facility means that you have to spend a lot more time future-proofing interfaces.  APIs are not only more naturally agile in response to change, they’re usually based on message-passing and can have more direct flexibility built into them.  Further, API standards are fairly easy to transmute into other API standards; it’s called an “Adapter Design Pattern” and it makes one API look like another.  The approach works as long as you have the same basic data elements (even if they’re in different forms, like one that accepts “speed” in Mbps and another in Gbps) in both.

All this may sound like academic/standards mumbo jumbo, but it’s actually so important that the success of standards could depend on it.  The reason is that if networking standards bodies apply the same processes to standards involving APIs that they’ve traditionally used for interfaces, they’ll get bogged down in details that have little or no consequence, and likely collide with other activities working in the same space from the software side.

Interface-bias in standards groups tend to generate software models that look like connected boxes.  They look at traffic rather than looking at events, and that undermines the exploitation of specialized software features like concurrency, scalability and resiliency.  If you look at the model of NFV that was produced in 2013 by the ISG, you’ll see this approach in action.  It’s fine as a representation of functional relationships, but it’s not suitable for driving software design.  A software person would have taken a different tack, starting with the “interfaces” of the end-to-end model and thinking of them as APIs instead.

Software processes that have to manage asynchronous activity have to be event-driven, because asynchronicity can only be communicated in terms of what’s happening and what state the elements are in.  Ask yourself what happens when, as a “box process” is handling a specific condition during deployment or scaling, another condition (related or unrelated) arises.  You’d like to be able to pass the new event off to another instance of the process, and coordinate any cases where two consecutive events impact the same service.  How do you show that in a box-process model?  You don’t.

A software developer with event-processing experience can see a lot of different ways to accomplish the same functional task, but in a software-optimized way.  In the early ExperiaSphere project, well before the NFV ISG launched, I demonstrated the notion of a “service factory” that could be instantiated any number of times, and where any instance could be passed a copy of the service model and the event and handle it correctly.  The “interfaces” here, really APIs, were event-passing interfaces that communicated only the event data.  The Service Factory pulled the service information from a database and did its thing.

Software architects defining APIs usually focus on two separate steps.  First, you decide what the higher-level interface will be.  In the software world, there are examples like REST (the web HTTP model), JSON (the JavaScript model), SOA and RPC, and a whole series of specialized APIs for the cloud.  Nearly all these APIs are designed to deliver payloads, meaning formatted data packages, which is the second thing the architect will define.  That means that software APIs are a combination of a simple package-exchange framework and some package format/content detail.

Does this mean that network standards people should be thinking of defining those package format/content rules?  Not necessarily.  APIs follow what some have jokingly called the “two-consenting-adults” rule; the two processes that are linked by an API have to agree on the package, but every possible process combination doesn’t have to use the same one.  The best example of an application of this rule is in “intent modeling”.

An intent model is an abstraction of a complex process that defines its behavior, inputs and outputs, and service-level agreement.  Does every intent model have the same API and package?  It shouldn’t, because every intent model doesn’t have the same “intent” or do the same thing.  A software geek would probably say that if you’re defining intent model APIs, you’d first create a broad classification of intent models; “IP subnet”, “IP VPN”, “cloud-host”, and so forth.  There might be a hierarchy, so that the “IP subnet” and “IP VPN” would be subclasses of “IP Network”.  Software is developed this way now; in Java you can define a Class that “implements” something, “extends” something, and so forth.

Interface people tend to focus instead on the notion of a universal data/information model, which demands that they explore all the possible things a software function might do and all the data it might need to do it.  Obviously, that takes a lot of time and produces something that’s potentially enormous.  That “something” is also very brittle, meaning that any change in technology or service could give rise to new data needs, which would then have to extend the old model.  Since that old model is “universal” it raises the risk that previous implementations might then have to be updated.  With the “right” approach, new data is only significant in the specific packages where it might be found; everything else stays the same.

People might argue that this loosey-goosey approach would make integration difficult or impossible, but that’s not true.  Suppose some body, like the TMF or ETSI, defined a set of intent classes to represent the elements of network services, meaning the features.  Suppose then that they defined the basic input/output requirements for each, which would not be difficult.  They could then say that anyone who implemented an Intent-Class IP-Subnet had to be able to run properly with the specified class data packages, and that they could offer extensions only within that basic operation.  Now integration is a choice for operators—use the basic class or exploit benefits.  If enough people decided to use an extension, it might then be made part of the basic class, or more likely used to create a derivative class.

Network services, or features of network services, aren’t the only place where APIs and interfaces will end up biting us if we’re not careful.  IoT is at great risk, and in fact so is event processing in general.  Here it’s important to remember two things; simplicity is important to contain costs and processing effort, and software can adapt to change more easily than hardware.

The big problem with IoT is that we’ve yet to face the truth with it overall.  We are not going to be putting hundreds of millions of sensors on the Internet, what we’re going to be doing is exposing vast amounts of sensor data on the Internet.  That is best done through APIs, and of course it should be APIs that deliver event information to functional/lambda elements or microservices.  Again, what’s needed here are “API classes” to allow multiple types of sensors to be combined with event processing and analytics to deliver secure, compliant, useful information.

We’re never going to make NFV integration effective without a model of services and infrastructure that’s hierarchical and supports inheritance and extension, like programming languages (Java for example) do.  We’re never going to make it work at scale without state/event logic, and newer things like IoT are going to be even more dependent on those two requirements than NFV is.  Nothing that’s done today in NFV, orchestration, IoT, automation, or whatever is going to have any long-term value without those two concepts, and it’s time we accept that and fit them in.

The Telco Cloud: Is It Ever Coming?

There were a lot of articles yesterday in Light Reading on various aspects of the telco cloud, perhaps leading up to a theme for their coming event.  We had the “strategic imperative” piece, some more technical pieces, and the view that the cloud was essential for hosting NFV.  Operators are in fact “wanting what they’ve got” (to quote one article, referencing operator envy of the OTT cloud positions), but the real question is how they’d get there.

It’s fairly easy for Amazon, Facebook, Google, and Microsoft to deploy cloud infrastructure.  Three of the four sell cloud services, and the other (Facebook) is a social-media-content player.  The telcos, in the main, do not sell much in the way of cloud services and they have no social-media content presence to speak of.  The point is that the over-the-top business model is great if you’re an OTT, but if you have to build and sustain networks, network infrastructure is likely to dominate your technology planning.

For operators to get to the cloud, there are two possible pathways.  First, they could use cloud technology to host network features.  That’s the NFV story, but it could be broadened to include things that haven’t been a historical focus for NFV, like the hosting of router instances.  Second, they could host something above the traditional network, meaning have an OTT component to their business.  Which of these could get them to a credible cloud position?  The answer would depend on a combination of “first cost” and return on investment (ROI).

First cost is a term that’s lived a long time in the operator world.  If you were to graph the cash flow associated with a new service that required an infrastructure investment, you’d see a curve that went downward like the dipping piece of a sine wave, then returned to cross the zero line and eventually head upward to plateau at the cash flow level that reflected full exploitation of the market opportunity.  That first downward dip is “first cost”, and it reflects the fact that to make a service credible you have to deploy a certain amount of infrastructure before you have any real sales to create a return.  Too much first cost means a higher risk.  Suppose it doesn’t turn around?

ROI is the more familiar measure of credible investment.  Most companies set a target ROI for their capital projects, and CIO approval of most projects will depend on establishing a credible strategy to earn that ROI fairly quickly.  The more revenue you can generate, and the faster you can generate it, the better the ROI.  Similarly, the lower the “I” or investment part, the better the ROI will look at constant revenue.

Any operator will judge our two pathways to the cloud based on these financial factors, and if the approach can’t measure up, it’s unlikely it will get CIO or CEO buy-in.  So the question for carrier cloud isn’t how we can get there (any way that generates the proper ROI) but rather which of the pathways has credible destinations that could generate results—and quickly.

There are six possible drivers that break down into the pathways that I’ve noted.  The first is virtual devices, which is the NFV route.  Second is virtualized ad and video delivery, a combination of caching and ad targeting.  Third is 5G transformation, the potentially massive shift in mobile technology accompanying 5G deployment.  Number four is network operator public cloud services, implying operators would get into the cloud business.  Five is contextual services, the creation and hosting of personalization features that arise with the increases in social-media impact, and six is the Internet of Things (IoT).  The first three of these are at least partially aimed at our network-technology-hosting pathway, and the last three at the OTT above-traditional-network pathway.

My modeling of the opportunity, based on techniques evolved from operator surveys that started in 1989, has assigned probable opportunity sizes and exploitation rates for all six of our drivers, which means we can assign them to the two pathways as well.  How does it look?  Let’s see.

According to the way operators have historically made service/infrastructure decisions, the first question we’d need to ask is whether any of our drivers created less than 10% of the available opportunity in a given year.  Operators don’t always jump on the driver with the greatest opportunity, but they rarely dip down to something that has less than a 10% rate of exploiting opportunity.  In 2018 and 2019, that logic says that neither NFV nor contextual/social service features have a chance of driving carrier cloud.  These drivers could exploit cloud deployments that something else justified, but they’d not be able to go it alone.

What has the best chance of driving carrier cloud in the 2018/2019 timeframe is advertising and video delivery, which controls about 55% of the opportunity in those years.  Second is the 5G modernization opportunity.  What makes these good drivers is the combination of the fact that there are already demonstrable opportunities, they require little in the way of proactive service marketing, and they have significant growth potential over five years.

But, you might ask, isn’t it true that both these drivers are NFV-dependent?  No, it’s not.  They require hosting things, but not hosting of specific in-line network features.  Even IMS/EPC features that are cloud-hosted would look more like cloud applications than virtual functions, and the orchestration and management complexity of NFV couldn’t really be justified for them.

The biggest shift in demand focus comes in 2020.  That year we see an uptick in 5G interest, combined with a significant jump in IoT-related opportunity.  However, the largest opportunity contributor will still be the advertising and video delivery, which will almost double its service revenue potential year over year.  5G and IoT have significant potential, but the problem of first cost and effective positioning and marketing is likely to make them under-perform in the real world.  I expect the realization of opportunity to continue to favor ad/video.

IoT and 5G start to combine with ad/video and contextual services in 2022, according to my latest model iteration, or they should.  The challenge for the industry, and forecasters of the industry, is that the convergence presumes operators would recognize a need to shift to a middle ground between connection services that dominate their revenue today, and OTT services that their rivals seem to have locked up.  Network feature hosting creates what are essentially “feature services” from which traditional network services can be composed.  Feature services could also be the basis of OTT services, and so they’re where operators might gain traction.

Let’s go back to first-cost and ROI.  Operators, as former utilities, typically approve projects whose first cost and ROI would be unfavorable to the OTTs.  That’s what keeps the latter out of the network service business.  Things like IoT, ad/video personalization and optimization, and contextual services are all examples of feature elements that could be composed into both traditional and OTT services, but that could require considerable infrastructure deployment.  That makes them less attractive to OTTs, and operators could step in and provide the features rather than trying to compete with the OTTs for the higher-level services—many of which the OTTs already dominate.

IoT is the best example of this.  Instead of focusing on IoT as an opportunity to sell 5G service to sensors, operators should be thinking about how to develop sensor data into features.  Operators, with a lower ROI target than the OTTs and more experience with massive pre-positioning deployments, are in a better position to develop useful intelligence from sensors, deliver useful insights based on correlations of mobile user behavior patterns.  The opportunities are here now, they won’t last forever, and whether operators take them up or don’t will be the thing that determines whether they compete with OTTs in the cloud, or just continue to press their noses to the cloud-candy-store window.

It’s Time to Think About How to Harmonize Independent Service/Network ZTA

Is zero-touch automation synchronized swimming or just playing in the pool?  That’s a perhaps-less-than-elegant way of asking whether ZTA requires a unified orchestration model, or can be achieved while retaining the separation of network and service management.  This is important because we don’t have unified standards or unified telco management across those two spaces, and if we do need a single model, we may struggle longer to get to our goal.

Historically, network operators have had three technical divisions—operations, science and technology, and OSS/BSS.  These are ruled by the COO, CTO, and CIO, respectively, and the three organizations are polarized enough that operators have tried forming executive-level teams to address what should be common strategies.  It’s not been a rousing success.

At the technical level, we have had a similar separation in infrastructure and spending, but this time we have two major focus points—the network operations activities that center on the network operations center (NOC) and network management systems, and the service operations activities hosted within the OSS/BSS framework.  Both of these areas have evolved as networks shifted from a provisioned TDM model to a capacity-planned packet model, and as services shifted from incident billing to period billing.  They’ve not gotten closer together, and in many of my discussions with operators I’ve actually seen more tension between the various technical models and the organizations that sustain them.

SDN and NFV threw a new complexity into the mix, largely because both technologies presume a more complicated technical structure aimed at reducing costs.  Complexity, of course, is never cheaper; the opposite is true.  Orchestration, in some form, is the reaction to this added complexity and the associated risk of rising opex.  You can see this most clearly with NFV, whose Management and Orchestration and VNF Manager (MANO and VNFM) sought to address the new and more complicated model of creating functions by stitching components hosted separately.

SDN and NFV management have focused, I think for convenience rather than as a strategy, on making virtual stuff look and behave at the management level, like the physical element that’s replaced.  A VNF-based virtual CPE (vCPE) deployment “looks” like an appliance, which means that management practices outside it could theoretically stay the same, both in terms of human interaction and management systems.  Orchestration, then, focuses on making management inside the box automatic.  A virtual device, in modern terms, is then an intent model that’s self-managed or at least zero-touch-managed against its SLA.

The industry has waffled on the point, but now it’s pretty broadly accepted that intent models are deployed and sustained using a data-model-driven process, increasingly based on OASIS’ TOSCA standard.  A model-driven approach lets a common software structure handle diverse network elements and events, which is important if you want to avoid rewriting management processes every time you deploy or sell something new.  The need for agility in the management of stuff contained within an intent model naturally raises the question of why you’d stop with ZTA when you’d replicated your old management model.  Why not try for something better?  That’s the question that gets us to the problem of synchronized swimming versus pool anarchy.

Think of the question as being one of scope.  If I can model infrastructure and automate network management practices, and if network management is related to service management (which it clearly is) then why not model both at the same time and create ZTA across all processes?  A given state/event combination could just as easily activate an OSS/BSS process as an NMS process, at least in theory.

But look at the flip side, meaning look at things in terms of event flows.  The great majority of things that happen at the network infrastructure management level have no reason to be visible at the service management level.  Network management remedies network issues, and if remediation is successful there is no service event at all.  Similarly, if a customer is late paying a bill, that may have to trigger a service-level reminder, but I’ve never run into an operator who proposed to prep the network for a possible no-pay service deactivation.  You’d wait till you made the commercial decision, then send a deactivate event to the network.

The point is that services and infrastructure are loosely coupled.  Everything that happens in one of the areas doesn’t propagate, even indirectly, into the other.  That means that it is not necessary to unify the two areas in a common orchestration/ZTA model, from an automation perspective at least.  There are other benefits, however.

One of the main reasons why operators say they’d like a common approach to service/network infrastructure ZTA is that it’s inconvenient to learn and maintain two different toolkits, and to synchronize their behaviors when something changes either in the service framework or the network.  This view is widely held among operators, but not widely held within them; organizational separation has tended to make a unified model look difficult to establish and sustain.

Another benefit of a common model is that most operators think they need to transform their OSS/BSS systems to be more event-driven, and if you’re going to do state/event processing in network ZTA, it would be nice to simply extend that into the OSS/BSS.  The same problem is evident here, though.  Operators fear that trying to unite services and networks in ZTA will raise internal barriers.

My work with ExperiaSphere suggests that there are perhaps a half-dozen events that cross the boundary between services and networks.  Thus, with today’s services, it’s really difficult to frame a specific need to apply common ZTA logic across both the service and network domains.  It’s going to be a matter of finding the best approach, which is one that works “locally”, can be integrated where there is interdependence, and can be sold internally without protracted political battles.

Let’s turn the question around, now.  The biggest reason to keep ZTA separated is that it’s developing that way anyway.  There are a number of credible initiatives for ZTA in the OSS/BSS space, and where there isn’t an easy path to ZTA in OSS/BSS, the path may be very difficult.  Enough, in fact, that it could slow up ZTA in the network space.

The biggest barrier to independent ZTA for networks and services may be effective management of the event connection between the two.  A more robust notion of intent modeling could help here, since a service and a network, in management terms, could be visualized as high-level intent models.  In that architecture, passing events between the models is an essential piece of coordinating behavior, and so we might get a lot of value from simply defining a “service” and a “service-infrastructure” as high-level events with their own event sets.  Vendors or operators would be free to do what they liked inside, as long as they harmonized to the specific standard at the border.

I still think it would be better to adopt a single model here.  I think that too much vendor differentiation or even too many open-source solutions will only dilute efforts and make integration/harmonization more difficult.  Sadly, I think the opportunity to achieve that has passed, so we should probably focus now on how to get the second-best approach to work optimally—and quickly.

Where Might the Streaming-Video Dynamic Take Us?

The Netflix results suggest that OTT video is a big factor in broadband’s future.  If that’s true, then what does it mean for the industry as a whole?  Linear TV was at one point the credible profit engine for wireline broadband.  What happens to that sector if the engine breaks down?  What happens to vendors?  How do the players reduce their risks of major damage?  Difficult questions, uneasy answers, and a lot of uncertainty seem certain.

The industry’s problem is the classic one of a feedback loop.  Online advertising has hit TV advertising, and in particular advertising associated with smartphones and mobile sites.  This reduces the budget for shows, which limits the quality of material and the audience it can pull in.  Combine that with the fact that mobile video tends to favor on-demand viewing over real-time streaming, and you have a formula for negative feedback.  It’s obvious, particularly with AT&T’s renewed commitment to DirecTV Now and Netflix’s quarter, that this is impacting at-home TV viewing too.  Time-shift viewing invites on-demand alternatives, and that’s where we’re headed.

AT&T it, in my view, going to be the determinant in this story, because the company seems focused both on mobile services and on its streaming offering.  If AT&T drives DirecTV Now as aggressively as it seems, including offering the cheapest skinny bundle anyone has at $15, then it’s hard to see how Verizon could avoid a major commitment to streaming.  But what exactly would that shift mean for the industry?

Verizon has the highest demand density of any US operator; about seven times that of AT&T because of its smaller footprint and larger population density.  As a result, Verizon could expect to earn more for each mile of fiber deployed, because it would pass more opportunity dollars.  So does Verizon decide to make what’s been a TV war into a broadband Internet war?

Demand density is a key to the whole issue here in the US, as it is in the rest of the world.  Roughly speaking, “demand density” is the opportunity for network services, measured per square mile of area.  If you have a lot of economic power concentrated in a small area, you have high demand density.  Of course, the converse is also true.  For video, though, the complicating factor is that if there’s a lot of population concentration in cities, you end up with over-the-air video potential.  Cable TV is hot in the US because most rural/suburban people can’t get good reception with an antenna.  The additional channels are a value for some, but the main over-the-air stations represent the live viewing market.

Live viewing has been the defense against streaming until recently, when broadband Internet speeds and metro/caching infrastructure made “live” on-demand practical.  This is what frames the choice for Verizon.  If you have a streaming service, you’re untied from specific delivery infrastructure and specific service geographies.  Anywhere with broadband will do.  Why not just pump up your Internet, field a streaming service, and take AT&T on?

The problem with that, even for Verizon with its higher demand density, is cable.  AT&T showed that you can’t really field a competitive linear TV offering using DSL, so they acquired DirecTV.  Verizon’s high demand density means they could deploy fiber more broadly, but they’ve said from the first that FiOS had its limits in terms of target geographies, and they’ve nearly stopped expanding it.  CATV cable has a much higher intrinsic speed potential than DSL, so jumping into a “how-high-can-you-go” came with broadband speed could put Verizon back in the FiOS game even where it’s not profitable enough.

There are two possible solutions to the dilemma both Verizon and AT&T face.  The first is to write off wireline plant modernization and focus entirely on mobile customers, and the second is to adopt the 5G hybridization with fiber-to-the-node (FTTN) to potentially cut the cost of multi-megabit broadband to the home by more than 70%.  That would allow either operator to cut prices and still improve wireline profits.

Fundamental to the first approach is the notion that not only is mobile where the money is, mobile will inevitably kill wireline broadband and mobile TV will inevitably kill linear TV.  If those points aren’t true, then focusing entirely on mobile will kill off wireline revenue with nothing left to compensate for it.  Some operators argue privately that’s OK, because wireline profit is doomed anyway and so it’s better to leap into the stream now than wait till a flood sweeps you away.

Mobile is already highly competitive, and would a shift to a mobile-and-streaming strategy not simply make it more competitive?  That would inevitably erode mobile-service profit margins, which could well kill mobile’s advantage in profitability, leaving operators with nothing if they’ve already ceded wireline.  With wireline gone and mobile marginal, telcos could well end up fatally disadvantaged competitively.

Guess who the winner would then be?  The big problem with a mobile-only approach is cable.  They have a lower pass cost for customers for both Internet broadband and linear TV, and that means that they can stay in the business even if the telcos decide to bail.  If they did, they can use their wireline/CATV base against the telcos’ mobile-only strategy.  CATV standards are improving cable broadband all the time, too, which means that even if the industry went to full streaming of video, cable providers could be in a good position.

Cable could also mess up mobile, too.  Comcast and Charter are pairing up to create a new operations platform for MVNO services.  That could be used to support an expanded set of mobile partnerships (beyond Verizon, the current one), which would weaken Verizon’s position should they try for a mobile-only strategy.  It could also be used to prepare for network slicing under 5G, though it’s far from clear that’s going to advance quickly enough to matter in the current competitive dynamic.

And then there’s content.  If everything else is going south, content is the one haven that’s secure whether you have linear TV or streaming.  The merger between Comcast and NBCU is in all practical senses a mirror of the proposed merger between AT&T and TW, but the former has already gone through and the latter is held up.  If we believe that linear TV is going south, then we have to believe that some content relationship is essential for any ISP that’s relied on video in the past.  For AT&T and regulators, the question is whether DirecTV itself, as a satellite TV property, is enough.

Streaming video is the big consumer of mobile and wireline bandwidth, so operators would have to confront streaming through increased access bandwidth and total capacity.  However, all of this would be highly metro-focused since video tends to be cached in metro enclaves and access networks connect users to that metro video pool.  The operators then see a two-part investment—more metro video pool capacity (cache and transport) and more radio network capacity.  Where mobile competition is the focus (Europe, for example) that translates into direct pressure to upgrade cell sites to 5G, though it’s not clear whether 5G Core would be similarly boosted.  Where linear TV is being replaced by streaming for wireline services, the focus would be limited to 5G/FTTH hybrids.

Vendors will find the focus important, for several reasons.  First, any erosion in full 5G support would tend to work against the 5G infrastructure players like Ericsson, Huawei, and Nokia.  That’s because open 5G RAN is a goal for operators globally, and if you pull 5G NR out of the 5G picture to promote it independently, you’re left with 5G-over-4G non-stand-alone (NSA), which isn’t the kind of huge integrated pie that the mobile specialists can easily dominate.  Second, metro networks are better candidates for SDN rather than massive routers, and they also lend themselves to the hosted-router-software white-box solutions (as AT&T’s white-box-in-cell-sites decision shows).

You can see the many dimensions of uncertainty here, I’m sure.  Geography matters, as well as competitive dynamic for the operators, the pace of 5G adoption, the pace of streaming adoption in replacing linear TV, demand density, and perhaps most of all the rate of adoption of open-box, open-OS technology in devices…it’s a mess.  I can’t model all the variables at this point, but what seems to be true is that three factors dominate.  One, which I’ve blogged about before recently, is the open-box trend, the second is 5G adoption, and the third is the displacement of linear TV.  Operators have to budget capital expansion, and if they’re under profit pressure from streaming, they are much more likely to push hard for cheap, open, solutions.  They are also, in the US at least, more likely to adopt less-than-5G, non-stand-alone, solutions.  That ends up putting pressure on vendors across the board.

What Can We Really Expect from “Automation”?

We’ve had the SDN buzz and the NFV buzz and now we seem to be getting the automation buzz.  It’s not quite hype yet, because everyone is trying to come to terms with just what “automation” is, but we’re surely headed there.  Will it become reality?  That depends on the benefits it can produce.  I mentioned in a previous blog that Huawei’s comment about cutting network operations staff by 90% wouldn’t be the revolution it seems, and that’s true with automation as well.  The value of automation in cutting costs depends on its scope, but also whether the “cost” has much of a displaceable human component.

Over the last 30 years, operators have cut their headcounts significantly, dropping the notion of dialing the operator, going to self-install for many services, and using automated attendant systems and offshore workers to reduce support headcount.  Telecommunications employment peaked in 2001, and since then has declined every year.  Today, it’s at about 75% of long-term historical levels, a result of a combination of consolidation, automation, and outsourcing.  This makes technical operations tasks a major contributor to direct employee headcount, but it complicates the way that operations expenses (opex) can be reduced.

Let’s start with network operations, which is what many vendors think of as “opex”.  In fact, opex in a financial sense is just about everything that’s not capex, in terms of cost.  You have to first dig out of “opex” the stuff that’s really related to service and network processes, which I’ve called “process opex”.  This will account for about 29 cents of every revenue dollar this year.  It would be very difficult to assign technology or automation any role in managing things outside process opex.  What can be done there depends on what part of the cost picture you’re talking about.

As I pointed out in that past blog, directly attributable network operations costs represent only about four cents of each revenue dollar today, which is about 40% the contribution of 2001.  Interestingly, since 2014 the directly attributable data center operations costs have run just a bit higher, but they are growing more slowly.  Between 2014 and 2018, data center ops costs have gone up about 0.4 cents (less than half a cent).  In the same period, network operations costs have gone up by almost a full cent (that has plateaued since 2016).

The point here is that we have a cost growth risk in net ops, a risk that’s largely associated with the increased complexity of modern IP networks and the growth in consumer broadband.  Services like SDN and NFV that could increase the complexity of networks while reducing capital costs, could end up creating more cost than benefit if they reignite the growth of net ops costs.  That means that the goal of automation at the network operations level is primarily to sustain current costs as service and network complexity increase.

The next cost area we need to look at is customer technical support.  This cost has gone from about 2 cents per revenue dollar 20 years ago, to about 5 cents today, despite the fact that automated self-help systems, offshoring, and self-installs have become widespread in that period.  What operators are finding is that they have reached the limit of self-help and efficiency improvements in support response; they now need to reduce incidents.

Reducing support incidents means reducing service problems that impact the user.  Automation can help with that, but only if it can produce a remedy within the “visibility time limit”, the interval between the user notices something wrong and takes action that commits the operator to a support cost.  Operators know that in many cases, the best way to reduce incidents is redundancy, resiliency, and over-capacity applied almost at the resource level.  This is a bit more like “autonomous” or “self-healing” than “automated” or “AI/analytics”.

The biggest process opex item is the most problematic; customer acquisition and retention.  This accounts for about 12 cents of every revenue dollar today, and it’s doubled since 2000.  What caused this?  Competition.  What’s the answer to competition?  Differentiation.  The problem we have now, according to operators, is that we’re limited in our ability to respond to service problems in a way that ensures the customer doesn’t seek another provider, and limited in the ways we can make a service stand out enough to be a market leader.

The best differentiation is a differentiated service, something that is a lot better than the competitors can offer.  Next best is a cheaper service that’s still profitable.  Automation to allow for service agility and at the same time stable opex would be very helpful in this space, but while it’s probably a necessary condition for service-based differentiation, it’s not a sufficient condition.  Operators are not used to marketing services, or much of anything.  Today they use pricing, smartphones (that the makers market), and bundled content to market mobile services, and they market consumer broadband on maximum speed.  Absent a strategy to exploit service features at the marketing level, automation isn’t going to do much.

Across all of this is the scope-of-impact problem.  In order for automation to improve opex overall, it has to automate the service processes fully, which means that it has to cross the traditional border between OSS/BSS and the CIO organization, and NMS/NOC and the COO.  This border has been difficult for operators to negotiate, particularly because new technology tends to come out of the CTO group, and because the CTO has greater influence on the COO/operations side than the data center/CIO side.  Some operators have established high-level teams to try to build a transformation strategy that spans this border, with mixed success.

Automation, optimally applied, could save about 5 cents of every revenue dollar in 2018, and that improves about a cent per year for the next three years.  However, almost half of that savings would have to come from customer acquisition/retention costs, meaning that it is held hostage both to the integration of service and network operations automation, and to the ability of network operators to design and market useful service features.

The real difference between “hype” and whatever you’d like to call “not-hype” is whether the stories can actually lead to a business case, which is the only way anything is going to be transformed by technology.  Until we look at just what operators really spend, and at how “automation” can influence that spending, and at what else would have to be done to exploit automation, we’re not going to make a lot of progress to a fully transformed and revolutionary future.

However…the ROI on service lifecycle automation is in itself enough to justify the relatively low investment needed to exploit it.  Automation may not transform the practice of networking, but it can still make a difference.  The low apple here, that five cents in savings that might be achieved, might be high enough to get things started.

SDN, SD-WAN, and Application Networking

In my last blog, I mentioned that we might be on the verge of “application networking”.  Some of my readers who know me sent emails asking if I could define exactly what I mean by that.  I can do that, but perhaps better still I can offer the definition that enterprises have evolved during the time I’ve surveyed them.  I can also offer my view on where application networking is likely going to go.

To users, “application networking” is the process of creating and sustaining connection among users and application components.  That, perhaps, isn’t totally helpful because in theory you could do that with a good shipping service.  When you dig down a bit, the users first add the notion that the connections have to support the quality of experience required for the user/workers, offer the availability they need, and support the range of traffic that application/user loads create.

Functionality plus service-level agreement, right?  That’s a fair start toward defining application networking as an intent model for connectivity.  The big question, though, is the nature of the inputs and outputs, meaning how the users and application components connect and how traffic is routed among them.  We have inherited our current model of connecting/routing from the Internet and even earlier service models, but is that what the users really want?

Not according to the surveys I’ve done.  Almost from the start (1989 for those interested in the specific history of the view) users have favored the notion of “logical networking”.  A logical network is one where users and application components (resources, if you like) are known by logical names.  The connections made are therefore logical connections, and things like URLs and IP addresses are things that should never rear their ugly heads above the users’ horizon.

The notion of URLs (or URIs, to adopt the current terminology) has been around a long time and has been used as a means of linking a logical name to a physical network address.  The problem of our age is that the physical network address of something is really the address of the “network service access point” or NSAP where that “something” is currently attached.  That was fine when users were chained to desks and applications chained to mainframes.  Today, it’s not so great.  Virtualization means that resources move dynamically, and mobile devices and mobile workers mean users do too.

The decoding of URLs, which comes through the Domain Name Server system (DNS), is a potential pathway to logical networking.  If every network user and network resource had a logical name registered in a DNS and changed if the user moved from one NSAP to another, you’d have a kind of proto-logical network model.  The problem would be that the updating of the DNS could take time and introduce an interruption in communication.  This is less an issue for message/transaction activity than for real-time streams, but it’s still a factor.  So is DNS security.

If the specific implementation of DNS isn’t the best answer, then the general strategy of a registration-and-reference database that translates between logical name and NSAP probably is.  Mobility management in today’s cellular networks is a form of this; a phone is associated with an IP address, but the traffic to the phone is piped to a tunnel that moves as the phone does.  The process can be handled to minimize the impact on the conversation, and surely moving tunnels is more time-consuming than other modern options for traffic management.

Users also think that load-balancing should be an attribute of an application network.  This issue is a lot more complicated than it seems, because of the fact that virtualization can move a resource a considerable distance, both in geographic and NSAP terms, and the optimum position of a load-balancer depends on where the scaled resources are hosted.  In today’s enterprise services, this complication is exacerbated by the fact that wherever that optimum location is, it’s likely inside a VPN whose structure is opaque and where user processes can’t be directly hosted.  What you really need to have in a VPN world is a distributed load-balancer that operates at the NSAPs near the user, which means that they can pick a resource instance based on any number of relevant factors, including traffic and QoS impact of the selection.

I worked out a distributed load-balancer at one point in my network integration career.  It issued a “bid” to resources and picked the one that responded with the best performance/load factor.  This approach works for the kind of application I was working with at the time, which had a small number of requests separated by considerable think time.  There are many other approaches that are more generally useful, and keep in mind that complex scheduling doesn’t work well in distributed load-balancing; you can use either round-robin or random-number approaches.  You do have to ensure that the user connection points are kept up-to-date on the available resource points to balance among.

All of this is something fairly easily accomplished with at least the extended version of SD-WANs, and it may also be available through broader data-center-and-user-office forms of SDN.  The two technologies started from different places—SDN in data center networking and SD-WAN in branch networking—but both seem to be broadening their scope to envelope a little of the other’s turf.

Where SD-WAN has the advantage, in my view, is that it’s always been considered a service-overlay solution to enterprise connectivity.  That means it’s fundamentally a logical-networking mechanism.  If logical networking is a fundamental enterprise requirement as my surveys show it is, then SD-WAN has already accepted that at least in some implementations.  SDN is still largely focused on “networking” meaning overlay-IP, addressed, entities.

We have work today in groups like the IETF on location-independent addressing, meaning some means of separating the logical “who” from the network “where”, but most of these are still hung up on IP addresses of some type.  The SD-WAN community could take it to where it needs to be, at least with some of the “entities” being connected.  Do they have the answer to full resource-side logical networking, including load-balancing?  Not yet, and I think that’s going to be a major differentiating point very quickly, particularly as MSPs and network operators vie for differentiation in the evolving service market.

We should have realized, from the dawn of virtualization and the emergence of Nicira and multi-tenant data centers, that this sort of thing was going to be important.  A smart approach to logical or application networking could have been devised when we had plenty of time.  Now, the market is going to demand a solution long before any realistic open standards process could be expected to do more than decide on its own name and governance policies.  There are now three camps in play to create ad hoc strategies; SD-WAN vendors, MSPs, and network operators.  Whoever wins the race might gain an enormous competitive advantage for themselves, and set the tone of networking for the foreseeable future.

Could SD-WAN Trends End Up Solving Container Problems?

There were a couple of interesting pieces in the trade news yesterday, and the combination of the two may be more interesting yet.  One piece was from SDxCentral, commenting on the reported difficulties associated with container deployments, and the other piece from Light Reading on the addition of zero-touch provisioning to SD-WAN.  Combined, these articles point to some trends that are further evidence of a major shift in networking.

Containers are the most important development in application deployment in recent times.  A container deployment is a kind of packaged union of the deployable, operations-centric, vision of an application with the developed components themselves.  Think of it as a plug-and-play wrapper around what would otherwise be a complex and variable set of technical stuff.  That containers can make things a lot easier is a given, and it probably isn’t a surprise that container adoption across the full range of deployment environments is proving more difficult than expected.

Docker is the container baseline, and yet it deploys apps in what most users would consider a trivial environment.  Swarms, the Docker extension for clustering, is a little better but most enterprises would still find it way short of the mark.  Kubernetes, an orchestration tool that acts as a kind of over-Docker layer (also over other container systems, by the way) is a more enterprise-grade tool, but even Kubernetes has its issues, as SDxCentral’s article shows.

There are two separate things going on, container-wise.  First is the obvious fact that in containers as in other tech things, we started at the bottom and we’re only slowly making our way up.  I think it’s clear that the problems that SDxCentral points out are problems of scaling, problems that arise when the little part you start with doesn’t quite fit the glorious whole hole you end up with.  Like SDN and NFV, containers have growing pains that a top-down approach could have solved.

The second thing that’s going on is also the thing that ties our two articles together.  In SDN and NFV, the problem is that we have enclaves of SDN/NFV technology that don’t scale within limited “control domains”, and there’s no accepted, proven, way to connect the domains.  In addition, the domain-specific automation doesn’t cover the full range of operations tasks, which means that there’s limited operations benefits.

Kubernetes does a good job of orchestrating larger resource clusters.  It lacks two critical things.  The first is a sense of business integration, integration that ties the applications together because that tie is essential in the way the business operates.  Related applications often share components and nearly always are involved in integrated workflows.  A microservice could be a piece of a hundred different apps, so you can’t treat it as a part of the container deployment of any of them.  It can scale across multiple hosts, some part of the clusters used for each of those apps, or perhaps across a cluster of its own.  The second is a sense of network integration.  You can add SDN plugins to Kubernetes, but the whole question of unified, enterprise-wide, address management is then a kind of add-on rather than a full partner to container orchestration.

Container systems are about application resources and components, and in our current world of onrushing virtualization and microservices and functional computing, application boundaries are being lost and resources are all abstractions.  We don’t have application resources or components, we just have resources and components.  That second missing piece, a unified enterprise networking model, is a hard technical barrier to full realization of container benefits.

The first missing piece is a benefit risk.  The operational complexity of transformed network technologies means that if we did nothing but transform with them we’d almost surely increase operations costs more than we could hope to reduce capex.  That’s because both SDN and NFV leave too much on the table with respect to scope, and that’s exactly what containers in particular, but DevOps tools in general, are doing.  So if zero-touch automation (ZTA) solves the SDN/NFV problem, and if it could contribute to networking unity, why couldn’t it solve those problems for container as well?

SD-WAN is a potential virtual CPE (vCPE) NFV application, one that can also obviously be deployed on a general-purpose server or perhaps even a P4 switch.  As such, it could naturally connect with any NFV ZTA processes.  SD-WAN is also a natural companion to SDN because SDN services could be enveloped in an SD-WAN wrapper for some or all of a network, and nobody would see that it wasn’t a legacy network service everywhere.

SD-WAN should be able to solve the container networking problem.  Like most SDN implementations, SD-WAN is an overlay technology.  It’s hosted or appliance-based, so you can put it in a container host, in the cloud, on bare metal, and in branch offices.  It can impose a logical, unified, addressing scheme, just as easily as SDN, and perhaps more so because of higher-layer features.  If you added management hooks, you could integrate SD-WAN automation with ZTA overall, and with any SDN or NFV automation that was in place or planned.

What 128 Technology has announced is a deal with Lanner and Arrow Electronics for a complete appliance solution, zero-touch provisioning, but not full lifecycle management.  The announcement doesn’t include integrating with SDN or NFV ZTA, and there are no ETSI ZTA specs yet to guide that, but most SD-WAN products have elements of self-management or “autonomous operations” and the more higher-layer functionality a product has, the more it’s capable of being, in effect, intent-driven.  If it can be made to report on the SLA compliance of the lower-layer transport on which it’s overlaid, it’s a good start in the ZTA direction.

Why stop there?  If a container is a step toward creating deployable units, why not add networking and higher-layer behaviors and create something for containers that’s a lot more plug-and-play.  My favorite container strategy, Apache’s MESOS and DC/OS, takes some steps toward facilitating that at least, and more steps could be taken by assuming that resource virtualization and application/service virtualization had to be interdependently orchestrated while retaining policy controls over each, independently.

Containers could be “packaged” too, with a combination of servers, operating systems, middleware, and networking tools.  The package approach actually makes ZTA easier because it reduces the number of variables.  You may recall that HPE, who had a key integration deal with Telefonica, lost the deal suddenly over a difference in approach.  My contacts told me (as I’d blogged at the time) that the difference was that HPE wanted to ease into the complexities of NFV integration by creating packages.

SD-WAN is a pathway to virtualizing networks, which virtualizes the binding element between applications/services and resources.  Some early SDN products (Nicira, now NSX under VMware, for example) had a broader network harmonization mission, and Nokia/Nuage supports that mission too.  SDN, though, may be a harder road to take, given that all the things that users want in “application networking” are already offered in at least some SD-WAN products.  Complete the migration from zero-touch provisioning to zero-touch automation and integrate the result with containers, and you get a potential solution to container scaling and integration—at any scale.

Does the Fact that Operators Know SDN and NFV are at Risk Mean They’ll Fix Them?

Are we finally realizing that NFV is going to fail without change?  I’ve been saying that for years, but an article in Light Reading last week suggested that operators in a conference were increasingly skeptical about both SDN and NFV.  Skepticism about the present is at least a credible driver to changes in the future.  Not everyone agrees about operator attitudes, of course; one vendor rep who also attended the session commented that he’d taken away a completely different picture.  Beauty, SDN, and NFV all appear to be in the eye of the beholder.

Let’s start by admitting that catchy headlines are a fixture in the media, and vendor denials of problems in a market they’re committed to are equally common.  The truth is that neither SDN nor NFV have met expectations, but both have had some success.  To my mind, that says that neither of the two technologies can be considered a success, and that SDN is largely hype (so is NFV).  Is NFV a “faux pas” as the article suggests?  The term means a “blundering error” or “gaffe”, and I think the first definition is close to the truth.  We had a chance, with NFV, to do something revolutionary, and messed it up unnecessarily, so that might be a blundering error.

Would operator skepticism be enough to change things for either or both technologies?  To start off, not all operators are skeptical because not all operators saw SDN or NFV as being the source of revolutionary improvements in their profit per bit.  Most operators have been targeting NFV at virtual CPE (vCPE) for business service missions, and for anyone except an operator who does little other than carrier Ethernet, that mission probably wouldn’t impact costs all that much.  For SDN, operators have never been convinced it could replace routing, and most would say that they are already interested in SDN for data center switching.

It also depends whether skepticism is channeled along positive lines.  I don’t mean that we should simply blow kisses at SDN and NFV because they have good intentions; people have told me I should be doing that, and I reject the notion that success can come from self-deception.  We should accept the failures but look to the reasons for the failures only as a pathway to solving problems.  I’ve blogged about some of the basic issues and pathways for solution for both SDN and NFV, so here I’m going to focus on whether attention to the “problems” of SDN and NFV could be enough to drive a solution.

SDN’s problem is lack of demonstrated scalability.  It works great in the data center.  It works great at the core of a transport network, married with optical technology and surrounded by BGP emulators (this is Google’s configuration).  Whether SDN could work at scale in private networks (VPNs) depends on the flavor of SDN we’re talking about.  Overlay SDN (Nokia/Nuage is the best example) can surely do that.  OpenFlow SDN needs controller federation to scale into the WAN.

NFV’s problem is also scalability, in a sense at least.  We know NFV can work for virtual CPE hosted in an agile premises box.  We don’t really know much about how it would work beyond that limited mission, and the mission of vCPE is way too narrow to justify the attention that’s been paid to NFV.  To get beyond that mission, NFV would have to address total service lifecycle orchestration, which it will never do.  We may address it outside NFV (as the NFV ISG intended from the first), but we’re not there yet.

Both the technical evolution of SDN and of NFV are blundering forward, but will they get to where they need to be?  The biggest problem for both, in my view, is that neither really had a specific destination.  They were technologies that asserted properties and not benefits.  A benefit is something that can be quantified, meaning measured against cost to establish a return on investment.  We heard a lot about the properties of both SDN and NFV, but not much about benefits.  Yes, I’ve seen studies sponsored by vendors.  In my view, none of them would stand close financial assessment.  We’d have to do better.

There are only two kinds of savings that a network technology could assert.  One is capex or “capital expense”, meaning the cost of the equipment and software, the stuff that’s usually depreciated over a fixed cycle.  The other is opex or “operations expense”, meaning the cost of operations.  This is normally expensed in the current period.  Any technology exposes operators to both, and so any technology really has to reduce the net of the two.  Ideally, that would mean reducing both at the same time.

The baseline against which both SDN and NFV have to be measured is the current network, which consists of multiple layers of technology.  SDN and NFV really address Level 2 and 3 (and, in the case of NFV, some stuff like firewalls that live above Level 3).  We have to look at what the technologies might do to both capex and opex, focusing on L2/L3/L3+, or we can simply look at how SDN and NFV would fare against other developments.

To me, the big problem for both SDN and NFV today is the open P4-modeled devices, the things that AT&T/Linux Foundation DANOS and ONF’s Stratum define.  If you built commodity switch boxes using merchant silicon and combined them with open-source routing/switching software, you’d erase as much capital cost as either SDN or NFV would.  And you’d do it at no incremental technology risk.  SDN and NFV were both estimated to save about 25% on capex.  Way back in 2013, operators told me that a 25% reduction in capex wouldn’t justify the NFV risk; “We can get that beating Huawei up on price” was their quote.  Open boxes will kill proprietary appliances more surely than SDN or NFV could.  That leaves opex.

Neither SDN nor NFV really gets into operations automation, what’s today called “zero-touch automation” (ZTA) of the service lifecycle.  Neither can then claim to reduce it, and no matter how operators view SDN or NFV, they’re not going to push either technology into ZTA.  That’s the realm of what’s called “orchestration” today, and most operators think that the ONAP project offers the best hope for open-source ZTA.

Perhaps, but how much would that save?  Light Reading contributed another article that bears on the solution to SDN’s and NFV’s problems.  Huawei, the 900-pound gorilla of networking, says that automation could eliminate 90% of network operations jobs.  Most of those who’ve read my blog over time know that I’ve been saying that opex reduction is a better target for modernization than capex reduction.  In fact, capex reduction will be swamped by increased complexity-related costs if something isn’t done with zero-touch automation.  Is Huawei promising the solution?

It depends on what a “network operations job” is.  This year, operators will spend about 30 cents of every revenue dollar on “process opex”, meaning the operations costs directly associated with the network.  If Huawei was going to cut those costs by 90%, they’d save 27 cents, which is more than the operators’ total capital budgets, and it would be a revolution.  But they aren’t going to do that.  They’re talking about “network operations costs”, meaning the resources used to control and maintain the network.  That cost, this year, will run about 4.4 cents of every revenue dollar, so Huawei’s claim would mean a savings of about 4 cents.  Capex savings of 25% would reduce costs by more than that.

The point here is that opex is just as much “bulls**t” as capex unless you address the whole of process opex, which goes way beyond the network operations costs alone.  OSS/BSS has to be redone for true ZTA.  You have to refine your competitive positioning on services, the way you attract and retain customers, sell, market, advertise.  True ZTA, according to my own model, could save about 8 cents from a 2020 total of 31 cents per revenue dollar in process opex.  Guess what that is?  It’s about 25%, but 8 cents is a third of total capex, and if you add a capex reduction of 25% to it, you end up saving 13 cents of each revenue dollar, which is more than enough to make operators’ profit per bit numbers turn around for a decade or more.

The real lesson here is that 25%.  It seems like everything we look at can only impact costs, in its area of focus, by about 25%.  That means that to get to a reasonable number overall, we need to have a broad area of focus, impact a lot of things.  The focus that everyone thought we needed to move standards and specs forward quickly has resulted in a scope too narrow for benefits to grow to a meaningful level.  We’ve undershot relative to necessary benefit scope in the past, and no amount of recognition of past problems will ensure we don’t do it again.  Where the ZTA and open-box stuff differs from SDN and NFV is that it can be applied incrementally and doesn’t require a fork-lift upgrade.  It can survive and grow even when short-term mindsets prevent grabbing ahold of the whole problem.

I think what operators are starting to see isn’t that we need something different in the way of network technology, or a better way of justifying the new technologies like SDN and NFV.  We could make an enormous change simply by adopting open-box technology and ZTA, and I think realization of that truth is dawning.  Not fast enough to reap the savings we could have achieved by 2020, though.  The lesson of the last half-decade is that not facing the truth immediately is very costly in the long run.

What the Heck is an Event-Driven App?

What does an event-driven app look like?  That might seem to be a silly question, but as we move toward at least some level of realization of the Internet of Things (IoT) and as serverless cloud computing services aimed at event processing proliferate, it’s a question we need to answer.  Most developers, consultants, analysts, and media types know “applications” primarily from either the business-transaction side or the Internet worldwide web side.  Events are similar to both in some ways, and very different in others.

Both transaction processing and web access are similar in the way they use resources.  Both are usually supported by a semi-fixed set of resources that host application/component instances.  I’m calling these “semi-fixed” because there is a specific amount of pre-positioned capacity, and that might or might not be scalable with changes in workload.  The decision to scale is explicit in that applications/components are designed to scale and scaling is normally invoked either by applications themselves through work scheduling, or through a separate manager that recognizes load changes.

Event-driven systems are forking in this particular attribute.  On the one hand, many are written to utilize the same kind of pre-positioned assets as web/transactional apps.  Containers are a particularly strong way to host event processing because they add relatively little overhead to the applications themselves, compared with VMs that require their own copy of the operating system even for event processes that might be a couple dozen lines of code.  On the other hand, events in the cloud have been associated with serverless “functional”, “lambda” or “microservice” programming where a copy of an event process is only loaded when the event it’s designed to process enters the system.

With cloud-hosted event processes, the decision to scale is implicit because it’s presumed that a new event will spawn a new event processor element.  The cloud approach to event processing is much more dynamic, scalable, and resilient than a pre-positioned component model of event processing.  Under current pricing terms, it can be more expensive if the volume of events is high, because that would justify dedicated and persistent resources.  The cloud model also brings to the fore one of the most critical issues with event-handling, which is context.

Web interactions are stateless; every HTTP event is processed for what it is and where it’s directed.  Transactional applications that involve multiple messages (query-update is an example) are typically handled by stateful processing that can keep track of where you are in the dialog.  Event processing context is harder because it can be about the timing of a given type of event, the relationship between different types over time, and where you are in the context.  Most of today’s event-processing systems are either based on state/event structures (If I get an “Activate” event in the “Ready” state, then assign resources”) or on complex event processing (CEP) software.

Amazon’s Step Functions create a state machine, which means that developers define specific “states” that a process can be in.  When an event occurs, the event is processed according to the logic associated with the current state, as defined in the state machine.  This lets developers build in context by defining how conditions progress a system through specific phases (states).  Part of the processing of an event in a state can set the “current state” so the next event will be processed according to those rules.

State/event programming is familiar to people who build protocol handlers, since virtually all of them work on state/event principles.  In the original ExperiaSphere project, which built state machines (called “Experiams”), the implementation defined specific states and events, and the presumption was that every event had to be handled (even if the handling was to set an error termination) in every state.  This illustrates the challenges of developing state/event logic in any form; you have to understand the relationship between states and events and define the progressions that each event in each state would trigger.  Most people use a diagram consisting of ovals (representing states) and arrows (representing events that generate a transition to another state) to keep track of the progressions.

One of the most challenging elements of any event-driven application is the notion of time.  There are two separate time-related aspects to event handling.  The first is the chronology aspect; events are significant in time, meaning that their precise sequence and interval is normally important.  That means that most systems have time-stamp facilities to carry event timing information to processes.  Transactional systems may also provide timing information, but it doesn’t need to be precise or synchronized in most cases.  Event-driven systems may need to synchronize all event sources to a common clock.  The second is the duration dimension.  States aren’t black holes; a good state/event progression will ensure that the system can’t simply stall waiting forever for something to happen.  This means that a timeout event is common; an event that signals that something being expected has not arrived and no more waiting will be allowed.

No matter where you decide to run an event-driven app, this issue of states, events, and timing will be right there with you.  It’s the fundamental difference between event-driven programs and other kinds of applications, which makes it the hardest thing for new event-programming teams to handle.  The way that state/event progression is addressed may vary between hosted container or component implementations and “lambda” or functional-cloud implementations, but the principles are exactly the same.

CEP is a model that can be related to state/event implementations for event-driven apps.  With CEP, there’s a kind of correlation front-end that accepts policies or definitions of the things that constitute relevant “raw” event sequences, and from them generates process triggers.  In theory, a CEP front-end could eliminate the need for explicit state/event programming, but users of CEP say that most of their implementations of event-driven apps use CEP only to summarize raw events, to cut down on the complexity of later state/event processes.

This opens the last of the issues of event-driven apps, which is event distribution.  Realistic event-driven systems tend to consist of Four classes of components—event generators that actually source the primary events in a system (sensors, for example), event distributors that take a single primary event and distribute it to multiple parties based on some publish/subscribe or bus process, and event processors that actually receive and “handle” events.  In some cases, event processors may then link back to process controllers that take real-world action.  All of this presumes some mechanism for event distribution, for which the event distributor processes are a key.

Distributors would normally have a fixed association with an event generator or a series of related generators, which means that it’s likely they have a dedicated connection of some sort.  You don’t want to have event generators, which are the devices likely to be most numerous in an event-driven system, handling a lot of processing or supporting a lot of direct connections—it raises costs and security concerns.  The distributors would define both a mechanism for knowing about the event processors that wanted a given event, and the connectivity framework that supported the distribution itself.

The overhead associated with an event distributor is important, and for two reasons.  First, many instances of the distributor are likely to be deployed, so you don’t want the process to be expensive to run.  Second, the distributor is in the primary path of events, and any latency introduced by the distributor will add to the “length of the control loop”, which is a way of describing the time interval between the generation of an event and the completion of the event processing.  It’s fair to say that the success of event-driven apps is linked explicitly to the effectiveness of event distributors.

This is why I believe that the focus of IoT on connecting event generators via cellular wireless is so illogical.  Not only should we be looking primarily at local low-cost connectivity for sensors (which is already the practice in the real world), we should be looking not at the service of connecting sensors but at the service of distributing events.  A network operator or cloud provider who had the right combination of edge computing and efficient event distribution could be the true, big, winner in the IoT race.

So why aren’t we seeing examples of this?  A big part, I think, is that in the financial markets of today, it’s only the current quarter that counts.  Things like 5G sensor connection are easy for the Street to understand, and easy to see as near-term revenue-generators.  Something like an edge-computing-and-event-distribution deployment looks like an enormous source of “first cost”, the dip in cash flow that accompanies a service that needs major infrastructure deployment before it generates any compensating revenue.  Another part is that the network operators, the most likely to be long-term players in an event-distribution market, aren’t software types and don’t see the need, or the necessary architectural steps involved in fulfilling it.

I’m of the view that without some explicit exploitation of the event distribution opportunity, there will be no meaningful IoT beyond the simple orderly growth of the same sort of private applications of event processing and process control that we’ve seen for decades.  With a good strategy in place, IoT will meet many (perhaps, optimistically, even most) of the high expectations the market has set for it.  And somebody, or a small group of companies, will make some big bucks.