The Relationship Between 5G and Edge Fiber

According to a recent report from Deloitte, “deep fiber” is critical to support the evolution to 5G.  There’s truth in this view—I think it’s clear that fiber is critical to 5G success.  The questions are whether “deep fiber” is the kind of fiber that’s critical, whether fiber is a sufficient guarantee of 5G, and what might get both fiber and 5G to where they need to be.

The first of the questions is two-part—is the term correct and is fiber critical.  The first is easy to answer based on the report itself.  Deloitte departs from normal practice by saying that “deep” means “toward the edge” when normally it’s taken to mean “toward the core”.  If we substitute some terms here, we could paraphrase the report by saying that “the United States is not as well prepared to take full advantage of the potential [of 5G], lacking needed fiber infrastructure close to the end customers, at the network’s edge.  The second, I think, is also easy.

Anything that proposes to increase access bandwidth significantly is going to have to rely on fiber closer to the network edge.  One of the elements of 5G that operators tell me is most interesting to them is the notion that 5G could be used in wireline deployments as a kind of tail connection to fiber-to-the-node systems.  In that role, it would replace copper/DSL.  Whatever happens to 5G overall, I am hearing that this element is going to move forward.

With a credible high-performance tail connection, FTTN deployment becomes a lot more sensible, and that would of course drive additional fiber deployment.  However, fiber to the prem (FTTP, usually in passive-optical network or PON form) is arguably the logical strategy for deployment of consumer and business broadband to any area where CATV cable is not already in place.  Even in some CATV-equipped areas, FTTH/PON might be required for competitive reasons (or not, as we’ll see).  Thus, edge fiber doesn’t depend on 5G as a driver, though unquestionably it would benefit.

However, edge fiber at the edge is a necessary condition for 5G.  Is it a sufficient condition, meaning that nothing else matters?  Probably yes in the limited sense of 5G as a tail circuit for FTTN, but not for all the rest.  In fact, it’s not clear whether 5G is really the driver here or just radio tails to FTTN.  It’s the fact that operators associate that mission with 5G that makes it a 5G mission, not the technical requirements.  That’s why the rest of 5G isn’t pulled through by the FTTN tail mission, and why we still need broader 5G drivers for the rest.

If all this is true (which I think it is), then it’s really the need to deploy more edge bandwidth—mobile and wireline—that’s the driver for more fiber.  Is that the only driver?  I don’t think so.  At the same time as we see an increased need for edge bandwidth, we also see a growing need for the deployers of that bandwidth to monetize that which they are doing.  That’s where carrier cloud, edge computing, and process interconnection come along—all topics of recent blogs.

Access deployment is dominated by consumer broadband.  Consumer broadband is dominated by asymmetrical bandwidth needs—more upstream toward the user than in the other direction.  Process interconnection tends to be symmetrical in terms of its requirements, and because latency in a process connection impacts QoE broadly, it’s more important to avoid it.  I think process interconnection will be a more significant force in fiber edge deployment than consumer broadband, and certainly both will be more than enough to drive a lot of new fiber at the network’s edge or very near to it—in the process edge.

The main point of the Deloitte report is one of the main report headings: “The current wireline industry construct does not incent enough fiber deployment.”  There, they have a point.  I’ve blogged a lot about the declining profit-per-bit problem, which means there’s a problem with return on infrastructure investment.  Even if you don’t believe my arguments, it’s hard to argue with the fact that network equipment vendors (except price-leader Huawei) have been facing difficult quarters revenue-wise.

Could opex savings from infrastructure modernization help?  The report notes that operations expenses are typically five to six times capex; this aligns fairly well with my surveys that show that on the average operators are spending about 18 cents of every revenue dollar on opex, another 18 cents returned as profits, and the remainder as operations and administration.  They suggest that modernization of legacy TDM networks, which are expensive to operate, has a lot to do with that.  My surveys don’t bear this out; only about 30 cents of every revenue dollar are associated with “process opex” meaning the cost of network operations and network-related administrative costs, and only about four and a half cents are pure network operations.  A TDM-to-packet transformation would therefore not impact much of the total OAM&P costs at all.

I’m a believer in reducing opex, and if you looked at the total process opex pie (30 cents per revenue dollar) and could reduce it by about half (which operators say is credible) you’d almost equal the total elimination of capex in terms of bottom-line impact.  The problem is that most of the savings come from service-level automation, not from improving network technology.  As a fiber driver, I don’t think modernizing out of TDM cuts the mustard.

Regulatory policy may hold the answer, according to the report, and I agree at the high level but disagree on what policies might help.  The report talks about fairly esoteric measures like encouraging cross-service deployments.  In a related section, it proposes improving monetization by encouraging OTT partnerships or even joint ownership of OTTs.  I think the answer is a combination of those points.  If you want operators to deploy more fiber, you make it profitable to do so.  If you want to make it profitable, you wring out operations costs through service-layer lifecycle automation, and you eliminate barriers to Internet settlement for QoS and traffic handling.

I think there’s a lot of good stuff in the report, but I also think it misses a major truth.  Any large-scale change in network infrastructure is going to require large-scale benefits to justify.  That’s true of edge fiber and it’s just as true of 5G.  We are forever past the phase of networking where a technology change can be seen as self-justifying, meaning it’s the “next generation”.

Fiber at/near the edge is, I think, a given because there are plenty of things that are driving it, and those things in turn have clear benefits.  5G is still proving itself on a broad scale, and it’s likely that its fusion with FTTN is going to be essential in making it a success.

How ONAP Could Transform Networking–or Not

A story that starts with the statement that a technology is “entering a new phase” is always interesting.  It’s not always compelling, or even true, but it’s at least interesting.  In the case of Network Functions Virtualization (NFV) the story is interesting, and it’s even true.  It remains to be seen whether it’s compelling.

NFV was launched as a specification group, a form of an international standards process.  Within ETSI the NFV ISG has done some good stuff, but it started its life as a many-faceted balancing act and that’s limited its utility.  Think operators versus vendors.  Think “standards” versus “software”.  Think “schedule” versus “scope”.  I’ve blogged on these points in the past, so there’s no need to repeat the points now.  What matters is the “new phase”.

Which is open-source software.  NFV had launched a number of open-source initiatives from the ISG work, but what has generated the new phase is the merger of one of these (the Open-O initiative) with AT&T’s ECOMP.  ECOMP mingles the AT&T initiatives toward an open and vendor-independent infrastructure with some insights into SDN, NFV, and even OSS/BSS.  The result is a software platform that is designed to do most of the service lifecycle management automation that we have needed from the first and were not getting through the “normal” market processes.

ECOMP, it’s clear now, is intended to be not only what the acronym suggests (“Enhanced Control, Orchestration, Management & Policy”) but more what the title of the new merged (under the Linux Foundation) initiative suggests, an Open Network Automation Platform.  I like this name because it seizes on the real goals of openness and automation.

I also like AT&T’s focusing of its venture arm on building new stuff on top of ONAP, and AT&T’s confidence in the resulting ecosystem.  In an article by Carol Wilson in Light Reading, Igal Elbaz, VP of Ecosystem and Innovation for AT&T Services says, “We believe [ONAP] is going to be the network operating system for the majority of the network operators out there.  If you build anything on top of our network from a services perspective, obviously you want to build on top of ONAP. But many operators are adopting all of a sudden this solution so you can create a ubiquitous solution that can touch a large number of customers and end users around the world.”

It’s that last notion that catapults NFV into its new age.  Some network operators, through support for open-source initiatives, have defined the glue that holds future network infrastructure and future services together.  Some of this involves virtual functions; more probably will involve some form of software-defined networking.  All of it could totally change the dynamic of both SDN and NFV, by creating an open model for the future network.  If ONAP can create it, of course.

The comment by AT&T’s Elbaz raises the most obvious question, which is that of general adoption of ONAP by network operators.  There is certainly widespread interest in ONAP; of the 54 operators I know to have active transformation projects underway, ONAP is considered a “candidate” for use by 25 off them.  That’s not Elbaz’s majority of operators, but it’s a darn good start.  I think we can assume that ONAP can reach the more-than-half goal, and likely surpass it.  It might well garner two-thirds to three-quarters of operators, in my view.

A related question is vendor support.  Obviously if a majority of network operators adopted ONAP, vendors would fall into line even if they were in tears as they stood there, which many might well be.  However, the only alternative to supporting ONAP would be rolling your own total service automation solution, which vendors have obviously not been linking up to do since NFV came along.  Would they change their tune now, with a competing open-source solution from and accepted by operators?  I don’t think so, and so I think that once ONAP really gets where it needs to be, it kills off not only other vendor options but any competing open strategies as well.

Where does ONAP need to get to, though?  I think the best answer to that is “where Google already is with their Google Cloud Platform”.  The good news for the ONAP folks is that Google has been totally open about GCP details, and has open-sourced much or even most of the key pieces.  The bad news is that GCP is a very different take on the network of the future, a take that first and foremost is not what launched ECOMP and ONAP, or even what launched NFV.  It may be very challenging to fit ONAP into a GCP model now.  Particularly given that GCP’s founding principle is that networks are just the plumbing that can mess up the good stuff.

Google’s network, as I’ve noted before, was designed to connect processes that are in turn composed to create experiences/applications.  Operators today are struggling to make business sense of pushing bits between endpoints in an age of declining revenue per bit.  Google never saw that as a business.  In fact, Google’s approach to “virtual private clouds” is to pull more and more cloud traffic onto Google’s network, to take even the traffic that’s today associated with inter-cloud connectivity off the Internet or an external VPN.  You could make a strong case for the notion that Google views public networking as nothing more than the access on-ramp that gets you to the closest Google peering point.

Google’s relationship with the Internet is something like this; everything Google does rides on Google’s network and is delivered to a peering point near the user.  GCP carries this model to cloud computing services.  Google also takes a lot of time managing the address spaces of cloud services and its own features.  User data planes are independent SDN networks, each having its own public (RFC 1918) address space.  Processes can also be associated with public IP addresses if they have to be exposed to interconnection.

Nothing of this sort is visible in the ECOMP/ONAP material, but it’s also true that nothing in the material would preclude following the model.  The big question is whether the bias of the ECOMP/ONAP model or architecture has framed the software in an inefficient way.  Google has planned everything around process hosting.  If process hosting is the way of the future, then NFV has to be done that way too.

The SDN and NFV initiatives started off as traditional standards-like processes.  It’s understandable that these kinds of activities would then not reflect the insight that software architects would bring—and did bring to Google, IMHO.  Now, with ONAP, we have another pathway to SDN and NFV, a path that takes us through software rather than through standards.  That doesn’t guarantee that we’ll see a successful implementation, but it does raise the prospects considerably.

We also have to look ahead to 5G, which as a traditional standards process has made the same sort of bottom-up mistakes that were made by those processes in the past.  We have a lot of statements about the symbiosis between 5G and SDN and NFV.  I’ve read through the work on 5G so far and while I can see how SDN or NFV might be used, I don’t see clear service opportunities or infrastructure efficiency benefits that are linked to any of those applications.  The big question might turn out to be whether AT&T or the others involved in ONAP can create a credible link between their work and credible 5G drivers.  Same with IoT.

A software revolution in networking is clearly indicated.  Nothing we know about future network services or infrastructure evolution harkens back to device connections and bit-pushing for differentiation.  We may be on the path for that software revolution—finally.  That would be good news if true, and it would be a disaster for the industry if it’s not.

The Economics Shaping Edge Computing

If event-handling and process hosting are the way of the near future, then (as I suggested last week) we would likely shift a lot of hosting and software development off traditional server platforms.  There are technical considerations here, of course (and I noted the key ones last week), but the primary issue is financial.

Event processing depends on two dimensions of event characteristics.  One is how often the event happens, and the second is how “valuable” in a commercial sense the event is.  Obviously both these characteristics are graded and not polar, but it’s helpful to think of the extremes, which in this case would mean event frequencies ranging from infrequent to very often, and event values ranging from negligible to significant.  The four combinations present great opportunities on one extreme, and nothing but cost and risk on the other.

Let’s start by looking at something we can dismiss quickly.  Low-value events that rarely happen probably don’t justify any event processing at all.  “An airplane flew across the face of the moon, relative to an observer in Iceland” might be one.  If you want to get IoT funding for that one, good luck!

Now let’s search for some relevant insight, starting with the most extreme on the negative side, very frequent events of little value.  A good example would be “cars passing an intersection”.  OK, you probably can think of things that could be done with this information, but if you think about a place like New York City and the number of cars that pass each of the intersections in the course of a single busy hour, and you have a formula for minimal ROI.

Where we have this situation, the goal is to prevent the “event” from ever getting out of local processing in discrete event form.  This would mean two things; that you want to use a cheap local technology to collect the events, and that you want to “summarize” the events in some way to reduce their frequency.  Cars per minute?  In a busy intersection that would probably reduce the events by several orders of magnitude.

Logically, the way to do this would be to have “agents” associated with groups of event sources.  The agents would use some cheap technology to collect events, and then a second cheap technology to summarize them in some way.  The agents would generate summary events (“Car Count for Intersection 5th and 42nd, 10:40 AM”).  If we needed only summaries based on time, you could do something like this with a custom chip, at a very low cost.

Something else would consume these summary events, of course, and since there are fewer such events and they’d not likely require very short response times, you could place these other processes deeper in the network.  In addition, it’s likely that the summary processes would be doing some kind of analytics, changing the nature of the process from strictly event-handling to something more “transactional”.  Keep that in mind!

At the other extreme?  That’s easy too—think of system failure or critical condition alerts.  There’s water in a tunnel, there’s smoke in a warehouse.  These happen rarely (hopefully) but when they do they could represent human lives and millions of dollars.  Not only that, each of these events (largely because of their value) could logically be seen as having a wide scope of interest and potential to trigger a lot of other processes.  Fire in a warehouse?  You might have to dispatch engines, but also activate an emergency plan for traffic lights, extending far outside the area of the fire or the path of the engines, to insure emergency vehicles associated with other incidents could still move.

This event type might well create what could be called a “tree” or “cascade”, the opposite of the aggregation that happened in our car-and-intersection example.    We’d want to provide a multicast mechanism, or publish-and-subscribe, to distribute this kind of event.  Each of the secondary recipients (after our primary processing) would then handle the event in a way appropriate to their interests.

The examples of dispatching engines or implementing an emergency light plan show that these events might well need fast response times, but there are other cascaded processes that could be more tolerant of delay.  I think it’s likely that many processes will require edge hosting, while others could tolerate deeper hosting.  All, however, are likely to be local in scope, since an emergency condition can’t be handled easily at a distance.  This high-value, limited-frequency stuff is thus a primary edge driver.

Now the in-between.  High-value events that happen more frequently would likely be traditional commercial transactions.  Think ATMs, mobile purchases, bank tellers, store register systems, automatic toll collection, and so forth.  Another class of high-value events would be contextual events associated with a social or location service, including and perhaps especially advertising.

Commercial transactions, huh?  That’s a theme common to the high-value-low-frequency stuff we covered earlier.  I think we can safely say that event-handling in general will have two distinct pieces—a front-end part where we’re either summarizing or cascading events and applying limited response processing to highly time-sensitive conditions, and a longer-cycle transactional back-end.

The cloud computing players like Amazon, Google, and Microsoft all see this combination-structure for event software, and all in fact show it clearly in some applications.  The front-end parts of the processes are “serverless” meaning that they’re designed to be spawned at need where needed, rather than assigned to a specific place.  That requirement, and the requirement that they be spawned quickly to be responsive, means that you have to avoid complicated new connection structures, access to distant resources, etc.

All of this shows that the heavy lifting on edge computing would have to be handled by the relatively valuable and infrequent events, and secondarily by perhaps expanding the traditional notion of “commercial” transactions to include social/location/advertising services.  It’s much harder to deploy and connect edge resources based on aggregations of low-value high-frequency stuff because you’d probably have to find a lot of applications and users to justify the deployment, and it’s hard to see how that happens without anything deployed.

It also shows that the cloud computing players like the three I already named have an advantage over the telcos and cablecos, even though the latter two have more edge-located real estate to leverage.  The software structure of event-handling is well-known to those cloud guys.  Google, at least, has explicitly evolved its own network to be efficient at process linking.  Amazon, of course, is addressing its own limited edge asset position through Greengrass, which lets users host event processes outside of (forward of) the cloud entirely.  Knowing what to do, not just where to do it, could be the deciding factor in the cloud.

Operators in their part have a tendency to see software the way they see devices.  If you look at SDN or (in particular) NFV design, you see the kind of monolithic boxes lined with APIs that are the exact opposite of the design for optimum event-handling systems.  Arguably a big reason for this is that operators tend to think of software elements that replace boxes—virtual functions replace physical functions.  But if the future is things like IoT, then it’s about events not data flows, and that’s where the operators need to do some thinking outside the box.

Where Market Drivers Might Take Fog/Edge Architecture

If fog or edge computing is the future, then what kind of future is it?  We have a tendency to think of distributed compute power or cloud computing as being a form of traditional computing, based on traditional hardware and software.  Is that the case?  If not, what model or models might really emerge?

The “edge of the future” will likely depend on the balance of a couple of forces.  One is the early driver force, the thing that will likely put the first mass of edge computing nodes into place.  The other is the shape of the long-term opportunity, meaning the application mix that will represent the future.  I’ve looked at the drivers of carrier cloud, and blogged about some aspects of them and their combined impact on deployment, and I propose to use some of the data to frame these forces and look at how the likely carrier cloud evolution really does shape the edge.

There are six drivers to carrier cloud according to my surveys of operators.  We have NFV/vCPE, personalization of advertising and video, 5G and mobile, network operator cloud services, contextual services, and IoT.  One of the five—network operator cloud services—could actually be expected to mandate conventional server technology.  Another, NFV/vCPE, doesn’t seem to favor any single platform option, and the remainder have specific architecture goals that current servers might not fit optimally.

NFV/vCPE is about hosting what’s largely security-and-VPN-related elements of traditional network appliances, and the software would presumably come from the providers of those kinds of devices.  An appliance or device uses what’s often called “embedded-control” software, and it usually runs on a hardware platform that doesn’t have many of the features of general-purpose computing.  In fact, the CPU chips usually put into servers would be overkill.

OK, but the problem is that diversity.  No appliance vendor will port to the architecture of a competitor, so the multiplicity of players could well foster a movement to pick a neutral platform.  Since hosting network functions on general-purpose computers was already being done, the logical platform would be the standard x86 server architecture, and perhaps a Linux distro.

Personalized advertising and video is a harder driver to assess in platform terms, in no small part because cable companies, telcos in the streaming business, and OTT streaming providers have their own “set-top-box-like” platform, and ISPs have CDNs.  Currently, both STB and CDN applications tend to run on servers or at least on something very server-like.  If the personalization and socialization of video (which is the functional driver of change in this space) doesn’t change the technical requirements significantly, then we could expect this driver to promote traditional server architectures too.

It might change, of course.  Personalization can be visualized as a refinement of current video selection practices, practices that are outside the network content delivery process.  Socialization, on the other hand, relies not on individual selection but on collective, even collaborative, behavior.  That shifts the focus toward event-driven processing since each of the “social elements” in video socialization (the people) are asynchronous with regard to each other until they collect to make a decision on viewing or make a recommendation to support such a decision.  Current video content delivery is a very web-server-like process, but social event handling is a new problem to solve.

Advertising and ad delivery is similar, and the technology is also similar.  A URL representing a place for an add will invoke a process that links the page containing it with a specific content element, based on a bunch of information known about the user.  This isn’t far from delivering a video cache point address based on a bunch of information about the player and the network.  Refining the process part of this might do little to change the technology requirements from web server to something else, but again there’s the issue of socialization.  How could we believe that the current process of targeting users won’t evolve to the process of targeting symbiotic communities of users?  If friends are an important part of marketing, then could marketing to friends justify collective delivery of ads?  Could we even promote “ad ecosystems” that would let one product type benefit from group acceptance of another?

Event-driven social processing in advertising and content would reinforce the fact that both the contextualization and IoT trends are explicitly event-centric in their requirements.  In fact, you could argue that all of these drivers are linked in terms of their dependence on events, and also in likely optimum implementation strategies.  Sensible markets might therefore consider them a single driver, a single trend that acts to shift the edge further from traditional compute models than anything else does.

Contextualization and IoT are related in other ways.  “Contextualization” to me means the introduction of the perceptive framework that surrounds us all into the way that we manage service delivery.  We are inherently contextual beings—we use context to interpret things, including questions others ask us.  If context is our subjective perceptive framework, then IoT could well be the objective resource we use to detect it.  Where we are, what we see and hear, and even a bit of what we feel can be obtained via IoT, and contextualization and IoT are obviously event-driven.

What we’re left with is 5G and mobile, and that’s not a surprise because it’s perhaps the most different of all the drivers.  5G is not the same kind of demand driver the others are; it’s a kind of “belief driver.”  If network operators believe that mobile service evolution is their primary capex driver for the future, and if 5G articulates the direction they expect mobile service evolution to take, then there will be momentum generated even if operator beliefs about 5G are incorrect…and some are.

5G is a curious blend of the old and the new.  The notion that major advances in network services have to be supported by massive initiatives that “standardize” the future is definitely an old concept, and at some levels I think it’s been clearly disproved by events.  On the other hand, 5G is a vehicle for the introduction of technical responses to currently visible trends.  In many ways, it’s a carrier for the transmission of ideas into realization that’s capable of moving ahead of tangible benefits.  If 5G “promotes” something that another of our drivers might later actually justify, then 5G could make things happen faster and at a larger scale.

The sum of all these parts seems to lack conviction at this point.  I see the future of the edge as depending on the pace at which event-driven processing is adopted, which is the sum of the personalization and contextualization applications.  If we see rapid adoption, then I think we’ll see edge computing take on a separate hardware identity, less likely to be dependent on the x86 model.  If not, then lack of a convincing direction will probably take deployment down a line of least resistance, which would be the current server platform architecture.

My carrier cloud model says that advertising and video personalization will drive the carrier cloud through 2020, and that 5G will then come along.  From an edge-platform-architecture perspective, neither of the early drivers seems to create a specific architecture for the edge, which would seem to promote the default.  That’s great for Intel and also likely great for players like Dell and HPE, but it’s a threat to other chip vendors and those who would like to see the edge be a truly different kind of computing.

I think it could still be.  Carrier cloud may not be committed to an event-driven edge yet, but it seems like the public cloud providers are.  Amazon, Google, and Microsoft have all launched edge computing models, and though all are still based on traditional servers, Amazon’s Greengrass shows that event edges could be simpler, different.  These three may again have to show the carriers where the action is.

Is the New Network Vendor Business Model Really Going to be Subscriptions/Services?

Wall Street has been framing Cisco’s recent technology announcements as less technology than business.  The thesis is that Cisco sees the future as being revenues from services and a subscription model, rather than from hardware.  This, in response to industry efforts to create commodity hardware platforms and in the growing tendency of buyers to keep their gear longer, slowing revenue from refresh cycles.

I’ve blogged before about Cisco’s “intent modeling” initiative, based largely on policies.  I don’t intend to reprise the technical issues here, but rather to look at the principle question the Street is raising, which is whether a service-and-subscription model can lift Cisco or any other networking company out of the current doldrums.

We have to start with what got us into the current mess.  Commodity hardware, of course.  Operators and enterprises continue to struggle to justify more network spending when there are few or no new benefits to justify it.  That struggle manifests in putting discount pressure on vendors, switching to price leaders (Huawei, notably) and delaying hardware upgrades.

There is no question that software subscription and services are a viable response to commodity hardware, providing you can meet two requirements.  First, your new subscription software can’t accelerate the replacement of the equipment you’re selling.  That would accept profit and company shrinkage as inevitable.  Second, the revenue gains from success in subscription software and services have to be significant; enough to make a difference in the bottom line.

The response to the first challenge is obvious; you target somewhere other than the data plane.  White boxes, hosted switch/router instances, SDN, and even much of NFV are valuable in large degree because they reduce capital spending on connectivity and network features.  You could avoid that by presuming that the current data plane isn’t dramatically changed.

Which brings us to the second requirement.  The challenge for a company that wants to make money on subscriptions and services is that it’s hard to see how either builds a ginormous revenue opportunity.  Is policy control of network behavior as valuable, or more valuable, than networks themselves?  If not, then even if equipment vendors were willing to risk their hardware business, any significant drop in hardware sales would likely swamp the opportunity in services and subscriptions.

What has proved tricky in framing a subscription-and-services model is creating truly credible and independent value above the data plane.  It’s ironic to me that the formula for achieving that was introduced with SDN and NFV, and it wasn’t developed (by the standards people or vendors) as specs evolved.

Orchestration, as NFV pioneered the concept, is the creation of service features through model-driven deployment.  While NFV presumed that the features were to be hosted replacements for traditional network devices, most in the NFV ISG quickly realized that it would be nearly impossible to attain any credible operations agility or efficiency if you couldn’t control legacy elements as well.

If you took network and service features as they are implemented today, and enveloped them in an “intent model” that abstracted their functional capabilities and interfaces from the implementation part, you could apply those models to services today, with the infrastructure we already have deployed.  Service lifecycle management and its automation depends on effective modeling, and if you have that the service agility and operations efficiency benefits are largely achieved without further infrastructure change.

This being the case, we can say that service lifecycle automation is that above-the-data-plane layer where opportunities for subscription and services arise.  We can say that the application of the management and orchestration principles of NFV in particular, to current device-driven networks, could generate enough benefit to create a subscription software and services business.

How much benefit?  Even in 2019 we could generate almost 7 cents per revenue dollar of process opex reduction, which is a 25% savings.  By 2025 the number would rise to 13 cents per revenue dollar, a savings of almost 50%, and by 2030 the opex savings associated with service lifecycle automation could nearly match total capex.  And, of course, this doesn’t consider the benefits in service agility that lifecycle automation could bring.

This total-service-lifecycle model is separate from the data plane, and it generates significant incremental benefits.  The ROI on it, in fact is at least forty times that of infrastructure transformation, and infrastructure transformation isn’t what network equipment vendors are really looking for.  If ever there was a (seeming) match made in heaven, this would appear to be it.  Which is why it’s so surprising that everyone dropped the ball on it…so far.  Are we seeing both Wall Street attitude and vendor response combining to suddenly accept this old paradigm?  Is it too late?

To the first question, whether a shift to accepting the “old” paradigm of opex-driven intent modeling of legacy elements is happening too late, the answer is both yes and no.  Certainly, the benefits I’ve cited remain to be tapped, and so the potential for a vendor is still there.  However, AT&T took a giant step with ECOMP, framing orchestration in the very kind of broad and universal context that’s required to reap those benefits.  Now, in the Linux Foundation’s merger of ECOMP and Open-O into ONAP, we have an activity that’s driving an open realization of the architecture.

This is where stuff gets complicated.  Nobody ever accused any consensus body of lightning response.  If ECOMP recognized all the issues and framed all the benefit-to-feature connections correctly, it would still have to work its way through the inevitable political process of securing support in the Foundation for the implementation.  They have not recognized the issues nor framed all the benefit/feature links, and the consensus process may mean it will take them longer.  That could mean that a lively and aggressive vendor could jump in and simply do the right thing immediately.

Which would then give the open-source communities in general a nice target to shoot at.  Competition firms up the old spec, as most vendors know.  There would be a window in which an aggressive vendor could reap enormous rewards from having a total-service-lifecycle story, but it would be only a window.  While true persistent vendor dominance has always been rare, it has become especially so as buyers recognize the lock-in goal and work against it.  Add a “free” software solution hanging over the picture and you can see why vendor management might take a whiff at the opportunity to get in on the total-service-lifecycle story.

This strategic confusion could be at the heart of Cisco’s own positioning shift.  They have to tell the Street something, or they could end up getting negative outlooks from analyst firms.  By focusing on their own business model change without getting too detailed about just where all the future revenue is going to come from, they can dabble in fairly tactical stuff and (they hope) save a seat for themselves at the total-service-lifecycle table.  That the tactical approach is also less likely to hasten the commoditization of network devices is an added benefit.

The problem with this lovely approach is that there are two classes of vendors who don’t have the same constraints as network equipment giants like Cisco do.  The first are the server vendors, who have an opportunity to cash in on any transformation from appliance/device hosting of features to server hosting.  The second are the wireless and optical players, who have a potentially enormous play in the future network because you can’t push a virtual bit.

If either of these classes of vendors take their own aggressive shot at the subscription-and-services space through total service lifecycle automation, they could get a major jump on Cisco and other network vendors who hang back.  So far that hasn’t been happening because all vendors seem reluctant to undertake the task of positioning a broad strategic story with buyers.  But…it could.

So we have two paths that could lead to a dead end in the subscription and services revenue transformation story for network vendors.  One is the path where open-source solutions advance enough to erode or eliminate the opportunity, and the other is the path where server/fiber competitors use their immunity from displacement risk to accelerate transformation and cut hardware revenues further and faster than subscription and services can restore.

It appears that vendor business model issues are going to kill off the “stay the course” option for networking, whether or not we accept that some transformational technology will also be at work.  That may be the most important point.  If the Street is right, then hardware (other than optical hardware) isn’t the business to be in any longer, and for networking that is an enormous shift.  It will be interesting to see how quickly the changes come, and who they favor.

Some General Thoughts on Service Modeling

Everyone tells us that service composition based on software tools is critical.  Operators say that “agility” in service creation would help them address opportunities faster and better, and vendors think it would promote their products in a market that’s increasingly price-competitive.  Perhaps it’s surprising, then, that there doesn’t seem to be a unified position on just what “service composition” means or what’s needed to provide it.

Network services are created from cooperative relationships among network devices.  Historically this has been simplified by the fact that the “service” and the cooperative relationships were created at the device level, meaning that the system of devices we’d call a “network” had their own rules for cooperative behavior, an offshoot of which was one or more retail services.  Services in this framework are somewhat coerced rather than composed, because the behavior of the network is really the natural underpinning of service behavior.

Even where operators didn’t have continuity of control, as would be the case for multi-provider services, it was routine to define network interconnect points with policy-managed traffic exchange.  As long as the service protocol was the same, NNIs largely eliminated the composition problem.  I say “largely” because multi-provider services still required the coordination of VPN services (for example) and access connections in various regions.  The fact that most early service composition thinking came from OSS/BSS providers and CIOs is likely due to this; the technical provisioning was less an issue than order coordination.

Virtualization messed things up because it created two levels of elasticity.  First, the relationship between features and network resources changed from being static (purpose-built devices) to dynamic (hosted).  Second, the feature composition of services also became elastic.  We can see both these levels of dynamism in virtualization today—you have “functions” that have to be assembled into services, but for those services to work, each function has to be hosted somewhere.  Virtualization, then, mandates explicit service composition.  Most probably agree with that; the issue is in what kind and how much.

At the “light touch” end of the spectrum of possibilities is the administration-centric view.  This is an evolution from the OSS/BSS TMF approach, one that focuses on the idea of assembling functionally complete elements but leaving the implementation of each up to the operator/administrator that owns them.  You can visualize the implementation as being a combination of commanding networks to do something, much as they would do it today, and instantiating software features on servers.

The opposite pole is the “complete composition” approach.  Here, the assumption is that composition actually builds a functioning service by building every feature, hosting it as necessary, and making internal and to-the-user connections in a fairly explicit way.  Services still require features/functions, but the process of including them and hosting them on something are blurred into different faces of the same coin.

There are a lot of differences between these polar approaches, but they tend to be rather subtle and many depend on just how the composition process (and the underlying resources and overarching services) are structured.  Let me give a single example to illustrate the issues, and I’m sure you can work out others on your own.

Where services are modeled at a low level of granularity—the light-touch approach—it’s very difficult to reflect conditions between parts of the model.  That’s because the modeling process abstracts the features/behaviors at the detail level, making it impossible to see what’s happening inside.  As the granularity of the model increases, in the complete composition approach, you have the chance to reflect conditions across the elements of the model.  That changes optimization and management.

In a light-touch scenario, the presumption is that the feature selection process divides the service by domain, meaning geography or administration.  A “service” like VPN, for example, consists of a core VPN service capability coupled with access elements, one for each endpoint.  If something breaks, the presumption is that either the function itself can self-heal (the VPN core can repair around a node or trunk fault) or there’s no healing possible without manual intervention (you can’t fix the cable-cutting backhoe problem without sending a tech to splice things).  Further, the presumption is that the elements can be managed independently.  Doing something to optimize one won’t impact the others.

The complete composition scenario relieves these presumptions.  If the entire service is built from some software-readable recipe, then the decisions made anywhere can be reflected into the decisions to be made in other places.  You can optimize both hosting and connectivity among hosting points together, because your composition recipe defines both connecting and hosting, and does it across the service overall.

Even within the two basic approaches to modeling, you still have the question of what the model looks like.  Modeling that is functionally generalized, meaning that all “access” has a common set of model properties and implementations of “access” all meet them, is key if you want to support interoperability and integration.  Intent modeling, the current favored approach, is an example of functionally generalized modeling.  Where modeling isn’t functionally generalized, you have the risk of exposing something implementation-specific in the model, which would then mean that the model structure is “brittle”.  That term means that it’s easy to break the model by making a small change at a low-level point, in implementation.  Brittle stuff means a lot of work and a lot of errors down the line, because a small technical change can mean that a lot of models and their associated services don’t work.

Even having a strong intent structure doesn’t insure you have interoperability and integration support.  Intent modeling, if it’s to be done right, means that we have to first define our functions to support functional generalization.  We need a kind of model class library that says that “access” is a high-level model that can represent “roaming-enabled access” or “tethered access”, for example, or that “function-hosting” can mean “VM” or “container” subclasses.  With this kind of approach, we can apply intent modeling without having our non-brittle goals stymied by inconsistent definitions and integration.  If Vendor A defines a “my-access-strategy” model, for example, what’s the likelihood that Vendor B will adopt the same approach?

A final point in modeling is just what you’re expecting the model to drive.  Do you want to model a “service lifecycle” for the purposes of automating it?  If so, then you are accepting the notion that service conditions can arise and can be detected at the service level.  While we surely have example of that, the broad trend in the industry since the replacement of TDM services by packet services has been to manage resources at the resource level and remediate based on resource reconfiguration and substitution.  The presumption is that resources are inherently multi-tenant, so individual per-service remediation isn’t practical.

All of this circles back to that original light-versus-full-composition division.  We tend to see network connection services modeled at a light-touch level because we build connection services largely by coercing naturally cooperative device communities to do something.  The question we have yet to answer in our modeling debates is whether the cloud-like virtualization additions we now contemplate will follow that same model—whether a pool of servers augment a pool of functions that augment or replace a pool of devices.

I think the question of “intent” is critical in modeling for more reasons than the classic one.  What do we intend service lifecycle management to be, to mean?  If systems of hosting elements, functions, and connections are to assemble self-managing service elements that are then composed into services, we have one level of problem.  If we want to have greater granularity of composition and control to allow for per-service lifecycle management, we’ll have to model differently.  Now’s the time to think about this, because we are clearly coming to a point where our lack of a modeling strategy will either limit virtualization’s impact, or force operators and vendors to latch onto something—which might or might not fit the long-term needs of the industry.

Taking a Deeper Look at the Evolution of SD-WAN

There has been a lot of recent discussion about SD-WAN technology and its potential.  Not surprisingly, most of it has been marred by our industry tendency to over-generalize, to seize on a term that describes a host of options and presume that all the options are really the same.  SD-WAN is really important, but not all its options have the same mission or potential.

The common theme of SD-WAN is the use of edge devices to establish what are effectively VPN services.  The earliest examples of SD-WAN focused on using Internet connectivity to supplement traditional (MPLS) VPNs in places where either there were no MPLS options or where MPLS VPN pricing was prohibitive.  These SD-WANs usually supported multiple (MPLS and Internet) connectivity, and often also allowed their users to use multiple ISPs to improve performance and reliability.

This particular SD-WAN mission is clearly tied to arbitraging the price difference and SLA differences between MPLS VPNs and the Internet.  That differential depends on a bunch of factors, perhaps the largest being the presumption that there will never be Internet QoS because there will never be paid prioritization on the Internet.  That’s a regulatory policy issue, and in the US at least it’s likely that the mood of regulators is now shifting the other way.

You can state the SD-WAN mission another way, though.  You could say that the goal of SD-WAN is to present a uniform IP-connective service over a variety of lower-level connectivity options.  The MEF people told me over a year ago that SD-WAN could be a big part of a successful “Third Network” deployment because it could support consistent services as operators shifted from one underlayment (MPLS, Ethernet, whatever) to another.

It’s this second mission that I think will really shape SD-WAN over time.  It doesn’t have radically different technology requirements, but it would weigh the requirements differently and would also be marketed and deployed based on different drivers.  Today, the primary driver of the service is the control of VPN profit per bit.  For the future, operators see it as a way of making network technology evolution more seamless.

Up to now, SD-WAN has been either a managed service play or an option for connectivity deployed by users themselves.  There’s been, recently, just a hint of the broader second mission, driven according to operators I’ve talked with by the need to cope with continued price pressure on VPN services.  One operator told me that in the last seven years, the capital cost of VPNs has declined by about 18% and the opex has almost doubled.  A big part of this is related to the need to go down-market to sell new customers, given that current ones expect their services to get cheaper with every renewal of the contract.

This same profit-per-bit pressure is behind the drive to virtualize things, to build services a different way.  SDN technology is a good way to create services that do essentially nothing but forwarding, for example.  An SDN switch doesn’t really know anything about topology or even about IP, it has a forwarding table that it matches things against, and handles each packet accordingly.  Is this an IP network?  No, and Google among others has demonstrated that if you want to use SDN in place of IP you have to add in some things that an IP user would “see” that SDN won’t provide.  You can do that if you have a piece of CPE that creates the VPN service edge, which means if you have SD-WAN.

If you have a piece of “new-age” SD-WAN CPE, you could say that it divides itself into two pieces.  One is the user-side functionality, which is responsible for creating a network interface that looks like the kind of VPN or VLAN or whatever service the user expects to see.  The other is the network-side functionality, which is primarily responsible for framing the connectivity of the service in the terms of the actual network capability.  If your SD-WAN uses MPLS, this is where the MPLS link has to be made.  If it uses some kind of secure tunnel over the Internet, it’s supported here.  If it expects SDN connectivity, or optical virtual wires, it’s connected here.

It seems pretty likely to me that future services will tend to be constructed using SD-WAN technology like this, because future services will likely evolve from the use of different connection services at the network level.  It also seems to me that there are many different SD-WAN technologies today that either don’t fit this approach at all, or fit it with significant limitations.  Those may or may not be useful, depending on the way that connectivity evolves.

If the FCC in the US were to sweep away all restrictions on inter-ISP settlement and paid prioritization of traffic, we’d end up seeing an Internet that had QoS.  That would quickly become the baseline for providing VPN services because it would be significantly less costly.  In this scenario, we would see SD-WAN VPNs evolve to exploit IP features that are naturally visible at the edge, and you would want your boxes to have some form of box-to-box high-level (Level 4) signaling to mediate the services at the user level.  If, on the other hand, we never see any QoS on the Internet, then it’s likely IMHO that VPNs will split off from IP to exploit SDN, NFV, agile optics, and so forth.

Logically, what we should want from SD-WAN is a kind of modular structure to handle both situations.  You have a plugin for the user connection, and this is the feature set that defines the retail service.  You have another for the network side, matched to the specific technology you’re exploiting there, and you mix and match them as needed.  This would be easy if you had software plugins, features/functions that could be installed in an agile premises device.

This may be the future mission of vCPE, and the most powerful stimulus for NFV-like deployments.  What users want and need on the premises, first and foremost, is a service interface to plug into.  If there is only one option for service, then the value of agility is limited.  If we’re in a serious state of network technology evolution, then agility is everything, and the SD-WAN model may be the best, even only, way to meet future goals.

What is a Model and Why Do We Need One in Transformation?

After my blog on Cisco’s intent networking initiative yesterday, I got some questions from operator friends on the issue of modeling.  We hear a lot about it in networking—“service models” or “intent models”, but typically with a prequalifier.  What’s a “model” and why have one?  I think the best answer to that is to harken back to what I think are the origins of the “model” concept, then look at what those origins teach us about the role of models in network transformation.

At one level, modeling starts with a software concept called “DevOps”.  DevOps is short for “Development/Operations”, and it’s a software design and deployment practice aimed at making sure that when software is developed, there’s collateral effort undertaken to get it deployed the way the developers expected.  Without DevOps you could write great software and have it messed up by not being installed and configured correctly.

From the first, there were two paths toward DevOps, what’s called the “declarative” or “descriptive” path, and what’s called the “prescriptive” path.  With the declarative approach, you define a software model of the desired end-state of your deployment.  With the prescriptive path, you define the specific steps associated with achieving a given end-state.  The first is a model, the second is a script.  I think the descriptive or model vision of DevOps is emerging as the winner, largely because it’s more logical to describe your goal and let software drive processes to achieve it, than to try to figure out every possible condition and write a script for it.

Roughly concurrent with DevOps were two telecom-related activities that also promoted models.  One was the Telemanagement Forum’s “NGOSS Contract”, and the other the IPsphere Forum’s notion of “elements”.  The TMF said that a contract data model could serve as the means of associating service events and service processes, and the IPSF said that a service was made up of modular elements assembled according to a structure, and “orchestrated” to coordinate lifecycle processes.

What’s emerged from all of this is the notion of “models” and “modeling” as the process of describing the relationship between components of what’s a logically multi-component, cooperative, system that provides a service.  The idea is that if you can represent all suitable alternative implementation strategies for a given “model”, you can interchange them in the service structure without changing service behavior.  If you have a software process that can perform NGOSS-contract-like parsing of events via the service model represented by a retail contract, you can use that to manage and automate the entire service lifecycle.

I think that most operators accept the idea that future service lifecycle management systems should be based on “models”, but I’m not sure they all recognize the features of the models that model derivation as I explained it would require.  A model has to be a structure that can represent as two separate things the properties of something and the realization of those properties.  It’s a “mister-outside-mister-inside” kind of thing.  The outside view, the properties view, is what we could call an “intent model” because it focuses on what we want done and not on how we do it.  Inside might be some specific implementation, or it might be another nested set of models that eventually decompose into specific implementations.

One of the big mistakes made in modeling is the requirement for event integration.  Each model element has an intent and a realization, and the realization is the management of the lifecycle of that element.  Thus, every model element has its own events and operating states, and these define the processes that the model requires to handle a given event at a given time.  If you don’t have state/event handling in a very explicit way, then you don’t have a model that can coordinate the lifecycle of what you’re modeling, and you don’t have service automation.

One of the things I look for when vendors announce something relating to SDN or NFV or cloud computing or transformation is what they do for modeling.  Absent a modeling approach that has the pieces I’ve described, you can’t define a complete service lifecycle in a way that facilitates software automation, so you can’t have accurate deployments and you can’t respond to network or service conditions efficiently.  So, no opex savings.

Models also facilitate integration.  If a service model defines the elements of a service, each through its own model, and defines the service events and operating states, then you can look at the model and tell what’s supposed to happen.  Any two implementations that fit the same intent model description are equivalent.  Integration is implicit.  Absent a model, every possible service condition has to somehow figure out what the current service state is, and what the condition means in that state, and then somehow invoke the right processes.  The service model can define even the APIs that link process elements; with no model what defines them, and insures all the pieces can connect?

Where something like policy management fits into this is a bit harder to say, because while we know what policies are at a high level (they are rules that govern the handling of conditions), unlike models it may not be clear how these rules relate to specific lifecycle stages or what specific events the conditions of the policies represent.  It’s my view that policy management is a useful way of describing self-organizing systems, usually ones that have a fairly uniform resource set on which they depend.

Router networks are easily managed using policies.  With NFV-deployed router instances, you have to worry about how each instance gets deployed and how it might be scaled or replaced.  It’s much more difficult to define policies to handle these dependencies, because most policy systems don’t do well at communicating asynchronous status between dependent pieces.  I’m not saying that you can’t write policies this way, but it’s much harder than simply describing a TMF-IPSF-DevOps declarative intent model.

Policies can be used inside intent models, and in fact a very good use for policies is describing the implementation of “intents” that are based on legacy homogeneous networks like Ethernet or IP.  A policy “tree” emerging from an intent model is a fine way of coordinating behavior in these situations.  As a means of synchronizing a dozen or hundred independent function deployments, it’s not good at all.

This all explains two things.  First, why SDN and NFV haven’t delivered on their promises.  What is the model for SDN or NFV?  We don’t have one, and so we don’t have a consistent framework for integration or service lifecycle management.  Second, why I like the OASIS TOSCA (Topology and Orchestration Specification for Cloud Applications).  Because it’s all about doing the very thing that’s too dynamic and complex to control via policies.  Remember, we generally deploy cloud applications today using some sort of model.

Integration is fine.  API specifications are fine.  Without models, neither of them are more than a goal, because there’s no practical way to systematize, to automate, what you end up with.  We will never make controlled services and service infrastructure substitute for autonomous and adaptive infrastructure without software automation, and it’s models that can get us there.  So forget everything else in SDN and NFV and go immediately to the model step.  It’s the best way to get everything under control.

What Does Cisco Intend with “Intent Networking?”

Cisco has announced it’s going to support, and perhaps even focus on, “intent-based” networking.  At one level this could be viewed as a vindication of a widely held view that intent-modeling is the essential (and perhaps under-supplied or even missing) ingredient in the progression of virtualization.  At another level, it could be seen as another Cisco marketing strategy.  The truth is that it’s a little of both.

At the heart of today’s issue set is whole different notion, that of determinism.  The old-day time-division-multiplexed networks were deterministic; they worked in a specific way and provided very specific capacity and SLAs.  As packet networks, and particularly the Internet, evolved, networking tossed out strict determinism in favor of lower cost.  We had “best efforts” networks, which is what dominates today.

So what does this have to do with “intent?”  Well, best efforts is increasingly not good enough in a competitive market, but nobody wants to go back to full determinism to achieve something better—the cost would be excessive.  The alternative is to somehow couple service requirements into packet networks in a way that doesn’t break the bank.  In an intent model, elements of infrastructure are abstracted into a black box that asserts interfaces and an SLA but hides the details.  Intent modeling is therefore a way of looking at how to express how deterministic a network has to be.  It also leaves it to the vendor (and presumably the network-builder) to decide how to fulfill the intent.

Intent modeling is an incredibly important tool in realizing the benefits of virtualization and infrastructure transformation, because it lets operators create abstract building-blocks (intent-based black boxes) that combine to build networks, and that then evolve internally from legacy to modern technology.  A good evolutionary intent model has to be anchored in the present, and support the future.

Cisco’s approach to transformation has always been what cynics would call “cosmetic”.  Instead of focusing on building SDN or building NFV, Cisco has focused on achieving the goals of those technologies using behaviors coerced from current technology.  At one level, this is the same kind of marketing gloss Cisco has been famous for, for decades in fact.  At another it’s reflective of a simple truth, which is that transformational technologies that do the transforming by displacing legacy infrastructure are exceptionally difficult to promote because of cost and risk.

There really isn’t much new in the Cisco intent approach.  Cisco has always been an advocate of “policy-based” networking, meaning a form of determinism where the goals (the “intent”) is translated into a hierarchy of policies that then guide how traffic is handled down below.  This is still their approach, and so you have to wonder why they’d do a major announcement that included the financial industry to do little more than put another face on a concept they’ve had around for almost a decade.

One reason is marketing, of course.  “News”, as I’ve always said, means “novelty”.  If you want coverage in the media rags (or sites, in modern terms) then you have to do something different, novel.  Another reason is counter-predation.  If a competitor is planning on eating its way along a specific food chain to threaten your dominance, you cut them off by eating a critical piece yourself.  Intent modeling is absolutely critical to infrastructure transformation.  If you happen to be a vendor winning in legacy infrastructure, and thus want to stall competitors’ reliance on intent modeling as a path to displacing you, then you eat the concept yourself.

OK, yes, I’m making this sound cynical, and it is.  That’s not necessarily bad, though, and I’d be the first to admit that.  In one of my favorite media jokes, Spielberg when asked what had been the best advice he’d received as a director, said “When you talk to the press, lie.”  But to me the true boundary between mindless prevarication and effective marketing is the buyers’ value proposition.  Is Cisco simply doing intent models the only way they are likely to get done?  That, it turns out, is hard to say “No!” to.

We have struggled with virtualization for five years now, and during that period we have done next to nothing to actually seize the high-level benefits.  In effect, we have as an industry focused on what’s inside the black-box intent model even though the whole purpose of intent models is to make that invisible.  Intent modeling as a driving concept for virtualization emerged in a true sense only within the last year.  Cisco, while they didn’t use the term initially, jumped onto that high-level transformation mission immediately.  Their decision to do that clearly muddies the business case for full transformation via SDN and NFV, but if the proponents of SDN and NFV weren’t making (and aren’t making) the business case in any event, what’s the problem?

Cisco has done something useful here, though of course they’ve done it in an opportunistic way.  They have demonstrated the real structure of intent models—you have an SLA (your intent) on top, and you have an implementation that converts intent into network behavior below.  Cisco does it with policies, but you could do the same thing with APIs that passed SLAs, and then have the SLAs converted internally into policies.  Cisco’s model works well for homogeneous infrastructure that has uniform dependence on policy control; the other approach of APIs and SLAs is more universal.  So Cisco could be presenting us with a way to package transformation through revolution (SLAs and APIs) and transformation through coercion (policies) as a single thing—an intent model.

They could also be stimulating the SDN and NFV world to start thinking about the top of the benefit pyramid.  If Cisco can make the business case for “transformation” without transforming infrastructure, bring service control and a degree of determinism to networking without changing equipment, then more radical approaches are going nowhere unless they can make a better business case.

Is Cisco sowing the seeds of its own competition?  More likely, as I suggested above, Cisco is seeing the way that a vulnerability might be developing and working to cut it off.  But one way or the other, Cisco is announcing that the core concept of SDN and NFV isn’t just for SDN and NFV to realize.  Those who don’t want five years of work to be a science project had better start thinking about those high-level benefits that Cisco is now chowing down on.  There are only so many prey animals in the herd, and Cisco is a very hungry predator.

Solving the Problem that Could Derail SDN and NFV

Back in the days of the public switched telephone network, everyone understood what “signaling” was.  We had an explicit signaling network, SS7, that mediated how resources were applied to calls and managed the progression of connections through the hierarchy of switches.  The notion of signaling changed with IP networks, and I’m now hearing from operators that it changes even more when you add in things like SDN and NFV.  I’m also hearing that we’ve perhaps failed to recognize just what those changes could mean.

You could argue that the major difference between IP networks and the old circuit-switched networks is adaptive routing.  In traditional PSTN we had routing tables that expressed where connections were supposed to be routed.  In IP networks, we replaced that with routing tables that were built and maintained by adaptive topology discovery.  Nodes told each other who they could reach and how good the path was, simply stated.

The big advantage of adaptive routing is that it adapts, meaning that issues with a node or a trunk connection can be accommodated because the nodes will discover a better path.  This takes time, to be sure, for what we call “convergence”, meaning a collective understanding of the new topology.  The convergence time is a period of disorder, and the more complicated convergence is, the longer that disorder lasts.

SDN sought to replace adaptive routing with predetermined, centrally managed routes.  The process whereby this determination and management happens is likely not the same as it was for the PSTN, but at least one of the goals was to do a better job of quickly settling on a new topology map that efficiently managed the remaining capacity.  The same SDN central processes could also be used to put the network into one of several operating modes that were designed to accommodate special traffic conditions.  A great idea, right?

Yes, but.  What some operators have found is that SDN has implicitly reinvented the notion of signaling, and because the invention was implicit and not explicit they’re finding that the signaling model for SDN isn’t fully baked.

Some operators and even some vendors say NFV has many of the same issues.  A traffic path and a set of feature hosts are assembled in NFV to replace discrete devices.  The process of assembling these parts and stitching the connections, and the process of scaling and sustaining the operation of all the pieces, happens “inside” what appears at the service level to be a discrete device.  That interior stuff is really signaling too, and like SDN signaling it’s not been a focus of attention.

It’s now becoming a focus, because when you try to build extensive SDN topologies that span more than a single data center, or when you build an NFV service over a practical customer topology, you encounter several key issues.  Most can be attributed to the fact that both SDN and NFV depend on a kind of “out of band” (meaning outside the service data plane) connectivity.  Where does that come from?

SDN’s issue is easily recognized.  Say we have a hundred white-box nodes.  Each of these nodes has to have a connection to the SDN controller to make requests for a route (in the “stimulus” model of SDN) and to receive forwarding table updates (in either model).  What creates that connection?  If the connection is forwarded through other white-boxes, creating what would be called a “signaling network”, then the SDN controller also has to maintain its signaling paths.  But if something breaks along such a path and the path is lost, how does the controller reach the nodes to tell them how the new topology is to be handled?  You can, in theory, define failure modes for nodes, but you then have to ensure that all the impacted nodes know that they’re supposed to transition to such a mode.

In NFV, the problem is harder to explain in simple terms and it’s also multifaceted.  Suppose you have to scale out, meaning instantiate a new copy of a VNF to absorb additional load.  You have to spin up a new VNF somewhere, which means you need to signal a deployment of a VNF in a data center.  You also have to connect it into the data path, which might mean spinning up another VNF that acts as a load-balancer.  In NFV, if we’re to maintain security of the operations processes, we can’t expose the deployment and connection facilities to the tenant service data paths or they could be hacked.  Where are they then?  Like SS7, they are presumably an independent network.  Who builds it, with what, and what happens if that separate network breaks?  Now you can’t manage what you’ve deployed.

I opened this blog with a comment on SS7 because one EU operator expert explained the problem by saying “We’re finding that we need an SS7 network for virtualization.”  The fact is that we do, and a collateral fact is that we’ve not been thinking about it.  Which means that we are expecting a control process that manages resources and connectivity to operate using the same resources it’s managing, and never lose touch with all the pieces.  If that were practical we’d never have faults to manage.

The signaling issue has a direct bearing on a lot of the SDN and NFV reliability/availability approaches.  You don’t need five-nines devices with virtualization, so it’s said, because you can replace a broken part dynamically.  Yes, if you can contact the control processes that do the replacing, then reconnect everything.  To me, that means that if you want to accept the dynamic replacement approach to availability management, you need to have a high-reliability signaling network to replace the static-high-availability-device approach.

Even the operators who say they’ve seen the early signs of the signaling issue say that they see only hints today because of the limited scope of SDN and NFV deployments.  We are doing trials in very contained service geographies, with very small resource pools, and with limited service objectives.  Even there, we’re running into situations where a network condition cuts the signaling connections and prevents a managed response.  What’s going to happen with we spread out our service deployments?

I think SDN and NFV signaling is a major issue, and I know that some operators are seeing signs of that too.  I also think that there are remedies, but in order to apply them we have to take a much higher-level view of virtualization and think more about how we provide the mediation and coordination needed to manage any distributed system.  Before we get too far into software architectures, testing, and deployment we should be looking into how to solve the signaling problem.