The Service Implications of an Open Network Model

What would an open networking model really mean?  I don’t mean from the perspective of the finances of operators or vendors (that’s obvious at the high level, and complicated enough to warrant a special blog at a lower level).  We need to think about what would happen to networks and services if operators had open platforms and greater control over the technology behind innovation.  That’s what’s likely to impact users, and also to shape the broader competitive environment to the point where we can then think about that side of the question.

If we had truly open devices, things like the white-box nodes that AT&T’s dNOS and ONF Stratum envision, it would offer three symbiotic benefits to operators.  First, this sort of technology would cut the capital cost of network devices by about half, according to my model, and in some cases even more.  Second, the software-driven approach would offer a more agile response to service opportunities that required some form of network technology change, and finally the operations model would be lower-touch and thus improve margins on services overall.  The future is likely the sum of these forces.

Cheap boxes are important because they facilitate the transformation of infrastructure.  If a new device is a bit cheaper than an old one, then the differential won’t cover much in the way of write-down cost.  You’d need to depreciate the old gear to zero.  On the other hand, a massive benefit in capex could justify faster displacement, which means that you’d see the other fruits of an open infrastructure quicker.  Think of this as a “facilitating force”.

The pie-in-the-sky value proposition for openness lies in the second point.  Operators have been complaining for literally decades that vendors stifle innovation and discourage change because they fear loss of near-term revenues.  Even a fat 2019 won’t satisfy companies when the Street wants results in the second quarter of 2018, so nobody trades much of the present for a rosy future.  Thus, new service models tend to be framed in old-infrastructure terms, and that obviously limits how revolutionary the services could be.

The third point is the bridge point, the thing that creates the near-term momentum that the first point needs in order to reach the value proposition of the second.  It’s nice to have “agility” if you know you want to dance or swim or climb, but without a specific mission to do any of these things, it’s hard to put a specific dollar value on the feature, or even prepare for a specific application of it.  Operations costs, on the other hand, are a right-now problem whose solution yields right-now benefits.

I think that the first and last of our value propositions for openness would be enough to drive operators in that direction, but the appeal of the future revenue opportunity story is undeniable.  You cannot postulate cost reduction below zero, but with new revenue the sky’s the limit.  The only problem is that we’ve spent decades postulating new revenues from new services and always end up with either nothing useful, or with a service that’s useful to the extent that it’s cheaper, earning operators less rather than more.  Or suppose you had a service that connected on demand, people ask.  Well, you’ve just invented dial-up telephony.

New services are likely to be overlays on more traditional transport models, though eventually new models like optical/SDN tunneling will emerge.  The problem with these new models leading the charge is that they don’t deliver anything really different except, perhaps, cost.  We already know users will buy cheaper stuff, but also that sellers aren’t interested in modernization based on losing revenues.  The overlay models may or may not use classic protocol-overlay tunneling, but they’d be logical overlays in that they could represent special feature sets built on top of connectivity.

SD-WAN-like services are a good example of a future opportunity.  With the open framework of networking, you could host SD-WAN functionality as-is in edge devices, but you could also conceptualize a new model where interior features and elements work with edge elements to create the service.  An example of this would be by-name routing, where users and application components register their presence by name (with proper authentication) and are then connected that way, without the need for an “address” in a conventional sense.

You could also do distributed publish-and-subscribe services that way, and of course there are literally hundreds of potential event-and-IoT-related service opportunities.  In fact, the notion of a connection service that’s built starting with what gets connected instead of what does the connecting would be a refreshing and probably compelling broad model.

It’s also very easy to imagine the new service layer being aimed at what could be called “logical distribution”.  Hosted features and hosted content are two sides of the same (hosted, obviously) coin.  If pathways and cache points are simply missions for whatever devices happen to be in a given place, then you can compose content delivery in a completely different way, with better results.

Or 5G and some of 5G goals?  Think about logical connectivity and its relationship to roaming, to jumping between WiFi and cellular, or cellular and satellite.  The fact is that 5G introduces a lot of complexity in the name of simplicity.  You don’t need a lot of the layering if all the networks have the same agile framework, and if there’s a logical-connection overlay.  One layer fits all, and in the broadest sense what we’re talking about is an open model to build that one layer.

There are an enormous number of possible ways that higher-layer services could evolve in an open model, including of course the way that leads to nothing evolving at all.  I’m doubtful that will happen, because the SDN community, the SD-WAN community, and the IoT community will all be thinking about what’s happening here.  Vendors, including Amazon, will be pondering the fact that a framework where content and network features are co-equal is also a framework where hosted functions and network features are co-equal.  If transport is the only secure service for network operators, if everything else is up for grabs from every quarter, somebody will get greedy.

Are We Seeing a Buyer-Side Revolution in Networking?

Revolutions have to unite before they can disrupt.  A bunch of isolated dissidents can’t really hope to do much except create a higher local noise level, but a united crowd can move a society—or in our case, an industry.  Telecom is undergoing a revolution right now, created by the irresistible force of falling operator profit per bit meeting the immovable object of vendor desire to preserve the status quo.  The “streets” the operators are taking their movement to are the industry forums and open-source projects.

AT&T says that it’s going to deploy about 60 thousand white-box switches based on its dNOS open-source operating system, as part of its 5G deployment.  Operator board members have redirected the work of the ONF to accommodate operators’ need for radical changes in how networks are built, changes that would impact legacy vendors.  Operator interest in open-source network projects has exploded, and they now seem to be driving these initiatives rather than sitting around passively waiting for vendors to introduce the Next Big (proprietary, expensive) Thing.

This is hardly a new problem, a new kind of creative tension.  Vendors have for decades resisted changes that put their own business models at risk, and it’s unrealistic to assume they’d do otherwise.  However, there’s a time when, as water is washing around your waist, you have to figure that the finger-in-the-dike approach is running out of gas.  Operators provided vendors with specific warnings of their profit-per-bit problem starting in 2012, and they provided an estimate of when they needed to have a solution.  This year is the end of that estimate period.

The right approach for vendors would have been to address operations efficiency immediately, through aggressive support of full-scope service lifecycle automation (now popularly called “zero-touch automation”).  Vendors didn’t do that, and in fact still haven’t put their heart into the efforts, largely because they saw a broad strategy for ZTA reducing their own competitive differentiation.  So operators picked up that ball, in the form of the ONAP/ECOMP initiative now gaining traction at a rather incredible pace.  Emboldened, operators are leveraging what they’ve learned, and vendors continue to hunker down and protect a diminishing patch of influence.

The trend toward an operator-driven universal service lifecycle automation approach is incredibly risky for vendors, not so much because of that loss of possible operations differentiation that vendors feared, but because of the experience operators are gaining in cooperating and controlling open-source projects.  Open-source has always been the big risk for vendors, a way for operators to create commodity technology.  Would AT&T, whose ECOMP launched the current ONAP stuff, have started its open-source work by defining a switch OS, having no real experience in open-source?  I don’t think so; they needed to start work doing something that they understood better, which was operations automation.

Open-source initiatives are big news for operators trying to control their own cost destiny.  They translate directly to product, unlike standards groups that take forever to get somewhere, which then may turn out to be the wrong place.  The open-source groups, like the Linux Foundation, have seen the light and are taking steps to be operator-friendly.  The Linux Foundation’s Linux Networking Fund (LNF) is a sandbox for operator-centric network platform elements, including ONAP, and it’s increasingly taking a role of selecting stuff that’s important to operators and framing it in a broad ecosystemic context.

What’s also happening is that operators are expanding their focus from operations to devices.  White-box OSs like dNOS offer operators not only an opportunity to adopt SDN, but also the opportunity to adopt an open form of traditional switching and routing.  Best of all, the operators can unite this with ONAP/ECOMP to operationalize the new environment efficiently.  That gives operators both capex reduction and opex improvement.

Another interesting initiative is the one involving ONF and ONAP to integrate optical management and agile optical technology into the overall service management automation/orchestration processes.  This suggests that AT&T, a driver in the initiative, intends to integrate optical management into zero-touch lifecycle automation.  That, of course, also means that open optical gear is likely to come along later, though margins on optical-layer technology are lower and so that layer of the network isn’t a primary target for aggressive cost-cutting.

It’s fair to ask what all of this could mean for the “establishment” technologies in transformation, like SDN and NFV.  On the surface, SDN could be advanced by white-box, but if dNOS follows the path that the ONF established, it could just as easily be used to host legacy switch and router software, and perhaps is more likely to do that because the cost improvement for operators would be realized faster.  On the surface, NFV might also advance because linking NFV better to ONAP could resolve the benefit shortage that’s hampering NFV adoption.  Dig deeper, and it also threatens NFV’s existence.

Let’s start with SDN.  Operators have often argued that we don’t need a new forwarding paradigm or to substitute central explicit routing for adaptive routing.  They’ve voted that point with their wallets, continuing to deploy classic routers and switches.  I think that agile devices, because they can easily transform from legacy adaptive behavior to centrally controlled behavior, offer an easier transformation path, but that doesn’t mean anyone will take it.  Look for some improvement in SDN adoption rates, but not a revolution.

The NFV picture seems more complicated, because it’s hard to say what is really driving NFV at this point.  Fortunately, the possible drivers seem to be converging on the same outcome.  If you track current adoption, then the NFV driver is virtual CPE.  Clearly a dNOS-type solution and a white box could do the same, with no need for unproven approaches.  If you believe that 5G will drive NFV, then AT&T seems to be announcing they’ll depend on dNOS and white boxes instead.  That these would be cell-site routers means that the majority of the gear associated with 5G might be dNOS-based, not NFV-hosted.

The end-game for operators is simple; drive cost out of networking even if it means destabilizing current equipment vendors.  After all, they reason, current vendors have stuck it to us for decades.  It might well be too late for vendors to respond to this in any constructive way, even assuming they were willing to bite the bullet they’ve refused to chew on so far—reduce their profits by supporting open technology.

Some of the changes I’ve noted here have other implications.  One good example is the whole issue of “operator leadership”.  The ONF has changed its web site, with the opening line being “The ONF is an Operator Led Consortium” whose goal is “Transforming Networks into Agile Platforms for Service Delivery”.  I remember that a decade ago, the IPsphere Forum was absorbed by the TMF because many of the EU operators had been told by their legal departments that the operator-control aspect of the organizations created a risk of a collusion charge by regulators.  Apparently that risk has passed, and if that’s true then the current operator steps are only the beginning.

This isn’t the only legal/regulatory issue that the current trend is exposing.  Huawei is a major figure in many of the open and open-source ventures just announced, and ZTE is also involved in some.  In the US, both companies are under pressure because of their Chinese origins.  The FCC doesn’t want to allow any broadband subsidies to be spent on products from China, Best Buy is dropping Huawei products, and the Administration is looking at ways to stop Chinese investment in strategic tech companies.  Might this taint open-source participation by Chinese companies, or sponsorship of open federations by those same companies?  Remember that US operators are already effectively barred from deploying Chinese network technology.

Perhaps the most fundamental question to be addressed in achieving this happy, open, future is the usual “who pays?” question.  Vendors have long been the drivers behind all technology initiatives, including open-source.  What will happen if vendors lose so much margin they can’t justify spending on personnel to do what’s essentially community development?  We already see some vendors ceding their stuff to open-source.  Suppose everyone does, then nobody commits bodies to sustaining it?

We’re at the start of something big, but we’ve been here before with other initiatives.  I’d like to believe that open-source was going to burst onto the scene and change the market, but we still have issues that could derail things, including the three I just mentioned.  There’s still time for vendors and operators alike to tune and refine their processes and avoid anything that could slow things down.  Can buyers and sellers cooperate, though?  That would be a real revolution.

Blockchain: What It Is and Where It’s Taking Us

Blockchain is another of the modern topics that seems to have a life of its own.  The problem is that about half the people who think of blockchain don’t really know what it is, and the other half think it’s all about Bitcoin and cryptocurrency.  The application doesn’t define the technology, or at least it shouldn’t.  Blockchain technology has broad applications, but it’s hard to see them without really knowing what it is and does.  Sadly, typical tutorials focus on applications and not on the technology potential.

The basic mechanics are simple.  A transaction against something (the “target” or “subject” of the blockchain) is a “block”.  When blocks are created, they are chained to the prior blocks, hence “blockchain”.  The blockchain is a journal of transactions from the inception of the target to the present.  OK, but that’s probably not really helpful in understanding one.

I think the easiest and best way to start a tutorial on blockchain is to look at a rather old-fashioned financial example, but in a different way.  Suppose you have an account at a store, and you go there periodically to buy things.  The storekeeper puts the items you’ve bought “on account”, and you go in from time to time to pay something on account.  It’s easy to see this as being about the account ledger or transactions, but what it’s really about is trust.  You trust the storekeeper to keep the books, so the truth is that it doesn’t matter much what the format is, or what you decide to buy or pay.

Blockchain is, at its core, a trust community.  There are “members” of the community who have a set of cooperative activities centered around some kind of records.  If your storekeeper let anyone go in and add items to your account, you’d be stuck paying for things you didn’t get.  You trust the storekeeper.  The storekeeper trusts you to pay for what you’ve purchased.  The account is the centerpoint in the trusting process, where the parties come together.

Two-party trust is pretty easy to address, but as the number of parties increases, so do both the issues that could arise and create a dispute, and the risk that parties involved might create those issues.  The traditional mechanism for trust extension in business is a combination of parallel record-keeping together with an exchange of information among parties to ensure the records are in sync.  If there’s a dispute, then a third party has to step in and decide what the truth is.

A variation on this approach for mass merchandizing is the concept of Electronic Data Interchange (EDI).  With EDI, it’s common to use a third party as a transaction intermediary between parties, on the theory that if a neutral player can authenticate the transactions, then it will always be possible to resolve the question of the authentic state of the records.  There may have to be a legal framework to enforce a decision on authenticity, and many EDI users executed agreements to establish that framework in advance.

You could think of blockchain as a kind of super-EDI.  Instead of relying on a third party to mediate transactions, what blockchain does is create a community to jointly mediate the records themselves.  The concept is based on three essential principles, and without them all, it doesn’t work.

The most fundamental of the principles is that of community.  A blockchain has specific “actors” or entities that are part of the community.  These are typically opaque to each other, but they all see all transactions (“blocks”) and participants in the transactions must all be community members and agree to the transaction.  Further, all community members have a copy of the blockchain.

The second principle is that of authority.  The blockchain is secured with strong encryption to prevent anyone from altering past blocks/transaction records.  Because everyone involved in a transaction had to agree to it, and because nobody can alter records to cook the books, the blockchain is an authoritative record of whatever information it’s designed to keep.

Principle number three is past-state auditability.  The original information in the chain is agreed to by all, and every transaction that changes that information is stamped and signed off on.  You can go to a blockchain and determine, at any point in time, what the state of the records were.  The current state is thus self-proved and self-audited.

If we stay with our example of the shopkeeper and the account, what blockchain does is provide a mechanism whereby the authenticity of the books is assured by the fact that when a transaction is entered, the account book the originator is referencing is supplied.  This means that all the members of the community can test the book (in blockchain form) against their own copy, and more than half must declare the book authentic in order for the transaction to be recorded.

In theory, you can do almost anything with blockchain technology, since the individual transactions don’t even have to be related to each other or anything else.  An example of this seemingly disorderly application type is a journal of events; it’s transactions that matter, not the net result of them.  Despite the broad potential, the largest number of blockchain applications involve something that has state, meaning a set of current conditions that are meaningful to the community.  Our account book had state, and so do nearly all financial transactions.

The broad theoretical impact can be narrowed by practical considerations, the most significant of which is the potential delay associated with achieving consensus on a transaction.  Could you do a blockchain to represent the state of a server (as some have already proposed)?  Yes, but if the community is distributed far enough, then real-time information may be far less than real-time because of the delay in getting approval to post the transaction as a block.

Bitcoin and other cryptocurrencies are good examples of blockchain because they do have state, and because the “transaction” against a Bitcoin is likely a simple change in ownership.  That means that human-level delays in reaching validity consensus probably don’t hurt much.

Bitcoin also illustrates a couple of fundamental issues with blockchain, one being the validity of the authority principle and the other being the issue of representation.  If either of these things are in question, then a blockchain is a complex validation of something invalid.

Blockchain authority is created in part by its presumed immutability; a blockchain has hash totals and encryption, and any time a block is presented it’s compared with the “back-chain” values of other blocks and thus authenticated.  You’d have to break both the encryption and corrupt the community in order to falsify a chain.  But a given blockchain, to a new user, is still a little bit of a trust question.  Suppose somebody gives you a cryptodollar.  You can think that the encryption and community validation protects you by guaranteeing its authenticity, but might the blockchain represent a fraudulent community, not the one you think it does?  Some protection is built into some applications, but it’s still possible to be duped.

The other issue is representation.  What makes a cryptodollar worth a dollar?  Answer:  Someone with a real dollar created the chain to represent the dollar, and “backed” the chain with that amount.  That’s called “provenance”; the backer vouches for the representation, and since that isn’t a transaction/block the representation may be doubtful.  With backing, a cryptodollar is like a paper dollar.  I remember that the US dollar used to say something like “This certifies that there is, on deposit in the Treasury of the United States, one dollar in silver, payable to the bearer on demand.”  If you can exchange a cryptodollar for a real one at the point of its origination, it’s clearly a proper representation.  If you can buy something with it that costs a dollar, it’s an accepted representation, but not necessarily real.  Think the classic pyramid scheme.

The final issue with blockchain is that it’s a chain.  As transaction volumes mount, so does the length of the chain and the resources it takes to transport and process transactions.  Something like a ledger typically won’t have enough transactions to contaminate the value proposition, but in some applications like the recording of events, the length of the chain could be a barrier to adoption.

Despite this, blockchain has applications beyond the obvious.  In the original ExperiaSphere project, I introduced a concept called “SocioPATH™” a “community” of nodes that were responsible for “voting” on asserted identities for membership, which then established connection and access rights.  You could do something like that with blockchain.  Google’s Wave technology, never carried forward, used a blockchain-like mechanism to establish a secure collaborative context that couldn’t be altered retrospectively and that was protected against intrusion.

What these initiatives show is that many of the blockchain ideas aren’t new, and that there are many elements in blockchain that could be used independently without the total context of blockchain, to develop new approaches to classic IT problems.  It may well be that these nubbins of blockchain will prove more broadly useful than the entire strategy.

Did Oracle Rain on the Cloud?

The media headline was something like “Oracle rains on the cloud.”  Catchy, but is it true?  Oracle turned in some worse-than-expected results despite a fairly aggressive stance on cloud computing.  Is there an Oracle execution problem, a cloud computing problem, or just a company missing its numbers?  We’ll have to dig a bit to see.

First, Oracle beat on EPS and barely missed on earnings, but their position came from a combination of weaker-than-expected cloud and stronger-than-expected legacy.  They had the misfortune to be announcing during a period when tech stocks were under pressure because of tariffs and Facebook privacy fears.  Their cloud revenues were up significantly if below expectations, but this is the third time in a row they’ve undershot earnings.  That’s why the question is whether Oracle is misjudging cloud potential, missing opportunities, or whether the cloud itself is perhaps a bit of a risk.

To add some color to this, Dell’s CTO has been making a number of statements that aim to discredit the popular notion that all IT is shifting to the cloud, so enterprise server sales are doomed.  Obviously, Dell has its own reasons to want to discredit a mass shift to the cloud, but it’s interesting that these comments are coming just as Oracle is missing on revenue (again).

The reality of the cloud situation, and Oracle’s place in it, is complicated.  The biggest problem we have with cloud computing is the stubborn belief, in defiance of actual industry data, that legacy applications are “moving to the cloud.”  There is some limited server consolidation shifting going on, but in the main the business-critical stuff isn’t moving and won’t be moving any time soon.  Dell’s CTO is absolutely right saying that enterprises are adopting virtualization and abstraction and that these will pay most of the benefits of the cloud.  He’s also right when he suggests that enterprises can have hyperscale server farms efficient enough to beat the public cloud on pricing.  All this shows why Oracle actually outperformed with legacy stuff.

The truth about the cloud is that most of its potential lies not in “moving stuff to it” but in “writing stuff for it.”  The big cloud providers know this well, and it’s why they’re focusing so much on web services that provide new development models.  All three cloud giants (Amazon, Google, and Microsoft) offer several dozen such packaged middleware-like toolkits, and all of these are designed not to facilitate moving something but rather to facilitate developing something.

My model says that about 20% of public cloud opportunity comes from “moving” to the cloud.  About 30% comes from using cloud tools to build or rebuild mobile and web front-end processes to legacy applications.  Another 20% comes from rebuilding other deeper applications around cloud capability, and the remainder from totally new cloud-specific development.  That means that 80% of the cloud opportunity is about development, and that of course is why the cloud providers are pushing development-level web services.

Besides aligning with the real-world opportunity, web services also earn cloud providers better margins than IaaS hosting.  The additional capabilities they offer displace costs beyond servers, which makes it possible to sell cloud services more easily by boosting the business case.  Those who have read my blog over time know I’ve always said that IaaS was a false opportunity.

Oracle seems to know this at one level, yet not at another.  On the one hand, the transcript of their earnings call is replete with references to “autonomous” services of one sort or another, and Oracle’s long-term commitment to them.  Oracle uses the term to reference an implementation of PaaS or web-service elements that are self-provisioning, self-sustaining, self-managing.  Think of an autonomous service as an intent-model implementation of a feature set.  Conceptually this isn’t a huge shift from the web service offering of others, but it’s a positioning shift and a potential advantage.  On the other hand, Mark Hurd has this to say: “Let me talk to you a little bit about our ecosystems. Our app ecosystem year-to-date is up 12% and we continue to grow faster than the market. Less than 15% of our apps customers have started to move their core apps to the cloud. Between customers that have partially moved and those not started yet, we have an enormous opportunity in front of us.”  This sure sounds like the old “Love is just around the corner” theme to me.

Mark, those core apps are not going, and strategies aimed at exploiting that inevitable end up inevitably failing.  Ellison and Hurd need to get their thinking aligned here, and it’s Larry and not Mark that’s right.  However, the problem with the truth here is that it generates a complicated cloud value proposition, particularly to Wall Street, who wants if not instant gratification, at least next-quarter gratification.

The truth about the cloud is that it really is different, it really does change everything.  Why then would we think we could simply move something to it?  This is now, and has always been in the long-term sense, a development-driven migration.  That takes time, which is why it’s not easy to accept.  However, success in that long-term cloud depends on doing the right thing, and it’s hard to see how that happens without accepting the truth.

Even Ellison may not have the vision down pat.  The Oracle concept of “autonomous” bends quite a bit toward an SaaS model, whatever Oracle calls it.  Oracle does have strong support for middleware tools (Java is Oracle’s, after all) but they also have a tendency to play to the crowd in positioning pieces of their approach as “new” even if it disconnects the elements from a common theme, and each other.  That’s not the best way to create a competitor to popular cloud web service suites, all of which are presented as an inventory from which developers can draw.

Where does this leave Oracle?  Their understanding of the market challenges may be strong, but they seem to be limiting their ability to move forward by refusing to accept where “forward” really is.  “Play to the Street” is fine as long as you realize you still have to sell to the market.  Oracle needs better positioning, and a stronger set of developer tools for its cloud, to encourage the real opportunity in the cloud.

Where does it leave the cloud?  Nowhere it hasn’t been all along.  Anyone who thought, as Hurd said, that all the legacy stuff is moving relentlessly and automatically toward the cloud is dreaming.  Drawing cloud disappointment from Oracle’s results is easy if you don’t mind overlooking your own ridiculous expectations.  The cloud will move forward, as all technologies do, based on exploiting its benefits.  Eventually even the Street will see that.

Thinking Beyond SDN, Beyond P4

In the last week, I’ve done a couple of blogs mentioning P4.  It should be clear that I’m a fan of the concept, but I want to point out that P4 is (like many other tech developments) at risk for being overhyped.  Does it, as SDxCentral says, “take[s] software-defined networking (SDN) to the next level”?  Not exactly, but P4 is an important piece of the future of SDN, and also a signal for what should happen in areas like IoT.

Network devices are packet processors.  They examine a message header formatted in some protocol standard, apply forwarding rules, and dispatch the message along its way.  In the most general examples, “along its way” could mean into an output queue or into a process that handles the message internally.  The output queue could be part of a data flow end-to-end, or an on-ramp to a central control element like an SDN Controller.

The general practice for building network devices has been to use fixed logic and software control in combination to implement the packet processing logic.  Appliances, like switches or routers, have increasingly relied on hardware enhancements like ASICs to accelerate their logic.  Software switches and routers, lacking custom hardware since they’re designed for commercial off-the-shelf (COTS) servers, do everything in software.  But appliances also use software, and there is considerable interest in using some off-the-shelf silicon in network adapters and custom boards in COTS servers.

In addition to this desire to exploit the best of software and hardware, there’s also increased interest in being more “flow aware” in devices, meaning looking deeper into the header structure to recognize new forms of tunneling or encapsulation.  A pure software instance of a packet processor could be updated to handle new flows, but it might be less efficient as the combination of new headers and flows increases.  Hardware appliances would be harder to upgrade, of course.

What P4 does is provide a high-level language to describe flow-based processing on a per-device basis as a combination of high-level logical tasks.  These tasks are then mapped to actual, arbitrary, hardware and software combinations.  Vendors of software packet handling and custom hardware devices could build “plugins” that would translate P4 functions into the form their own products require, so a properly written P4 program, supported by properly written “P4 compilers” by vendors, would be portable.

Improperly-written stuff clearly won’t be portable (or even perhaps workable), but that level of functionality is easy to determine.  The broader problem is what could be called the “P4-in-a-cooperative-system” problem.

Networks are systems of functional units that cooperate to support a common mission.  What, exactly, is that mission?  P4 can define how a unit cooperates within the mission, but not the mission itself.  A P4 programmer could author a router/switch program or a white-box handler for OpenFlow, but a bunch of disconnected P4 programmers could all come up with wonderful whiz-bang switch functionality that, when joined with the results of others’ work in a network, would do absolutely nothing.  The cooperative behavior is not defined; only unit behavior.

This isn’t an issue where P4 is used to define a “standard” implementation like IP routing, Ethernet switching, or OpenFlow forwarding, but it does have implications where you’d try to define a new forwarding/flow paradigm, add in a new encapsulation strategy, etc.  There, you’d need to lay out the cooperative behavior expected among the devices, meaning either any exchanges between devices or any exchanges with central management processes.

Coordinating inter-element behavior is something that the P4.org community has dabbled in.  The In-Band Network Telemetry draft describes how state/status information can be communicated for the purpose of supporting network analytics.  What’s not there so far is a general notion of defining cooperative behavior models that build a community of elements and not just an element alone.  There’s no reason why P4 could not provide this sort of modeling of community practices; the tools to coordinate between devices could be added.  The body just doesn’t do it now, and therefore we need to understand that it needs to be done somehow and that (obviously) other bodies may have to step up to do it.

Software-defined networking is more than software-controlled per-device forwarding, so P4 is not going to take SDN to the next level.  What it does is remove a barrier to implementing new levels at the element or node level, if those new levels involve a different model of node packet processing.  We still need to define the overall systemic relationships, and to me that’s what any given level of SDN is really about.

You could argue that P4 makes this systemic-level definition of SDN harder, not easier.  In the past, design of a network device would usually start with the specifications of the data, signaling, and management exchanges among devices, and from that build per-device specifications.  With P4 it’s possible to do the opposite; define per-device behavior from which systemic behavior could theoretically then be synthesized.  That’s a risk of classic bottom-upness that poses risks overall in the tech and networking space.

That doesn’t mean that P4 is a bad idea, of course.  If you want to define new network paradigms it’s sure nice to think that it’s possible to implement, at the node level, what you’ve defined.  It’s even nicer to know that as new hardware technologies emerge, they can be incorporated in devices and used by the node packet processing logic.  The systemic whole can’t be generated without nodal parts behaving as expected.

There are two truths that emerge from the reality of P4, IMHO.  One is the obvious one—we need to have somebody thinking about systemic-level network behavior and what it could be made to look like if constraints on quickly defining nodal behavior and exploiting modern hardware were erased.  The second one is that packet handling isn’t the only place where something like P4 would be helpful.

Event processing and network packet processing are similar in many ways.  If you believe that the trends in cloud provider interest in event-driven apps are real, then you have to believe that some form of P4-ish event processing language would be very helpful.  Event processing also illustrates the need to have a systemic and local vision of “handling” because events typically have to be correlated or contextualized to be useful.

A further extension to the event story comes in image processing.  We have zillions of security cameras, wildlife cameras, process monitors, and so forth.  Might it be useful to have some form of image processing to facilitate automated recognition of certain conditions?  For sure; we just need to think about images differently.

Every video stream is a stream of images, and it could be the source of a stream of events.  If you could process video images in at least near-real-time, you could extract things from the stream that were significant and trigger processes with the resulting events.  We already do some of this in the work on self-drive vehicles, and the applications are as wide-ranging as improved ad targeting and intelligence.

In all these areas, we need to take another lesson from P4 and start looking from the top instead of focusing exclusively on the bottom.  The lack of an organized framework can lead to so much variability (in the name of flexibility) that compatibility in implementations and ease of use both suffer, to the point where the whole value proposition could be jeopardized.

Why We Need to Pay More Attention to “Events”

One of the big issues in zero-touch automation is event generation.  Since management of services and infrastructure is all about responding to events, it’s pretty logical that getting events to respond to is critical, fundamental.  It’s not so much that we don’t know how to generate events, as that we don’t necessarily know how to generate “good” ones.  Then there’s the fact that the nature of events is related to the flexible, fuzzy, relationship between virtual service elements and infrastructure.

Virtualization overall is a matter of creating and exploiting multi-tenant infrastructure, which isn’t far from how networking has worked since packet switching replaced circuit-switching.  With virtualization, you have services based on virtual resources that map to real ones in some possibly complex way, and this process is what creates the event confusion.

Suppose a “real” trunk connection fails.  That connection might be a part of tens, hundreds, thousands of virtual resource relationships, either as a direct transport conduit or as a pathway over which “interior” virtual connections pass.  In either case, the failure of the trunk is clearly an “event” that has to be handled, and this raises the first of many event questions.

You can’t fix virtual resources, at least not conventionally, and so at one level the right approach to this failure is to fix the real trunk.  In fact, that probably has to be done in any case.  The problem is that the SLAs of some or all of the services impacted might require remediation faster than trunk repair could be expected to complete.  In traditional adaptive networks, in fact, we’d reroute traffic around the failure.  That means that the trunk failure event has to be somehow reflected into the virtual world.

If we assume that services were created from components that were modeled via intent model principles, and if we further assumed that these services supported an SLA, and further assumed that the intent model elements of a service could each generate an event if the SLA it offered was not met, then we could assume that all service events could be created through intent modeling.  Obviously, few of these assumptions could actually be made in the practical world, because intent modeling is far from a universally-accepted principle, and even where it is accepted there are a wide variety of implementation practices.

A standard intent/event strategy would be extremely valuable in zero-touch automation. In fact, if intent modeling contributed nothing to the process other than a uniform event generation approach, it would be a highly useful step to take. There are a number of bodies including ESTI and the TMF that could perhaps be expected to generate a set of event model relationships, but so far none of them appear committed to doing that. Given that, we have to look deeper and find more universally applicable but intrinsically complicated solutions.

For services where an explicit commitment between a resource and a virtual element is made, the correlation of a resource event to events associated with virtual elements is fairly straightforward.  When you bind a virtual resource to a real one, you need to create a record of that fact.  Think of the record as being simply a pairing:  Virtual>Real.  When our trunk fails, a database table of these pairings will give you all the associations between the trunk resource and virtual subordinates.

This point illustrates some very important truths about events.  Most important, it shows that if you want to have service-level responses to specific resource conditions, then you need to define a virtual-to-resource binding.  That means that building services with specific SLA remediation expectations means either accepting SLAs from underlying facilities and foregoing explicit remediation, or building the services on virtual elements that can be explicitly bound.

This latter requirement obviously demands an example.  Say you have a VPN service that maps to a resource-layer multipoint service made up of a thousand devices and thousands of trunks.  You can’t expect to remediate a trunk failure in that situation because your mapping isn’t to the right level of resource.  If instead you build your VPN from virtualized elements that are specifically mapped to hosting points, tunnels, etc., then you can design your zero-touch system to be able to report a failure of the bound resources to the service level for remediation.

The second point the binding pairing shows is that for multi-tenant services it quickly becomes impractical to bind tenant service SLAs to resource state in order to generate events.  Multi-tenancy is inherent resource sharing based on some rules, and you can’t bypass the rules and the multi-tenant-shared resources by binding across them.  Multi-tenant services abstract the services above the multi-tenant service-level virtual resources, and in turn map those virtual resources to the real ones.

Obviously, things would be easier if we could presume that “virtual” failures were detectable as events.  Take the example of a virtual wire that transits a half-dozen trunks and nodes.  The higher-level response to a fault in this wire would be the same whatever real resource failed—reroute.  Monitoring some virtual resources is straightforward; if the resource is a virtual device or software-hosted feature instance, it can expose a management interface that could itself be (directly or indirectly) an event source.  If the resource is a virtual wire, the options are more complicated.

There have been two general approaches to making virtual wires event sources.  One is what could be called the “admission point” approach.  If you can monitor entry/egress to a virtual wire, you can probably inject test packets or tag normal ones, and by doing that get a measure of the health of the path.  The other is to actually peep at the traffic along the way.

Until recently, the latter approach has been generally rejected for its impact on performance of the link.  However, there have been dramatic improvements in merchant silicon to facilitate packet processing, even where “deep header” inspection is required.  The P4 language I mentioned in a previous blog is designed to facilitate the description of “flow programs” that could perform monitoring/tagging functions as well as determine forwarding paths.

P4 and the associated silicon, likely used in embedded appliance or network adapter missions, could provide a better way of doing admission-point tagging and analysis, and a fairly practical way of doing deep-header inspection too.  It’s well within the realm of possibility that P4-driven silicon could provide monitoring of the virtual pathways that services of all kinds are likely to use.  If the silicon were powerful enough, it would be possible to make an interface or P4 switch flow-aware down to the specific virtual pipe level, and it’s certainly possible to be able to gather statistics and conditions on “class-of-service” trunks that were aggregates through which more detailed flows passed along well-traveled routes.

Having virtual-flow analytics would simplify the problem of generating relevant service events considerably.  The same technology might also be able to offer better responses to problems, if the virtual flows could be quickly diverted to “standby” routes or failure mode states, for example.  However, doing this, even with P4, requires coordination.  Switch programming still programs only one switch at a time, and so you need to coordinate behavior across switches to create a flow.  You also need to be able to relate per-switch programs to flow behavior when you’re analyzing virtual pipes.  Thus, there’s still work to do.

There’s also some confusion to address.  Many believe that “intent modeling” would eliminate the concern about events overall, or would eliminate at least the need to couple hardware conditions to service events.  That’s not really true.  Intent models can enforce an SLA within themselves, but that enforcement is almost certainly going to require the same kinds of management as we’ve talked about here.  The model makes the event exchange opaque from the outside, but it doesn’t eliminate it.

The net here is that events are way too important to be left to almost accidental or peripheral discussions, which is where they are now.  The same is true with “state” meaning the current condition of a cooperative system.  If you define states and events, you define the process context you’re expecting, and we can’t do zero-touch automation without that kind of definition.  We really need to press the bodies claiming to be discussing zero-touch automation to frame their event expectations early, and we then need to examine them closely to be sure they match our conception of the relationship between resources and services.

SDN and SD-WAN: Converging to Create What?

If software defines everything, is everything converging?  Obviously not, but some software-defined things probably are converging, and there’s no better example than that offered by SDN and SD-WAN.  The big questions raised by that convergence are what emerges in the way of a combined model, and what group of vendors/supporters wins in the mash-up.  Will the future be SDN-ish, driven more by SDN features and vendors, or will it belong to SD-WAN?

Let’s first dispense with the “probably” qualification in my opening.  SDN and SD-WAN are converging, but that’s not admitted by all the players.  Software-defined networking (SDN) emerged in the form of the Nicira overlay network technology aimed at creating effective multi-tenant cloud networks.  That first mission of SDN was to separate a physical network into multiple virtual networks.  SD-WAN is also a form of overlay technology, but one designed originally to combine multiple physical networks into one logical network.  Separating and combining are opposite processes, but they’re both a form of mapping between the physical and virtual.

The separation between SDN and SD-WAN really came about as the formal standardization of SDN emerged in the ONF.  SDN became not an overlay technology as the original Nicira model proposed, but rather a substitution for a traditional adaptive forwarding technology that was based on explicit forwarding tables maintained through the agency of a centralized controller.  The OpenFlow evolution hasn’t been enormously successful, as my blog yesterday on the new ONF initiative (Stratum) concludes.

The problem with this evolution of SDN from, a pragmatic marketing perspective, is that the kind of transformation required in network infrastructure to adopt SDN.  Switching to explicit white-box forwarding on a large scale is simply not practical, unless there’s some massive range of service benefits that could justify it.  Even the AT&T dNOS and ONF Stratum initiatives, and P4-based network devices, don’t change the fact that SDN is still about forwarding networks.

SD-WAN, in contrast, is about connecting and connection management.  The nice thing about SD-WAN and the original overlay model is that since it is an overlay it doesn’t require transformation of the underlying infrastructure.  Instead it maps services to a new layer on top of that infrastructure, and there it’s free to adopt whatever connectivity forwarding and connection management rules it finds convenient.  SD-WAN fits a “mister outside” model, a model that puts it in the point of connection with the user, and that’s the ultimate high ground.

Buyers report that SD-WAN is cheaper even than “commercial” overlay SDN, and it’s also reportedly a lot more operationally friendly.  That combination has made SD-WAN a favorite of managed service providers who position it as a kind of multinational VPN without MPLS interconnect problems.  If the SD-WAN product has prioritization features, it can even offer a better managed service model where the underlying service is homogeneous, or it consists of one VPN and the Internet.  But most of all, SD-WAN is about WAN, which is where SDN needs to be to succeed broadly.

It’s pretty clear that SDN is going to maintain a datacenter-centric mission unless something radical happens to justify SDN being extended into the WAN. It’s pretty clear that SD-WAN is already in the WAN, and because of that any attempt by SDN to penetrate the space is necessarily going to cause a functional and marketing collision with the other technology.

It might be possible today to see an example of that collision in the Nuage SDN product from Nokia, which is an SDN product that includes SD-WAN capability. Nuage is arguably the most functionally useful of the SDN implementations. It doesn’t conform to the strict OpenFlow white-box-centric new forwarding paradigm of the ONF.  Instead it goes back to the original roots of SDN with Nicira-like technology. Perhaps that regression is why Nuage is able to provide something which is, in a functional sense, almost a bridge between SDN and SD-WAN.  But how much of the SDN side is baggage now, if the SD-WAN space is the end of the rainbow?  What other useful or even critical SD-WAN stuff could we find in that space, stuff SDN has yet to address?

I blogged about what that stuff might be, over the last ten days.  The clear truth is that if both SDN and SD-WAN are about virtualization, then it’s time to start thinking about virtualization in the broadest sense.  That’s the only way we can hope to understand where the two converging technologies are converging.  Where is virtualization in networking services heading?  Ultimately, toward creating the truest form of “network as a service” (NaaS), a form where what’s connected is connected by policy.

The basic notion of NaaS is to provide policy-managed connectivity over arbitrary infrastructure, in such a way as to separate services that need to be separate and make inclusive those that need to be.  To make NaaS work, you need to have physical connectivity across your community of users, though it doesn’t have to be provided to every user in the same way.  The NaaS overlay would harmonize any differences in how lower-level services worked, and create communities that could draw from any mix of underlay technology.

SDN today has focused on providing alternative underlay networking not NaaS, though as I’ve said the earliest models of “SDN” were overlays and offered NaaS services.  So do some of the current options for SDN, in fact.  But it’s SD-WAN that most clearly aligns with the original NaaS concept, and so if the NaaS model is the future, it’s a future that SD-WAN would be best positioned to control.  Is NaaS the model?

The mission of most policy-managed connection models and even of access control technology at the application level, is to create virtual subnetworks.  Presume universal open connectivity and then bar the stuff that you don’t like.  This model poses pretty significant challenges even today because of the difficulty in ensuring that the prohibitions on connectivity are established and maintained in the face of dynamic worker movement.  Start moving resources around and it can only get worse.

In a dynamic cloud framework, you host stuff where you need to, based on a combination of hosting policies and application needs, tempered by availability of the resources.  From the first this created issues with addressing resources, because of the classic what/where dichotomy of IP networks.  An address gets traffic to the addressed entity, which means that the address represents the entity at a logical level and the hosting point at a real level.  Virtualization’s elasticity messes up this picture.  Amazon, with Elastic IP Addresses, provided a mechanism for tying a logical resource (the elastic address) to the physical resource (the private cloud address).

The problem with this basic level of virtualization support is that it doesn’t establish the community entitled to communicate, only the means of connection.  Fundamental to NaaS is the notion of “services” or “service subnets” that represent a collection of things that can connect among themselves.  That could be provided by SD-WAN, but the way it gets provided is as important as the facilities.  A pure policy-and-firewall model only establishes an SD-WAN as the place where you have to maintain connectivity rules, not a way to make the rules easier to apply and maintain.

To make service/subnets work, you have to start with logically named entities and deal with addresses as nothing more than a convention needed to map “virtuality” to reality.  You then have to provide a way to associate the names into communities, and only then map to a way of addressing them over the set of physical facilities that exist.  This capability could be offered by SD-WANs (and is offered in some cases) but it’s not the basis for most current products, and isn’t even provided in useful form in all of them.  If you’d like to read about the SD-WAN offering we think best fits our future-needs model, we’ve produced a free report.

We have done a great disservice to the SD-WAN space, to the SDN space, and to networking overall, by conflating solutions that don’t have much in common except an acronym.  SD-WAN is perhaps the worst of the lot, because there are so many different ways of seeing the technology, and so few that are actually leading us to where we need to be.  What will enable SD-WAN, and SD-WAN vendors, to win the race with SDN is an awareness of the real, long-term, value proposition of NaaS.  Some of that awareness is emerging in the market even now.

SDN may undergo a transformation as new forwarding-control technologies and architectures emerge.  I blogged about AT&T dNOS and ONF Stratum earlier this week, and they could certainly change the game and help get SDN out of the data center.  But all the way to the edge?  It’s hard to see how these SDN advances do anything more than harden SDN into transport flow management missions, and the money will always be where the user connects with what the user is paying for—connectivity.  This is SD-WAN’s game to lose.

How Two Initiatives Could Change the Face of Networking

Remember the AT&T open-source white box switch software?  Their announcement of the dNOS white-box operating system was made just three months ago, and open-sourced a month later.  Now they have a competing venture, from the ONF, called “Stratum”.  What are the two approaches doing, and what impact might they have on the SDN market?  Are they even that different?  We’ll take a look at all that here.

SDN has always been a rather fuzzy space.  Back at the beginning five years ago, I said there were really three different models of SDN—a “software-controls-legacy-router” model favored by vendors like Cisco, an overlay model supported first by Nicira (who was then acquired by VMware), and an explicit centrally controlled forwarding model defined by the Open Network Foundation (ONF).  This is still true today, and some operators have been frustrated by the fact that the first two models have dominated the market, in no small part because (the operators feel) vendors haven’t supported the “best” and most cost-transforming model, the ONF approach.  White-box switches, simple generic hardware devices to replace proprietary switches and routers, were the response, and there were two problems with that, one practical/market and one technical.

The practical problem is the need to displace current router technology with a new central-controlled forwarding paradigm.  That kind of transformation makes operators really antsy, and the benefits of central control can’t be realized (even if you accept them) unless you pretty much fork-lift out the traditional technology and replace it.  That kind of massive sunk-cost risk makes operators even antsier.

The technical problem with white-box switches is that hardware alone isn’t going to create a software-defined network solution.  Switch software is needed, and that was what AT&T was saying in their dNOS announcement.  dNOS stands for “Disaggregated Network Operating System”, and AT&T thinks it will create a platform that when combined with merchant silicon advances, ensures that network devices will be open and fully exploit new capabilities.  It’s about “innovation” for AT&T, according to their paper, and the fact that it would break the proprietary lock that vendors have on large transport IP devices probably doesn’t hurt either.

dNOS devices (routers, in AT&T’s terminology) have distinct separation between control and data planes, well-defined APIs between the two, and vertically to link to central management.  It’s a three-layer structure where the top layer (Applications) provides implementation of routing protocols and management elements, the middle layer (Shared Infrastructure and Data) does the management of the database functions and chassis resources, and the bottom (Forwarding and Hardware Abstractions) provides models of the functionality of the hardware, which can then be mapped to the specific devices supported.  The model is based in part on the P4 language, which lets you describe data-plane behavior exactly, rather than influencing generic data planes in very limited ways.  We’ll come back to P4 below.

In theory, the dNOS model can be applied to devices at any level, but AT&T’s proposal is clearly aimed at the “core router” and MPLS handling.  The white paper doesn’t make a big thing of OpenFlow, for example, and instead focuses on what for all practical purposes seems to be a router instance hosted on a specialized but still commercial-silicon-based chassis rather than on a general-purpose computer.  Because the dNOS model focuses on traditional routing behavior, it’s easy for AT&T to integrate with existing networks.

The fact that AT&T doesn’t mention OpenFlow might be an indicator of why the ONF is now looking independently at something new.  “Stratum” is the result, a project that the ONF says “Delivers on the ‘Software-Defined’ vision of SDN.”  Stratum, like dNOS, is expecting that the data plane devices will become fully agile and programmable (in P4 in Stratum, as in dNOS) and that capability means that it’s not necessary, or even useful, to classify network devices based on their “protocol”.  Protocol behavior is a P4 program, and so you need to rethink the relationship between devices and device control to accommodate the additional flexibility.

Stratum is a kind of P4 abstraction layer that sits on top of the various silicon implementations and understands P4 commands and how to implement them.  This is going to require a plugin process, I presume, to match the Stratum layer to the specifics of the hardware.  Stratum also provides a management and operations framework in which the P4 interpreter runs, and it exposes interfaces northbound via these features to connect to higher-level network applications.  These interfaces are different from the old ONF model, not because changing them was the goal but because you need a new API model to manage a P4-based forwarding abstraction.

The ONF documents point out that while OpenFlow was designed to control forwarding behavior, it wasn’t created to define it.  You could write a P4 program to enable it to implement OpenFlow, you could write one to implement Ethernet switching or IP routing (including MPLS), and you could write one to implement any arbitrary set of forwarding behavior you liked, controlled by any combination of in-band adaptive exchanges (like IP) or centrally managed processes (like OpenFlow SDN).  Thus, in a sense, P4 opens a wider range of behaviors that include everything we have now both in the legacy IP/Ethernet and SDN worlds.  In fact, you could write an overlay SDN program or SD-WAN program in P4 and control either/both of those behaviors in a generic device.

There are some new APIs in Stratum, but software APIs are way different from protocols like OpenFlow or hardware interfaces.  It’s fairly easy to map between software APIs, and so it wouldn’t be rocket science to adapt many network software stacks to Stratum.  The ONF offers examples in this kind of mapping, using popular projects like ONAP and popular initiatives like 5G.  But as I said before, APIs aren’t the story, it’s really P4 that makes both these initiatives different.  The ONF just calls this out more directly.

The ONF says that this eliminates the notion of “Black-“ or “White-Box” switches, because a generic switch is a P4 machine that’s as much of either the adaptive past or the explicit future as you like.  While P4 is a per-device language, it can be used to define “logical” pipelines that could correspond to tunnels or virtual routes, simply by coordinating the behavior of the P4 switching elements across multiple devices.   P4 is do-it-yourself forwarding, broadening the range of what a “router” could do to the very limits of silicon technology.  Stratum codifies the transformation that P4 creates.

How does Stratum relate to dNOS?  You can get a part of the answer by looking at the Google partnership in Stratum development.  Google is providing the foundation code for Stratum, and it is used by Google internally to host its Expresso network operating system.  The ONF provides a set of diagrams that depict a variety of network OSs sitting on Stratum, some aimed at more SDN-ish missions and some at traditional switching/routing.  In truth, a P4 Stratum base means that the devices themselves are non-mission-specific, and the way that P4 defines data plane behavior (which is controlled by the NOS above) is where the mission comes in.

That makes dNOS a “subset” of Stratum, with Stratum forming that bottom abstraction layer of dNOS.  dNOS, in my view, is then one of the NOS options that Stratum could support.  The two projects, then, are more complementary than contradictory or competitive.  That might make them even more important, because it suggests a couple of very critical changes are coming in the network equipment space.  These could give a new decoding to the “RINO” acronym—“Router in Name Only”.

The first and most important of these changes is that the combination of P4 and a generic appliance architecture, something like that of the Facebook-spawned Open Compute stuff, would open the network equipment market to pretty much anyone.  Clearly, the network operators in both projects would like to be able to make equipment a commodity play.  Clearly, their participation is a signal they’re prepared to push that goal aggressively.  Clearly, they will eventually succeed.

The second point is that there will still have to be some over-arching framework for the deployment of P4-compatible boxes in a network.  One set of box-specific forwarding rules does not make packet delivery possible.  The box-level activity has to mediate across all the box interfaces, either coordinated by a set of policies fixed in the box through a P4 program, or coordinated centrally through procedures similarly defined by P4.  Since these same box-to-box procedures need to be coordinated with those of existing equipment during a transition, that gives current vendors an edge—if they play it smart.

Which brings up the third point, historic incumbent stupidity driven by a desire to maintain current revenue flows at all costs.  This P4-driven approach is going to eventually destroy traditional network equipment models.  If the current router vendors can accept that and offer respectable transformation strategies and software products, they will retain a position (yes, a diminished one) with their customers.  If they don’t do that, then they’ll spawn a whole legion of new competitors who will do the right thing, and they’ll lose all their positions.

Point four involves Google.  Remember that Google has already deployed a new-model SDN forwarding in its network, and adapted to router protocols like BGP at the boundary point.  Google is contributing code here, and perhaps also the mindset of creating a true SDN core with a thin router veneer.  That would allow operators to transform to pure SDN much faster.

The final point is that the network vendors who sell to those operators will now all need to find new positions, to supplement their loss of traditional device revenues.  One logical place to do that is below, in the optical layer.  A second is in the building of generic boxes backed by a known vendor and carrier-quality fabrication and components.  A third is the cloud, and the final one is the radio-access network (RAN, and for 5G the New Radio or NR).  Cisco is obviously already jumping at the first and third of these spaces.  Other network incumbents will have to move quickly to cement their own choices.

This isn’t going to be an easy transition for network vendors, but it’s pretty obvious that with the exploding interest of operators in this new “agile-box” technology, combined with the fact that there are real specifications (P4) and hardware advances behind it, something is going to come of it.  The good choices for vendors are likely to be used up pretty quickly, so my advice to the network equipment vendors is simple—don’t delay.

Some Specific Early Experience in Zero-Touch Automation

In a couple past blogs on lifecycle management, I mentioned my “older” ExperiaSphere project.  The project was one of the earliest tests of zero-touch automation, launched with operator support.  There is still some documentation on the ExperiaSphere website, but some of you have asked me to explain the original project, hopefully relating it to the current state of zero-touch automation.  I think the discussion might raise some of the important issues we face in deploying hosted features and orchestrating for zero-touch automation, so here goes.

ExperiaSphere in its original form came out of a request by five Tier One operators who were working with me on the TMF Service Delivery Framework (SDF) project.  Their concern was that the notion of SDN was fine at a conceptual level but not necessarily one that could be implemented.  To quote: “Tom, we are 100% behind implementable standards, but we’re not sure this is one.”  They wanted me to approach the problem as a software architect, and stick as close to the SDF approach as possible to validate (or refute) it.  I demonstrated that a valid Java-based implementation could be develolped, and made several presentations to the SDF group with the project results.

The concept of SDF was to deliver services through “representational modeling”.  Some abstract component, presumably a piece of software with specific interfaces, would “represent” a service feature and (again presumably) participate as a proxy for that feature in lifecycle processes.  It would then execute the necessary steps on the “real” feature and infrastructure elements through their respective APIs.

The model I developed was based on two high-level concepts—the “Experiam” and the “Service Factory”.  An Experiam was a Java object built from a base class that defined the broad architecture of a representational model.  This object was customized to provide the “representation” part, and it contained all the state/event logic to recognize and progress through service lifecycle phases.  A Service Factory was a Java application written using Experiams, that when instantiated the first time wrote out a service template that defined everything it needed to order and deploy the service.  When a copy of this template was filled in and sent to the same factory (or another instance of the factory), the service would be deployed according to the template.

The concept of an Experiam was necessary (in my view of the time) to represent a deployable equivalent of a TMF NGOSS Contract, a data model.  Each element in the service would be represented by both a data model element and an Experiam that defined that element and was responsible for executing the lifecycle processes.  No Experiam had to be a generalized tool, it only had to do what it was written to do.  Instead of making service creation a process of model definition, ExperiaSphere in the form made it a software development process that created the data model as a byproduct.

Experiams could represent three broad classes of things.  First, they could represent coordinative processes designed only to organize the lifecycle processes of a hierarchy of pieces, each in turn represented by Experiams.  Second, they could represent service control or management elements that were part of the services’ actual implementation.  I did a demo with Extreme Networks to show content delivery prioritization based on web video selection, for example.  Finally, they could represent interfaces into Service/Network/Element Management Systems that controlled traditional network behavior.

Service Factories were the instrument of execution and scalability.  A service template lived in a repository, and when an event occurred, the template was extracted and dispatched to any convenient instance of the Service Factory, including a new one.  This instance could process the event because it had everything associated with every Experiam’s data model in the template.  The Service Factory concept was loosely adopted by the TMF SDF group.

An Experiam could represent another service template, making the concept self-federating.  A “proxy Experiam” in the main-line model would bind to a second service template that created another model.  That second template could be located anywhere, and the binding could either be tight (the interfaces specifically defined in the models) or indirect at order time (“Behavior binding”).  In the latter case, the low-level Service Factory advertised the “Behaviors” it would support in a directory, and the high-level Factory would then go to that directory with Behaviors and their characteristics to look for one.  When it picked one, the binding would be finalized.

The big concern I had about this approach was that it relied on software development for service creation.  The operators at the time weren’t too worried about that because they said that services weren’t really ad hoc.  Today I think they might take a different view.  I also wonder whether there are not some fairly contained number of basic service models, in which case a general software toolkit that could interpret data models, or a state/event table within such a data model, might not be a better approach.

I think that the best attribute of this original ExperiaSphere model is explicit scalability.  Every instance of a given service factory can build from or sustain the service order template that factory generates.  You can spin up as many Factories as you need, wherever you need them.  Scalability under load is essential for an NFV-zero-touch solution because network events that stem from a common major failure could generate an avalanche of problems that would swamp a simple single-thread serial software implementation.

ExperiaSphere provided an “encapsulated” model of state/event processing.  Each Experiam was itself a finite-state machine that represented the component set it was associated with, and was responsible for initiating its lifecycle processes, whether they were processes related to implementation of an element of the service, or processes that simply organized the combined state of lower-level elements.  You could either embed the necessary implementation processes at each state/event interface, or call them externally.  The latter approach would converge on a model of implementation where a general “Experiam” used the data model state/event table to create specific processes.  I didn’t implement that for the SDF test, though.

I mention this point because I think it’s the other critical question, beyond the scalability point.  Unless you want to build software from scratch, or adapt software you already have, to support specific state/event lifecycle progression, you need to have an agent that does that.  The agent can either be specificized to the mission (you write the logic into “Experiams”) or generalized based on interpreting a data model.  The first ExperiaSphere project was based on the first approach, and the subsequent project design (I didn’t have the time to do another implementation) for what’s now “ExperiaSphere” was based on the second.

The final element in the ExperiaSphere project was event distribution.  Lifecycle automation has to be event-driven, and having a state/event process doesn’t help if you can’t recognize events.  My work showed there were two kinds of events to be handled.  The first are generated by service lifecycle processes themselves, directed to either subordinate or superior elements to signal a change in state.  The second are generated “outside”, either by infrastructure management or higher-level service processes, like customer ordering or changing.  All internal events are easily handled; you know who you’re directing them to, so you simply dispatch them to a compatible Service Factory.  External events have to be associated with one or more services, which is easy for higher-level events because they’d have a service order to direct something to.  For infrastructure management events, I used an “infrastructure MIB” to hold all management data, so MIB proxies queried real MIBs to populate this database.  A daemon process that ran when any status change occurred in the infrastructure MIB then “published” relevant changes, which services subscribed to.

This maps out a reality for NFV, which is that if you want to minimize integration, you need to first create as standard a set of states and events as can be done, so there’s no variability across elements in how software links to network conditions.  Then you have to define “standard” processes with standardized interfaces for each state/event intersection, and use a software process (like the Adapter Design Pattern) to map your current software components to those standards.  You can simplify this, as I did with ExperiaSphere, by letting every process have access to the service data model and by defining the states and events in a “process template”.

This is a constructive step that the NFV ISG could undertake now.  So could the ETSI ZTA group.  So could the TMF.  It would be great if the three could somehow cooperate to do that, because if that were done it would take the largest step we can take now toward facilitating integration in NFV deployments.

Clusters, Service Models, and Carrier Cloud

If we want to apply cluster techniques to carrier cloud services, we need to first catalog just what kind of services we’re talking about.  Otherwise we can’t assess what the feature-hosting mission of carrier cloud technology would be, and without that, we can’t assign infrastructure requirements.  You’d think that all this would have been done long ago, but as you’ll see, the primary focus of function hosting so far has been a single service with limited scope.

Referring back to (and summarizing) my last couple blogs, a “cluster” is a homogeneous set of resources that acts as a single unit in virtualization and is scheduled/managed by an organizing entity.  Some clusters are “cloud-like” because they support large numbers of tenants as discrete service, and some are more like grid computing where applications draw on elastic resources, perhaps for multi-tenant missions.

“Services” can be defined in a lot of ways, most of which would deal with things like their features.  Features of services determine the functions that are exploited by the service, which for hosted services means the functions that are deployed.  A software function is what it’s written to be, and like an application, what it does doesn’t necessarily have much impact on how it’s hosted.  A more relevant thing to look at for hosting or zero-touch automation is the service model.

NFV has focused on a single-tenant provisioned service, and more specifically on edge features that are typically offered for business VPN services within custom appliances.  Virtual CPE (vCPE) is the most common NFV application referenced.  Single-tenant provisioned services are “ordered”, “modified” and “sustained” for their lifespan, as specific services, meaning they are paid for discretely and have a service-level agreement (SLA) or contract associated with them.

The most common “service model” in terms of total number of users and total traffic is the multi-tenant shared service, where a service provides mutual connectivity to a large population of users, each of which has their own communications mission.  The Internet is obviously such a service, and so is the old public switched telephone network (PSTN) and our mobile cellular services.

A third model of service that has emerged is the foundation or platform service, which is a multi-tenant service that is used not directly but rather as a feature within another service.  A good example of a platform service is the IP Multimedia Subsystem and Evolved Packet Core (IMS and EPC) of mobile networks.  Every customer “uses” IMS and EPC, but an instance of these two aren’t set up for each call.

From this, you can probably see that we can say that a “service” is something you have to set up and sustain.  Most of our service use today is based on a multi-tenant or foundation model where the service is used in support of many tenants.  Services have users/tenants, SLAs, lifecycles to manage, etc.  The way that tenants map to services is really a function, a piece of service logic, and so it has only an indirect impact on the way we need to think about service lifecycle management, including deployment of hosted functions.

Let’s look at the Internet in this light.  The Internet is made up of three basic foundation services.  One is IP networking, one is Domain Name Service, and one is address assignment (DHCP).  To “get on” the Internet, you get an address (DHCP), open your browser, click a URL, and it’s translated to an IP address that is used by the IP network service.  The Internet hosts other “web services” that are accessible once this has been done.

These three service types frame out the cluster/hosting implications fairly nicely, because they speak to an important point about all services, which is the nature of the binding between the logical/functional dimension and the physical/resource dimension.  Think of a cluster or resource pool as a literal pool from which stuff is withdrawn.  Some service models pull resources out and take them up toward, or into, the service layer, and others leave them below in the pool.

The service types frame the management model, by framing the requirements.  The paramount question in service management, and thus in zero-touch automation, is the relationship between the resources used by the service and the service-level agreement (SLA).  It’s this relationship that we have to map to cluster behavior.

Single-tenant services like vCPE are provisioned and managed per user, which means that resource behavior is tightly bound to each service, making it important to remediate problems at the resource level in conformance to specific service SLAs.  A resource assigned to such a service is actually assigned, and while there may be a sharing model for it (virtual machines sharing a server), the thing that’s assigned is committed.  Put another way, tenant awareness extends downward into the resource pool.

The fact that these services use committed (if often virtual) resources means that per-service remediation is likely the basis for sustaining operations within the SLA.  If something breaks, you have to either replace it 1:1, or redeploy some or all of the service to restore correct operation.  It is possible that some autonomous behavior at the resource level might act to help, as would be the case with adaptive routed networks, but this behavior is actually a multi-tenant service behavior, as we’ll see.

Because there are obviously going to be a lot of tenants for a single-tenant service, and because of the explicit service/resource association, these services require considerable orchestration, and they’re subject to the problem of cascading faults.  A broken trunk connection might impact a thousand services, and while fault correlation might be expected to address that problem at one level, a substitute trunk could well not provide the SLA needed by some or all of the services impacted.  At the least, there might be a better solution for some than simple rerouting.

The cluster implications for this model should be obvious.  You can’t do simple 1:1 resource substitution, you have to rely on more complex service lifecycle management.  A cluster tool would need to be cloud-like to start with since hosted components are single-tenant (like cloud applications) and you’d need service automation throughout.

Multi-tenant services are capacity-planned, meaning that the resource pool is sized to support the performance management guidelines set for the service, and there may also be admission control exercised to prevent the number of tenants drawing on the service from overloading the resource pool.  The resource pool could also adapt to changes in overall tenant or traffic load and replace “broken” components by redeploying.

In these multi-tenant capacity-planned services, the SLA is probabilistic rather than deterministic.  You set a probability level for the risk of failure and build to keep things operating below that risk point.  As long as you can do that, you’re “successful” in meeting your SLA, even if once in a while some users get unlucky.  The SLA itself can be tight or loose (the extreme of the latter being best-efforts), but whatever it is, it is on the average that counts.

Because you’re playing the law of averages, reconfiguring to meet a capacity plan is a very different thing from single-tenant reconfiguration.  The service layer doesn’t have any explicit resource commitments, even to virtual resources.  What it has is a capacity-planned SLA, and so you could easily define resource pool failure modes and simply reconfigure into one of them when you have a problem.  This means that cluster software could easily handle most of the problem management and service lifecycle tasks.

The in-between or foundation services are the final consideration.  Like multi-tenant services in general, foundation services will generally have a capacity-planned SLA, but often there will be more specific stipulations of things like response time, because the “host” service that references the platform service likely has some broad timing constraints.  An example is easily found in mobile services, where calls tend to be abandoned if the user doesn’t get some signal of progress fairly promptly.

Could a foundation service slip across the boundary into requiring a more explicit SLA?  I think it could, which would create a multi-tenant class that had more specific virtual-resource commitments and more explicit orchestration of scaling and redeployment.  There’s no reason why, in this case, we couldn’t regard the “user” of the platform service as being the host service to which it’s integrated, rather than the users of that host service.  Everyone with a phone isn’t a user of IMS, they’re a user of the system that includes it.

The future of services is already visible in the Internet—a collection of foundation services bound into a larger service model, supporting many users concurrently.  That mission is certainly not the focus of NFV today, and it’s not even a perfect match to current thinking on cloud computing.  It may be something that tips more to the traditional grid missions, but that’s a question we probably can’t address until we get more enlightened thinking on the role of foundation services in the future of networking.