Here’s How Cloudification of Existing OSS/BSS/NMS Could Work

Can a monolithic application be cloudified?  That’s one of the critical questions for network operators, because so much of their operations/management software inventory is monolithic.  To make matters even more complicated (and, arguably, worse), most of the implementations of management tools for things like NFV and even zero-touch automation, are in large part monolithic.  Are these elements now insurmountable barriers to transformation and modernization?  Maybe not.

There are two overall models for software that handles or is dependent on external conditions.  One is the “polled” model, where the software accesses an external status resource when it needs to know about the system it represents.  Many of the old Simple Network Management Protocol (SNMP) tools would poll the management information base (MIB) for device information.  The other model is the “event” model, where the external system generates a message or signal, traditionally called an “event”.  This event is then processed by the management system.

Most operations/management software today relies on the event model, which leads to them being called “event-driven”.  We’ve had a decades-long set of initiatives aimed at making OSS/BSS systems “event-driven”, and so if everything (or nearly everything) today has graduated to events, why are we talking about this issue in a discussion of cloud-native or cloud-friendly operations?  Because there are three different subsets of the event-driven model, and they’re very different.

If we have software that’s “single-threaded”, meaning that it processes an item of work at a time (as many transaction processing systems do), it can be made to support events by adding an event queue in the front of the software.  When an event happens, it’s pushed onto the queue, which in software terms is called a “FIFO” for “first-in-first-out”.  These preserve the arrival order, and so when the software “pops” the queue when it has capacity to do more work, it gets the next event in order of arrival.

This is an OK approach for what we could call “limited and self-limiting” events.  That means that there’s a very small chance that the event source will generate another event before the first one is processed fully.  The problem of related events can arise is when an event is popped for processing, and a later event in the queue relating to the same source indicates a change in condition.  We now have to either risk having a processing task undertaken with out-of-date status, or we have to check the status of the event source in real time (either by polling it or by forward-scanning the event queue) before we process.  And, of course, where does this end?

Where software has multiple threads, meaning that it can do several things at once, you can have each thread pop the event queue, which brings some concurrency to the picture.  However, it also introduces an issue with respect to the relationship among events.  If Instance One of the software pops an event associated with System A, and Instance Three pops another System A event a moment later, we have two different software processes potentially colliding in the handling of related events.

With multi-threaded event-handling systems, we still have the same problem of multiple events from the same source being queued and popped for processing, but we have the additional problem that different processes could all be working on the same system, driven by the events they happened to pop.

Both these problems have additional complications arising from implementation.  One problem is that multi-threaded but monolithic software runs in one place (because it is monolithic), and the sources of the events and the places where process actions might have to be applied could be quite a distance from the monolithic process.  Another is that even multi-threading an application doesn’t make it suitable for a wide range of loads; there are limits on the number of threads you can run in a given place.  Our sub-model of event processing addresses, at least to a degree, all of these.

Suppose that our event-driven system is fully described by a functional model that illustrates how features, events, and resources relate to each other.  Suppose that the current state of the event-driven system, all the rules and policies associated with it, are all held in this model.  That model is then effectively a blueprint to describe goal behavior, and a single central record of current state for all the subsystems within our overall modeled system, right?

If that’s true, then any software process instance designed to work on the model, meaning to apply an event to the model, would be able to process that model correctly, and thus process events against it.  We could spin up an instance of the process set (which, in a presentation I did for the TMF’s SDF group about a decade ago, I described as a “service factory”) and it would be able to process the event by accessing the model.

This process would allow a “master event source” to scale a service factory to handle events.  The events could be associated with different services/systems, or a different part of the same one.  We’d still need a “locking” mechanism that would allow for the queuing of events, but only if the events were associated with the same subsystem, and a process for that subsystem was already running.

The critical piece of a service model is a state/event table.  Each element of the model would have such a table, and the table would identify the operating states (set by event processing) and the events that could be received.  The combination, by implementation tradition, would yield a process ID and a “next state” to indicate state transformation.  This is how monolithic stuff can be coupled to our third kind of event processing.

Each of the process elements, threads, components, or whatever, in a monolithic application could now be fed by the service factory according to the state/event table in the model.  The service factory process then acts as the queuing mechanism, resolves collisions, handles scaling up front, and so forth.  You can still have event queues for the monolithic component, but if the service model and factory are properly designed, the conflicts and collisions are weeded out before something is queued.

If the processes identified in the state/event table are designed to work from the data passed to them and to pass results back, they could likely be scaled in an almost-cloud-like way.  If not, the approach would at least define the specific goals of an operations software remake—turn the components into autonomous functions.  That alone would be an advance from today, where we have little specific guidance on what to look for in things like OSS/BSS modernization.

There’s nothing new about this approach.  As I noted above, I presented it to the TMF over a decade ago, and it’s also a fairly classic application of state/event and modeling technology.  All of this means that this could have been done at least a decade ago.  That doesn’t mean it couldn’t (and shouldn’t) be done now, but it does justify a bit of skepticism about the willingness of operators and vendors to take the necessary steps.

Either group could break the logjam.  AT&T still wields considerable influence on ONAP, and they could drive that initiative to do more.  I stopped taking ONAP briefings, until (as I told them) they adopted a service-model-centric approach.  AT&T could accelerate that shift.  Any major vendor in the networking space could easily assemble the pieces of this approach, buy a couple companies to make it real, and push the market in that direction.  Progress is possible, meaning that we could take a giant step toward cloud-native technology in carrier cloud infrastructure more easily, less expensively, than most operators think would be possible.

It’s also worthwhile to note that this same approach could be applied to enterprise software to integrate cloud-ready elements.  The cloud providers could be the drivers of this, or it could be vendors like VMware, whose aspirations and capabilities in the cloud space are obvious to all.  VMware, of course, is also casting covetous eyes at the telecom space, so perhaps they could take a lead in both areas.  Competitor Red Hat, under the IBM umbrella, could also do something in this space.

I do think that broader redesign of operations software for the cloud would be valuable, but the big benefits, the majority of opex savings and process improvements, could be realized with this much smaller step.  I hope somebody finally decides to take it.

Is There a Deeper Meaning To Cisco’s Exablaze Deal?

It may be that Cisco still has more to show us in their network transformation.  Their Acacia announcement, followed by Silicon One, was IMHO a big step toward a new model of networking.  It just could be that their Exablaze deal was another.

Exablaze is a company specializing in low-latency networking, meaning building networks that have a very small transit delay in packet-handling.  The company is best-known for its applications in the financial industry, both as a part of a high-frequency trading platform and in specialized applications relating to the mass distribution of financial data.

It’s not unreasonable to think that Cisco might have acquired Exablaze simply for its position in the financial industry, or to mark territory on an area where arch-rival Arista is strong.  However, Cisco already sells to that market, and while their own products aren’t as specialized to financial-industry requirements, it’s a bit unusual for Cisco to buy something that would compete with its own products.  There are a couple of quotes in the Cisco blog (a short one; most of the substance is in the areas I quote below) announcing the deal that could be indicative of a broader role, perhaps even a deeper one.

The first quote is “Network capacity and speed are today’s bread and butter for data intensive, highly transactional-based businesses requiring optimized bandwidth performance in industries and market segments like high-frequency trading (HFT), financial services, high-performance computing, and emerging AI/ML clusters.”  HFT, obviously, is a current opportunity, along with financial services in general.  High-performance computing (HPC) has been around a while, and so it’s fair to wonder whether Cisco’s mention of the space is an indicator that it’s going to grow.  Obviously, that would be the case with “AI/ML clusters”.

You can almost draw a curve from HFT to AI/ML clusters, growing as you move across the technologies to reflect the heady expectations for each.  In every case, we have current needs, but the question is whether any of them, even HFT, would be enough to make Cisco do an M&A at this critical point.  Is Cisco trying to buy some revenue?  Remember that Wall Street saw the Acacia deal as simply a way of getting better profits on the optical interfaces Cisco was already using.  I think the Cisco Silicon One announcement proved that was a shortsighted view.  I think it is here too.

The second quote is “Integrating Exablaze’s innovative products and technology into the Cisco portfolio will give our customers the latest field programmable gate array (FPGA) technology providing them with the flexibility and programmability they require.”  To me, this says that the FPGA technology, at least as much as Exablaze’s current products, are the justification for the deal.  That, of course, raises the question of what Cisco might be intending to do with the stuff, and I want to look at some possibilities here, as a part of a general “what-will-they-do-with-Exablaze” exploration.

One obvious Exablaze target is to add low-latency smarts to network interface cards in the Cisco product line.  Remember that Amazon’s Nitro model of cloud hosting proposed offloading a lot of I/O and network functionality to the interfaces.  Exablaze cards offer exceptionally low latency, which is critical for cloud applications, as we’ll get to in a minute.  Cisco could add this kind of capability to its own servers, make it available (as it already is, from Exablaze) to other server vendors, and get a bit of street cred in the cloud server space.  This mission would be logical, given that interface cards don’t directly compete with any of Cisco’s own products.

It’s also worthy to note that routers and switches typically use interface cards too.  Might Cisco be looking at enhancing its own products with custom interfaces, again for the purpose of lowering latency in handling?  Such a move could give Cisco an edge, and if you combine a low-latency network card with the low-latency Silicon One chip, you might have a whole new class of network device.

The second, and related, target area could be edge computing.  Remember that the only real reason to site hosting near the edge is to reduce latency; otherwise you’d be better off creating deeper data centers with larger resource pools.  It seems likely to me that if we do uncover commercially viable edge computing applications, the “edge” will actually be a couple of layers of hosting, likely including something on premises and moving deeper as the value of collective pools increases.  This kind of structure could benefit from edge devices with low-latency interfaces, and also from edge switching based on low-latency technology.

The edge, of course, is really just a piece of the cloud, and there we find what might be the most interesting of possible Cisco missions for Exablaze.  If we examine a cloud-native application or service, we’d find it consists of a bunch of distributed, scalable, resilient microservices, deployed via Kubernetes as containers and linked with a service mesh technology like Istio.  The implementation of Istio is typical; there is a “sidecar” process that represents the connectivity management piece, and the real microservices are clients of that sidecar set.  This is a lot like the separation of control and data plane that I talked about in yesterday’s blog.

Istio and other service meshes have a fairly significant latency associated with them because of the handling of the packets.  Could Cisco use Exablaze technology to create 1) an accelerated sidecar process, 2) an accelerated hosting platform for microservices, and/or 3) an accelerated network mesh that would deliver microservice workflows with much lower latency?  That would be the first true architecture to support service mesh and cloud-native, and it could be very big indeed.

A cloud data center is today very similar to a kind of upsized corporate data center.  If we truly transition to cloud-native, the models might well diverge, and then be conceptually reunited at the software level via things like Kubernetes and service mesh.  That would fit my thinking that the cloud isn’t an alternative hosting mechanism, but an alternative application model that happens to preference a whole new set of hosting options, including the edge.

Data centers are the drivers of enterprise networking policy, and also (obviously) the engine of growth in carrier cloud applications.  Clearly, Cisco knows that, but it’s not clear how deep their understanding is.  Does Cisco see Exablaze as a little, nice, rounding out of a product strategy for a key vertical, or do they see it as the start of a broader transformation of application model?  A lot will depend on the answer to that one.

In Pursuit of the Cloud-Native Router Instance

Perhaps is the biggest technical barrier to cloud-native in telecom is reconciling microservice behavior with network service behavior.  If service QoS is limited by microservice and cloud-native implementation, then cloud-native suitability for telecom is in doubt.  If service setup, change, and restoration is limited by the way microservices are orchestrated, then cloud-native suitability for telecom is in doubt.  But do those limitations exist?  It depends.  We have to look at what cloud-native software imposes in the way of conditions, then see if telecom services can tolerate it.

The general vision of cloud-native is that functionality of any sort is created by stitching a series of small components called “microservices”, that are designed so that they can be run anywhere, scale anywhere, and still be assembled into a workflow.  For this to work, it’s essential that they’re “stateless”, which means that microservices themselves can’t store anything within themselves.  If that were done, then that internal information would be lost if the microservice failed (in which case its replacement wouldn’t function the same way) or was scaled (in which case the new iteration wouldn’t have the information, and wouldn’t function the same way).

Statelessness is insidious.  A lot of IP networking is seen as being stateless because it’s “connectionless”, there are no hard-allocated resources associated with a relationship between two IP endpoints.  That’s not the same as stateless, though.  A routing table, for example, is stored in a router.  If another instance of that router is spun up, either to replace the original or for load scalability, the new one would have to rebuild the routing table to function.  Not only that, if the new router was located somewhere other than the exact same place, on the same trunks, as the old, then routes could change, in which case other routing tables also need to be updated.

In scaling IP, if we were to develop parallel paths to share a load, the normal practices of adaptive routing would indeed use them, but only after the routes had “converged” (the other routing tables had accommodated the new choice), and only if the routing protocols could properly detect the congested state of the first device and shift things.

The key point here is that a relationship between two IP addresses (which, in OSI terms we could call a “session”) is stateful as far as the users are concerned.  They, in most cases, will actually employ a TCP process, which is stateful for congestion control.  We can switch around the route of the session within the network, but we can’t make a substitution of one instance for another invisible, nor can we make scaling invisible, in impact-on-session terms.

You may wonder what this adds up to, so let’s get to it.  If we assume that we implemented routers as monolithic components, the result would be something not based on microservices and not rightfully cloud-native.  NFV, as defined, creates virtual devices which are monolithic components.  These can be spun up and scaled or whatever, as real physical routers are, and with the same effect on services.

But what about “cloud-native?”  If we were to construct a “cloud-native router” we’d want to decompose the router function into microservices.  This process poses a number of questions, some perhaps minor and some perhaps critical.  The thing we have to keep in mind is that the microservices are presumed to be independently hosted, so they have to be connected by a network.  There’s latency associated with that, so we’d not want to divide router functions into microservices in such a way as to ensure that when we deploy, we end up having to put them all in the same place for performance reasons.

The first question, and probably the most fundamental, is how we divide router functionality.  In traditional routers, all the functionality is controlled from the data plane, meaning that router management, topology exchange, and other control features are all riding along the same paths as the data.  If we were to follow the monolithic-router model, we would necessarily have a single microservice that had to represent a port/trunk process to converge the data and signaling.  Do we then have a microservice for every port and trunk?  Where do we do the routing part?

The SDN movement, at least in its purist form, might have suggested a solution.  If we were to separate the control and management plane from the data plane, which you may recall Amazon suggests in its Nitro model for security/stability purposes, the control-and-management piece could well be a good candidate for microservices.  This would suggest to me that the “ideal” model for cloud-native implementation of heavyweight forwarding devices like routers would be a white-box switch at the data plane, separating out control and management functions for implementation through microservices.

The notion of the separate but centralized controller in SDN is a bit counter to the scalable-resilient-distributed model of microservices, but separating the control and management planes doesn’t mean you necessarily need a central controller.

A simple model of a separate control/management plane would consist of white-box switches (or software instances) that are paired with an agent process that provides control and management features.  The network connecting these agents would be an independent “VPN”, and control and management traffic would be separated from the inline IP flow at the access level, and thereafter flow on this separate network.

If you like the notion of adaptive discovery, you could continue to exchange topology and status information about the network within this management/control network, and feed the topology changes to the associated switches/instances as forwarding table updates.  If you like a more centralized approach, you could feed a resilient process set with a network-wide database from the same management/control network, and have each switch obtain policy-filtered topology information (again as forwarding table updates) from that central point.

You could also take another approach altogether.  Suppose that every “edge” element in our future network operated as a node in what used to be called “Next-Hop Resolution Protocol” (NHRP).  The concept is similar to the way OpenFlow works when there’s no forwarding entry for a given packet, except that this time we could query a central point and receive a source-route vector that would steer it (and future flows to that destination) to the desired off-ramp.

If a network error occurred, each “node” in the network would receive a “poison pill”, and if it received a source-route vector containing the failed element, it would request another route, and fix the source route vector accordingly.  If we wanted to scale a component, adding it to the inventory of used-in-routing elements would make it available for any new routes.  We could also selectively “poison” some existing routes to force a reconfiguration, which could then use our newly scaled element.

I’m not suggesting this is the best, or only, way of making cloud-native implementations of telecom services work, but it illustrates something that seems to address the key points.  What are they?  First, it’s not going to be easy to make a data-plane element into a microservice; you probably can’t afford the overhead of making it get its routing table from an external source every time it receives a new packet.  That means separation of the data plane from control and management (a good idea in any case) is required.  Second, you can easily implement control and management plane activity, including updating the routing tables, in a variety of ways, using microservices.  That’s really the software heavy lifting in today’s routers anyway.

This all suggests two things; that perhaps SDN had the right idea in control-plane separation, but didn’t pursue all the options for the separate implementation of that control plane, and that perhaps SDN’s model would be easier to translate to a cloud-native implementation than the model of monolithic routers.  Something to think about?

Fixing AT&T’s Problems: Possible?

Every network operator worldwide faces profit challenges.  Some face challenges more acute than others, of course, and we need to watch those operators to get an advance look at what may be coming for the industry overall.  AT&T is such an operator.  They have a challenging demand density, they made some mistakes in their early planning, and they face a hostile/activist shareholder.  No wonder what they’re thinking is important, but what is it?

Light Reading covered the latest from AT&T here, but it was announced by AT&T in its October earnings call.  What it boils down to is that AT&T is accepting that it needs to radically reduce costs because it can’t radically increase revenues in the current climate.  It’s worthwhile to look quickly at how AT&T got to where it is, and then at the direction it’s taking.  Maybe we can even answer the critical question, which is “Will this work?”

The critical determinant in the intrinsic profitability of a given operator’s service geography is what I’ve called demand density, which is a measure of the potential revenue to be earned from service activity within a square mile of territory.  I’ve collected demand density numbers for all the major countries, and in the US for all the states.  Thus, I’ve had the data for the old Regional Bell Operating Companies, and for companies like AT&T that have been built by combining some of those old RBOCs.  AT&T has relatively low demand density—a fraction of competitor Verizon’s, and that’s its big problem.

Demand density relates mostly to what would be called “pass cost”, the cost of providing service to a customer’s area, to the point where if the customer ordered something from you, you could connect them to your access infrastructure.  Low demand density means that the cost of passing, say, a thousand customers is too high for the customers’ associated revenue to deliver acceptable returns on investment.

You need to raise revenue or lower pass cost to counter low demand density, and AT&T tried to pioneer in the video business to do that.  It was one of the few operators to attempt DSL-delivered U-verse video, and it bought a satellite company to launch its DirecTV service.  Satellite video and broadband has always been a good solution in thin-demand areas because it has no local infrastructure, but DSL video hasn’t proved successful overall because of limitations in the outside plant.

In a sense, cost, in the form of pass cost, is actually a barrier to new revenue, at least new broadband Internet and TV revenue.  The problem is that conditioning old infrastructure for new service is expensive, particularly when some market players (cable companies notably) have infrastructure better suited to video and broadband delivery.  Even though AT&T may have customers within its territory who have revenue potential, their distribution doesn’t lend itself to a high return.  If a new service requires new access infrastructure investment, that new investment isn’t likely to pay off for a low-demand-density operator like AT&T.

The financial industry uses something called “return on capital employed” or ROCE as a measure of how effectively a company uses its investments in infrastructure.  I’ve been calculating demand density for operators for over a decade, and in that time, I’ve found that demand density is highly correlated with an operator’s ROCE, meaning that having a low demand density tends to create a proportionally low RCOE, and that discourages access infrastructure investment.  A lot of AT&T’s problems are baked into its territory.

That pretty much leaves broader-targeted cost-cutting, which traditionally means cutting opex.  A typical network operator spends about 19 cents of every revenue dollar on capital projects and about 33 cents on what could be called “process opex”, meaning the operations costs associated with service/network lifecycle.  AT&T has attempted to address both of these through technology, the former with an increased reliance on open-model networks (white boxes and open-source service features to run on them), and the latter through a series of point-strategies and also via its broader ONAP open-source lifecycle management initiative.  Neither of these have delivered as AT&T has hoped.

Capital cost reduction through open-model networking is limited in its impact by the useful life expectations of the technology already in place.  If you presume that your assets have a five-year useful life, then you can only impact 20% of those assets in a given year, and you’ll save only on the cost difference between legacy/proprietary devices and open-model solutions.  AT&T can get more from relying on open-model stuff for new deployments like 5G, which is why its DANOS open-source-and white-box 5G access strategy is important to them.  But it only reduces future costs, not current ones, so it’s also limited.

The ONAP initiative of AT&T had, in my view, great promise.  In a nutshell, ONAP was designed by AT&T as an open-source solution to lifecycle management.  When it was conceived, it could have framed a vision of operations automation that could have saved AT&T as much as 30% of its process opex.  The problem was that, IMHO, ONAP was stuck in OSS/BSS-think.  It imposes a monolithic model of management, has no solid framework to accommodate virtualization, and supports other initiatives (like SDN and NFV) by “plug-ins” or APIs rather than actual integration.  It also had, like virtually all projects and network operator standards initiatives, an interminable development cycle, so as a result AT&T (and other operators looking for a lifecycle automation strategy) have nibbled at the problem with one-offs.  That’s resulted in approximately a 15% reduction in process opex, which clearly wasn’t enough.

What does this leave for AT&T, strategy-wise?  Their open-model network strategy will yield benefits down the line, perhaps enough to satisfy shareholders.  Their ONAP opex reduction strategy is unlikely to help them much, in my view, unless they redo it somehow, and I don’t think they understand what’s wrong with it, much less how to fix it.  They have two options, both of which have been suggested.  One is to recognize that wireline is perhaps the biggest problem for them, and so “rationalizing” that strategy could help.  The other is to simply put a squeeze on labor costs and benefits.

There are few telcos I’ve worked with that weren’t bloated in head-count terms.  That’s a kind of public-utility legacy.  However, there are also few telcos I know that are adequately staffed in the areas on which their future depends.  That’s a result of the same legacy, a past where flexibility and market-responsiveness didn’t mean much.  Build it and they will come.  In any event, the risk here is that any measures taken to reduce staff layoffs will be complicated, in part because of the union labor percentage in the company and in part because as soon as layoffs are announced, all the workers you really need will, because of their greater mobility in the job market, likely put out resumes.  You lose those you can’t afford to lose.

Selling off wireline customers and assets to others, as Verizon and other telcos have done, is a smarter play.  You shed revenue opportunity you know you can’t realize and shed labor costs at the same time, which is a decent trade.  However, cable companies like Comcast are under some of the same pressures as AT&T, but hardly shedding their cable business.  Comcast has also made a go of buying media/content companies, a move that with AT&T set off the activist-investor thing to start with.

What I think AT&T has to do at this point is to first identify those people within their organization who are capable of OTT-and-cloud-think.  Wherever these people are, they need to be harnessed rather than threatened.  This is going to be problematic if they follow their usual “who’s valuable” approach.  If the effort becomes “bell-heads-save-each-other”, they’re in trouble.  It may involve using outside resources to assess the people objectively, at least to get the initiative started.  It’s not as big a job as it might seem to AT&T; all you need is a small cadre of really good technology-thinkers; they’d recognize others almost on sight.

Second, AT&T should focus on 5G/FTTN hybrid technology to deploy in the areas where demand density justifies it and video and high-speed broadband is a viable offering.  There’s great promise in using 5G millimeter-wave technology for last-mile connectivity, and if you could drop the access cost, you could harness opportunity at an acceptable rate of return.  They also need to use this technology for business services, meaning don’t just think of it as a residential strategy but also as a branch office and SMB strategy.  If they do this, they can keep the customers who could be targets for additional revenue-generating service and sell off the wireline customers and infrastructure.  This would also accelerate the adoption of the open-model 5G deployment, which accelerates their savings on capex.

The third thing they need to do is to clean up their ONAP act.  ONAP as a whole is a mess, but it’s a mess that could be saved by creating a kind of cloud-native nucleus that would then pull together the monolithic tasks within ONAP, and also serve to integrate external initiatives.  I’ll blog more on this point next week.

The final thing that AT&T needs to do is recognize that a cost management focus promises little more than vanishing to a point.  Revenue growth is the only survival strategy that works in the long term, and like all operators, AT&T has been stuck in neutral on that issue.  No, new connection services won’t work.  No, video opportunity is, at this point, compromised by the growth of streaming services and the inevitability of content providers going it alone.  Can AT&T make its own content successful?  Sure, but can it carry the rest of the company?  There are a lot of service opportunities waiting for AT&T and other operators above the connection layer, and an insightful set of cloud-thinkers can surely help them find at least a few of them.

It takes a long time to change direction in an inertia-centric player like AT&T, but the corollary is that it also takes a long time for it to die.  That’s true with most telcos.  As much as their near-term pressure has created alarm, particularly in the Street, there’s still time for them to not only save themselves, but to prosper.

How Real is Cisco’s New Silicon/Optics Story?

If open networking is a challenge to vendors, what’s the solution?  One possible pathway is to try to guide your buyers to better benefits, thus justifying higher costs.  That’s not an easy sell, and particularly problematic for Cisco, whose sales-driven approach to networking is legendary.  That leaves the other pathway; steal someone else’s money.  By that, I mean build products that consolidate layers of functionality into a single box.  That, I think, is where the whole Cisco Silicon One and Optics thing just announced is heading.

When Cisco bought Acacia, I did a blog suggesting that Cisco aimed at expanding its optical capability in large part to subsume the optical layer into routers.  That would put the optical budget on the table in router sales, reducing the price pressure on Cisco in itself.  It would also raise the technical stakes for open-network routers by adding in stuff that white boxes aren’t likely to provide.  The idea has a lot of merit for Cisco, and for others in the router space.  The reverse, subducting routing into the optical layer, has corresponding value for vendors like Ciena.

We used to offer all services based on time-division multiplexing (TDM), and so we really had single-layer networks using things like SONET.  When IP came along, it became an overlay, using optical transport.  As IP came to dominate, the obvious question was whether you built “transport” based on packet-enhancing the optical layer, or by optical-izing the router layer.

To support the former, optical vendors like Ciena have pushed packet optics and reconfigurable optical add-drop multiplexers (ROADMs) as the foundation of modern IP networks.  At one time, in the world of TDM and SONET, the notion of a separate optical layer was a given, since there were multiple service technologies (TDM and IP, ATM, whatever) that shared optical transport.  Now, that’s less true, and so optical players have worked to carve out an IP mission.  Ciena today is touting its support for 5G, for example.

An IP mission for Ciena means less money for Cisco, or put more opportunistically, less Ciena mission can equal more Cisco mission.  My speculation with respect to the Acacia buy was that Cisco was planning to make a direct assault on the optical layer, and such a move would necessarily enrich Acacia as a separate company.  It would also put Cisco’s plans at risk should Acacia be bought by a competitor.

As logical as it would be for Cisco to do an Acacia deal to keep the riches in-house, there’s still the matter of generating the riches in the first place, which is where I think Silicon One and the announcement come in.  What Cisco is working on is a model of more affordable or practical IP/optical transport that can be more IP than optics.  The Cisco model is described by Cisco here and (in particular, in more detail, here).  The early target, I think, is the convergence of multiple IP services for business, telephony and video, residential and Internet, all onto a common infrastructure.  The later target is to ensure that infrastructure is mostly routers, mostly Cisco.

One of the most serious threats to a goal of converging transport multiplexing and routing in a single layer is the problem of different services’ needs and value to the operators.  The Internet generates most of the bits and yet its margins stink.  Services with better margins don’t generate the same level of traffic, and unruly Internet traffic could compromise QoS on those better-paying services.  In the future, if we do indeed end up with 5G slicing or IoT or something else that has a different profit-per-bit potential, this challenge gets harder to address, which is where I think Cisco’s “early target” comes in.

If you listen to Cisco’s short video, the most visible theme is what I’ve (in what I’m sure Cisco sees as an irreverent lapse) call the “devices suck” paradigm.  Devices suck bits automatically, so the fact that more of them are getting connected means more bits will be sucked through the network.  It doesn’t matter whether operators make money on it or not (well, OK, they can try to minimize their losses as long as it doesn’t hurt Cisco sales), operators have to pay for those sucking devices’ appetites.  We get a lot of the what’s sucking detail in the video, but it leads up to the point that we have today the ability to supply enough bits so all the device sucking can’t stagger the Internet.  That’s what the announcement is all about.  We can converge all different kinds of IP traffic because we now have a combination of silicon and optics that let us do it at the IP level.

I characterize this cynically because I think it is cynical, but that doesn’t mean there’s no thread of value in the theme.  So, let me rephrase to say the same (sort of) thing without the usual Cisco oversell.

Operators are forced, today, to create parallel service/transport infrastructure because they lack a truly efficient way of partitioning and sharing IP capacity within a single infrastructure.  That raises their costs, not only for capital equipment but for operations as well.  Cisco’s union of silicon and optics supports orderly, efficient, partitioning and sharing, and so can credibly be expected to lower that cost.  Same conclusion, but without Cisco’s usual excitement-building story.

There’s more to the story than photons and electrons, though.  As my sanitized recap says, you need to enhance operations too.  Here I find a bit of a glitch in Cisco’s numbers.  The material says “According to IDC, global telecom operating expenses were 78.7 percent of 2018 revenue.”  They make this statement in the context of “operations include prioritizing traffic and engineering traffic routes to ensure quality of service for distinct customer experiences.”  The problem is that the percentage they give is the portion of each revenue dollar that’s associated with all OAM&P, which is everything except capex and return to the investors.  The description they give is for what I’ve called “process opex”, which currently amounts to (liberally interpreted) about 30 cents of every revenue dollar.  Still, we do need to improve process opex, and in particular we need to ensure that adding new services or changing technology doesn’t drive up opex faster than it reduces capex.

Cisco’s view is that it’s not efficient to try to provide operations assurance by pushing the SLA spaghetti up two hills, the IP layer and the packet-optics-and-ROADM layer.  Better to unify the technologies so you can harmonize and improve the efficiency of the operations.  True, but the portion of operations costs that could (again, liberally) be considered to be directly related to network operations and SLA assurance is only about 5 cents per revenue dollar.

The basic problem with process opex is the operators’ inability to properly integrate the software elements associated with both service and infrastructure lifecycle management.  This is something that both the ETSI’s zero-touch automation activity and the Linux Foundation (and AT&T’s) ECOMP were aimed at providing, and have failed to provide.  Cisco’s not providing it either, not even in the new stuff they’re talking about.

It may sound at this point like I’m debunking Cisco’s whole story and I’m not.  I’m always critical of Cisco’s way of announcing things because I think it’s short on substance and long on razzle-dazzle, but fundamentally, the network needs to be simplified, and Cisco’s approach may not (won’t, in my view) “revolutionize” the Internet, but it will improve IP-network economics when operators need improvement desperately.  Reducing the layer count is a logical approach, and to do that you can either subsume more optical-layer tradition into the router layer, or the other way around.  Cisco, obviously, prefers the former option, and their preference is important not only because this razzle-dazzle stuff does work, but also because the optical vendors like Ciena don’t do either razzle-dazzle or substance in their counterpoints.

The future is not determined by the best path forward, but the best path taken.  Operators are, and have been, almost slavishly devoted to vendor positioning to set their planning priorities.  Cisco is positioning.  Ciena is not positioning, nor are any other vendors in either of the two spaces.  Judged by that standard, Cisco has a winner here.

Should Amazon’s Nitro Open a Discussion about Virtualization?

Is virtualization already obsolete?  Or, at least, is it heading in that direction?  Amazon would like everyone to believe it is, as this SDxCentral Article shows, and they may be right.  What Amazon and others would do about it could be a benefit to both cloud and data center.  Amazon’s own answer is its “Nitro” model, and since it’s approach to the problem of virtualization is already published, it’s a good place to start in understanding how virtualization principles could change radically.

Virtualization is one of an evolving set of strategies to share hardware infrastructure among users and applications.  “Multi-tasking” and “multi-programming” have been fixtures of computing infrastructure since the introduction of true mainframe computers over 50 years ago.  The difference between early multi-use strategies and today’s virtualization is primarily one of isolation and transparency.  The goal of virtualization of servers is to keep the virtual users isolated from each other, so that a given user/application “sees” what appears to be a dedicated system and not a shared resource at all.

Amazon, arguably the company with the most experience in server virtualization, recognized that there were two factors that interfered with achieving the isolation/transparency goal.  The first was complexity and overhead in the hypervisor that mediated CPU and memory resources, and the second was the sharing of the I/O and network interfaces on the servers.  What Amazon hopes to do is fix those two problems.

The concept of a “thin hypervisor” is easily understood.  Hypervisors manage access to shared system resources, creating the “virtual machines” that applications then use.  If the management and sharing mediation takes a lot of resources, there’s less to actually divide among the virtual machines.  In theory, this is an issue with all VM applications, but the problem is most acute for public cloud providers, whose business model depends on achieving reasonable profits at as low a price point as possible.  Amazon’s “Nitro Hypervisor” focus here could remedy a problem we should have addressed better and more completely before.

The I/O and network interface problem is more subtle.  One issue, of course, is that if all the VMs on a server share the same interfaces, then the capacity limits of those interfaces can cause queuing of applications for access, and that means VMs are not truly isolated.  This problem is greatest for lower-speed interfaces, so the general shift to faster I/O and network connections mitigates it to some degree.

The second issue is more insidious.  Most I/O interfaces are “dumb”, which means that a lot of the I/O and network activity is passed off to the server CPU.  Access to a disk file or the support of a network connection can cause an “interrupt” or CPU event for every record/packet.  In most cases, this event-handling will suck more CPU time than an inefficient hypervisor would.  Worse yet, it means that applications that are heavy users of interrupt-driven interfaces will suck resources from the CPU that can then starve co-loaded applications.  The goals of isolation and transparency are compromised.

The solution is smart, or smarter, interfaces.  All forms of I/O, including network I/O, has an associated “protocol”, a standard for formatting and exchanging information.  In addition, all applications that use these I/O forms will have a standard mechanism to access the media, implemented via an API.  There’s been a long-term trend toward making I/O and network interfaces smart, so they can implement the low-level protocols without CPU involvement.  Even PCs and phones use this today.  What Amazon wants to do is to implement the high-level application-and-API mechanisms on an interface directly.  Smart interfaces then become smarter, and applications that use interfaces heavily create far less CPU load, and have less impact on shared resources in virtual environments.

One big benefit of the smarter Nitro interface cards is that you can add I/O and network connections to a server to upgrade the server’s throughput, without adding a proportional load on the server CPU to handle the data exchanges.  It seems to me that, in the limiting case, you could create what Amazon (and the article) call “near-bare-metal” performance by giving applications a dedicated interface card (or cards) rather than giving them dedicated servers.

Isolation also has to address the issue of intrusion—hacking or malicious applications.  Amazon offloads the virtualization and security functions onto a chip, the “Nitro Security Chip”.  Dedicated resources, not part of the server’s shared-resource partitioning, are used for these functions, which means that they’re not easily attacked from within the virtual server environment; they’re not part of it.

What does this all mean for the cloud, and cloud-centric movements like NFV?  One clear point is that a “commercial off-the-shelf server” (COTS) is really not an ideal platform for the cloud.  A cloud server is probably going to be different from a traditional server, and the difference is going to multiply over time as the missions of cloud virtualization and data center applications separate.  That’s what hybrid cloud is already about, we should note.

Initially, I’d expect to see the differences focus on things like smarter interface cards, and perhaps more server backplane space to insert them.  As this trend evolves, it necessarily specializes servers themselves based on the interface complement that’s being supported.  Amazon’s Nitro page illustrates this point by showing no less than nine different “configurations” of EC2 instances.  That, in turn, requires we enrich our capability to schedule application components or features onto “proper” resources.

I think this point is often missed in discussions of feature hosting and NFV.  There are clearly a lot of different “optimum” server configurations in the cloud, and it’s unrealistic to assume that support for all possible applications/features on a single platform would be an economically optimal (or even viable) approach.  As we subdivide the resource pool, we create a need for more complex rules for assigning components to hosting points.  NFV recognized this, but it’s chosen to establish its own rules rather than try to work with established scheduling software.

Kubernetes has developed a rich set of features designed to control scheduling, and I think it’s richer than that available on VMs.  Containers, though, are less “isolated and transparent” than VMs are, and so this probably means that container hosting versions of platforms like Linux will evolve to include many of the “Nitro” features Amazon is defining.  Hosting platforms, like servers, will evolve to have very distinct flavors for virtualization versus application hosting.

Perhaps the most radical impact of the Nitro model is that it suggests that it’s imperative we separate management- and control-plane resources from application/service resources to improve both security and stability.  IP networking is historically linked to in-band signaling, and that makes it harder to prevent attacks on fundamental network features, including topology and route management.  Evolving concepts like SDN (OpenFlow) perpetuate a presumption of in-band signaling that makes them vulnerable to a configuration change that cuts off nodes from central control.  How we should isolate and protect the control and management planes of evolving services is an important question, one that we need some work to answer.

Are Telcos Making a Mistake with Cloud Provider Edge Deals?

Cloud providers like Amazon, Google, and Microsoft have been tuning their strategies to conform to the evolution of the hybrid cloud model.  Telcos have been trying to tune or transform their business model to wean themselves away from dependence on bill-for-bits services, all of which are commoditizing.  Is it a surprise that the two groups might find some common cause?

Light Reading did an entertaining piece on this, casting the OTT players and public cloud giants as “vampires” being let into the walled village, at the peril of all.  For those weaned on the concept of “disintermediation”, it’s an attractive picture.  The question is whether edge computing, the focus of the deals the article talks about, is a source of blood food, or perhaps a source of killing illumination.

It’s common to talk about “cloud computing” and “edge computing” as being the next great human profit opportunity, when in fact we have no direct evidence that’s the case.  Both the cloud and the edge have a variety of possible missions, the cloud’s being more firmly rooted in a set of hybrid applications that have an established business case.  But every mission isn’t necessarily a profit opportunity, and some vampire opportunities are traps.

Cloud computing, on the surface, would seem to be a great place for a telco to want to be.  Its value proposition relies to a great degree on economies of scale, high reliability, and the ability to tolerate a fairly low rate of return, all of which characterize the old telco model.  However, one does not transform one’s self by readopting the old models.  Most of the telcos I’ve talked with don’t see themselves jumping into public cloud services.  To quote one “I’m already in a low-margin business.  Why would I want to get into another?”

Why, if the business is low-margin, would Amazon and others want to be in it?  The answer lies in the real business model of cloud providers.  It’s not about “hosting”, it’s about services.  According to a recent Wall Street report, Amazon now offers about 175 different web services associated with its public cloud, an increase of 1.75 times over only two years.  When Amazon looks at a cloud customer, they see a prospect for many (or all) of those 175 services, not an “edge computing” customer.  In fact, in service terms, most cloud users wouldn’t know whether their stuff was running at “the edge” or not; they’d only know the service parameters Amazon would guarantee.

For the telco, then, the question is whether letting Amazon host stuff in telco edge facilities is really exposing them to a competitive risk.  Not, obviously, if they either don’t want to compete, or can’t.  I’ve already noted that telcos in general think public cloud is a low-margin service, which means that they’re really talking about hosting or IaaS.  The “services” that the public cloud providers increasingly focus on aren’t in the telco wheelhouse, first because they’re software and not infrastructure, and second because there’s a next-to-zero chance telcos would be able to build up an inventory of 175 services in two years, during which Amazon would presumably have added another 130 or so.  They’d have to be a totally different kind of company to play that game.

But—and it’s a big “but”—this doesn’t mean that telcos aren’t letting a vampire into their fold, just that the vampire isn’t a threat for what it takes (blood, in this analogy, or those profitable services in the real world) but in what it prevents the telco from doing.  It’s the space the vampire is taking up, not the teeth and appetite, that telcos should fear.  Even vampires you don’t need to fear for the obvious reasons still take up valuable space if you let them in.

Telcos need “services” too, for the same reason the cloud providers themselves do.  Features virtual computing or connectivity is, because it is featureless, exceptionally difficult to differentiate.  It commoditizes.  Telcos are now trying to address what cloud providers addressed in the dawn of IaaS; the margins of the basic service stink.  Thus, you need something other than the basic service.  For the telcos, the answer is not to try to compete with Amazon or Microsoft or Google, but to do things that those cloud providers aren’t doing.  They need their own services.

Operators’ problems with the “own services” paradigm is that they instinctively fall back to “new connection services” that have exactly the same commodity-bandwidth problem they’re trying to escape.  I can’t tell you how many have said, wistfully, that “new services are hard!”  Yeah, they are, and they’ll get harder every year, because the very players you’re now letting into your data centers will be looking at the very services that operators/telcos could naturally expect to lead.

Personalization/contextualization is the largest incremental opportunity for hosted (meaning cloud) services in the market, an opportunity that alone could justify over 30,000 edge data centers worldwide.  It’s tightly linked with personal communications, location services, and IoT, and the investment needed in infrastructure to create a realistic and compelling market base would be formidable—just what a giant-infrastructure player like a telco could make.  However, you could creep into the space from the application side., and the public cloud providers are already doing that.

Most public cloud tools have one feature that telcos wouldn’t even want to try to emulate—they’re developer-centric.  In effect, they’re middleware.  Telco “services” should be information services and insight services, derived from all the stuff that telcos “know” about users.  In past blogs, I remarked that by understanding the patterns of movement of users among mobile cells, you could infer a lot about traffic, congestion, and even popularity of events or stores.  The point is that there’s a lot of information a telco has that could be made into a service, and since this information is anonymous (it doesn’t matter who’s stuck in traffic, only how many, for travel time and congestion analysis), it wouldn’t compromise privacy.

Much of the potential personalization/contextualization information has applications that require quick delivery.  Traffic avoidance is inherently a “systemic” activity, not like collision avoidance that belongs on-vehicle.  However, traffic avoidance does require quick responses to change, and so latency is an issue.  In addition, it’s inherently a local problem, so local processing is likely the optimum way to solve it.  Thus, it’s a nice application for edge computing.

If telcos don’t really want to get into anything that’s a non-collection service, consider what they could do with personalization/contextualization services in advertising for their own TV services.  Ad targeting based on context is, today, largely limited to having recent searches or perhaps emails trigger certain ads.  There’s far more information available that telcos could exploit.

I’d love to see the telcos frame a personalization/contextualization architecture (see my blogs on “information fields” for my thoughts), but if that’s too much of a reach, they might be able to get their arms around the information services I’ve discussed here.  But to return to my original point, the OTT vampire is a risk not for stealing edge real estate as much as stealing the services that justify edge real estate.  If telcos sit back and participate in the service revolution by proxy, they’re disintermediating themselves, just as they did by staying in the Internet access business when OTT opportunities were flourishing.

Why we Need a Tech Ecosystem to Save 5G, IoT, AI/ML, and Edge

Years ago, I was involved in early work on ISDN (remember that!).  Since CIMI Corporation had signed the Cooperative R&D Agreement (CRDA) with the National Institute of Standards and Technology (NIST), I had a chance to review many of the early use cases.  I remember in particular a day when a vendor technologist burst into the room, excited, and said “We just discovered a new application for ISDN.  We call it ‘file transfer’!”

Well, gosh, that’s been done, but even that misses the point.  A new technology has to have a business justification for its deployment.  That justification isn’t just a matter of thinking up something that could be done with it, or even thinking up something to which our new technology adds value.  What we need is a justification that meets the ROI requirements of all the stakeholders that would have to participate in the deployment.

The justification problem has plagued tech in general, but network and service provider network technology in particular.  The reason is that the value proposition for these spaces is complicated, meaning that there are a lot of issues that not only decide who the stakeholders might be, but also decide how each of them might make a business case.  I can prove that just about any new network technology would be great for operators if we assumed that 1) there was no sunk cost being displaced, 2) that the vendors who supplied the technology would charge less than break-even for it, and 3) that all the people involved in buying, installing, and maintaining the new technology were instantly and miraculously endowed with the necessary skills.  In most cases, none of those assumptions would be valid.

Right now, where we’re seeing this issue bite is in areas like 5G, edge computing, and artificial intelligence.  These are great for generating exciting stories, but so far, they’ve not been particularly good at making that critical business case.  One might wonder whether a big part of the problem is that nobody is really trying to make it.

Vendor salespeople tell me that exciting new technologies are very valuable to them, because they “open a dialog”.  You can’t push for a sale if you can’t get an appointment, and often a new technology offers a chance to sit down with a prospect.  Often, that discussion can lead to an early deal for a more conventional product.  This makes the vendors themselves a party to the exaggerated views we read, and it also means that the very people who will have to push for technology advance may have a vested interest in holding back.  With everyone focused on the next quarter, after all, what good is something that might take two or three years to develop?  This view then trickles down to impact the technologies themselves.

Let’s start with 5G.  When the technology was first discussed, everyone was agog about the enormous speeds it would offer, and the changes in our lives that mobile broadband at these new speeds would surely bring.  We still have a lot of stories about things like doing robotic surgery over 5G, and there are still people who eagerly await 5G service and phones, despite the fact that it’s very likely that the applications of today wouldn’t benefit a whit from 5G speeds, even if they were 10x what 4G brings.

Which, probably, they won’t be.  The 5G radio spec would enable very high speeds, but would the operators necessarily upgrade all their network infrastructure to carry those extra bits per second?  This article in Light Reading from Thursday of last week asks whether a major 5G operator has “moved the goalposts” in 5G performance.  Probably they did, because as I’ve often told my clients, PR is a “bulls**t bidding war.”  The story that generates the most excitement gets the best placement.  If Joe down the road says their 5G is twice as fast as LTE, then I’ll see that 2x and raise you 2x, and so we’re off to the races.

The interesting thing is that we’re going to get 5G.  Almost surely not the 5G that stories have featured and vendors have dreamed of, but we’re going to get it.  It’s the logical upgrade path for LTE networks, after all, and as mobile service expands, it will expand based on the most efficient technology available.  We might, eventually, lead into new mobile/wireless applications that demand higher speeds, even though we probably will never see that 10x improvement.  In the near term, though, we’ve started a hype wave that we now have to declare is falling short, because reality is less interesting than the early publicity.

The real problem with 5G, the one we need to be working on, is that willingness to pay for incremental bits of performance is far less than the cost of providing those bits.  Would you pay ten times as much for 5G as 4G?  Surveys show that users want 5G with no price premium, which means that it would have to be provided at zero incremental cost, that operator profits would fall when the offered it, or that some other revenue source (presumably from something 5G could enable) would pick up the slack.

IoT was what the operators, and many vendors, hoped would provide that last boost, but IoT is another technology that’s been misunderstood or misrepresented.  IoT can help applications and people coexist in a virtual world, by feeding real-world context, habits, interests, and goals into a common system.  Absent that common system, expansion of sensor/controller technology beyond its current levels is hard to justify.  Where’s the work on providing for that expansion?

It’s now AI’s turn, and like 5G, AI is something we’ll certainly see more of, but that will have a hard time living up to the hype it’s generated.  The public, even many in the tech space, think of AI as something like one of Isaac Asimov’s robots or the “Hal” in “2001: A Space Odyssey”.  I guess I don’t need to point out that we’re almost a generation past 2001 at this point, and Hal is still proving illusive, but that doesn’t mean that AI won’t be important.

There are many applications of AI or machine learning (ML), and at least some of them are practical and useful in the near term.  One area where I think there’s particular hope is in replacing or augmenting fixed policies in networking and computing.  One problem with “policy-based” management, for example, is coming up with all the policies you need.  AI/ML could certainly help with that, both by learning from human responses to conditions and by inferring both problems and responses based on similar past activities.

AI/ML could be a huge benefit, combined with a realistic vision of IoT.  Interpreting the variables input from the real world, and combining them with things like past behavior, current activity and communications, stated goals, and other factors, is a job for something more intuitive than rigid policies can provide.  Could we frame an IoT mission combined with an AI/ML mission?  Sure, but where’s that happening?

The challenge AI/ML faces is mostly one of expectations.  Remember Y2K?  There were legitimate problems with some programs that had failed to account for a rollover of dates beyond two digits, but visions of elevators plunging, planes dropping out of the skies?  How many companies failed to address the real issues because they were busily tossing mattresses on the floors of their elevators?  AI/ML isn’t going to threaten us with the Takeover of the Machine.  It’s not going to put you out of work, or likely anyone you know.  It could do some good, if we forget the silly stuff and focus on the basic value of inference and learning.

Which leaves us with edge computing.  Is there a mission for it?  Most assuredly, things like autonomous vehicles, computer gaming, and augmented reality could be applications of edge computing, but it’s not as simple as people would suggest.

Start with autonomous vehicles, everyone’s favorite…except mine.  The only truly sensible model for an autonomous vehicle is the model where all the critical systems like avoiding collisions are resident in the vehicle and reliant on local sensors.  The technology to do this is already in use and not particularly expensive.  Why would we want to off-load it to anything else, even at the edge?  We’d risk mass problems in the event of a network failure, and we’d really gain nothing in return.  Could edge computing be an information-resource partner to on-vehicle driving technology?  Sure, if we had it, and had the model for information distribution that would reside at the edge, neither of which we have.

Start with gaming.  Sure, latency matters in multiplayer games, but remember that the latency is a function of the “electrical distance” between you and the gaming processor that keeps track of players’ positions and actions.  Unless we think we’re playing only against local people, that processor isn’t in “the edge”.  Could games be designed for optimal use of edge computing?  Probably, but nobody is going to do that until it’s widely deployed; the move would limit the market for the game.

AR is probably the best candidate for a real low-latency edge mission.  I blogged about this in the past, and I continue to think that AR and contextual applications that would build on it are perhaps the most revolutionary thing we could actually end up seeing in the near term.  But again, while the pieces of the AR-and-contextual puzzle are easily identified, nobody seems interested in doing the dog work of building the system and promoting the approach enough to make a credible ecosystem out of it.  Simple applications that get a bit of ink are good enough.

AR is in many ways the endpoint of a realistic vision for all these technologies.  If we are to live in a virtual, contextualized, world, we need a way of getting the state of that alternate reality into our own brains.  Human vision is by far the richest of our senses, the only one that could input enough information fast enough to render our virtual world useful.  But does this mean we rush out and make AR work, and thus drive 5G and IoT and AI/ML?  That’s the same mistake we’ve been making.

You don’t achieve the benefits of transcontinental flight with an airplane door, you need not only the whole airplane but airports, traffic control, fuel, flight crew…you get the picture.  We’ve done ourselves a disservice by dividing revolutions into component parts, parts that by themselves can’t make a business case, and then hyping them up based ono the presumption that by providing one part, we secure the whole.

The open-source community may be the only answer to this.  Why, because in today’s world all problems, all challenges, all opportunities are in the end destined to be addressed by software.  A good team of software architects, working for six months or so, could frame the optimum future for all these technologies.  A current body like the Linux Foundation, or a new group, could drive this effort, and launch what might not be the “Hal” or “I, Robot” or “George Jetson” future of our dreams, but a future that’s truly interesting, even compelling.  If you’re starting, or involved in, an open-source activity to address the tech-ecosystem challenge we face, let me know.

Hybrid Cloud and Wall Street

Wall Street is always important, but not always insightful.  Because the view of the Street on a company or strategy determines share price, which usually has a decisive impact on management policy, we have to follow it.  But the Street has its own agenda, a desire for a revolution that can be played, so we have to take their views on technology itself with a grain of salt.  Credit Suisse did a recent conference on TMT (telecommunications, media, and technology) that brought all this home.

The most important thing that I take away from the conference, based on attendee comments and material, is that the Street still doesn’t understand the cloud.  Most, I think, don’t even know what it means at a technical level.  Nowhere is this more obvious than in the way the Street covers “hybrid cloud”.

Suppose you build an extension on your home, maybe a new deck or a dormer and extra room.  Are you building a house?  Obviously not; the house is being extended.  If you have a data center and add on a cloud front-end, are you building a hybrid cloud?  No, you’re extending the data center strategy by adding a public cloud strategy.  The analogy here is intended to show that if the foundation strategy of something is built on, it’s the thing being added that dominates the planning, or should.  You don’t need to add much to the house to support the deck, nor do you have to add a lot to the data center to support hybrid cloud.

One Street takeaway from the CS conference was that hardware vendors were seeing “choppy” demand but “the secular focus remains squarely on the push for hybrid cloud.”  The push is real, but the impact on hardware is highly speculative.

Nearly all the enterprise cloud use today is focused on providing a web/mobile front-end to legacy applications.  Those applications expose a set of APIs (application program interfaces) that are normally used for online transaction processing, and the cloud then feeds the APIs with work, essentially replacing dedicated terminals (remember those?) or local PCs.  The user experience is largely set by the cloud front-end.  This is “hybrid cloud”, meaning that applications are a hybrid of public cloud and data center processes.  The former, the cloud front-end, is the new piece, though.  Little or nothing is done to the data center.

As enterprises evolve this front-end/back-end cloud-and-data-center model, what usually happens is that work on the user experience opens an opportunity to create a different information relationship between the user and applications.  This different relationship is created by exposing new APIs, new processing features or information resources.  I’ve talked with a lot of enterprises about how this is done, and none of them indicated it was a task that required rethinking data center hardware.

Another related cloud takeaway actually comes from somebody who should know better, IBM.  IBM said that “Chapter One” of the cloud was where we were now, where 20% of “the work” had moved to the cloud.  They said this was largely “customer-facing applications”, which is a vague characterization of the cloud-front-end model that’s really what’s out there, but then they said that “Chapter Two” was moving the other 80%.

There is a next-to-zero chance that 100% of enterprise applications are moving out of the data center.  There is only a small chance that even half of it will; my model says that the fully mature cloud model will end up with about 42% of enterprise work in the cloud and the rest in the data center.  Most of that 42% isn’t “moved”, it’s newly developed for the cloud model.  Later on, IBM introduces another set of numbers; 60% of workloads migrate and 40% don’t.  More realistic, but my model reverses the percentages, and I think there’s very, very, little chance that IBM’s figures are right.

One reason for the difference in cloud view might be the way IBM and the Street view containers and Kubernetes.  IBM thinks the biggest drag on adoption of the cloud is the difficulty in gaining comfort in containerization.  That raises my next point of Street confusion; the Street thinks all cloud applications have to be containerized, and that all containerization is aimed at the cloud.

Containers are portable application components, in effect.  They are an essential element in a good virtualization strategy, and containers and Kubernetes orchestration of container deployment was really first adopted by enterprises in a pure data center hosting context.  You can tell that because the public cloud adoption of Kubernetes (Google invented Kubernetes, so we’ll exclude their use) lagged the enterprise data center use of the concept.  Not only that, recent advances to Kubernetes have been aimed at making it work in a hybrid cloud, which wouldn’t have been a necessary add-on had Kubernetes been designed for the cloud to start with.

Containers are a great strategy for any data center, and a beyond-great strategy for any company with multiple data centers.  Container adoption has been proceeding in the data center from the first, and it’s almost certain that we’d have containers and Kubernetes sweeping the software/platform space were there no cloud computing at all.  However, that doesn’t mean that there’s no container action going on in the hybrid cloud space.

What’s really going on here is what you might call the “gravitational pull” of the cloud.  As we add APIs to data center applications to enrich the user experience, we create “cloud pull” on the application components that present those APIs.  In more technical terms, software running in the data center, as it’s more tightly bound to the cloud’s scalable and elastic processes, tends to become more elastic.  Non-scalable APIs can’t effectively connect to a highly scalable front-end.

Making these new front-back-end-edge components scalable means applying more cloud principles to their design, and also being able to move some instances into the cloud if local resources dry up.  This “cloudbursting” is where you end up needing to orchestrate across the boundary between cloud and data center, or among clouds.

In effect, we have a middle ground between cloud and data center where elements can float between the two, scaling up into the cloud or back into the data center depending on load and performance issues.  That middle ground will get larger, constrained eventually by the fact that cloud economics favor applications that have highly variable workloads, and core business applications (once the front-ends have been moved to the cloud), lack that characteristic.  This constrained-migration thing is entirely driven by software issue, of course.

And that is what I think is the big miss for the Street, as the conference illustrates.  There’s a focus on the cloud and the future in hardware terms, not recognizing that what’s driving change in IT and networking alike is software.  The vendor with the best hardware may win server deals, but they’re not going to drive cloud migration, and in fact their sales are slaved to the hosting and application strategies that software will set, and is setting.

Lifecycle Automation, Granularity, and Scope

One of the moving parts in lifecycle automation is scope, by which I mean the extent to which an action impacts the network or a service.  It should be fundamental to our thinking about lifecycles and lifecycle automation, but it’s almost never discussed.  We’re going to discuss it today.

Imagine yourself driving down a long, lonely, road.  You round a bend and a large tree has fallen across the road.  You get out and inspect the tree, and you realize you have a choice to make.  You have a simple saw, and it will take quite a while to remove enough branches to get past.  You could turn back and look for another route, but you don’t know how far out of your way that will take you, or whether that other route might also be blocked.  That’s the “scope decision”; is removing the obvious barrier a better choice than finding a non-blocked route?

A service fails, because a critical network element has failed.  Should we try to replace that element alone, or as small a number of elements as we can, to restore service?  Should we just tear the service down and re-provision it, knowing that task will avoid all failures and not just the one we happened to have?  That’s another “scope decision”, and it might be of critical importance in lifecycle automation.

The complexity of a remediation process is directly proportional to the complexity of the remedy.  In order to restore a service using the smallest possible number of changes or new elements, I have to understand the way the service was deployed, understand where the broken element fits, and then be able to “back out” the smallest number of my decisions, then remake them.  This requires a service map, a model that describes the way the service was first created.  That model could tell me what I need to do, in order to make minimal changes to restore service.

A service model divides a service into discrete elements in a hierarchical way.  We have, perhaps, something called “VPN-CORE”, and that particular element has failed because something within it failed.  It might have a dozen sub-components, chosen from a larger list when the service was deployed.  So, do we fix the subcomponent that broke?  If so, we’ll have to rebuild the service connections within and to/from VPN-CORE.  We can do that, but clearly this is the process that makes lifecycle automation possible.

The other option, of course, is to start over.  Suppose we simply tear down the service completely and redeploy?  Suppose the services were built that way, built so that they didn’t leave committed resources hanging around that had to be removed?  Suppose everything was probabilistic and capacity-planned?  We’d just, when we decommissioned the failed service, restore any “capacity withdrawal” we’d made, and when we commission the service again, we’ll withdraw again.

The question that we’re asking with our scope discussion is whether the overall effort associated with limited-scope remediation of a failure is worth the cost and complexity.  Or, alternatively, how granular should we presume deployment and redeployment should be, to maximize the complexity versus benefits trade-off?  That question requires we look at the benefits of highly granular remediation and redeployment.

The most obvious benefit would be limiting service disruption.  Depending on the service we’re talking about, a limited-scope remediation of a fault could leave the rest of the service up and running, while a total redeployment would create a universal fault.  In our VPN example, it’s probable that if we redeployed an entire VPN-CORE element, we’d break all the VPN connections for the period of the redeployment.  I don’t think this would be acceptable, so we can presume that a complete redeployment isn’t in the cards.

This is logical in the framework of our long-dark-road analogy.  If getting around the tree could be avoided only by going all the way back to the starting point of our journey, we’d probably not like that option much.  In service terms, we probably could not accept a remedy approach that took down more customers than the original fault.

Another benefit to granular remediation is that it avoids churning resource commitments.  That churn is, at the least, wasteful of both steps and capacity, since we’d have a period in which the resources weren’t available for use.  We also have a risk, if we churn resources, that we can’t get the right resources back because something else grabbed them between our release and our request for redeployment.  If we worked to limit that lost-my-place risk, we’d increase the risk we’d waste capacity, because we couldn’t be sure whether the same resources would be grabbed in a new deployment.

The biggest problem with granular remediation is the complexity it introduces.  Going back to our roadblock analogy, if you focus entirely on getting around a local obstacle, you can easily end up cutting a lot of trees when another route could have avoided all the falls.  However, if you think local remedies, how do you know what the global conditions are, and when another broad option is better?  Hierarchical decomposition of service models avoids this problem.

If “VPN” is decomposed into a half-dozen elements, then the failure of any one can either be handled locally, or you can break down everything and recompose.  The former can be more complicated because every time you break down a piece, you likely have to reconnect the other pieces anyway, and you may lose sight of the fact that there’s enough broken that full recomposition is needed.  Should we then say that all faults are reported upward to the higher “VPN” element, who can either command the lower-level (broken) piece to fix itself, or just start up again?

It’s clear to me that the overall problem with granular remediation is that it requires a very strong service modeling strategy.  I don’t think that should be a problem because I think such a strategy is important for other reasons, like its ability to support multiple approaches to the same service feature during a period of technology migration.  It also seems important to have a granular service model available to guide service setup.  Despite the value, though, effective granular modeling seems to be stalling the lifecycle automation efforts of vendors and operators alike.  ONAP is a good example.

I don’t like the “start-over” choice, in part because it’s not completely clear to me that it avoids service modeling.  In order for it to work, we need a completely homogeneous infrastructure or we need fully autonomous components that bind themselves together (adaptive, in other words) to do things or fix things.  I’m not sure this can be done in an era where we’re talking about migrating from proprietary boxes to hosted software feature instances.  But if we don’t figure out, and accept, and promote, a broad service modeling strategy, it may be our only option going forward.

You can bend infrastructure and service level agreements to eliminate granular remediation and service modeling.  You can probably bend your deployments of modernized hosted-feature technology for the same purpose.  You can possibly handle networks and services that cross administrative boundaries and accommodate persistent differences in technology from one network to another.  All these qualifiers make me nervous.  We have TOSCA-based service models today; why not focus on the modeling-and-granular-remediation approach, and define how it works?  It would greatly advance our progress toward lifecycle automation.