What’s the Real Thing You Should Look For at MWC?

With MWC just around the corner and a flood of 5G stuff inevitable, this is the time to ask two questions.  First, what is really happening in the 5G space?  Second, when can we expect to see a complete 5G story deploy?  Those aren’t easy questions to answer in MWC discussions, particularly when the emphasis at trade shows isn’t likely to be “reality”.

The biggest barrier to a realistic view of 5G is the lack of a concrete definition of what 5G is.  There are really two broad pieces to 5G.  One is the 5G New Radio stuff, and the other is the changes to mobile infrastructure that accompany, but don’t necessarily have to accompany, the 5G NR stuff.  We can roughly map this to what’s being called 5G Core.  Originally the two were considered to be joined at the hip, but the 3GPP adopted a new work plan in 2017 called “NSA” for “Non-Stand Alone”.  This quirky name means 5G NR without 5G core, and that’s the key point to addressing our questions.

Everyone agrees that mobile services need more capacity and more bandwidth per user.  That’s particularly true if operators want to use a combination of fiber-to-the-node (FTTN) and cellular radio to create what look like wireline broadband services.  All of the 5G changes relating to the cellular radio network itself are part of NR.  From NR, you have three possible pathways forward.

Pathway one is to meld millimeter-wave NR with FTTN to support what’s essentially a “local 5G” network that would serve not mobile users but wireline broadband users.  This model of 5G doesn’t need mobility management (fiber nodes aren’t migrating, and fixed “wireline” terminations aren’t either) or any of the handset roaming and registration features.  This form of 5G is already being tested, and there will surely be real if limited deployments even in 2018.

Pathway two is 5G NSA (non-standalone, remember?).  This route says that you upgrade the RAN, where probably 95% of the 5G benefits come from anyway, and continue to use the mobility management and handset registration technologies of 4G, meaning IMS and EPC.  This changes out handsets to use 5G frequencies and radio formats, but leaves the rest of mobile infrastructure pretty much intact.  Since most users won’t toss handsets just to migrate to 5G and since 4GLTE compatibility is a requirement for 5G anyway, this gives you most of what 5G promises at the lowest possible impact.  Some 5G NSA deployments are certain for 2019, and possible even in late 2018.

Pathway three is true 5G, the combination of 5G NR and Core.  This is the 5G of dreams for vendors, the vision that includes stuff like network slicing, NFV integration, and all sorts of other bells and whistles that someone thinks would be exciting.  This is also the 5G that may never come about at all, and that’s the nub of the issue with 5G discussions.  Do we talk about the “standard” of 5G that always includes both NR and Core, or do we talk about NSA, which includes only NR and which is almost certain to deploy?

In my view, the 5G standard is a combination of an attempt to address a plausible pathway for wireless evolution and the typical overthinking and overpromotion of our networking marketplace.  If you assumed that we were going to see billions of IoT devices connected via the cellular network, and you assumed that we were going to have hundreds of different, independent, virtual cellular networks to support hundreds of different applications, and if you assume that we were going to demand free roaming between WiFi, satellite, wireline, and wireless calls, then perhaps you need full 5G.  Can we make the assumptions?  Not now, but standards have to prepare for the future and so it’s not unreasonable to talk about 5G in full standards form, as an evolutionary path that has to be justified by tangible market opportunity.  Otherwise it’s a pie in the sky.

5G NSA is proof of that.  The vendors involved were supporters of the NSA approach, and surely they would have been more reserved had it been likely that holding out for full 5G was an option.  The vendors almost certainly realized that if operators were presented with a 5G all-or-nothing, they’d have selected the “nothing” option, or forced an NSA-like approach down the line.  Sure, vendors would love a mobile infrastructure revolution, but not if it’s a low-probability outcome on a high-stakes game where real radio-network dollars are at stake.

What this means for 5G is that NSA is the real path, and that all the Core stuff is fluffery that will have to be justified by real opportunities.  Here, the facts are IMHO quite clear; there won’t be any of those real opportunities in the near term.  If there’s a 5G core evolution, it’s probably not coming until after 2022, and even then, we’d have to see some decisive progress on the justification front, not on the standards front.

There are two realistic drivers to a broader 5G deployment, rapid IoT adoption that’s dependent on cellular-linked IoT elements, and a shift to streaming video rather than linear TV.  The first of the two is the most glamorous and least likely, so we’ll look at it first.

Almost all the growth in “IoT” has been through the addition of WiFi-linked sensors in the home.  This has no impact whatsoever on cellular needs or opportunity, and it creates zero additional 5G justification.  What you’d need for 5G stimulus is a bunch of new sensors that are directly connected to the cellular network, and while there are various 5G radio suggestions for this, and there are missions that could credibly consume that configuration, the business case behind them hasn’t been acceptable up to now.

Apart from the “soft” question of public policy and personal privacy that open sensors raise, there’s the simple point of ROI.  Remember, anything that’s directly networked, rather than behind the implicit firewall of home-style NAT, would have to be secured.  The stuff would have to be powered, maintained, etc.  Private companies installing IoT sensors would have to wonder how they’d pay back on the cost.  Would cities be willing to fund initiatives to make all traffic lights smart?  It would depend on how much of a case could be made for a result that would improve traffic conditions and the driving experience.  And how gutsy politicians would be that the results would be delivered, because if they weren’t the next election wouldn’t be pretty.

The video story is complicated, but plausible.  We already know that streaming on-demand video demands effective content delivery networking, places to cache popular material to avoid over-consuming network resources on long delivery paths.  What about live TV?  Imagine a bunch of mobile devices, and also wireline-via-FTTN/5G stuff, streaming live TV.  Would you want to pull each stream from a program source, even one within a metro area?  Would you want to cache the material locally, use multicasting, or what?

How much have you read about live TV streaming as a 5G driver?  I know I’m not exactly bombarded with material, I don’t hear operators clamoring for it, I don’t see vendors pushing solutions.  But if you want to see 5G in anything other than the NSA version, that’s what you should be looking for in Barcelona.  Don’t hold your breath, though.  As I’m sure most of you know, relevance to real issues is not a trade show strong point.

What’s Happening with SD-WAN and How To Win In It

The SD-WAN space is a unique combination of risk and opportunity.  It’s clearly a risk to the traditional VPN and VLAN service models, the operator services that are based on the models, and the vendors whose equipment is used to deliver the services.  It’s an opportunity for startups to rake in some money, for enterprises to save some money, and for operators to create a more infrastructure-agile and future-proof service model.  The question is what side of the risk/reward picture will win out.

Right now, the market is in a state of flux.  Operators are dabbling in SD-WAN, and every startup knows that if network operators themselves become the dominant conduit for SD-WAN products, the winning vendors will look very different from the winners should enterprises dominate.  But “dabbling” is closer to “dominating” alphabetically than in the definitional world.  Not only are operators not truly committed to SD-WAN, they’re really not committed to why they should be committed to it.

SD-WAN is an infrastructure-independent way of delivering virtual network services, because it’s a form of overlay technology.  Vendors differ in how they position their stuff, whether they use a standard encapsulation approach, a proprietary approach, or don’t strictly encapsulate but still tunnel.  In the end, every technology that uses a network service to ride on is an overlay.  All overlays have the property of being independent of who’s doing the overlaying, so to speak.  For there to be differentiation, you have to create some functional bond between the overlay and underlay, a bond that the network operator can create because they own the network, but that others could not.

Operators aren’t committed to offering any such functional bonding.  I contacted two dozen of them over the last six weeks, and what I found was “interest” in the idea of having a symbiotic relationship between an SD-WAN and the underlying network, and “exploration” of benefits in both a service-technical sense and a business/financial sense.  It almost appears that operator interest in SD-WAN offerings is more predatory than exploitive.  Perhaps by offering SD-WAN they can lance the boil of MSP or enterprise deployments, and in the process weed out a lot of SD-WAN vendors whose initiatives might end up generating some market heat.

It’s really hard to say whether this strategy would work, and almost as hard to say whether operators could establish a meaningful symbiotic strategy if they wanted to.  Some of my friends in the space tell me that operators fall into two camps—those who have to defend territory and those who can benefit significantly from breaking down territory boundaries.  It’s the latter group who have been most interested in SD-WAN, and obviously running an SD-WAN outside your own territory limits how much you can hope for from symbiotic offerings.  The other guy isn’t going to let you tweak his underlay.

What extraterritorial SD-WAN does to is let operators create a seamless VPN connectivity map for buyers whose own geography is way broader than the operator’s own range, and even the range the operator can cover with federation deals with partner carriers.  However, some operators say that they’d really rather somebody else did this extension, preferably the enterprises themselves and if necessary the MSPs.  The problem they cite is the difficulty in sustaining high connection quality (availability and QoS) with Internet overlays.

Amid all this confusion, it’s not surprising that SD-WAN vendors are themselves a bit at sea.  That’s bad, because it’s clear that there’s going to be a shakeout this year, and absent a clear vision of what the market will value, the risk of being a shake-ee is too high for many.  What might work?

To me, the clear answer is SD-WAN support for composable IP networks.  Market-leading container software Docker imposes one presumptive network model, and market-leading orchestration tool Kubernetes imposes another totally different one.  Microservices and component sharing fit differently into each of these, and so do things like cloudbursting or even in-house scaling.  Public cloud providers have their own addressing rules, and then there’s serverless and event processing.  It’s very easy for an enterprise to get so mired in the simple question of making everything connect like it’s supposed to that they don’t have time for anything else.

One thing in that category that seems a sure winner is a superior system for access control, to apply connectivity rules that govern what IP addresses can connect to what other ones.  Forwarding rules are essential to SD-WAN anyway, and having a system that lets you easily control connection policies makes an SD-WAN strategy functionally superior to most VPNs, where doing the same thing with routers is far from easy.

Related to this is address mapping/remapping.  SD-WAN is likely to deploy as a VPN, connecting various virtual hosts created with VM or container technology, and also pulling in a public cloud or two or three.  Each of these domains has specific rules or practices for addressing, and getting all them to harmonize on a single plan is valuable in itself, and essential if you’re also going to control connectivity as I’ve suggested in my first point.

The management framework, including the GUI and network mapping features, would be critical for both these capabilities.  Even more critical is a foundational notion, the notion that the challenge of the future (posed by virtualization) is to create connection/address elasticity that corresponds to the resource and component elasticity that modern cloud and application practices give us.  We are building IP networks today based on the same principles that were developed before there was a cloud, or even an Internet, in any real sense.

There are, to be sure, plenty of initiatives in the IETF to modernize IP, but most of them are actually unnecessary and even inappropriate in the near term, because they’d require client software transformation to adopt.  What an SD-WAN box could to is, by sitting at the network-to-user boundary, make the network appear to the user to be what it needs to be, and allow the transport network to be what it must be.

Nobody in the SD-WAN space is in the right place on this, so far.  That means that even if there’s a market shake-out coming, there’s still a chance to grab on and hold on to the critical market-shaping ideas.

The Relationship Between Service Modeling and Management Strategies

Service modeling is important for zero-touch automation, as I said in an earlier blog.  Service modeling, in terms of just how the model is constructed, is also important for operations, service, and network management.  In fact, it sets up a very important management boundary point that could have a lot to do with how we evolve to software-centric networking in the future.

You could argue that the defining principle of the modern software-driven age is virtualization.  The relevant definition is “not physically existing as such but made by software to appear to do so.”  Jumping off from this, software-defined elements are things that appear to exist because software defines a black-box or boundary that looks like something, often something that already exists in a convenient physical form.  A “virtual machine” looks like a real one, and likewise a “virtual router”.

Virtualization creates a very explicit boundary, outside of which it’s what software appears to be that matters, and inside of which is the challenge of ensuring that the software that really is looks like what’s being virtualized.  From the outside, true virtualization would have to expose the same properties in all the functional planes, meaning data plane, control plane, and management plane.  A virtual device is a failure if it’s not managed like the real device it’s modeled on.  Inside, the real resources that are used to deliver the correct virtual behavior at the boundary have to be managed, because whatever is outside cannot see those resources, by definition.

One way to exploit the nature of virtualization and its impact on management is to define infrastructure so that the properties of virtual devices truly map to those of the real thing, then substitute the former for the latter.  We already do that in data centers that rely on virtual machines or containers; the resource management properties are the same as (or similar enough to) the real thing as to permit management practices to continue across the transition.  However, we’ve also created a kind of phantom world inside our virtual devices, a world that can’t be managed by the outside processes at all.

The general solution to this dilemma is the “intent model” approach, which says that a virtual element is responsible for self-management of what’s inside, and presentation of an explicit SLA and management properties to what’s outside.  An older but still valuable subset of this is to manage real resources independently as a pool of resources, on the theory that if you capacity-plan correctly and if your resource pool is operating according to your plan, there can be no violations of SLAs at the virtual element level.

The difference between the broad intent-model solution and the resource management solution arises when you consider services or applications that are made up of a bunch of nested layers of intent model.  The lowest layer of modeling is surely the place where actual resources are mapped to intent, but at higher layers, you could expect to see a model decompose into another set of models.  That means that if there are management properties that the high-level model has to support, it has to do that by mapping between the high-level SLA and management interface, and the collection of lower-level SLAs and interfaces.

From a management perspective, then, a complex service or application model actually has three different management layers.  At the top is the layer that manages virtual elements using “real-element practices”.  At the bottom is the resource management layer that manages according to a capacity plan and is largely unaware of anything above, and in the middle is a variable layer that manages the aggregate elements/models that map not to resources but to other elements/models.

The management layering here is important because it illustrates that many of our modern network/service automation strategies have missing elements.  The simple model of two layers, the top of which is based on the “real” device management already in place and the bottom on generalized resource management, won’t work if you have a service hierarchy more than two levels deep.

One solution to that is to make the virtual device bigger, meaning to envelope more resource-directed functions in high-level models.  A VPN that is created by one huge virtual router represents this approach.  The problem is that this creates very brittle models; any change in infrastructure has to be reflected directly in the models that service architects work with.  It’s like writing monolithic software instead of using componentization or microservices—bad practice.  My work on both CloudNFV and ExperiaSphere have demonstrated to me that two-layer service structures are almost certain not to be workable, so that middle layer has to be addressed.

There are two basic ways to approach the management of middle-level elements.  One is to presume that all of the model layers are “virtual devices” some of which are just based on no current real device.  That approach means that you’d define management elements to operate on the middle-layer objects, likely based on device management principles.  The other is to adopt what I’ll call compositional management, meaning adopting the TMF NGOSS Contract approach of a data model mediating events to direct them to the correct (management) processes.

IMHO, the first approach is a literal interpretation of the ETSI NFV VNF Manager model.  In effect, you have traditional EMS processes that are explicitly linked with each of the virtualized components, and that work in harmony with a more global component that presumably offers an ecosystemic view.  This works only as long as a model element decomposes always into at least resources, and perhaps even virtualized functions.  Thus, it seems to me to impose a no-layers approach to virtual services, or at the minimum doesn’t address the middle layers.

You could extend the management model of the ISG to non-resource-decomposed elements, but in order to do that you’d need to define some explicit management process that gets explicitly deployed, and that then serves as a kind of “MIB synthesizer” that collects lower-level model element management views, and that decomposes its own management functions down into those lower layers.  This can be done, but it seems to me to have both scalability problems and the problem of needing some very careful standardization, or elements might well become non-portable not because their functionality wasn’t but because their management wasn’t.

The second approach is what I’ve been advocating.  A data model that defines event/process relationships can be processed by any “model-handler” because its functionality is defined in the data model itself.  You can define the lifecycle processes in state/event progression terms, and them to specific events.  No matter what the level of the model element, the functionality needed to process the events through the model is identical.  The operations processes invoked could be common where possible, specialized when needed, and as fully scalable as demand requires.

You probably can’t model broad resource pools this way, but you don’t need to.  At the critical bottom layer, a small number of intent models with traditional resource management tied to policy-based capacity planning can provide the SLA assurance you need.  This approach could also be made to work for services that didn’t really have independent SLAs, either implicit or explicit.  For all the rest, including most of what IoT and 5G will require, we need the three layers of management and most of all a way to handle that critical new virtualization-and-modeling piece in the middle.  Without the middle, after all, the two ends add up to nothing.

A Hands-On Perspective on the Future of TV Viewing

Streaming video that includes live TV is the biggest threat to the traditional linear model of television.  Depending on just what the operator does with it, that makes it the biggest threat to the cable TV and telco TV franchises, or the biggest opportunity.  I’ve blogged about the overall issues of streaming live TV already, but I’ve recently had a chance to look at the space closely, so I want to share my hands-on perspective.

Beauty, and live TV streaming, are functionally in the eye of the beholder.  Streaming fans tend to come from two different camps.  One represents TV viewers who either are not regularly viewing at home, or are tending to consume time-shifted material more than live material.  The other represents more-or-less traditional TV viewers who want to “cut the cord” and eliminate the charges and contracts associated with traditional linear TV.  The camp you’re in is critical when you look at the various live TV streaming options.

Hulu’s live offering is in beta still, but IMHO it illustrates a concept targeted to the first of my two groups.  The experience is nothing like watching linear TV.  There’s no channel guide, and getting to network live programming is awkward to say the least, for those who “watch TV”.  On-demand or time-shifted viewing is typically managed by show and not by time, and so things like channels guides and the notion of “on now” are fairly meaningless.  For this group, it may make perfect sense to go to a “show” tile to find both VoD and live programming, and to go to a “network” tile for the “on-now” stuff.

What you want to watch is another dimension.  Hulu, Google, and Sling TV all have limited channel repertoires, and I think that’s also something appropriate to my first on-demand-centric group.  If you want to watch something on now, then having more channels to choose from gives you a greater chance of finding something you like.  If you’re just looking for something interesting, then it’s more the totality of shows/movies than the number of live channels you’d like.  Amazon Prime is a perfect example; you look for material based on your areas of interest, and it’s on-demand.

DirecTV Now and Playstation Vue are more aligned to my second group of viewers.  Both have linear-TV-like channel guides that let you keep your “on-now” perspective of viewing.  Both also have a larger inventory of channels (and more channel bundles at different price levels).  This means that people accustomed to linear viewing with cable or telco TV have a better chance of getting a package that includes the channels they are used to watching.  Both include local stations in major market areas, so you don’t miss local news and weather or local sporting events with local announcers.

Playstation Vue and DirecTV Now illustrate another difference between the two groups of viewers.  My second linear-committed viewer group are most likely to watch shows in groups, where the first viewer group is most likely to watch solo.  This reflects in the fact that Playstation Vue supports more concurrent sessions than DirecTV Now, and of course the link between Playstation Vue and Playstation as a game system reinforces that solo model.

Another important point is DVR.  Virtually all linear systems allow for home recording of TV shows, but because streaming TV doesn’t require a specific “cable box” in which to site DVR functionality, the capability isn’t automatic in streaming services.  However, everyone but DirecTV Now offers cloud DVR (some at additional cost), and DirecTV Now is supposed to get the feature, along with a GUI upgrade, sometime in the spring.

For some, DVR isn’t necessarily a key point.  Most of the streaming live TV services include archives of past shows, usually available quickly and for a month or so.  Many networks let you sign onto their own streaming apps (on Roku, Google Chromecast, Amazon Fire TV, etc.) using TV Everywhere credentials, after which you can access shows within three or four days max of when they aired.  How long material is available will vary, though.  There’s also a PC package called “PlayOn” (also available in a cloud version) that will record not the live shows but the archived shows, which means you won’t lose them after 30 days or so, the usual limit.  With some effort, you can make live streaming TV work almost as well as linear TV, and with lower cost.

The qualifiers I keep using, like “almost”, are important because the second group of cord-cutters includes the generation less likely to be comfortable diddling with technology or dealing with changes.  The GUI is one such change.  Linear TV typically has a fairly architected user interface, with a custom remote that makes it easier to do what you need/want.  With streaming live TV, there’s a need for streaming providers to accommodate a range of different streaming devices, each with their own control and perhaps a different mobile app.  You can get things like voice control of TV that might be impossible with linear, but many people find the change in interface a bit jarring.

The setup for streaming can also be intimidating to non-technical people.  You need some sort of device to stream to.  There are add-on devices from Amazon, Apple, Google, and Roku (for example), and there are “smart TVs” that can directly support streaming, plus you may be able to cast to your TV from your phone or computer, as well as stream directly.  All these work a bit differently, and not all the streaming channels are supported on a given device.  You need to check what you can get before you make a deal, and if possible, you should view your choice in a friend’s home before you buy.

Then there’s the quality of your home network.  You probably need to think seriously about 50Mbps or better Internet service to stream live TV reliably to a typical home.  Some people tell me they can make 25Mbps work, but it depends on how many independent streams you have and whether you’re streaming 4K or a lower resolution.  That means that many DSL connections are going to be limited in terms of streaming support, which is likely why AT&T elected to buy DirecTV and start supporting a satellite delivery alternative to U-Verse TV.

In the home, the “standard WiFi” may or may not work.  Modern WiFi is based on two frequency bands, 2.5Ghz and 5Ghz, and there are two classes of 802.11 standards available, the “single-letter” ones like 802.11n and the “two-letter” 802.11ac.  The former is limited to 54 Mbps and of course it’s shared, and you might find your WiFi use congests things.  The latter lets you operate at more than double that speed.

Streamers may also face issues with coverage.  A large home can be covered completely with a single well-placed, state-of-the-art WiFi router, but many of the standard WiFi routers that come with broadband Internet service will not cover anywhere near that, and you’d need to have a “mesh” system or repeaters.  The problem with WiFi mesh/repeater technology is that most of it is based on a WiFi relay principle, where WiFi both connects the repeater to distant parts of the home, and also connects the repeater to the master WiFi router.  That means that the repeater has to handle the same traffic twice.  Multi-band WiFi is probably critical for repeater use.

You can probably see the challenge here already.  If there’s no computer/Internet literacy in a household, it’s going to be a long slog to get streaming to work unless you’re lucky and it works out of the box because you got the right gear from your ISP or you have a limited area where you need access.  And if you’ve followed the trend of using networked thermostats, security devices, etc. you may have even more problems, because changing your WiFi will in most cases disconnect all that stuff, and you’ll have to reconnect it.

Where does this leave us?  If there are two different viewing groups with two different sets of needs, then logically we’d really have two different markets to consider.  So far, streaming has penetrated fastest where it offers the most incremental value and the least negative impact on expectations.  If it continues in that channel of the market, it would likely eventually change TV forever, but not in the near term.  If it manages to penetrate the second market group, it transforms TV at the pace of that penetration.  It’s the TV-oriented group, then, that matters most.

Sling TV owes its leading market share to its appeal to my first mobile-driven, on-demand, group.  DirecTV Now has the second-largest market share, and I think the largest potential to impact the second group.  AT&T would have suffered a net loss of video customers were it not for DirecTV Now, and they mentioned the service and plans to upgrade it on their last earnings call.  They have the money to push the service, and for AT&T it offers the attractive capability to ride on competitors’ Internet access and so draw TV revenue from outside the AT&T territory.

That last point is probably the factor that will determine the future of streaming TV and thus the ongoing fate of linear TV.  The success of AT&T with DirecTV Now would certainly encourage competitors to launch their own offerings.  Amazon, who reportedly decided not to get into the streaming live TV business, might reconsider its position.  Google, remember, already has an offering, and they’re a rival of Amazon in virtually every online business.  Comcast and Charter both have basic wireless streaming, though not yet ready to displace traditional linear TV in their own service areas.  There are lots of opportunities to step up the streaming competitive game, and boost streaming live TV overall.

Streaming isn’t going to make TV free, and in fact if it totally displaces linear and induces providers to offer different bundling strategies, it might end up killing some smaller cable networks who only get license fees as part of a larger package.  It will make TV different, and shift the focus of wireline providers from linear video to Internet.  Since Internet services are lower-margin, the shift could put them all in a profit vice unless they do something—like focus on streaming, and what might be beyond it in terms of OTT video services.  That’s a shift I think is coming sooner than I’d expected.

The Multiple Dimensions of Service Data Models

Service modeling has been a recurring topic for my blogs, in part because of its critical importance in what I’ve called “service lifecycle automation” and the industry now calls “zero-touch automation”.  Another reason it’s recurring is that I keep getting requests from people to cover another angle of the topic.  The most recent one relates to the notion of a “service topology”.   What the heck is that, many are wondering.  Good question.

A “service” is a cooperative relationship between functional elements to accomplish a systemic goal that the service itself defines.  The “functional elements” could be legacy devices like switches, routers, optical add-drop multiplexers, and so forth.  In the modern software-defined age, it could also be a bunch of virtual functions, and obviously it could be a combination of the two.

A service model, then, could be seen as a description of how the functions cooperate, a description of their relationship.  In a “network” of devices, a service model might be a nodal topology map, perhaps overlaid with information on how each node is induced to behave cooperatively.  This is our traditional way of thinking about services, for obvious reasons.  It assumes that the service is a product of the network, which is an entity with specific structure and capabilities.  In our software-defined era, things get more complicated, and at several levels.

Yes, you could visualize a future service as a collection of virtual devices that mapped 1:1 to the same traditional boxes in the same physical topology.  That could be called the “cloud-hosted node” model.  NFV proposes a different approach, where virtual functions that are derived from physical functions (devices) are instantiated per-customer, per service.  Thus, a service isn’t always coerced from communal nodes; it can be created ad hoc from service-specific hosted elements.

The second level is the stuff that we’re hosting on.  In software-defined services of any sort, we have a pool of resources that are committed as needed, rather than dedicated and specialized boxes.  Since the pool of resources is shared, we can’t let the individual services just grab at it and do what they want.  We need some notion of virtualization of resources, where everything thinks it has dedicated service elements of some sort, but all actually pull stuff from the pool.  The pool is a “black box” with no structural properties visible.

Down inside the resource-pool black box, we still have physical topologies, because we have servers, switches, data centers, trunks, and other stuff that create the “resource fabric” that makes up the pool.  However, we now have a series of virtual topologies, representing the relationships of the software elements that make up the services.

While you might be tempted to think of a virtual topology as the same kind of trunk-and-node structure we’re familiar with from switch/router networks, just translated to “virtual switch/router” and “virtual trunks”, that’s not logical.  Why would someone want to define, at the service virtual level, a specific virtual-node structure?  A VPN might really be a collection of cooperating routers, but to the user it would look like one giant virtual router.  That’s what the virtual-service topology would also look like.

A virtual-service, or let’s face it, any service, topology is really a functional topology.  It describes the functions that are needed, not how those functions are delivered.  When we have to pass through the boundary between services and the resources, we need to map to the real world, but we don’t want to pull real-world mapping specifics into the virtual service layer.  Thus, the functional view of a VPN would be “access-plus-VPN”.

This lets us shine some light on issues that are raised periodically on service modeling.  Most would agree that there are two contenders today in the model space—TOSCA from OASIS and YANG.  In terms of genesis, TOSCA is a data modeling language that was developed to describe how to configure a physical nodal topology of devices.  TOSCA is a declarative language to describe (as the acronym “Topology and Orchestration Specification for Cloud Applications” suggests) the deployment of components in the cloud.  If you take the root missions as the current definitions for each language, then they both belong to the process of decomposing functions into resource cooperation.  The more “functional” or software-defined you are, the more TOSCA makes sense.  If you see nothing but physical devices or hosted equivalents thereof, you can use YANG.

What about the service layer?  There, you’ll recall, I suggested that we really had a function topology—“access” and “VPN” in the example.  You could say that “VPN-Service” as a model element decomposes into “VPN-Access” and “VPN-Core”.  You could say that “VPN-Access” has two interfaces—customer-side and network-side, the latter of which connects to one of a variable number of “customer-side” interfaces on “VPN-Core”.  We could visualize this as some kind of pseudo-molecule, a fat atom being the core and a bunch of little atoms stuck to the surface and representing the access elements.

It’s perfectly possible to describe the service structure as a series of intent model objects in a hierarchy, and use something like XML to do that.  The initial CloudNFV project did that, in fact.  However, it’s also possible to describe that structure in both TOSCA and YANG too.  I favor TOSCA because it is designed for the deployment of hosted elements, not for the configuration of semi-fixed elements (but again, you can probably use either).

But let’s leave modeling religious arguments and return to our hierarchy (however we model it).  A high-level element like “VPN-Access” or “VPN-Core” could be decomposed in a number of ways depending on the specific nature of the service, things like where the users are located and what the infrastructure in that area looks like.  Thus, we could expect to see some possible decomposition of each of our high-level elements based on broad things like administrative domain/geography.  When we know that a specific user access point in a specific location needs “VPN-Access” we would expect to see the model decompose not into other models, but into some configuration or deployment language that can actually commit/coerce resources.

The virtualization layer of our structure creates a kind of boundary point between services and resources, and it’s convenient to assume that we have “service architects” that build above this point and “resource architects” that handle what happens below.  If we build services from collected resource behaviors, it makes sense to presume that resource architects create a set of behaviors that are then collected into model elements at the boundary point, and are visible and composable pieces of the service.  There may be a model hierarchy above the boundary point, as illustrated here, and there might also be a hierarchy used to organize the resource behaviors according to the kind of infrastructure involved.

In this example, we could say that “VPN-Access” is such a boundary behavior element.  Every administrative domain that provides access to users would be expected to provide such an element, and to decompose it as needed to create the necessary connection for the full set of users it offers such a service to.  As this decomposition and commitment progresses, the functional/hierarchical model of the service creates a physical topology that connects information flows.  VPN-Access trunk goes into VPN-Core port.  This connecting has to be an attribute of the deployment (or redeployment) of the service.

What this says is that there is a service topology at two levels, one at the functional/descriptive level, and one at the resource/connection level.  The first creates the second, and mediates the ongoing lifecycle management processes.

The boundary between these two layers is what’s really important, and yet we’ve had almost no industry focus on that point.  Operators need to be able to build services from some specific toolkit, even in a virtual world.  Operators need to be able to divide the “operations” tasks between the service side (OSS/BSS) and the network side (NMS, NOC).  Anyway, in any virtual world, it’s where concept meets reality that’s critical, and that is at this boundary point.

What might live here?  Let’s take a specific example: “Firewall”.  I could sell a firewall feature as an adjunct to VPN-Access, but depending on how it was implemented in a given location, the abstract notion of Firewall might decompose to a VNF, a physical device, or some sort of network service.  Above, it’s all the same, but where the function meets the resource, the deployment rules would vary according to what’s available.

We could probably gain a lot by forgetting the modeling-language issues for the moment, and focusing on what happens at the virtual service/resource boundary.  Each language could then present its credentials relative to that point, and we could see how it frames what we do now, and what we expect from the network of the future.

Can the TMF Retake the Lead in Zero-Touch Automation?

On the ever-popular topic of zero-touch automation, Light Reading has an interesting piece that opens what might well be the big question of the whole topic—“What about the TMF?”  The TMF is the baseline reference for OSS/BSS, a body that at least the CIO portions of operator organizations tend to trust.  The TMF has also launched a series of projects aimed in some way at the problem of zero-touch automation.  As the article suggests, there are two in particular that are currently aimed at ZTA, the Open Digital Architecture (ODA) and the Zero Touch Orchestration, Operations, and Management (ZOOM) initiatives.  Could the TMF resolve everything here?  We need to look at the assets and liabilities.

Engagement is critical with anything in the zero-touch area, and the TMF has both plusses and minuses in that category.  Among CIO organizations the TMF is generally well-regarded, and it has long-standing links to the OSS/BSS vendors too.  The body runs regular working group and plenary meetings, the latter being mega-events that are as much trade shows as technical meetings (not unlike MWC).

Outside the CIO area, things are a bit different.  Among the rest of the C-suite executives, the TMF is held in lower regard than bodies like ETSI, and there are some in the staff of CTO, CMO, and COO execs that think the whole OSS/BSS program should be scrapped and redone, under different industry bodies.

This division is critical at this particular moment because the zero-touch initiative really demands a solution that crosses the boundary between OSS/BSS and the COO/CTO missions of network technology transformation and operations transformation at the NMS level.  The barriers to having just one C-suite exec take over the whole ZTA space are profound, in operator political terms.

Technology is also critical to ZTA, of course, and there the TMF has made the greatest single contribution of any body. The so-called “NGOSS Contract” (where NGOSS means Next-Gen OSS) established the first recognition of event-driven operations and the use of a model to steer events to processes.  I think it would be appropriate to say that if this model had been developed and accepted fully and logically, we’d have the right solution to ZTA already.

Why didn’t that happen?  That’s a very difficult question for anyone to answer, but one for which many people have contributed one.  What I hear from others is that 1) vendors obstructed the vision, 2) operators were reluctant to adopt something that transformative, 3) TMF people never articulated the concept correctly, and 4) the TMF never really saw the connection between NGOSS Contract and later ODA and ZOOM work.

I was a member of the TMF at the time when the NGOSS Contract work came to be, though not a part of it.  I was also a member when ZOOM kicked off, but engagement in TMF activity (like engagement in any standards activity) is both time-consuming and frustrating, and I couldn’t commit the time and energy as the work formalized.  I was never involved in ODA in any form.  I mention this because my own view, drawn from my exposure and the comments I’ve gotten from operators, suggests that there is some truth in all four of the reasons why the TMF NGOSS Contract stuff didn’t go where it might have.

The thing I think is critical here is that there is no reason why the TMF couldn’t now do what it didn’t do a decade ago when NGOSS Contract was hot.  At least, no new reason, because those same four issues are still pushing just as hard as they did a decade ago when all this started.  Not only that, the same four issues will have to be addressed by anyone who thinks they can make ZTA work.

The simple truth is that zero-touch automation can be made to work easily if a service model (like the TMF SID) is used to describe the elements of a service as functional models that contain the way that “events” within the scope of each element are directed to processes based on a state/event model of some sort.  That’s what is needed, and to the extent that you can describe ODA or ZOOM that way, they can be made to resolve the ZTA problem.

I believe firmly that there are probably a hundred people in the TMF (or affiliated as one of their levels of professional partners) who could do exactly what is needed.  I believe most of them were involved through the last decade, too, and somehow they didn’t do what they could have.  That’s another “Why?” that probably has lots of opinions associated with it.  The short answer is that the projects like ODA and ZOOM weren’t launched to specifically build on NGOSS Contract.  There were TMF Catalysts (demos, usually involving multiple vendors and given at one of the big TMF events) that showed many of the critical steps.  What was lacking was a nice blueprint that showed everything in context.

This is a failure of articulation, again.  For the TMF, articulation is hard because the body has created a kind of micro-culture with its own language and rules, and they don’t make much effort to communicate outside themselves.  Remember that most of the C-suite is outside the TMF, so that failure to speak a broadly understood industry language is a major problem.  But even when I was involved in the early ZOOM work, I didn’t see much attention being paid to NGOSS Contract, even when I brought it up as part of the CloudNFV work.

The Light Reading article raises a good point, or question.  “But is there not,” it asks, “a risk that yet another initiative sows confusion and leads to fragmentation, rather than standardization?”  Darn straight there is, but there may be another bigger issue.  Framing a simple but not fully understood solution like NGOSS Contract inside something else risks losing touch with what works.  Every new layer of project or acronym may lead to new claims of relevancy, but they move further from that basic truth that could have worked a decade ago.

The TMF has issues, in my view, and I’ve never shied away from that point.  It also has the most important core asset of all, the core understanding of the relationship between events, data models of services, and processes.  They’ve under-used, or perhaps even ignored, that asset for a decade, but they could still make truly explosive progress if they just did what should come naturally, and exploit their strengths.

To do that, they have to explicitly go back to NGOSS Contract, and they have to accept the fundamental notion that a service model (pick your modeling language) is a hierarchy or topology of element models, each of which obey the NGOSS-Contract mandate to mediate the event/process relationship.  Forget all the TMF-private nomenclature for a moment and just do this in industryspeak.  Once that’s done, they need to frame their ODA work first in that model, again using common language.  Only when they can do that should they dive into TMF-oriented specifics.  That makes sure that what they do can be assessed, and appreciated, by the market overall.

Can they do this?  Bodies like the TMF tend to fall into their own private subcultures, as I’ve already noted.  That’s what contributes to the problem the Light Reading piece cites, the tendency for new stuff to build another floor on the tower of babel instead of simplifying the problem or presenting solutions.

What Operator Projects in Operations Automation Show Us

In my last blog, I talked about the need to plan strategically and implement tactically.  The logic behind that recommendation is that it’s all too easy to take short-term measures you think will lead somewhere, and then go nowhere beyond the short term.  Looking at operators’ actions in 2017 and their plans in 2018, I see that risk already biting some critical projects.

Based on my research and modeling, I’ve developed a picture of the trends in what I call “process opex”, meaning the non-capital costs associated with network and service operations.  In 2017, they ran about 24 cents on each revenue dollar across all network operators, which is more than the total capital budget of operators.  SDN, NFV, and zero-touch automation all promise to reduce process opex, and in fact the savings that could have been achieved in 2016 and 2017 amounts to about six cents per revenue dollar.  What actually happened, according to operators themselves, was a savings of just a bit more than two cents per revenue dollar.

I was curious about the problem here, and what I learned when I asked operators for the reason for the savings shortfall was pretty simple and also sadly expected.  According to operators, they did not start their projects with the target of complete zero-touch automation.  Instead, they attacked individual subsystems of their network and operations process.  The also didn’t consider any broad automation mission in the way they did their per-subsystem attacks.  Many even said that they didn’t group multiple subsystems and try to find a solution across the group.

Only about five percent of operators seem to understand there’s a risk associated with this approach, and none of them seem to have quantified it or even fully determined what its scope could be.  I had a couple who I’d worked closely with, and I opened with them the touchy question of whether this might end up locking them into a path that didn’t go where they wanted.  Only one had a comment, and it was “I do understand that might happen, but we have no options for a broad strategy being presented to us at this point, and we need to do something now.”

What made this particular comment troubling was that this operator was also assessing ONAP/ECOMP, and so it could be argued that they did have a possible broader pathway to automation already under review.  Pushing my luck, I asked why ONAP wasn’t seen as a possible basis for a long-term automation approach, and the response was “Oh, they’re not there yet and they don’t really know when that might be possible.”

The urgency in addressing automation, for the purpose of reducing operations costs and improving both accuracy and agility, is understandable.  Operators have been under pressure to improve profit per bit for over five years, and it’s only been in the last two that they’ve really done much.  What they’ve done was to attack areas where they had a clear problem and a clear path forward, and as my specific contact noted, they weren’t being offered anything broad and systemic at the time their initiatives kicked off.

They still aren’t being offered anything, of course.  If you look at the things operators have already attacked (you can find a list of some in a Light Reading piece on this topic), you find mostly stuff in the OSS/BSS area.  You also see areas that are undergoing changes because of the changing nature of services, customer support, and competition.

I think the fact that the current opex-reducing automation initiatives really got started late in 2015 or early 2016 is also a factor.  There was no zero-touch automation group in ETSI then.  Most NFV work being done at that point didn’t really address the broader questions of automating services that consisted of virtual and physical functions either.  The notions of service events and an event-driven models for service and network management were still in the future, and AT&T’s ECOMP wasn’t even really announced until March of 2016.  It’s also true that ECOMP as it was defined, and largely as it still is, was targeted to deploy “under” OSS/BSS rather than as a tool to orchestration OSS/BSS processes.  Once operators got started with OSS/BSS-linked automation projects, they were out of scope to the ECOMP evaluation people.  They were, in most cases, even run by a different group—the CIO for OSS/BSS and the CTO for ECOMP.

Operators, as I’ve said, do understand that they’re likely to lose something with their current approach (but see no alternative).  How much they might lose is something they’ve not assessed and can’t communicate, but if I tweak my model I can make a guess.

Today process opex is running about 24 cents per revenue dollar across all operators.  The incremental growth in process opex through 2020 is projected to be four cents per revenue dollar, and automation can relieve only two cents of that, making 2020 process opex 26 cents per revenue dollar.  It could have been 23 cents if optimum automation practices had been adopted and phased in beginning in 2016.  The model says that achieving optimality now would lower the 2020 number to 25 cents, which is still two cents more than optimum.  To put things into perspective, cutting capex by ten percent would save that same two cents.

Losses are accumulative, and so are shortfalls on savings.  We could have hoped to save an accumulated total of twenty-six cents on each revenue dollar with zero-touch automation from 2016 through 2020.  What operators now think they will save with their incremental approach is less than half that amount.  There is a LOT of savings left on the table, enough to fund more than half-a-year’s capex budget.  That’s the price of not having a grand strategy to adopt.

Where things could get really complicated is when you take this beyond 2020.  IoT, 5G, increased reliance on Internet-only customers because of streaming competition, and interest in expanding the use of SDN and NFV could combine to add another three or four cents to process opex by 2025.  For some possible services, the lack of automated tools in management and operations could threaten the business case, or kill it.

I got a LinkedIn message that might give us some insight into why we didn’t have that grand strategy.  Norman Dorn sent me a LINK to a new-gen functional architecture for telecommunications systems that was developed by Bellcore before the iPhone came out.  The insight the list of goals demonstrates is dazzling, but Apple came along with a commercial product that swept the market without dealing with most of the issues, particularly insofar as they related to the relationship between network technology and appliance technology.  Comprehensive strategy will always support optimality, but it may not support practicality or competitive advantage.

There has to be a middle ground for operators in areas like zero-touch automation, 5G, IoT, and even SDN/NFV.  We could still find it for all of these things, but to do that we may have to abandon the traditional standards-driven approach and try to mold open-source projects into the incubators of future technology.  The only thing in that direction that’s worked so far is the “ECOMP model” where the buyer does the early work and when critical mass has been developed, the result is open-sourced.  That puts too many functional and strategic eggs in that first buyer’s basket.  Maybe it’s time to work out a way for technical collaboration of network operators of all sorts, without vendor interference and regulatory barriers.

If We’re in the Software Defined Age, How Do We Define the Software?

We are not through the first earnings season of the year, the first after the new tax law passed, but we are far enough into it to be able to see the outlines of technology trends in 2018.  Things could be a lot worse, the summary might go, but they could also still get worse.  In all, I think network and IT execs are feeling less pressure but not yet more optimism.

Juniper’s stock slid significantly pre-market after its earnings report, based on weak guidance for the first quarter of this year.  Ericsson also reported another loss, and announced job cuts and what some have told me is an “agonizing” re-strategizing program.  Nokia’s numbers were better, but mostly compared with past years, and they still showed margin pressure.  On the provider side, Verizon and AT&T have reported a loss of wireline customers, with both companies saying that some online property (Yahoo, etc., for Verizon and DirecTV Now for AT&T) helped them offset the decline.

In the enterprise space, nine of every ten CIOs tell me that their primary mission in budgeting is not to further improve productivity or deliver new applications, but to do more with less.  Technology improvements like virtualization and containers and orchestration that can drive down costs are good, and other things need not apply.  The balance of capital spending between sustaining current infrastructure and advancing new projects has never in 30 years tipped as heavily toward the former goal.

In the world of startups, I’m hearing from VCs that there’s going to be a shake-out in 2018.  The particular focus of the pressure is the “product” startups, those who have hardware or software products, and in particular those in the networking space.  VCs say that they think there are too many such startups out there, and that the market has already started to select among the players.  In short, the good exits are running out, so it’s time to start holding down costs on those who aren’t already paying off for you.

Something fundamental is happening here, obviously, and it’s a something that the industry at large would surely prefer to avoid.  So how can we do that?

Every new strategy has to contend with what could be called the “low apple syndrome”.  It’s not only human nature to do the easy stuff first, it’s also good business because that probably represents the best ROI.  The challenge that the syndrome creates is that the overall ROI of a new strategy is a combination of the low and high apples, and if you pluck the low stuff then the residual ROI on the rest can drop precipitously.  The only solution to that problem is to ensure that the solution/approach to those low apples can be applied across the whole tree.  We have to learn to architect for the long term and exploit tactically, in short.

There are two forces that limit our ability to do that.  One is that vendors in the space where our new strategies and technologies are targeted want to sustain their revenue streams and incumbent positions.  They tend to push a slight modification of things, something that doesn’t rock the boat excessively.  However, they position that as being a total revolution, and that combination discourages real revolutionary thinking at the architecture level.

The other force is the force of publicity.  We used to have subscription publications in IT, but nearly everything today is ad sponsored.  Sponsor interests are likely to prevail there.  Many market reports, some say most, are paid for by vendors and thus likely to favor vendor interests.  Even where that’s not directly true, one producer of market forecast reports once said to me that “There’s no market for a report that shows there’s no market for something.  Report buyers want to justify a decision to build a product or enter a market.”  I used to get RFPs from firms on outsourcing analyst reports, and the RFP would start with something like “Develop a report validating the hundred-billion-dollar market for xyz.”   Guess what the market report will end up showing.  How do you get real information to buyers under these conditions?

OK, if we have two forces that limit us, we need two counterbalancing forces.  The essential one is a buyer-driven model in specification, standardization, and open source.  I’ve seen vendor/buyer tension in every single standards activity I’ve been involved in, and the tension is often impossible to resolve effectively because of regulatory constraints.  Many geographies, for example, don’t allow network operators to “collude” by working jointly on things, and valuable initiatives have been hampered or actually shut down because of that.

The operators may now have learned a way of getting past this issue.  AT&T’s ECOMP development was done as an internal project, and then released to open-source and combined with OPEN-O orchestration of NFV elements to create ONAP.  The fact that so much work went into ECOMP under AT&T means that even though open-source activity would likely face the same regulatory constraints as standards bodies, vendors have a harder time dominating the body because much of the work is already done.  AT&T is now following that same approach with a white-box switch OS, and that’s a good thing.

The second solution path is establishing software-centric thinking.  Everything that’s happening in tech these days is centered around software, and yet tech processes at the standards and projects level are still “standards-centric”, looking at things the way they’d have been looked at thirty years ago.  Only one initiative I’m aware of, the IPsphere Forum or IPSF, as it was first considered a decade ago.  This body introduced what we now call “intent models”, visualizing services as a collection of cooperative but independent elements, and even proposed the notion of orchestration.  However, it fell to operator concerns about anti-trust regulation, since operators were driving the initiative.

Clearly, it’s this second path that’s going to be hard to follow.  There’s a lot of software skill out there, but not a lot of strong software architects, and the architecture of any new technology is the most critical thing.  If you start a software project with a framework that presumes monolithic, integrated, components linked by interfaces—which is what a traditional box solution would look like—that’s what you end up with.

The NFV ISG is a good example of the problem.  The body has originated a lot of really critical stuff, and it was the second source (after the IPSF) of the application of “orchestration” to telecom.  However, it described the operation of NFV as the interplay of functional blocks, something easy to visualize but potentially risky in implementation.  Instead of framing NFV as an event-driven process, it framed it as a set of static elements linked by interfaces—boxes, in short.  Now the body is working to fit this model to the growing recognition of the value of, even need for, event-driven thinking, and it’s not easy.

I think that blogs are the answer to the problem of communicating relevant points, whether you’re a buyer or seller.  However, a blog that mouths the same idle junk that goes into press releases isn’t going to accomplish anything at all.  You need to blog about relevant market issues, and introduce either your solution or a proposed approach in the context of those issues.  You also need to blog often enough to make people want to come back and see what you’re saying.  Daily is best, but at least twice per week is the minimum.

A technical pathway that offers some hope of breaking the logjam on software-centric thinking is the open-source community. I think ONAP has potential, but there’s another initiative that might have even more.   Apache has a “Mesosphere” project that combines DC/OS, Apache Mesos, and Apache Marathon, and all this is tied into a model of deployment of functional elements on highly distributed resource pools.  Marathon has an event bus, which might make it the most critical piece of the puzzle for software-defined futures.  Could it be that Red Hat, who recently acquired CoreOS for its container capabilities, might extend their thinking into event handling, or that a competitor might jump in and pick up the whole Mesosphere project and run with it?  That could bring a lot of the software foundation for a rational event-driven automation future into play.

Don’t be lulled into thinking that open-source fixes everything automatically.  Software-centric thinking has to be top-down thinking, even though it’s true that not everyone designs software that way.  That’s a challenge for both open-source and standards groups, because they often want to fit an evolutionary model, which ties early work to the stuff that’s already deployed and understood.  It shouldn’t be considered impossible to think about the “right” or “best” approach to a problem considering future needs and trends and at the same time prevent that future from disconnecting from present realities.  “Shouldn’t” apparently isn’t the same as “doesn’t”, though.  In fairness, we launched a lot of our current initiatives before the real issues were fully explored.  We have time, in both IoT and zero-touch automation, to get things right, but it’s too soon to know whether the initiatives in either area will manage to get the balance between optimum future and preservation of the present into optimum form.

The critical truth here is that we live in an age defined by software, but we still don’t know how to define the software.  Our progress isn’t inhibited by lack of innovation as much as by lack of articulation.  There are many places where all the right skills and knowledge exist at the technical level.  You can see Amazon, Microsoft, and Google all providing the level of platform innovation needed for an event-driven future, a better foundation for things like SDN and NFV than the formal processes have created.  All of this is published and available at the technical level, but it’s not framed at the management level in a way suited to influencing future planning in either the enterprise or service provider spaces.  We have to make it possible for architectures to seem interesting.

Complexity cannot be widely adopted, but complexity in a solution is a direct result of the need to address complex problems.  It’s easy to say that “AI” will fix everything by making our software understand us, rather than asking us to understand the software.  The real solution is, you guessed it, more complicated.  We have to organize our thinking, assemble our assets, and promote ecosystemic solutions, because what we’re looking for is technology changes that revolutionize a very big part of our lives and our work.

How Events Evolve Us to “Fabric Computing”

If you read this blog regularly you know I believe the future of IT lies in event processing.  In my last blog, I explained what was happening and how the future of cloud computing, edge computing, and IT overall is tied to events.  Event-driven software is the next logical progression in an IT evolution I’ve followed for half a century, and it’s also the only possible response to the drive to reduce human intervention in business processes.  If this is all true, then we need to think about how event-driven software would change hosting and networking.

I’ve said in previous blogs that one way to view the evolution of IT was to mark the progression of its use from something entirely retrospective—capturing what had already happened—to something intimately bound with the tasks of workers or users.  Mobile empowerment, which I’ve characterized as “point-of-activity” empowerment, seems to take the ultimate step of putting information processing in the hands of the worker as they perform their jobs.  Event processing takes things to the next level.

In point-of-activity empowerment, it is possible that a worker could use location services or near-field communications to know when something being sought is nearby.  This could be done in the form of a “where is it?” request, but it could also be something pushed to a worker as they moved around.  The latter is a rudimentary form of event processing, because of its asynchronicity.  Events, then, can get the attention of a worker.

It’s not a significant stretch to say that events can get the attention of a process.  There’s no significant stretch, then, to presume the process could respond to events without human intermediation.  This is actually the only rational framework for any form of true zero-touch automation.  However, it’s more complicated than simply kicking off some vast control process when an event occurs, and that complexity is what drives IT and network evolution in the face of an event-driven future.

Shallow or primary events, the stuff generated by sensors, are typically simple signals of conditions that lack the refined contextual detail needed to actually make something useful happen.  A door sensor, after all, knows only that a door is open or closed, not whether it should be.  To establish the contextual detail needed for true event analysis, you generally need two things—state and correlation.  The state is the broad context of the event-driven system itself.  The alarm is set (state), therefore an open door is an alarm condition.  Correlation is the relationship between events in time.  The outside door opened, and now an interior door has opened.  Therefore, someone is moving through the space.

I’ve used a simple example of state and correlation here, but in the real world both are likely to be much more complicated.  There’s already a software discipline called “complex event processing” that reflects just how many different events might have to be correlated to do something useful.  We also see complex state notions, particularly in the area of things like service lifecycle automation.  A service with a dozen components is actually a dozen state machines, each driven by events, and each potentially generating events to influence the behavior of other machines.

Another complicating factor is that both state and correlation are, so to speak, in the eye of the beholder.  An event is, in a complete processing sense, a complex topological map that links the primary or shallow event to a series of chains of processing.  What those chains involve will depend on the goal of the user.  A traffic light change in Manhattan, for example, may be relevant to someone nearby, but less so to someone in Brooklyn and not at all to someone in LA.  A major traffic jam at the same point might have relevance to our Brooklyn user if they’re headed to Manhattan, or even to LA people who might be expecting someone who lives in Manhattan to make a flight to visit them.  The point is that the things that matter will depend on who they matter to, and the range of events and nature of processes have that same dependency.

When you look at the notion of what I will call “public IoT”, where there are sensor-driven processes that are available to use as event sources to a large number of user applications, there is clearly an additional dimension of scale and distribution of events at scale.  Everyone can’t be reading the value of a simple sensor or you’d have the equivalent of a denial-of-service attack.  In addition, primary events (as I’ve said) need interpretation, and it makes little sense to have thousands of users do the same interpretation and correlation.  More sensible to have a process do the heavy lifting, and dispense the digested data as an event.  Thus, there’s also the explicit need for secondary events, events generated by the correlation and interpretation of primary events.

If we could look at an event-driven system from above, with some kind of special vision that let us see events flying like little glowing balls, what we’d likely see in most event-driven systems is something like nuclear fission.  A primary neutron (a “shallow event”) is generated from outside, and it collides with a process near the edge to generate secondary neutrons, which in turn generate even more when they collide with processes.  These are the “deep events”, and it’s our ability to turn shallow events from cheap sensors into deep events that can be secured and policy managed that determines whether we could make event-driven systems match goals and public policies at the same time.

What happens in a reactor if we have a bunch of moderator rods stuck into the core?  Neutrons don’t hit their targets, and so we have a slow decay into the steady state of “nothing happening”.  In an event system, we need to have a nice unified process and connection fabric in which events can collide with processes with a minimum of experienced delay and loss.

To make event-driven systems work, you have to be able to do primary processing of the shallow events near the edge, because otherwise the control loop needed to generate feedback in response to events can get way too long.  That suggests that you have a primary process that’s hosted at the edge, which is what drives the notion of edge computing.  Either enterprises have to offer local-edge hosting of event processes in a way that coordinates with the handling of deeper events, or cloud providers have to push their hosting closer to the user point of event generation.

A complicating factor here is that we could visualize the real world as a continuing flood of primary, shallow, events.  Presumably various processes doing correlation and analysis, and then distribution of secondary “deeper” events would then create triggers to other processes.  Where does this happen?  The trite response is “where it’s important”, which means anywhere at all.  Cisco’s fog term might have more a marketing genesis than a practical one, but it’s a good definition for the processing conditions we’re describing.  Little islands of hosting, widely distributed and highly interconnective, seem the best model for an event-driven system.  Since event processing is so linked with human behavior, we must assume that all this islands-of-hosting stuff would be shifting about as interests and needs changed.  It’s really about building a compute-and-network fabric that lets you run stuff where it’s needed, no matter where that happens to be, and change it in a heartbeat.

Some in the industry may have grasped this years ago.  I recall asking a Tier One exec where he thought his company would site cloud data centers.  With a smile, he said “Anywhere we have real estate!”  If the future of event processing is the “fog”, then the people with the best shot at controlling it are those with a lot of real estate to exploit.  Obviously, network operators could install stuff in their central offices and even in remote vault locations.  Could someone like Amazon stick server farms in Whole Food locations?  Could happen.

Real estate, hosting locations, are a big piece of the “who wins?” puzzle.  Anyone can run out and rent space, but somebody who has real estate in place and can exploit it at little or no marginal cost is clearly going to be able to move faster and further.  If that real estate is already networked, so much the better.  If fogs and edges mean a move out of the single central data center, it’s a move more easily made by companies who have facilities ready to move to.

That’s because our fog takes more than hosting.  You would need your locations to be “highly interconnective”, meaning supported by high-capacity, low-latency communications.  In most cases, that would mean fiber optics.  So, our hypothetical Amazon exploit of Whole Foods hosting would also require a lot of interconnection of the facilities.  Not to mention, of course, an event-driven middleware suite.  Amazon is obviously working on that, so have they plans to supply the needed connectivity, and they’re perhaps the furthest along of anyone in defining an overall architecture.

My attempts to model all of this show some things that are new and some that are surprisingly traditional.  The big issue in deciding the nature of the future compute/network fabric is the demand density of the geography, roughly equivalent to the GDP per square mile.  Where demand density is high, the future fabric would spread fairly evenly over the whole geography, creating a truly seamless virtual hosting framework that’s literally everywhere.  Where it’s low, you would have primary event processing distributed thinly, then an “event backhaul” to a small number of processing points.  There’s not enough revenue potential for something denser.

This is all going to happen, in my view, but it’s also going to take time.  The model says that by 2030, we’ll see significant distribution of hosting toward the edge, generating perhaps a hundred thousand incremental data centers.  In a decade, that number could double, but it’s very difficult to look that far out.  And the software/middleware needed for this?  That’s anyone’s guess at this point.  The esoteric issues of event-friendly architecture aren’t being discussed much, and even less often in language that the public can absorb.  Expect that to change, though.  The trend, in the long term to be sure, seems unstoppable.

Clouds, Edges, Fog, and Deep and Shallow Events

What is an “edge” and what is a “cloud”?  Those questions might have seemed surpassingly dumb even a year ago, but today we’re seeing growing confusion on the topic of future computing and the reason comes down to this pair of questions.  As usual, there’s a fair measure of hype involved in the debate, but there are also important, substantive, issues to address.  In fact, the details that underpin this question might decide the balance of power in future computing.

The popular definition of “cloud computing” is utility computing services that offer virtual hosting on a public or private data center.  In practice, though not as a real requirement, cloud data centers for public providers tend to be in a small number of central locations.  To put this in an easier vernacular, cloud computing resources are virtual, a kind of hosting-capability black box.  You don’t know, or care, where they are unless your cloud service includes specific regional/location affinities.

“Edge” computing is actually a less concrete concept, because it begs the obvious question “Edge of what?”  Again, we have to fall back on popular usage and say that edge computing is computing at the edge of the network, proximate to the point of user connection.  Cisco uses the term “fog computing” to be roughly synonymous with “edge computing”, but in my view, there could be a difference.  I think that the “fog” notion implies that edge resources are both distributed and logically collectivized, creating a kind of edge-hosted cloud.  In contrast, I think that some “edge computing” might rely on discrete resources deployed at the edge, not collected into any sort of pool.  Whether you like either term or my interpretation, I’m going to use this definition set here!

The question of whether a given “cloud” should be a “fog” is probably the most relevant of the questions that could arise out of our definitions.  Resource pools are more economical than fixed resources because you share them, just as packet trunks are more economical than TDM trunks.  The concept of resource-sharing is based on the principle of resource equivalence.  If you have a bunch of hosts, any of which will do, one can be selected to optimize operations and capital costs.  Greater efficiency, in other words, and efficiency grows with the size of the pool.  If your resources are different in a way that relates to how they might be used, then you might not have the full complement of resources available to suit a given need.  That means a smaller pool, and lower efficiency.

This is why we’ve tended to see cloud providers build a small number of massive data centers.  That breeds lower cost (and greater profit) but it also creates a problem with transit time.  An event or transaction that enters the network at a random point might be transported thousands of miles to reach processing resources.  That distance might include multiple nodes and trunks, introducing handling delays.  In the end, we could be talking tenths of a second in delay.

That might not sound like much, but if we presume that we have a 0.2 second round-trip delay and we’re controlling something like a turnstile, a vehicle at 60 mph would travel about 18 feet in that amount of time.  The point is that for applications like process control, a transit delay can be a true killer, and if we could have moved our processing resource to the network edge, we could significantly reduce that delay.

The Internet of Things, to the extent and at the pace that it becomes relevant, is the logical on-ramp for applications sensitive to transit delay.  That means that centralized cloud data centers become problematic with higher rates of IoT adoption, and that the trend would likely encourage the migrating of hosting of IoT processes to the edge.  That’s the trend that actually divides the broader term “edge” computing from what I think is Cisco’s “fog” value proposition.

Cloud providers like Amazon support IoT event processing already, through their functional/serverless or “Lambda” offering.  Obviously, Amazon isn’t about to push out thousands of data centers to the network edge in a field-of-dreams hope of attracting IoT opportunities.  Instead, what they’ve proposed to do is allow their event processing to migrate out of the network and onto customer premises, using customer-provided resources and Amazon software, called “Greengrass”.  Edge computing, provided by the customer, is used to extend the cloud, but in a specific, facility-linked, way.  You can’t think of this kind of edge hosting as a general “fog” because the elements that are distributed are tasked with serving only the general location where they’re found.  You could just as easily say they were remote servers.  If you remember the notion of “server consolidation” in the cloud, think of Greengrass as “server un-consolidation”.  The cloud comes to you, and lives within you.

IoT promotes edge computing, but it doesn’t limit the source of the compute power or whether or not it is collected into the “fog/cloud”.  We need to have at least some event processing collected in proximity to the event source, and from there the “inward” requirements won’t be much different from those of traditional transaction processing.  Enterprises could do on-prem event processing or cede it to the cloud provider.

The “some event processing” point may be critical here.  Event processing isn’t a simple, single-point, task, it’s a series of interlocking processes, primary and secondary events, correlations, and distributions.  Eventually it could feed back to traditional transaction processing, and even reshape much of that, and the IT elements that support it, along the way.  We have “shallow” events, primary events, close to the surface or edge, and “deep” events that could reside in the traditional more centralized cloud, or the data center.  Wherever we have the events, they’re a part of that interlocking process set, and so somebody who wins in one place might well, lacking competition, win in them all.

The cloud providers clearly know this, and they’ve taken the lead in defining event-driven compute models.  Nearly fifty enterprises have told me that all three major cloud providers are further along in their thinking on event processing than their software vendors are, and nearly all say the cloud providers are further along than the enterprises themselves.  However, the former assertion isn’t true.  Vendors are prepared to offer the enterprise just as enlightened an event architecture as the cloud providers could.  They’re just not doing the job, either because they fear lengthening their sales cycle or because the sales organization doesn’t understand event processing.

This is the source of the cloud/edge/fog battles we’ll see fought this year.  If public cloud providers can offer a strong architecture for “deep events” then they can use the Greengrass strategy, or in the case of Microsoft simply support Windows Server local hosting, to make the edge an extension of the cloud and support “shallow events”.  They can then extend their own hosting closer to the edge over time, confident they have grabbed control of event processing.

If they don’t, then eventually the right architecture for both deep and shallow events will be created by software professionals building middleware tools, and this could be used to build a data-center-driven event processing model that wouldn’t use the cloud any more than current transactional or web/mobile applications would.  The event cloud could be cut off at the knees.

So far, things are on the side of the cloud providers.  As I noted earlier, enterprises aren’t hearing much about event processing from their software vendors, only from their cloud resources.  As we evolve toward event-driven computing, with or without the aid of IoT, we can expect it to become the singular focus of technical modernization of IT infrastructure.  That’s a very important win…for somebody.