The Multiple Dimensions of Service Data Models

Service modeling has been a recurring topic for my blogs, in part because of its critical importance in what I’ve called “service lifecycle automation” and the industry now calls “zero-touch automation”.  Another reason it’s recurring is that I keep getting requests from people to cover another angle of the topic.  The most recent one relates to the notion of a “service topology”.   What the heck is that, many are wondering.  Good question.

A “service” is a cooperative relationship between functional elements to accomplish a systemic goal that the service itself defines.  The “functional elements” could be legacy devices like switches, routers, optical add-drop multiplexers, and so forth.  In the modern software-defined age, it could also be a bunch of virtual functions, and obviously it could be a combination of the two.

A service model, then, could be seen as a description of how the functions cooperate, a description of their relationship.  In a “network” of devices, a service model might be a nodal topology map, perhaps overlaid with information on how each node is induced to behave cooperatively.  This is our traditional way of thinking about services, for obvious reasons.  It assumes that the service is a product of the network, which is an entity with specific structure and capabilities.  In our software-defined era, things get more complicated, and at several levels.

Yes, you could visualize a future service as a collection of virtual devices that mapped 1:1 to the same traditional boxes in the same physical topology.  That could be called the “cloud-hosted node” model.  NFV proposes a different approach, where virtual functions that are derived from physical functions (devices) are instantiated per-customer, per service.  Thus, a service isn’t always coerced from communal nodes; it can be created ad hoc from service-specific hosted elements.

The second level is the stuff that we’re hosting on.  In software-defined services of any sort, we have a pool of resources that are committed as needed, rather than dedicated and specialized boxes.  Since the pool of resources is shared, we can’t let the individual services just grab at it and do what they want.  We need some notion of virtualization of resources, where everything thinks it has dedicated service elements of some sort, but all actually pull stuff from the pool.  The pool is a “black box” with no structural properties visible.

Down inside the resource-pool black box, we still have physical topologies, because we have servers, switches, data centers, trunks, and other stuff that create the “resource fabric” that makes up the pool.  However, we now have a series of virtual topologies, representing the relationships of the software elements that make up the services.

While you might be tempted to think of a virtual topology as the same kind of trunk-and-node structure we’re familiar with from switch/router networks, just translated to “virtual switch/router” and “virtual trunks”, that’s not logical.  Why would someone want to define, at the service virtual level, a specific virtual-node structure?  A VPN might really be a collection of cooperating routers, but to the user it would look like one giant virtual router.  That’s what the virtual-service topology would also look like.

A virtual-service, or let’s face it, any service, topology is really a functional topology.  It describes the functions that are needed, not how those functions are delivered.  When we have to pass through the boundary between services and the resources, we need to map to the real world, but we don’t want to pull real-world mapping specifics into the virtual service layer.  Thus, the functional view of a VPN would be “access-plus-VPN”.

This lets us shine some light on issues that are raised periodically on service modeling.  Most would agree that there are two contenders today in the model space—TOSCA from OASIS and YANG.  In terms of genesis, TOSCA is a data modeling language that was developed to describe how to configure a physical nodal topology of devices.  TOSCA is a declarative language to describe (as the acronym “Topology and Orchestration Specification for Cloud Applications” suggests) the deployment of components in the cloud.  If you take the root missions as the current definitions for each language, then they both belong to the process of decomposing functions into resource cooperation.  The more “functional” or software-defined you are, the more TOSCA makes sense.  If you see nothing but physical devices or hosted equivalents thereof, you can use YANG.

What about the service layer?  There, you’ll recall, I suggested that we really had a function topology—“access” and “VPN” in the example.  You could say that “VPN-Service” as a model element decomposes into “VPN-Access” and “VPN-Core”.  You could say that “VPN-Access” has two interfaces—customer-side and network-side, the latter of which connects to one of a variable number of “customer-side” interfaces on “VPN-Core”.  We could visualize this as some kind of pseudo-molecule, a fat atom being the core and a bunch of little atoms stuck to the surface and representing the access elements.

It’s perfectly possible to describe the service structure as a series of intent model objects in a hierarchy, and use something like XML to do that.  The initial CloudNFV project did that, in fact.  However, it’s also possible to describe that structure in both TOSCA and YANG too.  I favor TOSCA because it is designed for the deployment of hosted elements, not for the configuration of semi-fixed elements (but again, you can probably use either).

But let’s leave modeling religious arguments and return to our hierarchy (however we model it).  A high-level element like “VPN-Access” or “VPN-Core” could be decomposed in a number of ways depending on the specific nature of the service, things like where the users are located and what the infrastructure in that area looks like.  Thus, we could expect to see some possible decomposition of each of our high-level elements based on broad things like administrative domain/geography.  When we know that a specific user access point in a specific location needs “VPN-Access” we would expect to see the model decompose not into other models, but into some configuration or deployment language that can actually commit/coerce resources.

The virtualization layer of our structure creates a kind of boundary point between services and resources, and it’s convenient to assume that we have “service architects” that build above this point and “resource architects” that handle what happens below.  If we build services from collected resource behaviors, it makes sense to presume that resource architects create a set of behaviors that are then collected into model elements at the boundary point, and are visible and composable pieces of the service.  There may be a model hierarchy above the boundary point, as illustrated here, and there might also be a hierarchy used to organize the resource behaviors according to the kind of infrastructure involved.

In this example, we could say that “VPN-Access” is such a boundary behavior element.  Every administrative domain that provides access to users would be expected to provide such an element, and to decompose it as needed to create the necessary connection for the full set of users it offers such a service to.  As this decomposition and commitment progresses, the functional/hierarchical model of the service creates a physical topology that connects information flows.  VPN-Access trunk goes into VPN-Core port.  This connecting has to be an attribute of the deployment (or redeployment) of the service.

What this says is that there is a service topology at two levels, one at the functional/descriptive level, and one at the resource/connection level.  The first creates the second, and mediates the ongoing lifecycle management processes.

The boundary between these two layers is what’s really important, and yet we’ve had almost no industry focus on that point.  Operators need to be able to build services from some specific toolkit, even in a virtual world.  Operators need to be able to divide the “operations” tasks between the service side (OSS/BSS) and the network side (NMS, NOC).  Anyway, in any virtual world, it’s where concept meets reality that’s critical, and that is at this boundary point.

What might live here?  Let’s take a specific example: “Firewall”.  I could sell a firewall feature as an adjunct to VPN-Access, but depending on how it was implemented in a given location, the abstract notion of Firewall might decompose to a VNF, a physical device, or some sort of network service.  Above, it’s all the same, but where the function meets the resource, the deployment rules would vary according to what’s available.

We could probably gain a lot by forgetting the modeling-language issues for the moment, and focusing on what happens at the virtual service/resource boundary.  Each language could then present its credentials relative to that point, and we could see how it frames what we do now, and what we expect from the network of the future.

Can the TMF Retake the Lead in Zero-Touch Automation?

On the ever-popular topic of zero-touch automation, Light Reading has an interesting piece that opens what might well be the big question of the whole topic—“What about the TMF?”  The TMF is the baseline reference for OSS/BSS, a body that at least the CIO portions of operator organizations tend to trust.  The TMF has also launched a series of projects aimed in some way at the problem of zero-touch automation.  As the article suggests, there are two in particular that are currently aimed at ZTA, the Open Digital Architecture (ODA) and the Zero Touch Orchestration, Operations, and Management (ZOOM) initiatives.  Could the TMF resolve everything here?  We need to look at the assets and liabilities.

Engagement is critical with anything in the zero-touch area, and the TMF has both plusses and minuses in that category.  Among CIO organizations the TMF is generally well-regarded, and it has long-standing links to the OSS/BSS vendors too.  The body runs regular working group and plenary meetings, the latter being mega-events that are as much trade shows as technical meetings (not unlike MWC).

Outside the CIO area, things are a bit different.  Among the rest of the C-suite executives, the TMF is held in lower regard than bodies like ETSI, and there are some in the staff of CTO, CMO, and COO execs that think the whole OSS/BSS program should be scrapped and redone, under different industry bodies.

This division is critical at this particular moment because the zero-touch initiative really demands a solution that crosses the boundary between OSS/BSS and the COO/CTO missions of network technology transformation and operations transformation at the NMS level.  The barriers to having just one C-suite exec take over the whole ZTA space are profound, in operator political terms.

Technology is also critical to ZTA, of course, and there the TMF has made the greatest single contribution of any body. The so-called “NGOSS Contract” (where NGOSS means Next-Gen OSS) established the first recognition of event-driven operations and the use of a model to steer events to processes.  I think it would be appropriate to say that if this model had been developed and accepted fully and logically, we’d have the right solution to ZTA already.

Why didn’t that happen?  That’s a very difficult question for anyone to answer, but one for which many people have contributed one.  What I hear from others is that 1) vendors obstructed the vision, 2) operators were reluctant to adopt something that transformative, 3) TMF people never articulated the concept correctly, and 4) the TMF never really saw the connection between NGOSS Contract and later ODA and ZOOM work.

I was a member of the TMF at the time when the NGOSS Contract work came to be, though not a part of it.  I was also a member when ZOOM kicked off, but engagement in TMF activity (like engagement in any standards activity) is both time-consuming and frustrating, and I couldn’t commit the time and energy as the work formalized.  I was never involved in ODA in any form.  I mention this because my own view, drawn from my exposure and the comments I’ve gotten from operators, suggests that there is some truth in all four of the reasons why the TMF NGOSS Contract stuff didn’t go where it might have.

The thing I think is critical here is that there is no reason why the TMF couldn’t now do what it didn’t do a decade ago when NGOSS Contract was hot.  At least, no new reason, because those same four issues are still pushing just as hard as they did a decade ago when all this started.  Not only that, the same four issues will have to be addressed by anyone who thinks they can make ZTA work.

The simple truth is that zero-touch automation can be made to work easily if a service model (like the TMF SID) is used to describe the elements of a service as functional models that contain the way that “events” within the scope of each element are directed to processes based on a state/event model of some sort.  That’s what is needed, and to the extent that you can describe ODA or ZOOM that way, they can be made to resolve the ZTA problem.

I believe firmly that there are probably a hundred people in the TMF (or affiliated as one of their levels of professional partners) who could do exactly what is needed.  I believe most of them were involved through the last decade, too, and somehow they didn’t do what they could have.  That’s another “Why?” that probably has lots of opinions associated with it.  The short answer is that the projects like ODA and ZOOM weren’t launched to specifically build on NGOSS Contract.  There were TMF Catalysts (demos, usually involving multiple vendors and given at one of the big TMF events) that showed many of the critical steps.  What was lacking was a nice blueprint that showed everything in context.

This is a failure of articulation, again.  For the TMF, articulation is hard because the body has created a kind of micro-culture with its own language and rules, and they don’t make much effort to communicate outside themselves.  Remember that most of the C-suite is outside the TMF, so that failure to speak a broadly understood industry language is a major problem.  But even when I was involved in the early ZOOM work, I didn’t see much attention being paid to NGOSS Contract, even when I brought it up as part of the CloudNFV work.

The Light Reading article raises a good point, or question.  “But is there not,” it asks, “a risk that yet another initiative sows confusion and leads to fragmentation, rather than standardization?”  Darn straight there is, but there may be another bigger issue.  Framing a simple but not fully understood solution like NGOSS Contract inside something else risks losing touch with what works.  Every new layer of project or acronym may lead to new claims of relevancy, but they move further from that basic truth that could have worked a decade ago.

The TMF has issues, in my view, and I’ve never shied away from that point.  It also has the most important core asset of all, the core understanding of the relationship between events, data models of services, and processes.  They’ve under-used, or perhaps even ignored, that asset for a decade, but they could still make truly explosive progress if they just did what should come naturally, and exploit their strengths.

To do that, they have to explicitly go back to NGOSS Contract, and they have to accept the fundamental notion that a service model (pick your modeling language) is a hierarchy or topology of element models, each of which obey the NGOSS-Contract mandate to mediate the event/process relationship.  Forget all the TMF-private nomenclature for a moment and just do this in industryspeak.  Once that’s done, they need to frame their ODA work first in that model, again using common language.  Only when they can do that should they dive into TMF-oriented specifics.  That makes sure that what they do can be assessed, and appreciated, by the market overall.

Can they do this?  Bodies like the TMF tend to fall into their own private subcultures, as I’ve already noted.  That’s what contributes to the problem the Light Reading piece cites, the tendency for new stuff to build another floor on the tower of babel instead of simplifying the problem or presenting solutions.

What Operator Projects in Operations Automation Show Us

In my last blog, I talked about the need to plan strategically and implement tactically.  The logic behind that recommendation is that it’s all too easy to take short-term measures you think will lead somewhere, and then go nowhere beyond the short term.  Looking at operators’ actions in 2017 and their plans in 2018, I see that risk already biting some critical projects.

Based on my research and modeling, I’ve developed a picture of the trends in what I call “process opex”, meaning the non-capital costs associated with network and service operations.  In 2017, they ran about 24 cents on each revenue dollar across all network operators, which is more than the total capital budget of operators.  SDN, NFV, and zero-touch automation all promise to reduce process opex, and in fact the savings that could have been achieved in 2016 and 2017 amounts to about six cents per revenue dollar.  What actually happened, according to operators themselves, was a savings of just a bit more than two cents per revenue dollar.

I was curious about the problem here, and what I learned when I asked operators for the reason for the savings shortfall was pretty simple and also sadly expected.  According to operators, they did not start their projects with the target of complete zero-touch automation.  Instead, they attacked individual subsystems of their network and operations process.  The also didn’t consider any broad automation mission in the way they did their per-subsystem attacks.  Many even said that they didn’t group multiple subsystems and try to find a solution across the group.

Only about five percent of operators seem to understand there’s a risk associated with this approach, and none of them seem to have quantified it or even fully determined what its scope could be.  I had a couple who I’d worked closely with, and I opened with them the touchy question of whether this might end up locking them into a path that didn’t go where they wanted.  Only one had a comment, and it was “I do understand that might happen, but we have no options for a broad strategy being presented to us at this point, and we need to do something now.”

What made this particular comment troubling was that this operator was also assessing ONAP/ECOMP, and so it could be argued that they did have a possible broader pathway to automation already under review.  Pushing my luck, I asked why ONAP wasn’t seen as a possible basis for a long-term automation approach, and the response was “Oh, they’re not there yet and they don’t really know when that might be possible.”

The urgency in addressing automation, for the purpose of reducing operations costs and improving both accuracy and agility, is understandable.  Operators have been under pressure to improve profit per bit for over five years, and it’s only been in the last two that they’ve really done much.  What they’ve done was to attack areas where they had a clear problem and a clear path forward, and as my specific contact noted, they weren’t being offered anything broad and systemic at the time their initiatives kicked off.

They still aren’t being offered anything, of course.  If you look at the things operators have already attacked (you can find a list of some in a Light Reading piece on this topic), you find mostly stuff in the OSS/BSS area.  You also see areas that are undergoing changes because of the changing nature of services, customer support, and competition.

I think the fact that the current opex-reducing automation initiatives really got started late in 2015 or early 2016 is also a factor.  There was no zero-touch automation group in ETSI then.  Most NFV work being done at that point didn’t really address the broader questions of automating services that consisted of virtual and physical functions either.  The notions of service events and an event-driven models for service and network management were still in the future, and AT&T’s ECOMP wasn’t even really announced until March of 2016.  It’s also true that ECOMP as it was defined, and largely as it still is, was targeted to deploy “under” OSS/BSS rather than as a tool to orchestration OSS/BSS processes.  Once operators got started with OSS/BSS-linked automation projects, they were out of scope to the ECOMP evaluation people.  They were, in most cases, even run by a different group—the CIO for OSS/BSS and the CTO for ECOMP.

Operators, as I’ve said, do understand that they’re likely to lose something with their current approach (but see no alternative).  How much they might lose is something they’ve not assessed and can’t communicate, but if I tweak my model I can make a guess.

Today process opex is running about 24 cents per revenue dollar across all operators.  The incremental growth in process opex through 2020 is projected to be four cents per revenue dollar, and automation can relieve only two cents of that, making 2020 process opex 26 cents per revenue dollar.  It could have been 23 cents if optimum automation practices had been adopted and phased in beginning in 2016.  The model says that achieving optimality now would lower the 2020 number to 25 cents, which is still two cents more than optimum.  To put things into perspective, cutting capex by ten percent would save that same two cents.

Losses are accumulative, and so are shortfalls on savings.  We could have hoped to save an accumulated total of twenty-six cents on each revenue dollar with zero-touch automation from 2016 through 2020.  What operators now think they will save with their incremental approach is less than half that amount.  There is a LOT of savings left on the table, enough to fund more than half-a-year’s capex budget.  That’s the price of not having a grand strategy to adopt.

Where things could get really complicated is when you take this beyond 2020.  IoT, 5G, increased reliance on Internet-only customers because of streaming competition, and interest in expanding the use of SDN and NFV could combine to add another three or four cents to process opex by 2025.  For some possible services, the lack of automated tools in management and operations could threaten the business case, or kill it.

I got a LinkedIn message that might give us some insight into why we didn’t have that grand strategy.  Norman Dorn sent me a LINK to a new-gen functional architecture for telecommunications systems that was developed by Bellcore before the iPhone came out.  The insight the list of goals demonstrates is dazzling, but Apple came along with a commercial product that swept the market without dealing with most of the issues, particularly insofar as they related to the relationship between network technology and appliance technology.  Comprehensive strategy will always support optimality, but it may not support practicality or competitive advantage.

There has to be a middle ground for operators in areas like zero-touch automation, 5G, IoT, and even SDN/NFV.  We could still find it for all of these things, but to do that we may have to abandon the traditional standards-driven approach and try to mold open-source projects into the incubators of future technology.  The only thing in that direction that’s worked so far is the “ECOMP model” where the buyer does the early work and when critical mass has been developed, the result is open-sourced.  That puts too many functional and strategic eggs in that first buyer’s basket.  Maybe it’s time to work out a way for technical collaboration of network operators of all sorts, without vendor interference and regulatory barriers.

If We’re in the Software Defined Age, How Do We Define the Software?

We are not through the first earnings season of the year, the first after the new tax law passed, but we are far enough into it to be able to see the outlines of technology trends in 2018.  Things could be a lot worse, the summary might go, but they could also still get worse.  In all, I think network and IT execs are feeling less pressure but not yet more optimism.

Juniper’s stock slid significantly pre-market after its earnings report, based on weak guidance for the first quarter of this year.  Ericsson also reported another loss, and announced job cuts and what some have told me is an “agonizing” re-strategizing program.  Nokia’s numbers were better, but mostly compared with past years, and they still showed margin pressure.  On the provider side, Verizon and AT&T have reported a loss of wireline customers, with both companies saying that some online property (Yahoo, etc., for Verizon and DirecTV Now for AT&T) helped them offset the decline.

In the enterprise space, nine of every ten CIOs tell me that their primary mission in budgeting is not to further improve productivity or deliver new applications, but to do more with less.  Technology improvements like virtualization and containers and orchestration that can drive down costs are good, and other things need not apply.  The balance of capital spending between sustaining current infrastructure and advancing new projects has never in 30 years tipped as heavily toward the former goal.

In the world of startups, I’m hearing from VCs that there’s going to be a shake-out in 2018.  The particular focus of the pressure is the “product” startups, those who have hardware or software products, and in particular those in the networking space.  VCs say that they think there are too many such startups out there, and that the market has already started to select among the players.  In short, the good exits are running out, so it’s time to start holding down costs on those who aren’t already paying off for you.

Something fundamental is happening here, obviously, and it’s a something that the industry at large would surely prefer to avoid.  So how can we do that?

Every new strategy has to contend with what could be called the “low apple syndrome”.  It’s not only human nature to do the easy stuff first, it’s also good business because that probably represents the best ROI.  The challenge that the syndrome creates is that the overall ROI of a new strategy is a combination of the low and high apples, and if you pluck the low stuff then the residual ROI on the rest can drop precipitously.  The only solution to that problem is to ensure that the solution/approach to those low apples can be applied across the whole tree.  We have to learn to architect for the long term and exploit tactically, in short.

There are two forces that limit our ability to do that.  One is that vendors in the space where our new strategies and technologies are targeted want to sustain their revenue streams and incumbent positions.  They tend to push a slight modification of things, something that doesn’t rock the boat excessively.  However, they position that as being a total revolution, and that combination discourages real revolutionary thinking at the architecture level.

The other force is the force of publicity.  We used to have subscription publications in IT, but nearly everything today is ad sponsored.  Sponsor interests are likely to prevail there.  Many market reports, some say most, are paid for by vendors and thus likely to favor vendor interests.  Even where that’s not directly true, one producer of market forecast reports once said to me that “There’s no market for a report that shows there’s no market for something.  Report buyers want to justify a decision to build a product or enter a market.”  I used to get RFPs from firms on outsourcing analyst reports, and the RFP would start with something like “Develop a report validating the hundred-billion-dollar market for xyz.”   Guess what the market report will end up showing.  How do you get real information to buyers under these conditions?

OK, if we have two forces that limit us, we need two counterbalancing forces.  The essential one is a buyer-driven model in specification, standardization, and open source.  I’ve seen vendor/buyer tension in every single standards activity I’ve been involved in, and the tension is often impossible to resolve effectively because of regulatory constraints.  Many geographies, for example, don’t allow network operators to “collude” by working jointly on things, and valuable initiatives have been hampered or actually shut down because of that.

The operators may now have learned a way of getting past this issue.  AT&T’s ECOMP development was done as an internal project, and then released to open-source and combined with OPEN-O orchestration of NFV elements to create ONAP.  The fact that so much work went into ECOMP under AT&T means that even though open-source activity would likely face the same regulatory constraints as standards bodies, vendors have a harder time dominating the body because much of the work is already done.  AT&T is now following that same approach with a white-box switch OS, and that’s a good thing.

The second solution path is establishing software-centric thinking.  Everything that’s happening in tech these days is centered around software, and yet tech processes at the standards and projects level are still “standards-centric”, looking at things the way they’d have been looked at thirty years ago.  Only one initiative I’m aware of, the IPsphere Forum or IPSF, as it was first considered a decade ago.  This body introduced what we now call “intent models”, visualizing services as a collection of cooperative but independent elements, and even proposed the notion of orchestration.  However, it fell to operator concerns about anti-trust regulation, since operators were driving the initiative.

Clearly, it’s this second path that’s going to be hard to follow.  There’s a lot of software skill out there, but not a lot of strong software architects, and the architecture of any new technology is the most critical thing.  If you start a software project with a framework that presumes monolithic, integrated, components linked by interfaces—which is what a traditional box solution would look like—that’s what you end up with.

The NFV ISG is a good example of the problem.  The body has originated a lot of really critical stuff, and it was the second source (after the IPSF) of the application of “orchestration” to telecom.  However, it described the operation of NFV as the interplay of functional blocks, something easy to visualize but potentially risky in implementation.  Instead of framing NFV as an event-driven process, it framed it as a set of static elements linked by interfaces—boxes, in short.  Now the body is working to fit this model to the growing recognition of the value of, even need for, event-driven thinking, and it’s not easy.

I think that blogs are the answer to the problem of communicating relevant points, whether you’re a buyer or seller.  However, a blog that mouths the same idle junk that goes into press releases isn’t going to accomplish anything at all.  You need to blog about relevant market issues, and introduce either your solution or a proposed approach in the context of those issues.  You also need to blog often enough to make people want to come back and see what you’re saying.  Daily is best, but at least twice per week is the minimum.

A technical pathway that offers some hope of breaking the logjam on software-centric thinking is the open-source community. I think ONAP has potential, but there’s another initiative that might have even more.   Apache has a “Mesosphere” project that combines DC/OS, Apache Mesos, and Apache Marathon, and all this is tied into a model of deployment of functional elements on highly distributed resource pools.  Marathon has an event bus, which might make it the most critical piece of the puzzle for software-defined futures.  Could it be that Red Hat, who recently acquired CoreOS for its container capabilities, might extend their thinking into event handling, or that a competitor might jump in and pick up the whole Mesosphere project and run with it?  That could bring a lot of the software foundation for a rational event-driven automation future into play.

Don’t be lulled into thinking that open-source fixes everything automatically.  Software-centric thinking has to be top-down thinking, even though it’s true that not everyone designs software that way.  That’s a challenge for both open-source and standards groups, because they often want to fit an evolutionary model, which ties early work to the stuff that’s already deployed and understood.  It shouldn’t be considered impossible to think about the “right” or “best” approach to a problem considering future needs and trends and at the same time prevent that future from disconnecting from present realities.  “Shouldn’t” apparently isn’t the same as “doesn’t”, though.  In fairness, we launched a lot of our current initiatives before the real issues were fully explored.  We have time, in both IoT and zero-touch automation, to get things right, but it’s too soon to know whether the initiatives in either area will manage to get the balance between optimum future and preservation of the present into optimum form.

The critical truth here is that we live in an age defined by software, but we still don’t know how to define the software.  Our progress isn’t inhibited by lack of innovation as much as by lack of articulation.  There are many places where all the right skills and knowledge exist at the technical level.  You can see Amazon, Microsoft, and Google all providing the level of platform innovation needed for an event-driven future, a better foundation for things like SDN and NFV than the formal processes have created.  All of this is published and available at the technical level, but it’s not framed at the management level in a way suited to influencing future planning in either the enterprise or service provider spaces.  We have to make it possible for architectures to seem interesting.

Complexity cannot be widely adopted, but complexity in a solution is a direct result of the need to address complex problems.  It’s easy to say that “AI” will fix everything by making our software understand us, rather than asking us to understand the software.  The real solution is, you guessed it, more complicated.  We have to organize our thinking, assemble our assets, and promote ecosystemic solutions, because what we’re looking for is technology changes that revolutionize a very big part of our lives and our work.

How Events Evolve Us to “Fabric Computing”

If you read this blog regularly you know I believe the future of IT lies in event processing.  In my last blog, I explained what was happening and how the future of cloud computing, edge computing, and IT overall is tied to events.  Event-driven software is the next logical progression in an IT evolution I’ve followed for half a century, and it’s also the only possible response to the drive to reduce human intervention in business processes.  If this is all true, then we need to think about how event-driven software would change hosting and networking.

I’ve said in previous blogs that one way to view the evolution of IT was to mark the progression of its use from something entirely retrospective—capturing what had already happened—to something intimately bound with the tasks of workers or users.  Mobile empowerment, which I’ve characterized as “point-of-activity” empowerment, seems to take the ultimate step of putting information processing in the hands of the worker as they perform their jobs.  Event processing takes things to the next level.

In point-of-activity empowerment, it is possible that a worker could use location services or near-field communications to know when something being sought is nearby.  This could be done in the form of a “where is it?” request, but it could also be something pushed to a worker as they moved around.  The latter is a rudimentary form of event processing, because of its asynchronicity.  Events, then, can get the attention of a worker.

It’s not a significant stretch to say that events can get the attention of a process.  There’s no significant stretch, then, to presume the process could respond to events without human intermediation.  This is actually the only rational framework for any form of true zero-touch automation.  However, it’s more complicated than simply kicking off some vast control process when an event occurs, and that complexity is what drives IT and network evolution in the face of an event-driven future.

Shallow or primary events, the stuff generated by sensors, are typically simple signals of conditions that lack the refined contextual detail needed to actually make something useful happen.  A door sensor, after all, knows only that a door is open or closed, not whether it should be.  To establish the contextual detail needed for true event analysis, you generally need two things—state and correlation.  The state is the broad context of the event-driven system itself.  The alarm is set (state), therefore an open door is an alarm condition.  Correlation is the relationship between events in time.  The outside door opened, and now an interior door has opened.  Therefore, someone is moving through the space.

I’ve used a simple example of state and correlation here, but in the real world both are likely to be much more complicated.  There’s already a software discipline called “complex event processing” that reflects just how many different events might have to be correlated to do something useful.  We also see complex state notions, particularly in the area of things like service lifecycle automation.  A service with a dozen components is actually a dozen state machines, each driven by events, and each potentially generating events to influence the behavior of other machines.

Another complicating factor is that both state and correlation are, so to speak, in the eye of the beholder.  An event is, in a complete processing sense, a complex topological map that links the primary or shallow event to a series of chains of processing.  What those chains involve will depend on the goal of the user.  A traffic light change in Manhattan, for example, may be relevant to someone nearby, but less so to someone in Brooklyn and not at all to someone in LA.  A major traffic jam at the same point might have relevance to our Brooklyn user if they’re headed to Manhattan, or even to LA people who might be expecting someone who lives in Manhattan to make a flight to visit them.  The point is that the things that matter will depend on who they matter to, and the range of events and nature of processes have that same dependency.

When you look at the notion of what I will call “public IoT”, where there are sensor-driven processes that are available to use as event sources to a large number of user applications, there is clearly an additional dimension of scale and distribution of events at scale.  Everyone can’t be reading the value of a simple sensor or you’d have the equivalent of a denial-of-service attack.  In addition, primary events (as I’ve said) need interpretation, and it makes little sense to have thousands of users do the same interpretation and correlation.  More sensible to have a process do the heavy lifting, and dispense the digested data as an event.  Thus, there’s also the explicit need for secondary events, events generated by the correlation and interpretation of primary events.

If we could look at an event-driven system from above, with some kind of special vision that let us see events flying like little glowing balls, what we’d likely see in most event-driven systems is something like nuclear fission.  A primary neutron (a “shallow event”) is generated from outside, and it collides with a process near the edge to generate secondary neutrons, which in turn generate even more when they collide with processes.  These are the “deep events”, and it’s our ability to turn shallow events from cheap sensors into deep events that can be secured and policy managed that determines whether we could make event-driven systems match goals and public policies at the same time.

What happens in a reactor if we have a bunch of moderator rods stuck into the core?  Neutrons don’t hit their targets, and so we have a slow decay into the steady state of “nothing happening”.  In an event system, we need to have a nice unified process and connection fabric in which events can collide with processes with a minimum of experienced delay and loss.

To make event-driven systems work, you have to be able to do primary processing of the shallow events near the edge, because otherwise the control loop needed to generate feedback in response to events can get way too long.  That suggests that you have a primary process that’s hosted at the edge, which is what drives the notion of edge computing.  Either enterprises have to offer local-edge hosting of event processes in a way that coordinates with the handling of deeper events, or cloud providers have to push their hosting closer to the user point of event generation.

A complicating factor here is that we could visualize the real world as a continuing flood of primary, shallow, events.  Presumably various processes doing correlation and analysis, and then distribution of secondary “deeper” events would then create triggers to other processes.  Where does this happen?  The trite response is “where it’s important”, which means anywhere at all.  Cisco’s fog term might have more a marketing genesis than a practical one, but it’s a good definition for the processing conditions we’re describing.  Little islands of hosting, widely distributed and highly interconnective, seem the best model for an event-driven system.  Since event processing is so linked with human behavior, we must assume that all this islands-of-hosting stuff would be shifting about as interests and needs changed.  It’s really about building a compute-and-network fabric that lets you run stuff where it’s needed, no matter where that happens to be, and change it in a heartbeat.

Some in the industry may have grasped this years ago.  I recall asking a Tier One exec where he thought his company would site cloud data centers.  With a smile, he said “Anywhere we have real estate!”  If the future of event processing is the “fog”, then the people with the best shot at controlling it are those with a lot of real estate to exploit.  Obviously, network operators could install stuff in their central offices and even in remote vault locations.  Could someone like Amazon stick server farms in Whole Food locations?  Could happen.

Real estate, hosting locations, are a big piece of the “who wins?” puzzle.  Anyone can run out and rent space, but somebody who has real estate in place and can exploit it at little or no marginal cost is clearly going to be able to move faster and further.  If that real estate is already networked, so much the better.  If fogs and edges mean a move out of the single central data center, it’s a move more easily made by companies who have facilities ready to move to.

That’s because our fog takes more than hosting.  You would need your locations to be “highly interconnective”, meaning supported by high-capacity, low-latency communications.  In most cases, that would mean fiber optics.  So, our hypothetical Amazon exploit of Whole Foods hosting would also require a lot of interconnection of the facilities.  Not to mention, of course, an event-driven middleware suite.  Amazon is obviously working on that, so have they plans to supply the needed connectivity, and they’re perhaps the furthest along of anyone in defining an overall architecture.

My attempts to model all of this show some things that are new and some that are surprisingly traditional.  The big issue in deciding the nature of the future compute/network fabric is the demand density of the geography, roughly equivalent to the GDP per square mile.  Where demand density is high, the future fabric would spread fairly evenly over the whole geography, creating a truly seamless virtual hosting framework that’s literally everywhere.  Where it’s low, you would have primary event processing distributed thinly, then an “event backhaul” to a small number of processing points.  There’s not enough revenue potential for something denser.

This is all going to happen, in my view, but it’s also going to take time.  The model says that by 2030, we’ll see significant distribution of hosting toward the edge, generating perhaps a hundred thousand incremental data centers.  In a decade, that number could double, but it’s very difficult to look that far out.  And the software/middleware needed for this?  That’s anyone’s guess at this point.  The esoteric issues of event-friendly architecture aren’t being discussed much, and even less often in language that the public can absorb.  Expect that to change, though.  The trend, in the long term to be sure, seems unstoppable.