How Events Evolve Us to “Fabric Computing”

If you read this blog regularly you know I believe the future of IT lies in event processing.  In my last blog, I explained what was happening and how the future of cloud computing, edge computing, and IT overall is tied to events.  Event-driven software is the next logical progression in an IT evolution I’ve followed for half a century, and it’s also the only possible response to the drive to reduce human intervention in business processes.  If this is all true, then we need to think about how event-driven software would change hosting and networking.

I’ve said in previous blogs that one way to view the evolution of IT was to mark the progression of its use from something entirely retrospective—capturing what had already happened—to something intimately bound with the tasks of workers or users.  Mobile empowerment, which I’ve characterized as “point-of-activity” empowerment, seems to take the ultimate step of putting information processing in the hands of the worker as they perform their jobs.  Event processing takes things to the next level.

In point-of-activity empowerment, it is possible that a worker could use location services or near-field communications to know when something being sought is nearby.  This could be done in the form of a “where is it?” request, but it could also be something pushed to a worker as they moved around.  The latter is a rudimentary form of event processing, because of its asynchronicity.  Events, then, can get the attention of a worker.

It’s not a significant stretch to say that events can get the attention of a process.  There’s no significant stretch, then, to presume the process could respond to events without human intermediation.  This is actually the only rational framework for any form of true zero-touch automation.  However, it’s more complicated than simply kicking off some vast control process when an event occurs, and that complexity is what drives IT and network evolution in the face of an event-driven future.

Shallow or primary events, the stuff generated by sensors, are typically simple signals of conditions that lack the refined contextual detail needed to actually make something useful happen.  A door sensor, after all, knows only that a door is open or closed, not whether it should be.  To establish the contextual detail needed for true event analysis, you generally need two things—state and correlation.  The state is the broad context of the event-driven system itself.  The alarm is set (state), therefore an open door is an alarm condition.  Correlation is the relationship between events in time.  The outside door opened, and now an interior door has opened.  Therefore, someone is moving through the space.

I’ve used a simple example of state and correlation here, but in the real world both are likely to be much more complicated.  There’s already a software discipline called “complex event processing” that reflects just how many different events might have to be correlated to do something useful.  We also see complex state notions, particularly in the area of things like service lifecycle automation.  A service with a dozen components is actually a dozen state machines, each driven by events, and each potentially generating events to influence the behavior of other machines.

Another complicating factor is that both state and correlation are, so to speak, in the eye of the beholder.  An event is, in a complete processing sense, a complex topological map that links the primary or shallow event to a series of chains of processing.  What those chains involve will depend on the goal of the user.  A traffic light change in Manhattan, for example, may be relevant to someone nearby, but less so to someone in Brooklyn and not at all to someone in LA.  A major traffic jam at the same point might have relevance to our Brooklyn user if they’re headed to Manhattan, or even to LA people who might be expecting someone who lives in Manhattan to make a flight to visit them.  The point is that the things that matter will depend on who they matter to, and the range of events and nature of processes have that same dependency.

When you look at the notion of what I will call “public IoT”, where there are sensor-driven processes that are available to use as event sources to a large number of user applications, there is clearly an additional dimension of scale and distribution of events at scale.  Everyone can’t be reading the value of a simple sensor or you’d have the equivalent of a denial-of-service attack.  In addition, primary events (as I’ve said) need interpretation, and it makes little sense to have thousands of users do the same interpretation and correlation.  More sensible to have a process do the heavy lifting, and dispense the digested data as an event.  Thus, there’s also the explicit need for secondary events, events generated by the correlation and interpretation of primary events.

If we could look at an event-driven system from above, with some kind of special vision that let us see events flying like little glowing balls, what we’d likely see in most event-driven systems is something like nuclear fission.  A primary neutron (a “shallow event”) is generated from outside, and it collides with a process near the edge to generate secondary neutrons, which in turn generate even more when they collide with processes.  These are the “deep events”, and it’s our ability to turn shallow events from cheap sensors into deep events that can be secured and policy managed that determines whether we could make event-driven systems match goals and public policies at the same time.

What happens in a reactor if we have a bunch of moderator rods stuck into the core?  Neutrons don’t hit their targets, and so we have a slow decay into the steady state of “nothing happening”.  In an event system, we need to have a nice unified process and connection fabric in which events can collide with processes with a minimum of experienced delay and loss.

To make event-driven systems work, you have to be able to do primary processing of the shallow events near the edge, because otherwise the control loop needed to generate feedback in response to events can get way too long.  That suggests that you have a primary process that’s hosted at the edge, which is what drives the notion of edge computing.  Either enterprises have to offer local-edge hosting of event processes in a way that coordinates with the handling of deeper events, or cloud providers have to push their hosting closer to the user point of event generation.

A complicating factor here is that we could visualize the real world as a continuing flood of primary, shallow, events.  Presumably various processes doing correlation and analysis, and then distribution of secondary “deeper” events would then create triggers to other processes.  Where does this happen?  The trite response is “where it’s important”, which means anywhere at all.  Cisco’s fog term might have more a marketing genesis than a practical one, but it’s a good definition for the processing conditions we’re describing.  Little islands of hosting, widely distributed and highly interconnective, seem the best model for an event-driven system.  Since event processing is so linked with human behavior, we must assume that all this islands-of-hosting stuff would be shifting about as interests and needs changed.  It’s really about building a compute-and-network fabric that lets you run stuff where it’s needed, no matter where that happens to be, and change it in a heartbeat.

Some in the industry may have grasped this years ago.  I recall asking a Tier One exec where he thought his company would site cloud data centers.  With a smile, he said “Anywhere we have real estate!”  If the future of event processing is the “fog”, then the people with the best shot at controlling it are those with a lot of real estate to exploit.  Obviously, network operators could install stuff in their central offices and even in remote vault locations.  Could someone like Amazon stick server farms in Whole Food locations?  Could happen.

Real estate, hosting locations, are a big piece of the “who wins?” puzzle.  Anyone can run out and rent space, but somebody who has real estate in place and can exploit it at little or no marginal cost is clearly going to be able to move faster and further.  If that real estate is already networked, so much the better.  If fogs and edges mean a move out of the single central data center, it’s a move more easily made by companies who have facilities ready to move to.

That’s because our fog takes more than hosting.  You would need your locations to be “highly interconnective”, meaning supported by high-capacity, low-latency communications.  In most cases, that would mean fiber optics.  So, our hypothetical Amazon exploit of Whole Foods hosting would also require a lot of interconnection of the facilities.  Not to mention, of course, an event-driven middleware suite.  Amazon is obviously working on that, so have they plans to supply the needed connectivity, and they’re perhaps the furthest along of anyone in defining an overall architecture.

My attempts to model all of this show some things that are new and some that are surprisingly traditional.  The big issue in deciding the nature of the future compute/network fabric is the demand density of the geography, roughly equivalent to the GDP per square mile.  Where demand density is high, the future fabric would spread fairly evenly over the whole geography, creating a truly seamless virtual hosting framework that’s literally everywhere.  Where it’s low, you would have primary event processing distributed thinly, then an “event backhaul” to a small number of processing points.  There’s not enough revenue potential for something denser.

This is all going to happen, in my view, but it’s also going to take time.  The model says that by 2030, we’ll see significant distribution of hosting toward the edge, generating perhaps a hundred thousand incremental data centers.  In a decade, that number could double, but it’s very difficult to look that far out.  And the software/middleware needed for this?  That’s anyone’s guess at this point.  The esoteric issues of event-friendly architecture aren’t being discussed much, and even less often in language that the public can absorb.  Expect that to change, though.  The trend, in the long term to be sure, seems unstoppable.

Clouds, Edges, Fog, and Deep and Shallow Events

What is an “edge” and what is a “cloud”?  Those questions might have seemed surpassingly dumb even a year ago, but today we’re seeing growing confusion on the topic of future computing and the reason comes down to this pair of questions.  As usual, there’s a fair measure of hype involved in the debate, but there are also important, substantive, issues to address.  In fact, the details that underpin this question might decide the balance of power in future computing.

The popular definition of “cloud computing” is utility computing services that offer virtual hosting on a public or private data center.  In practice, though not as a real requirement, cloud data centers for public providers tend to be in a small number of central locations.  To put this in an easier vernacular, cloud computing resources are virtual, a kind of hosting-capability black box.  You don’t know, or care, where they are unless your cloud service includes specific regional/location affinities.

“Edge” computing is actually a less concrete concept, because it begs the obvious question “Edge of what?”  Again, we have to fall back on popular usage and say that edge computing is computing at the edge of the network, proximate to the point of user connection.  Cisco uses the term “fog computing” to be roughly synonymous with “edge computing”, but in my view, there could be a difference.  I think that the “fog” notion implies that edge resources are both distributed and logically collectivized, creating a kind of edge-hosted cloud.  In contrast, I think that some “edge computing” might rely on discrete resources deployed at the edge, not collected into any sort of pool.  Whether you like either term or my interpretation, I’m going to use this definition set here!

The question of whether a given “cloud” should be a “fog” is probably the most relevant of the questions that could arise out of our definitions.  Resource pools are more economical than fixed resources because you share them, just as packet trunks are more economical than TDM trunks.  The concept of resource-sharing is based on the principle of resource equivalence.  If you have a bunch of hosts, any of which will do, one can be selected to optimize operations and capital costs.  Greater efficiency, in other words, and efficiency grows with the size of the pool.  If your resources are different in a way that relates to how they might be used, then you might not have the full complement of resources available to suit a given need.  That means a smaller pool, and lower efficiency.

This is why we’ve tended to see cloud providers build a small number of massive data centers.  That breeds lower cost (and greater profit) but it also creates a problem with transit time.  An event or transaction that enters the network at a random point might be transported thousands of miles to reach processing resources.  That distance might include multiple nodes and trunks, introducing handling delays.  In the end, we could be talking tenths of a second in delay.

That might not sound like much, but if we presume that we have a 0.2 second round-trip delay and we’re controlling something like a turnstile, a vehicle at 60 mph would travel about 18 feet in that amount of time.  The point is that for applications like process control, a transit delay can be a true killer, and if we could have moved our processing resource to the network edge, we could significantly reduce that delay.

The Internet of Things, to the extent and at the pace that it becomes relevant, is the logical on-ramp for applications sensitive to transit delay.  That means that centralized cloud data centers become problematic with higher rates of IoT adoption, and that the trend would likely encourage the migrating of hosting of IoT processes to the edge.  That’s the trend that actually divides the broader term “edge” computing from what I think is Cisco’s “fog” value proposition.

Cloud providers like Amazon support IoT event processing already, through their functional/serverless or “Lambda” offering.  Obviously, Amazon isn’t about to push out thousands of data centers to the network edge in a field-of-dreams hope of attracting IoT opportunities.  Instead, what they’ve proposed to do is allow their event processing to migrate out of the network and onto customer premises, using customer-provided resources and Amazon software, called “Greengrass”.  Edge computing, provided by the customer, is used to extend the cloud, but in a specific, facility-linked, way.  You can’t think of this kind of edge hosting as a general “fog” because the elements that are distributed are tasked with serving only the general location where they’re found.  You could just as easily say they were remote servers.  If you remember the notion of “server consolidation” in the cloud, think of Greengrass as “server un-consolidation”.  The cloud comes to you, and lives within you.

IoT promotes edge computing, but it doesn’t limit the source of the compute power or whether or not it is collected into the “fog/cloud”.  We need to have at least some event processing collected in proximity to the event source, and from there the “inward” requirements won’t be much different from those of traditional transaction processing.  Enterprises could do on-prem event processing or cede it to the cloud provider.

The “some event processing” point may be critical here.  Event processing isn’t a simple, single-point, task, it’s a series of interlocking processes, primary and secondary events, correlations, and distributions.  Eventually it could feed back to traditional transaction processing, and even reshape much of that, and the IT elements that support it, along the way.  We have “shallow” events, primary events, close to the surface or edge, and “deep” events that could reside in the traditional more centralized cloud, or the data center.  Wherever we have the events, they’re a part of that interlocking process set, and so somebody who wins in one place might well, lacking competition, win in them all.

The cloud providers clearly know this, and they’ve taken the lead in defining event-driven compute models.  Nearly fifty enterprises have told me that all three major cloud providers are further along in their thinking on event processing than their software vendors are, and nearly all say the cloud providers are further along than the enterprises themselves.  However, the former assertion isn’t true.  Vendors are prepared to offer the enterprise just as enlightened an event architecture as the cloud providers could.  They’re just not doing the job, either because they fear lengthening their sales cycle or because the sales organization doesn’t understand event processing.

This is the source of the cloud/edge/fog battles we’ll see fought this year.  If public cloud providers can offer a strong architecture for “deep events” then they can use the Greengrass strategy, or in the case of Microsoft simply support Windows Server local hosting, to make the edge an extension of the cloud and support “shallow events”.  They can then extend their own hosting closer to the edge over time, confident they have grabbed control of event processing.

If they don’t, then eventually the right architecture for both deep and shallow events will be created by software professionals building middleware tools, and this could be used to build a data-center-driven event processing model that wouldn’t use the cloud any more than current transactional or web/mobile applications would.  The event cloud could be cut off at the knees.

So far, things are on the side of the cloud providers.  As I noted earlier, enterprises aren’t hearing much about event processing from their software vendors, only from their cloud resources.  As we evolve toward event-driven computing, with or without the aid of IoT, we can expect it to become the singular focus of technical modernization of IT infrastructure.  That’s a very important win…for somebody.

So, Will Nationalized 5G Save Us?

5G may be the darling of the networking media, but it has profound technical and economic issues.  The standards aren’t yet done, there are questions about how some of the proposed features would be implemented, and there’s the overriding question of whether there will be sufficient return on investment for operators.  Those complications are daunting, and now we have a story that the Administration is looking at building a “nationalized” 5G network.  The Administration denies the story, but suppose it’s true.  How bad can this get?  Plenty.

The apparent issue behind the story is the fear of some in Washington that Chinese equipment vendors like Huawei (AT&T and Verizon have also said they won’t sell Huawei phones) might build in network vulnerabilities that would threaten 5G network services and security.  They imagine, in particular, a kind of Chinese wave of IoT takeovers, where infrastructure is held hostage because everything is “on the Internet”.  The question is first whether there’s a real risk here, and second whether having a nationalized 5G infrastructure is the way to address whatever risk there is.

We obviously have networks today that are vulnerable to hacking.  The presumption of proponents of the nationalized approach is that 5G is different, for two reasons.  First, it’s a massive rebuild that would offer more opportunities for hostile powers to introduce back-door vulnerabilities because there’s more new stuff being done.  Second, the 5G networks will support what one media story called “legions of data-hungry IoT devices”, which would let intruders diddle with our thermostats, TVs, power plants, and other critical stuff.  Let’s start with these two points and try to assess the real situation.

It is true that 5G would involve a lot of new gear, and also generate a greater dependency on software-defined features and network elements.  These could in fact be equipped with back doors that could allow someone access to control and management elements, putting the system overall at risk.  But remember that it’s not risk per se that we have to assess, it’s incremental risk.  It does no good to protect yourself from Thing A when you’ve let Thing B pass, and it present a greater risk.

China already makes a lot of network hardware, and most other countries buy it without much concern over planted back-door vulnerabilities.  The US has not been a major buyer of China’s telecom gear up to now, and the government pressure to ensure that’s the case could just as easily be applied to 5G.  Remember the AT&T and Verizon decision to stop selling phones from Huawei?  You don’t need to nationalize 5G to apply pressure on network operators to avoid vendors the government believes pose a risk; we didn’t need it with 3G or 4G.

Then there’s the fact that we already know that we have to secure access to the control and management planes of devices and software.  It usually takes two things to create an exploit.  One is the back-door or vulnerability, and the other is access to it.  What network operator would not secure their critical elements?

OK, let’s move to the next point, which is the incredible risk posed by all those data-hungry IoT devices.  Look around your home.  How hungry is your smart thermostat or your security sensor?  Not only is it probably not very hungry at all, it really doesn’t need the higher data rates of 5G.  Further, your thermostats and sensors aren’t “on the Internet”, they’re on your home WiFi, so you don’t need 5G to connect them.  The point is that it’s far from clear that we’re going to have a legion of new IoT devices on the Internet.  If we do, we already know that there are going to be massive security and DDoS issues with them, whoever provides the 5G network.  In fact, the vulnerabilities of these devices at the IP level likely dwarf the risks at the cellular technology level.

What about the need for addressing those billions of new gadgets?  Isn’t that a 5G problem?  No, for two reasons.  First, as I said, you probably don’t use public IP addressing for the great majority of IoT devices.  Your home is networked using WiFi in almost all cases, remember, and the network is based on a private IP address from one of three address spaces defined by RFC 1918.  One space is a Class A address space with over 16 million available addresses, one a Class B space with over a million, and the last a Class C space (the one used by most homes) that offers about 65,000 addresses.  Every home could have its own address space, with over 16 million devices on it, without going to IPv6.  Finally, you could use IPv6 with 4G if you needed to.

There is a value to 5G of course, most obviously in competitive marketing of wireless and also in improving capacity and performance even for traditional mobile users.  It’s also valuable in millimeter wave form to extend fiber-to-the-node, replacing copper loop.  Are these enough to drive 5G deployment alone?  Sure, eventually, but not like some onrush of those data-hungry IoT devices (which, as I’ve said, don’t really exist).  Without some dramatic driver, 5G is more an evolution than a revolution, and there won’t be a new refresh of the entire wireless world.  That means no more risk of back-door intrusion being introduced than we’ve already been facing.

There’s another point we have to consider, which is whether it’s feasible to create nationalized 5G infrastructure, forgetting the Chinese connection for the moment.  Go back perhaps three decades and you’d find telecom networking largely consisting of regulated monopolies and government-owned facilities, usually under what was known in Europe as “Postal, Telegraph, and Telephone” or PTT bodies.  We moved to privatize all of that in the ‘90s.  Was it wise?  Perhaps or perhaps not, but it was done.  Since then at least one country, Australia, has attempted (for broadband Internet) a return to something like the nationalized 5G concept.  NBN (New Broadband Network) hasn’t been a rousing success for Australia, as a search of the concept will show.

If the US Government were to build out an NBN-like utility 5G network and then lease capacity to operators, it would jump-start 5G and benefit equipment vendors (presumably other than the Chinese).  But think about this point a moment, and you’ll see the flaw.  The government would have to do it, not just propose it.  We don’t have a recent history of vast Congressional successes.  It might take years for something like this to get approved.  What happens in the meantime?  Nothing.

That’s right, absolutely, literally, nothing.  No current operator would invest in 5G thinking the Government was going to step in and compete or take their efforts and investments over.  If you think 5G is moving too slowly now, wait and see how a nationalized 5G plan would impact deployment.  Glaciers would look like space ships by comparison, and by the way, isn’t space now getting privatized?  Maybe there’s a lesson here.

Software Architecture and Implementation in Zero-Touch Automation

I know that I’ve saved the hardest of the zero-touch issues for last.  The technical or software architecture almost has to be considered at the end, because you can’t talk about it without framing scope and functionality.  Those who read my last blog know that I touched on technical issues with the comment that a concurrent, model-driven, scalable process implementation of zero-touch was critical.  Here I’ll expand on that point, and raise other structural questions.

I want you to imagine for a moment the network and customer base of a large Tier One.  They have millions of customers, tens of millions of services, and tens of thousands of different offices and vaults where equipment is installed.  A typical service is a cooperative relationship that could easily require thousands of different devices and trunks.  And that’s without software virtualization and function hosting!

Every day, every hour or minute, something is happening in such an environment.  Worse yet, there are many critical and common elements in the infrastructure, things that when broken will impact many services and many customers.  Anyone who’s ever run a network operations center or even visited a NOC regularly will recognize the problems of a fault avalanche.  One thing triggers thousands of alarms, which of course means it triggers thousands of events.  It’s then up to the operator to remedy things.

Or it was.  Now it’s up to zero-touch automation, and our problems can be traced from the origin of that critical common fault to its ultimate remediation.

Suppose we have a big trunk failure.  It impacts thousands of services, but which ones are impacted?  The first question in any kind of automated response to lifecycle conditions is how you associate a condition to a lifecycle.  Some say you don’t.  The traditional approach to facility problems is to forget about service-level remediation and focus on “facility remediation.”  If Trunk A broke, either fix it or replace it, and if you do that you can put service Humpty back together again.  Stated without the nursery rhyme association, this approach says that you focus on resource management and forget services.

That doesn’t always work, of course.  Fixing Trunk A might require hours or even days, presuming some “cable-seeking backhoe” got into the mix.  Replacing it might or might not be possible, depending on the capacity Trunk A had, the residual capacity elsewhere, and the ability of that residual to meet the SLAs for the impacted services.  However, a replacement for Trunk A, if Trunk A was explicitly committed, is going to require that it be “uncommitted” and that whatever replaces it be connected instead.  The replacement might be a simple parallel trunk or, more likely, a more devious path through new nodal points, transiting new facilities.  That’s what we have to be able to automate.

There are five steps to zero-touch effectiveness.  One, awareness of the fault itself, which might come through direct management reports or through analytics.  Two, the correlation of the fault with impacted services so that per-service policies can be applied.  Three, the interpretation of the fault in the current service context, so appropriate responses can be mediated.  Four, the execution of the appropriate processes, and five, the analysis of the response of the remediation and, if necessary, the report of an unresolved problem to some higher-level process.

We’re assuming we have a fault report, so let’s start with our second point.  We need to know what services were impacted, and there are two ways that could happen.  We could learn about the trunk failure and have a correlation scheme whereby we know what services were assigned to it.  Or, we could wait for each service to recognize the fault in some way.  In either case, we have to prepare the mechanism.  Do we have fault correlation?  Do we have service-level awareness of the state of lower-level resources?  That’s going to depend on our architecture.

But the worst is yet to come.  We have to address our fourth and fifth points, the execution of appropriate service processes in response to conditions/events, and the “kicking upstairs” of problems that can’t be handled within the black box of an intent model.  We also have to do this at scale.  Remember that, by one means or another, learned about a fault in a thousand services.  All of these services will now have to be reprocessed to reflect the loss of that trunk, and that has both a functional and technical dimension.

Functionally, we need to be able to remediate a trunk loss at a level of service assessment where alternatives are defined and the original selection is made.  That means that whatever process selected the trunk now has to respond to its loss.  If it can’t be done at that level, we have to kick the issue upstairs, so to speak.  That requires that an understanding of how the service composition decisions were made has to be maintained for remediation, which means that either we had to have followed a specific model to build the service, or recorded what we did do.  We then have to be sure that we don’t make the same bad trunk decision again, either with the specific service component that reported the fault or with other components who might have relied on that same trunk.  Remember that the fault could have impacted multiple components of the same service set, and that it might be propagating up a tree of models in several places at once.

The technical side of this is that if there is only one place where we can address service faults, one string of logic, then we have to expect that hundreds, thousands, of events will end up being queued up for handling.  Worse, some of those events may well impact the same service sets, which means that we could in theory handle something by taking an action that a yet-to-be-processed event would invalidate.

What you need here is the notion of spawning a handling process for each event, when the event occurs.  Yes, you may have to apply some mechanism to accommodate the fact that hosting resources for these processes may be limited, but within those constraints you’re better off to launch something per-service.  That means that the intent model hierarchy for that service has to be viewed as a distributed-state database that somehow gives every process the information it needs to run, remediate, or pass on its own fault as an event.

Dynamic spawning of a process from a model-mediated state/event intersection, is critical in developing agile, scalable, zero-touch software.  It’s also a significant technology/architecture challenge.  In order for an event to spawn a process, there has to be an association between the two, similar to that which players like Amazon, Microsoft, and Google offer in the triggers for their functional computing services.  The presumption, then, would be that each model element had a “base process” that was kicked off, and that in turn would activate other processes based on state and event.  However, for that to happen the event would have to be associated with the element, not just the service or the fault.

If we go back to our notion of Trunk A faults, a Trunk A event might activate a “Trunk A distributor process” that knew what services depended on Trunk A.  That process might then kick off a service-specific process that would have to associate the fault with the correct element.  Might it pass it down the chain?  Implementations of various types are possible.  It would also be possible to have the Trunk A distributor process “know” the specific service model element that was dependent on the trunk, which would make the intermediate service-model process mediation of the event unnecessary.

So far, we’ve been considering the “primary events” of conditions from outside the service itself.  That would include resource/service faults detected in a variety of ways, but also “administrative events” generated by operations personnel or even by customers making service changes.  There are also “internal events” to consider.  When a single element in a complex intent-modeled service structure wants to report a condition that it has generated, perhaps as the result of processing some other event, it has to be able to generate an event to wake up another higher-level (or lower-level) process.

Primary events would typically be generated either at the top of the model hierarchy (administrative events) or at the bottom (resource conditions).  The rest of the events would be generated internally, and represent the passing of requirements down the chain to subordinates, or of issues up the chain to superiors.  Overall, this complex set of event flows is characteristic of asynchronous processes, and you either have to handle them that way or you have to end up creating some serialized single-process solution that could leave stuff waiting for a response for a very long time.  Doing the former means having dynamic processes associated not with zero-touch automation overall, but to each service.

I experimented with something like this in my original ExperiaSphere project, launched in response to some operator concerns about how the TMF’s Service Delivery Framework concept could be implemented.  My thought was that there was a “service factory” that spun out models for ordering, and any such factory could handle all of the events for any of the models it was associated with.  The service order data carried all the state and structure information needed.  In my CloudNFV project, EnterpriseWeb used their broader architecture to provide a kind of event fabric that also conveyed stored state and structure to a process that could be spun up as needed.

Ultimately, it should be obvious that zero-touch automation is an event-driven process and that it will therefore require event-driven processing.  Stateless elements with stored/distributed state coordination can be scaled easily and replace easily.  They can also be made to address the complexities of concurrency, the need to support events as they happen even when they happen in floods.  Functional computing, of the kind Amazon, Microsoft, and Google have already introduced, is going to be a key piece of the implementation of zero-touch automation, or it’s not going to work at scale.

So, this frames the industry dilemma.  There is no question in my mind that only ONAP has the credibility and scope to be able to touch all the essential pieces of service lifecycle automation.  Absent that scope/scale, you can’t generate enough benefits to drive deployment.  But like the ETSI ISG, the ONAP people have presented a picture of their architecture that could be interpreted as being what I’ve called a linear process and not a model-driven state/event process supported by stateless elements.  Is that what the software is?  Remember that the ETSI picture was initially described as a functional model, which means that it didn’t have to be implemented exactly as shown.  What broke that down was a combination of literal interpretation of the diagram by some, and the fact that when you describe specific interfaces between components you implicitly define those components.  Remember black boxes?

I hope to have some further discussions with the ONAP people on the way their software is really built.  Once we know how it was done, we’d know whether it can be made stateless and event-driven to the extent needed to make it scalable.  While everything in zero-touch automation could be done with a state/event-model-driven stateless-process combination, it doesn’t all have to be.  You could, for example, “eventify” the critical virtualization piece and have that handle most events.  The interface to the less-used processes, like those of OSS/BSS, could remain traditional event-queue based.

This is the last of my series on zero-touch automation issues and architectures.  What does all this prove?  In my view, it proves that we have to start thinking about networks in a totally different way, for several reasons.  First, zero-touch software is a prime example of the new model of event-driven systems, and that’s not how standards people and even many software people think about it.  Second, this is a complicated space, and finding even one solution much less a competitive basket of them, is going to be difficult.  It cries out for open-source.  Finally, we are never going to get useful coverage of this topic if everyone in the media/analyst space relies on vendor press releases and “sponsored” papers.  We need to figure out a way of exploring the issues in an open forum without either limiting the details the dialog can raise, or contaminating the process with special interests.

Operators, making that happen is up to you.

Why the Functional Model of a Zero-Touch Solution is important

Scope of impact is critical for the success of zero-touch automation, whether we’re talking about the general case of managing application/service lifecycles or the specific case of supporting network operator transformation.  There’s a lot of stuff to be done in deploying and sustaining something useful, and the more pieces there are involved the more expensive and risky the lifecycle processes are.  If you grab a low apple, you may not get more than a tiny bite of the pie.

While scope of impact is important, it’s not the only thing that matters.  The second zero-touch automation issue of the trio of issues I promised to address is the functional model used by the software solution.  This is the set of issues or factors that determine just what a zero-touch solution can really do, because it determines what it really knows.

Zero-touch automation, as applied to network service lifecycle automation, involves two things.  First, the knowledge of the actual lifecycle, as a stepwise progression from nothing to something that’s running and being used and paid for.  Second, a knowledge of changes in conditions (events) that impact in some way the status of the service.  In these terms, zero-touch automation is the execution of service processes in response to events, filtered by the combination of the current lifecycle state and the “goal state”.

The popular vision of zero-touch implementation is a kind of Isaac Asimov-ish “I, Robot” or Star Wars R2D2 concept, where the implementation is imparted with human intelligence and understands the goals and the things happening.  Perhaps we’ll get to that, but right now we can’t ask R2 to do something to run our networks so we have to look at more pragmatic approaches.

The software DevOps world has given us two basic ways of thinking of automating software-related tasks.  One, the prescriptive model, says that you do specific things in response to events.  This is what a human operator in an operations center might do—handle things that are happening.  The other is the descriptive model, which says that there is a goal state, a current state, and a perhaps-implicit event, and the combination indicates what has to be done to get from where you are to where you want to be.  In networking, another concept seems to have taken pride of place in current thinking—intent modeling.

In intent model systems, a functional element is a black box, something that is known by its properties from the outside rather than by what’s inside.  The properties describe what the functional element is supposed to do/offer (its “intent”).  In common practice, intent models are nested to reflect the structure of a service.  You might have a “service” model at the top, below which are “high-level-function” models, each of which are successively decomposed into lower-level stuff until you reach the point where a functional element is actually implemented/deployed.

There are two properties of intent models that are important to zero-touch automation.  The first property is that each intent model is responsible for fulfilling its intent or notifying the superior element of a failure to comply.  What’s inside is invisible, so nothing outside is able to manage it.  The second property is that all implementations of a given intent model are by definition equal and compatible.  If you can’t see inside the black box you can’t tell what’s different or the same.  That means that all implementations of a given intent model relate to the rest of the service the same way, and are managed the same way.

The nice thing about intent models is that you can stick either prescriptive or descriptive behavior inside them, as long as they do behave as promised or report their flaws.  Arguably the approach of intent modeling is a better high-level way of looking at services or application lifecycles because of that.  It also means that whatever DevOps tools might be available, they can be exploited in the implementation of intent models.  Hey, it’s a black box.

The missing link here is that while what goes on inside an intent model is invisible, what gets outside one cannot be.  Remember that a given model has to either do its job or report its failure.  The “to whom?” question is answerable fairly easily—to whatever model contains it, or to the service/application process that spawned the model tree if it’s the top model in such a tree.  The question of what the report is, and does, is more complicated.

We obviously can’t say X broke; too bad! and let it go.  The event that “X broke” is then up to the superior object to X to handle.  Since the conditions of all its subordinate objects are asynchronous with respect to each other (and everything else in the hierarchy), the most convenient way to address the event interpretation is via the current-state-and-event model.  When a service is ordered, it might move from the “Orderable” to the “Deploying” state.  The superior service intent model element might then send an event to its subordinates, “Ordering” them as well, and moving them to “Deploying”.  Eventually, if all goes well, the bottom objects would reach the “Active” state, and that would pass up the line, with each superior object reporting “Active” when all its subordinates are active.

If a fault occurs somewhere, the intent model that went bad would report “Fault” to its superior and enter a “Fault” state.  The superior object then has the option of redeploying the failed element, rethinking its whole decomposition into subordinates based on the failure (redeploying them all), or reporting “Fault” up the line.  Remediation would involve a similar cascade of events.

The nice thing about this approach is that you can visualize a model hierarchy as a set of data elements, and you can host the state/event processes anywhere you have resources and can read the appropriate data elements.  There doesn’t need to be a process if there’s nothing happening at a given intent-model level; you spin one up as needed, and as many as you need.  Everything is elastic and scalable.

All of this is lovely, and in my view the right way of doing things.  There’s also a linear model of the process, which says that there is no per-intent-model state/event process at all, but rather an event is simply handed to a software tool, which then figures out what the event means by exploring the service structure in some way.  There’s no concurrency in this approach, no scalability, no real resiliency.  And yet this is how zero-touch or service lifecycle process management is often visualized, how the NFV ISG’s E2E model describes it.  A set of software components, connected by interfaces, not a set of objects controlling state/event process management.

If, as I’ve said, ONAP/ECOMP is the only path forward toward achieving zero-touch automation simply because the scope issue is so complicated that nothing else will be able to catch up, then does ONAP support the right model?  As far as I can tell, the current software is much more linear than state/event in orientation.  It wouldn’t have to stay that way if the software was designed as a series of functional components, and I might be wrong in my assessment (I will try to check with the ONAP people to find out), but in any event the documentation doesn’t describe the models in detail, indicate the state/event relationships, etc.  More documentation, and perhaps new work, will be needed.

It should be considered essential to do both, for two reasons.  First, there is a risk that a large number of events would swamp a linear model of software.  Without a high degree of scalability, zero-touch automation is a trap waiting for a major incident to spring.  Second, the credibility of ONAP could be threatened by even near-term issues associated with limited linear processes, and that would put the only credible path to zero-touch automation at risk.  Sometimes you have to fix things, not try to start over, and this is such a time.

How Do We Ensure that Zero-Touch Automation Actually Touches Enough?

The more you, do the more it helps…and costs.  That’s the challenge of scope for zero-touch automation in a nutshell.  The challenge is especially acute when you consider that the term itself could be applied in many ways.  What exactly are we zero-touch automating?  It’s that question, the question of scope, that determines the impact that a solution can have, the disruption it will cause, and the cost it might generate.

The scope issue has to start with something no matter how vague its boundaries might become, and there are two dimensions in which scope variation happens with applications and services.  The first dimension is what could be called the business versus technology dimension, and the second the resource relationship dimension.

Both services and applications do something in a functional sense, and run on something in a technical sense.  The value, both intrinsic and commercial, depends on the functional side and the operational requirements for the resources fall on the technical side.  With network services (the main focus of my discussions here), the business side for operators is represented by the operations support and business support systems (OSS/BSS), and the technical side by network management systems (NMS) and network operations personnel and processes (the “network operations center” or NOC).

The business/functional versus operational/technology dimension also exists in applications, where developers or software selection professionals working with line departments decide what to use, and operations center personnel decide how to run it.  The origin of the concept of “DevOps” or Developer/Operations tools lies in this relationship.

The resource relationship dimension has focused on how stuff gets managed.  Services and applications have historically been managed, in lifecycle terms, in one of two specific ways.  The first is the explicit-linked resource model, where we have resources that are specifically linked to services or applications, and the state of those things are thus dependent on the state of the specific resources assigned.  Old TDM services worked this way.  The second is the shared resource model, which says that resources are assigned from a pool, shared among services or application.  Packet networks share capacity and resources, so that’s the traditional network model.

A third model has recently emerged with virtualization and the cloud.  The pooled resource model differs from the shared resource model in that resources needed for services or applications are allocated from a pool, one that includes virtualized resources hosted on real servers.  In effect, it creates a hybrid between the two earlier approaches.  There’s real stuff underneath, and virtual stuff above it.

This real/virtual boundary is what’s actually at the heart of the zero-touch scope debate.  A “virtual” resource looks real from above and looks like an application from below.  When SDN and NFV came along, they created explicit virtualization but not explicit virtualization management practices.  The biggest problem that the notion of zero-touch automation faces is the fact that we have this new virtual layer that divided the first of our dimensions as well as the second.  How do you manage virtual resources, and how do they relate to the services/applications and resources that they separate?

The reason this is important to the scope issue is that since virtualization has to be somehow accommodated, there’s been a tendency to focus on managing it rather than managing everything.  Some zero-touch proponents are really exclusively interested in managing virtualization, some are interested in managing virtualization and services, and some virtualization and resources.  The problem is that if you want to manage, in zero-touch terms, the glorious whole, you end up having to address a lot of stuff that’s already been addressed, but not in the virtual-world environment in which we currently live.  This is where the ONAP/ECOMP initiative comes in.

It’s always been my view that AT&T got to its ECOMP platform because its Domain 2.0 architecture needed a software operationalization framework to live in, given that different vendors made up different pieces/zones of the same infrastructure.  That mission makes ECOMP a fairly complete management tool, and if ECOMP can deliver on zero-touch automation, it can fully address the dimensions of business, function, services, applications, virtualization, and resources.  In my view, zero-touch automation is meaningless if it only limits what you have to touch a bit.  It has to limit it a lot.

There are two corollaries to this point.  First, ECOMP needs to explicitly address both zero-touch automation and all the pieces and dimensions I’ve listed.  If it doesn’t, then it’s just another partial solution that may not get complete backing from operators.  Second, if you think you have a better strategy than ECOMP, you’d better have full coverage of all those areas or you are offering a partial solution.  That means that anyone who wants to sell into network operator infrastructure or management has to make it their mission to fit into ECOMP, which means ONAP today.

What about the ETSI zero-touch activity?  Well, the only thing that separates them from becoming another science project in the standards space is a decision to specifically target their strategies to the enhancement of ONAP/ECOMP.  Might the reverse work?  In theory, but the problem is that it makes no sense to link a currently deployable framework for solution to a long-term standards process and wait for happy convergence.  Make what can be done now the reference, and let the long-term process define an optimal evolution—in conjunction with the ONAP people.

ONAP/ECOMP offers a good blueprint for full-scope management, but it still needs work in explicitly defining management practices related to virtualization.  ONAP/ECOMP effectively manages virtual resources and admits real resources into the same management process set, making all resources virtual.  Since OSS/BSS systems have typically treated services as the product of “virtual god boxes” that provided all the necessary functionality, this marries neatly with OSS/BSS processes.  However, it doesn’t orchestrate them.  That’s not necessarily a horrible problem given that OSS/BSS intervention into operations events can be limited to situations where billing and customer care are impacted, but it could bite the process later on.

The ONAP marriage is new, and still evolving, of course.  ONAP is hosted by the Linux Foundation, which has just established a combined administrative structure (see HERE) to manage their collective networking projects.  I’m hoping that this reflects a longer, broader, vision of networking; it certainly encompasses a lot more of the important stuff, including OpenDaylight, that bears on the issue of that critical virtual layer.  ODL codifies the notion that everything can be made to look virtual, and that formalization in the ONAP structure might induce the group to be more explicit about the strategy I’ve outlined, which at the moment can only be inferred and not proved.

One thing that addressing the virtualization model would accomplish is closing the loop on the issue of the explicitly linked versus shared resource model, in management terms.  Shared resources are generally managed independently of services, and that is also true with resource pools.  The presumption is that higher-layer elements that are built on real resources will respond to a failure of those resources by redeploying.

Where the challenge in this model is most obvious is in the way that we commission new resources, or coerce resource-centric behaviors into a form where they can be consumed by services.  We’re used to thinking about modularity and intent modeling in services, but the same concepts have a lot of value on the resource side.  A concept like “component-host”, for example, is a property of something that can (obviously) host components.  That might be a VM, a container, a real server, an edge blade in a router, or a piece of agile vCPE.  However, it might be packaged as a piece of equipment, a piece of software, a rack of equipment, an entire data center, or (in the case of M&A) a community of data centers.  The boundary layer concept is important because it gives us not only a way of making an infrastructure-neutral way of mapping service components to resources, but also because it offers a way of making new resources available based on predefined capabilities.  One vendor I know to be working in this area is Apstra, and there may be others.

The breadth of impact that’s potential with ONAP/ECOMP is perhaps the most critical thing about it, because of the potential opex savings and agility benefits that full zero-touch service lifecycle automation would offer.  A complete solution would offer potential savings of about 7 cents per revenue dollar even in 2018, across all categories of network operators.  In agility terms, it could lower the time required for a setup of a new service from about two weeks for largely manual processes excluding new wiring, three days for today’s state-of-the-art automation, or four hours for what’s considered today to be full automation, down to about ten minutes.  These benefits are critical for operator transformation, so much so that without them there’s doubt that much transformation can be achieved.  But, as we’ll see, there are issues beyond scope of impact that we have to address, and this blog series will cover them as we go along.

Dissecting the Challenges of Zero-Touch Automation

There’s no question that “zero-touch automation” is emerging as the new industry buzzword (or words), and it has the advantage or disadvantage of being more complicated than most of our past ones.  Sure, at the high level at least, you can express the idea simply.  It’s about full automation of the service or application lifecycle, from conception to deployment and use.  The problem is that this broad scope implicates just about everything an operations organization does, and the mechanisms for automating all that stuff will require some fundamental changes in how we think about operations software.

ZTA (as I’ll call it to reduce the typing burden) has three specific issues.  First and foremost is scope of impact.  A lot of different people and activities are involved in lifecycle processes, and most of that stuff has already been supplied with tools to support the current more manual processes.  We need to be able to “automate” all of this, or we risk creating costs and reducing agility rather than improving either of those.  Second is the functional model.  It’s easy to say something like “AI will be used to make all decisions rather than having humans do the heavy lifting”, but we don’t have AI that can just be trained to do the right thing.  How does the automation process know what to do?  Finally, we have software architecture itself.  If software is driving the tasks of ZTA then what kind of software is it?  I’m going to open each of these topics with this blog, and follow up with a blog on each.

“Automation” strategies for management of applications and network services have typically fallen far short of their goals in terms of reducing costs and improving agility.  The reasons for the shortfall almost always start with too limited a scope of impact.  Say that three organizations are involved in a given lifecycle process.  You grab what seems to be the low apple of the three and automate it, but the other two are left alone.  That means you have to make your automation fit a process flow that’s dominantly manual, and if you then move on to the other tasks you’ve not had the advantage of thinking of the flow of activity as a whole.  Thus, the sum of your three automation projects probably won’t be optimal.

The biggest fault-line in scope that ZTA faces is the divide between network management and service management, which is embodied in the separation of network management software and systems and operations support or OSS/BSS.  Time and time again we have worked to simplify the two independently, and have created something that is, overall, more complicated and expensive.  A service or application lifecycle is an operations sequence that touches different job functions in different ways, and somehow that entire sequence has to be automated, not pieces of it.

This doesn’t mean that we have to unify all management in an organizational sense, or that our automation solutions don’t recognize a difference between lifecycle processes in the NMS and OSS/BSS spaces.  It does mean that rather than having those two activities independent or even interdependent, we have them co-dependent on our organizing operational sequence, the concept that guides the ZTA approach, which is our functional architecture.

Lifecycle software isn’t a new concept.  What we expect of ZTA models in the future is today delivered, in part, by what’s called “DevOps” software.  DevOps is short for “Developer/Operations” and it describes the way that people who build software or software-driven processes communicate the deployment and operational needs of their processes to those who are responsible for running them.  There are two broad approaches to this—a “proscriptive” and a “descriptive” approach.  One defines the specific steps to be taken to do something and the other defines the goal state of things, and then uses that to build steps to achieve that state.

We seem to have accepted that the latter approach is best for ZTA.  “Intent modeling” of a service or application breaks something complex down into simple atoms, often using a hierarchical decomposition of the top level to work down to the bottom.  Each element in the model is responsible for delivering a specific functionality, and in response to any changes in the conditions that impact that responsibility, each is expected to take action internally or report failure up the line.

In network management, this has been seen for ages as a “state-event” problem.  For each element there are a number of possible states, one of which is the target state of the moment.  The goal of the software is to respond to “events” in such a way as to achieve an orderly transition to the goal state.  In OSS/BSS, work done by the TMF a decade ago launched the notion that a contract contained elements that linked service events to operations processes (“NGOSS Contract”, which morphed into their GB942 spec.  To make either approach work, you have to be able to componentize your processes so that they can be triggered in an individual way by a state/event combination.  The state/event progressions determine process execution, not some abstract workflow implicit in the applications themselves.

If we have a bunch of abstract events collecting around a functional intent model hierarchy that defines a service or application, it follows that the events are asynchronous in their appearance and that some processes might be referenced in multiple places.  That makes it important to have a software architecture that lets you define component processes and scale them as needed, and that drives us toward what’s usually called a microservices approach.

Microservices are little functional pieces that expect to run on demand, and that don’t themselves try to force “context” or stepwise operation on the stuff that’s invoking them.  If you want Process A to run on demand, and if you want to scale it on demand, it’s essential that the process doesn’t store something that would alter its behavior the next time it’s run.  Otherwise secondary events have to be sent to the same process instance that got the primary events, and the way the process works will vary in a way that the state/event model can’t anticipate.

Like everything, the notion of microservices is easily trivialized.  Just saying you have the approach, of course, does nothing.  Even building software from microservices may not do anything either.  Scalable architecture helps overall scalability only if the stuff that’s subject to variable loads is the same stuff you’ve made scalable, and that you can scale to accommodate the loads expected or possible.

Consider an operator or cloud provider with a million services, meaning a million service models.  Normal issues could be expected to impact thousands of these in a day even if the rate of events is a tenth of one percent.  A single large-scale failure could hit ten percent of the total base, and that would swamp most systems.  The fact is that every service model has to be considered an independent process set.  That’s much more like the event-driven thinking of functional computing now evolving.

The ZTA of the future should really be thought of as a set of models, referencing processes, and dynamically linked to hosting resources when something happens.  Each service has the potential, at any point in the lifecycle, of being a process/resource consumer.  That means that in many cases we’ll have to be able to spin things up almost on a per-service basis.  Instead we tend to think of ZTA as a software engine that inputs model events and churns out answers.  That’s not going to work.

One reason all of this is important now is that Verizon has joined the ONAP project that AT&T’s ECOMP operations model anchors.  That makes it almost certain that ONAP will dominate open-source ZTA, which dominates operators’ plans.  If ZTA is going to be implemented, it will either be via the open-source ONAP project or through some integrated commercial tool set.  That means that we have to be able to fit ONAP to these three points, and that we have to use them to judge any competing solution suites or components thereof.

In past technology innovation adventures, including both SDN and NFV, we let ourselves get blinded by buzzwords, by anything that claimed a connection.  Automation has to be different because we’re running out of time to transform things.  In my next blog, I’m going to talk about the scope issue, and that will explain why I think that things like ONAP are so critical.  From there, we’ll move to the rest of the issues, ending with that most-difficult-of-all software architecture problem.  I hope you’ll find it all interesting!

A Structure for Abstraction and Virtualization in the Telco Cloud

It is becoming clear that the future of networking, the cloud, and IT in general lies in abstraction.  We have an increasing number of choices in network technology, equipment vendors, servers, operating systems (and distros), middleware…you get the picture.  We have open-source software and open hardware initiatives, and of course open standards.  With this multiplicity of options comes more buyer choice and power, but multiplicity has its downsides.  It’s hard to prevent vendor desires for differentiation from diluting choice, and differences in implementation mean difficulty creating efficient and agile operations.

Abstraction is the accepted way of addressing this.  “Virtualization” is a term often used to describe the process of creating an abstraction that can be mapped to a number of different options.  A virtual machine is mapped to a real server, a virtual network to real infrastructure.  Abstraction plus mapping equals virtualization, in other words.

The challenge we have isn’t acceptance of the notion of abstraction/virtualization, but the growing number of things that need to be virtualized and the even-faster-growing number of ways of looking at it.  Complex virtualization really means a modeling system to express the relationships of parts to the whole.  In my ExperiaSphere work on service lifecycle automation, I proposed that we model a service in two layers, “service” and “resource”, and I think we are starting to see some sense of structure in virtualization overall.

The best way to look at anything these days is through cloud-colored glasses, and the cloud offers us some useful insights into that broader virtualization vision.  “Infrastructure” in the cloud has two basic features, the ability to host application components or service features, and the ability to connect elements of applications and services to create a delivered experience.  We could visualize these two things as being the “services” offered by, or the “features” of, infrastructure.

If you decompose infrastructure, you end up with systems of devices, and here we see variations in how the abstraction/virtualization stuff might work.  On the network side, the standard structure is that a network is made up of a cooperative community of devices/elements, and that networks are committed to create connection services.  Thus, devices>networks>connection-services in progression.  On the hosting or computing side, you really have a combination of network devices and servers that collectively frame a data center hardware system, and that hosts a set of platform software tools that combine to create the hosting.

There are already a couple of complicating factors entering the picture.  First, “devices” at the network and hosting levels can be virtualized themselves.  A “router” might be a software feature hosted in a virtual machine assigned to a pool of servers.  Second, the virtual machine hosting (or container hosting) might be based on a pool of resources that don’t align with data center boundaries, so the virtual division of resources would differ from the physical division.  Container pods or clusters or swarms are examples; they might cross data center boundaries.

What we end up with is a slightly more complicated set of layers, which I offer HERE as a graphic to make things easier to follow.  I’ve also noted the parts of the structure covered by MANO and ONAP, and by the Apache Mesos and DC/OS combination that I think bears consideration by the ONAP people.

At the bottom of the structure, we have a device layer that hosts real, nuclear, hardware elements.  On top of this is a virtual-infrastructure layer, and this layer is responsible for mapping between the real device elements available and any necessary or useful abstraction thereof.  One such abstraction might be geographical/facility-oriented, meaning data centers or interconnect farms.  Another might be resource-pool oriented, meaning that the layer creates an abstract pool from which higher layers can draw resources.

One easy illustration of this layer and what it abstracts is the decision by an operator or cloud provider to add a data center.  That data center has a collection of real devices in it, and the process of adding the data center would involve some “real” and “virtual” changes.  On the real side, we’d have to connect that data center network into the WAN that connects the other data centers.  On the virtual side, we would need to make the resources of that data center available to the abstractions that are hosted by the virtual-infrastructure layer, such as cloud resource pools.  The “mapping processes” for this layer might contain policies that would automatically augment some of the virtual-infrastructure abstractions (the resource pools, for example) with resources from the new data center.

Above the virtual-infrastructure layer is the layer that commits virtual resources, which I’ll call the “virtual resource” layer.  This layer would add whatever platform software (OS and middleware, hypervisor, etc.) and parameterization needed to transform a resource pool into a “virtual element”, a virtual component of an application or service, a virtual device, or something else that has explicit functionality.  Virtual elements are the building-blocks for services, which are made up of feature components hosted in virtual elements or coerced behavior of devices or device systems.

If we accept this model as at least one possible layered framework for abstraction, we can also map some current projects to the layers.  ONAP and NFV MANO operate at the very top, converting virtual resources into functional components, represented in MANO by Virtual Infrastructure Managers and Virtual Network Functions.  ONAP operates higher as well, in service lifecycle management processes.

Below the ONAP/MANO activities are the layers that my ExperiaSphere stuff calls the “resource-layer models”.  In my view, the best current framework for this set of features is found in the DC/OS project, which is based on Apache Mesos.  There are things that I think are needed at this level that Mesos and DC/OS don’t provide, but I think they could be added on without too much hassle.

Let’s go back now to DC/OS and Mesos.  Mesos is an Apache cluster management tool, and DC/OS adds in features that abstract a resource cloud to look like a single computer, which is certainly a big step toward my bottom-layer requirements.  It’s also something that I think the telcos should have been looking at (so is Marathon, a mass-scale orchestration tool).  But even if you don’t think that the combination is a critical piece of virtualization and telco cloud, it demonstrates that the cloud community has been thinking of this problem for a long time.

Where I think DC/OS and Mesos could use some help is in defining non-server elements, resource commissioning and data center assignment and onboarding.  The lower layer of my model, the Device Layer, is a physical pool of stuff.  It would be essential to be able to represent network resources in this layer, and it would be highly desirable to support the reality that you onboard entire data centers or racks and not just individual servers or boxes.  Finally, the management processes to sustain resources should be defined here, and from here should be coupled upward to be associated with higher-layer elements.

I think this is a topic that needs to be explored, by the ONAP people, the NFV ISG, and perhaps the Open Compute Project, as well as Apache.  We need to have a vertically integrated model of virtualization, not a bunch of disconnected approaches, or we’ll not be able to create a uniform cloud hosting environment that’s elastic and composable at all levels.  And we shouldn’t settle for less.

The Cloud and the Battle for “Everyware”

Even in an industry, a world, committed to hype, reality always wins in the end.  Cloud computing is an example of this tenant, and what’s interesting is less the validity of the central point than the way that cloud reality is reshaping the industry.  Most interesting of all is the relationship between the cloud and open-source.

When public cloud computing first came along, I did my usual industry surveys and modeling, and what emerged from the process was a couple key points.  First, no more than a maximum of 24% of current applications could be justifiably ported to the cloud.  Second, over 80% of the actual opportunity for public cloud services would come from developing cloud applications that had never run elsewhere.  Finally, public cloud would never displace enterprise data centers to any significant degree.

What we are seeing in cloud computing today is a reflection of these points.  Cloud-specific applications dominate, and hybrid cloud dominates, even now.  Increased competition among cloud providers, and the constant need for differentiation, has generated a whole cloud industry of “web services” that present hosted feature add-ons to basic cloud services.  This is one of the reasons why we’re seeing cloud-specific applications.  Now the same forces are acting in the hybrid cloud area.

Hybrid clouds are a symbiotic relationship between enterprise data centers and public cloud services.  Given that, it’s obvious that somebody with a foot in both spaces would have an advantage in defining the critical connecting features, and that has benefitted Microsoft significantly.  In my surveys, Microsoft’s cloud has outperformed the competition, even though non-enterprise applications have pushed Amazon into the overall lead in public cloud services.  Amazon and Google know this, and both companies have been struggling to create a credible outpost for their own cloud services within the data center.

The obvious way to create the hybrid link to your cloud service is to offer a premises-hosted element that appears to be a part of your cloud.  Amazon has done this with Greengrass.  Google is working with Cisco to develop an open hybrid strategy, and is said to be especially under pressure to make something happen, hybrid-wise, because of Google’s third-place position in the public cloud market.  Amazon is now working its own Linux distro, Linux 2, into the game, and some say that Google is hoping Kubernetes, the popular container orchestrator that Google developed initially, will provide it with hybrid creds.  Unfortunately for Google, everyone supports Kubernetes, including Amazon and Microsoft.

While the competitive dynamic in the public cloud space, and hybrid cloud impact on that dynamic, get a lot of buzz, the biggest and longest-lasting impact of the hybrid cloud may be on “platform software”, meaning the operating system and middleware elements used by applications.  Amazon and Salesforce have made no secret of their interest in moving off Oracle DBMS software to an open platform, something that would lower their costs.  If public cloud platforms gravitate totally to open source, and if public cloud operators continue to add web services to build cloud-specific functionality that has to hybridize with the data center, isn’t it then very likely that the public cloud platform software will become the de facto platform for the hybrid cloud, and thus for IT overall?

What we’re talking about here is “cloudware”, a new kind of platform software that’s designed to be distributable across all hosting resources, offering a consistent development framework that virtualizes everything an application uses.  Hybrid cloud is a powerful cloudware driver, but working against this happy universality is the fact that cloud providers don’t want to have complete portability of applications.  If they don’t have their own unique features, then they can only differentiate on price, which creates the race to the bottom nobody particularly wants to be a part of.

It’s already clear that cloudware is going to be almost exclusively open-sourced.  Look at Linux, at Docker, at Kubernetes, at OpenStack, and you see that the advances in the cloud are already tied back to open source.  A big part of the reason is that it’s very difficult for cloud providers to invent their own stuff from the ground up.  Amazon’s Linux 2 and the almost-universal acceptance of Kubernetes for container cloud demonstrate that.  Open-source platform software is already the rule, and cloudware is likely to make it almost universal.

The biggest question of all is whether “cloudware” will end up becoming “everyware”.  Open-source tools are available from many sources, including giants like Red Hat.  Is it possible that cloudware would challenge these incumbents, and if so what could tip the balance?  It’s interesting and complicated.

At the level of broad architecture, what’s needed is fairly clear.  To start with, you need something that can virtualize hosting, modeled perhaps on Apache Mesos and DC/OS.  That defines a kind of resource layer, harmonizing the underlying infrastructure.  On top of that you’d need a platform-as-a-service framework that included operating system (Linux, no doubt) and middleware.  It’s in the middleware that the issue of cloudware/everyware hits.

Everyone sees mobility, or functional computing, or database, or whatever, in their own unique and competitively biased way.  To create a true everyware, you need to harmonize that middleware, which means that you need an abstraction layer for it just as we have for hardware or hosting.  For example, event-driven functional computing could be virtualized, and then each implementation mapped to the virtual model.

If we are evolving toward a fairly universal hybrid platform, then either that platform has to evolve from current enterprise open-source stuff like Red Hat, or emerge from the cloud.  Proponents from either camp have an opportunity to frame the universal “everyware” of the future, but they also face specific challenges to their moves to do that.

For cloud providers, the problem is lack of unity.  The cloud is not the only place applications run; it’s not even the dominant place.  Not only that, differentiation- and profit-driven moves to enhance web services available to cloud applications creates not one vision of cloudware, but a vision for every provider.  If enterprises who think in terms of hybrid cloud confront the issue of premises data center hosting, those who think in terms of multicloud confront the diversity of implementations for basic cloud service features.

The premises players have their own special challenge, which is that the cloud changes everything, at least with respect to application architectures and developer strategies.  It’s hard to see how you could build an event-driven app in the data center unless you wanted to host stuff all over the world where your events originated.  That means that the premises players have to cede the critical future development trends to the cloud providers.

The battle to control “everyware” may be the defining issue in 2018, because it will not only establish market leadership (and maybe even survival) in both the cloud and platform software spaces, but will influence the pace of cloud adoption and application modernization.  This is the cloud’s defining issue for the coming year, and it will also play a major role in defining how we evolve to carrier cloud and hosted services.  Keep an eye on it; I know I’ll be watching!

How NFV Can Save Itself in 2018

Network Functions Virtualization (NFV) has generated a lot of buzz, but it became pretty clear last year that the bloom was off the rose in terms of coverage and operator commitment.  Does this mean that NFV was a bad idea?  Is all the work that was done irrelevant, or about to become so?  Are vendor and operator hopes for NFV about to be dashed for good?

NFV wasn’t a bad idea, and still isn’t, but the fulfillment of its potential is in doubt.  NFV is at a crossroads this year, because the industry is moving in a broader direction and the work of the ISG is getting more and more detailed and narrow.  The downward direction collides more and more with established cloud elements, so it’s redundant.  The downward direction has also opened a gap between the business case and the top-level NFV definitions, and stuff like ONAP is now filling that gap and controlling deployment.

I’ve noted in many past blogs that the goal of efficient, agile, service lifecycle management can be achieved without transforming infrastructure at all, whether with SDN or NFV.  If we get far enough in service automation, we’ll achieve infrastructure independence, and that lets us stay the course with switches and routers (yes, probably increasingly white-box but still essentially legacy technology).  To succeed in this kind of world, NFV has to find its place, narrower than it could have been but not as narrow as it will end up being if nothing is done.

The first step for NFV is hitch your wagon to the ONAP star.  The biggest mistake the ETSI NFV ISG made was limiting its scope to what was little more than how to deploy a cloud component that happened to be a piece of service functionality.  A new technology for network-building can never be justified by making it equivalent to the old ones.  It has to be better, and in fact a lot better.  The fact is that service lifecycle automation should have been the goal all along, but NFV’s scope couldn’t address it.  ONAP has a much broader scope, and while (as its own key technologists say) it’s a platform and not a product, the platform has the potential to cover all the essential pieces of service lifecycle automation.

NFV would fit into ONAP as a “controller” element, which means that NFV’s Management and Orchestration (MANO) and VNF Manager (VNFM) functions would be active on virtual-function hosting environments.  The rest of the service could be expected to be handled by some other controller, such as one handling SDN or even something interfacing with legacy NMS products.  Thus, ONAP picks up a big part of what NFV doesn’t handle with respect to lifecycle automation.  Even though it doesn’t do it all, ONAP at least relieves the NFV ISG of the requirements of working on a broader area.

The only objections to this step may come from vendors want to push their own approaches, or from some operators who have alternative open-platform aspirations.  My advice to both groups is to get over it!  There can be only one big thrust forward at this point, and it’s ONAP or nothing.

The second step for NFV is probably going to get a lot of push-back from the NFV ISG.  That step is to forget a new orchestration and management architecture and focus on adapting cloud technology to the NFV mission.  A “virtual network function” is a cloud component, period.  To the greatest extent possible, deploying and sustaining them should be managed as any other cloud component would be.  To get to that point, we have to divide up the process of “deployment” into two elements, add a third for “sustaining”, and then fit NFV to each.

The first element is the actual hosting piece, which today is dominated by OpenStack for VMs or Docker for containers.  I’ve not seen convincing evidence that the same two elements wouldn’t work for basic NFV deployment.

The second element is orchestration, which in the cloud is typically addressed through DevOps products (Chef, Puppet, Heat, Ansible) and with containers through Kubernetes or Marathon.  Orchestration is about how to deploy systems of components, and so more work may be needed here to accommodate the policy-based automation of deployment of VNFs based on factors (like regulations) that don’t particularly impact the cloud at this point.  These factors should be input into cloud orchestration development, because many of them are likely to eventually matter to applications as much as to services.

The final element is the management (VNFM) piece.  Cloud application management isn’t as organized a space as DevOps or cloud stacks, and while we have this modern notion of intent-modeled services, we don’t really have a specific paradigm for “intent model management”.  The NFV community could make a contribution here, but I think the work is more appropriately part of the scope of ONAP.  Thus, the NFV people should be promoting that vision within ONAP.

The next element on my to-do-for-NFV list is think outside the virtual CPE.  NFV quickly got obsessed with the vCPE application, service chaining, and other things related to that concept.  This has, in my view, created a huge disconnect between NFV work and the things NFV will, in the long term, have to support to be successful.

The biggest problem with vCPE is that it doesn’t present even a credible benefit beyond business services.  You always need a box at the point of service termination, particularly for consumer broadband where WiFi hubs combine with broadband demarcations in virtually every case.  Thus, it’s difficult to say what you actually save through virtualization.  In most current vCPE business applications, you end up with a premises box that hosts functions, not cloud hosting.  That’s even more specialized as a business case, and it doesn’t drive carrier cloud deployment critical for the rest of NFV.

Service chaining is another boondoggle.  If you have five functions to run, there is actually little benefit in having the five separately hosted and linked in a chain.  You now are dependent on five different hosting points and all the connections between them, or you get a service interruption.  Why not create a single image containing all five features?  If any of the five break, you lose the connection anyway.  Operations and hosting costs are lower for the five-combined strategy than the service-chain strategy.  Give it up, people!

The beyond-vCPE issue is that of persistence and tenancy.  Many, perhaps even most, credible NFV applications are really multi-tenant elements that are installed once and sustained for a macro period.  Even most single-tenant NFV services are static for the life of the contract, and so in all cases they are really more like cloud applications than like dynamic service chains.  We need to have an exploration of how static and multi-tenant services are deployed and managed, because the focus has been elsewhere.

We have actually seen some successful examples of multi-tenant service elements in NFV already; Metaswitch’s implementation of IMS comes to mind.  The thing that sets these apart from “typical” NFV is that you have a piece of service, an interface, that has to be visible in multiple services at the same time.  There has to be some protection against contamination or malware for such services, but there also has to be coordination in managing the shared elements, lest one service end up working against the interests of others.

Nothing on this list would be impossible to do, many wouldn’t even be difficult, and all are IMHO totally essential.  It’s not that a failure to address these points would cause NFV to fail as a concept, but that it could make NFV specifications irrelevant.  That would be a shame because a lot of good thinking and work has gone into the initiative to date.  The key now is to direct both past work and future efforts in a direction where results that move the ball for the industry as a whole, not for NFV as an atomic activity, can be obtained.  That’s going to be a bitter pill for some, but it’s essential.