How Transformational are Cicso’s “Five Critical Technologies?”

A Cisco executive cites five network technologies as critical for enterprises in 2019, according to a Cisco blog and a Network World article.  In fact, these five will be game-changers according to Cisco’s VP of engineering, so “2019 is going to be a transformative year in enterprise networking.”  What are these technologies, and is Cisco playing this straight or engaging in its usual marketing hype game?  Cisco says that the five are “Wi-Fi 6 (802.11ax), 5G, “digitized spaces”, SD-WAN and Artificial Intelligence/Machine Learning (AI/ML) for network management.”  Do enterprises agree, and do I?  We’ll look at each of the five, and a hidden point as well.

WiFi 6 is a fairly significant upgrade to WiFi to provide increased device density within a WiFi zone, and to provide for deterministic packet scheduling to improve some aspects of application performance.  The big question isn’t whether it’s useful, but whether it’s transformative, and on that front, I think it misses the mark.

The biggest advantages of WiFi 6 are related to device density; you can handle more users within a single WiFi “zone”.  That has benefits, to be sure, but we’ve long had the ability to multiply user capacity by increasing the number of WiFi hubs.  In addition, WiFi 6 will require upgrades to devices for it to be useful, and it will take years for these upgrades to roll out.

Vendors (including Cisco) probably love the notion of WiFi 6 because it is a driver of upgrades, but WiFi is so ubiquitous that making any fundamental changes to its specifications will take way too long for it to qualify as a game-changer…unless you’re playing a very slow game.

5G doesn’t need much of an introduction or explanation.  Like WiFi 6, it’s an upgrade to current specifications; in this case for cellular broadband.  Like WiFi 6, it offers both improved performance per user and improved density of users, and like WiFi 6 it requires a new set of devices (smartphones, primarily) to fully realize its benefits.

There are 5G opportunities, though, and they are transformational at least in potential.  They’re just not what Cisco thinks they are.  It’s going to take a long time for 5G to “transform” mobile broadband, if indeed it ever does.  We don’t know how much increased broadband bandwidth operators will be prepared to offer under current unlimited plans.  We don’t know how quickly new handsets can be socialized.  All we know is that 5G/FTTN hybrids based on millimeter-wave technology have the potential to transform home video delivery and even branch networking.

Network World called Cisco’s “digitized spaces” “geolocation”, which I think trivializes the concept, but Cisco’s example was almost totally focused on location data.  The concept that I think is important is what I’ve called “contextualization”, meaning giving applications a fuller understanding of what users/workers are actually doing at the moment, as a means of improving the relevancy of information or the interpretation of requests.

Context makes ads more valuable, applications more useful, information more relevant.  It’s true that location is a part of context, even a big part, but it’s far from being sufficient to reap the rewards that contextualization could provide.  Given the extraordinary value of context (“digital spaces”, or the creation of a virtual “goal-world” in which users can be said to be living) it’s surprising to me that Cisco doesn’t see it in its entirety.  More so, perhaps, since as a server/cloud infrastructure player, Cisco could be expected to play in the broadest definition of context, which isn’t true for other network vendors.

Another concept that doesn’t need much explanation is SD-WAN, and I think it has the greatest potential to change the network game for enterprises in 2019.  SD-WAN is an overlay network concept first created to support cloud hosting and later adapted (as the name suggests) to the WAN.  The evolution of these overlay virtual networks means that they intersect all the hot trends in the market—you need them for the cloud, to access cloud applications, to unify worker application and information access.  They’ve got pride of place, big-time.

The problem Cisco has with SD-WAN is that the narrow focus of the market today, which is simple small-branch connectivity, threatens their VPN/router market and doesn’t move the ball much in terms of improving the overall network business/benefit case.  For Cisco, that makes the SD-WAN a defensive strategy—lose some sales to yourself rather than to someone else.

SD-WAN’s importance is either a fallacy or it’s because of the network-as-a-service (NaaS) mission I’ve blogged so often about.  Cisco hasn’t supported that broader mission, so either their citing of SD-WAN as a game-changer means there’s change coming in Cisco’s SD-WAN strategy, or that Cisco is just washing the market with hype.

AI and machine learning for network management is the final point, and perhaps the prediction that rivals SD-WAN for the most potential as a game-changer for enterprise networking.  Operations complexity and cost is not only holding back new technology innovations, it’s also eating up more money than capital cost of equipment.  I blogged yesterday about the integration of AI and simulation into network operations automation, and that role could be decisive.

Cisco’s definition of AI in network management seems to focus on a mixture of analytics and complex event processing.  Yes, it’s helpful to be able to look for patterns of events to determine whether something is going wrong before things get too far and something actually fails.  Yes, that’s something AI might be able to do.  However, artificial intelligence is the current UFO of technology; it’s not landing in your yard and presenting itself for inspection, so you can assign to it any characteristics you find exciting or helpful in your marketing.  Machine learning has to learn stuff.  It would be a lot smarter to have the pump primed by something like simulation than to just wait till “I Robot” finally gets smart enough to be useful.

There’s an unnamed fifth thing in Cisco’s litany, which is IoT.  We had a recent article in Light Reading that noted that cable TV is looking toward growth in the number of households for a longer market ramp.  That’s logical given that you don’t typically have multiple cable TV subscriptions per household; it reflects the reality of “total addressable market” or TAM.  The same thing exists in the Internet, in wireless broadband.  If you want Internet growth, you need growth in the things that use it.  IoT promises billions of new Internet devices, right?

Well, you all know what I think about that.  What IoT will do, and is already doing, is add some new applications to the Internet and to smartphones.  Many of those new apps will potentially enhance the notion of digitized space or context, which reinforces my point that it’s context that really matters here.

Cisco is right about one thing, which is that smartphones and computers have already become extensions of people, and are on the way to defining their own parallel universe, a universe we all live in along with our lives in the real world.  In some ways (augmented or artificial reality, for example) we might explicitly inhabit that digital world, but in all ways we’ll draw on the digital universe to guide our behavior in the real world.  That’s truly transformational, and since all of Cisco’s five trends augment that digital universe, we can say that Cisco is right.  On the other hand, Cisco is dissecting the digital universe in their vision, separating the pieces that should be and must be mutually supporting to achieve anything.  In that respect, Cisco is wrong.  Is it lack of vision, or is it simply a sales-driven company trying to break down something big and complex so that the pieces can be sold more easily?  You can call that one for yourself.

Simulation, AI, and Testing in the NaaS of the Future

Virtual networking or Network-as-a-service (NaaS) makes connectivity easier, but it complicates automation of lifecycle processes.  The problem is that when “services” are created on top of connectivity rather than through the same devices that provide connectivity, you lose some insight into service behavior.  You also lose pretty much all of the traditional ways of doing management.  There’s talk about “monitoring”, “analytics”, “simulation”, and “artificial intelligence” to fill the gap, but little detail about how that might work.

The basic principle of automated service lifecycle management is that there are a series of generated events that represent changes in conditions.  These events are then analyzed so that a cause and correction can be determined, and then a remediation process is launched to restore normal (or at least optimal under current conditions) behavior.  This is the process into which we have to fit any new tools we hope to adopt.

The tools?  These also need some definition.  Monitoring is the real-time examination of network behavior to identify deviations from the desired operating state.  Such deviations are then coded as events, things that represent alerts that have to be handled in some way.  Analytics is the broader process of historical trend analysis or correlation analysis, aimed at deepening the insight you gain from traditional monitoring, usually to detect things earlier than monitoring would.  Simulation is the process of modeling the behavior of the network under a specific set of conditions, and it can be used either to help predict how current monitoring/analysis conditions will play out (giving you yet more advanced notice) or predict how effective a given remedial process will be.  Artificial intelligence is similarly a tool to interpret our “events” to either predict how a problem will unfold, or what reaction to the problem is most likely to be successful.  Got it?

We can apply our tools to our lifecycle model now, starting with the “event” generation.  Let’s say that the network has a goal state, which is implicit in nearly all operations processes.  There’s a network plan, perhaps a capacity plan, that is based on assumptions about traffic and connectivity.  That plan implicitly defines a network behavior set that’s based on the plan, and that’s the goal state.  If the current network behavior deviates, then it would be expected that the plan would be invalid, and some remediation would be required.  Monitoring and analytics are the traditional and leading-edge (respectively) ways of detecting a deviation.

Simulation could be a powerful tool at this point, because when the network’s presumptions on connectivity and traffic are known, those presumptions could be used to set up a simulation of normal network behavior.  What simulation brings to the table is the opportunity to model detailed conditions in the network as traffic and connectivity demands change.  That provides specific ways of generating early warnings of conditions that are truly significant, versus just things that won’t make much of a difference or will be handled through adaptive network behavior.

AI can also be injected at this point.  If we have a good simulation of “normal” behavior and we also have a simulation of how the current set of conditions is likely to impact the network, we can then use AI to establish a preemptive (or, of we’re late in recognizing things, a reactive) response.  That response could then be fed into the simulator (if there’s time) to see if it produces a satisfactory outcome.  We could also use AI to define a set of possible responses to events, then simulate to see which works best.  That could then be fed back into AI (machine learning) to help rule out actions that shouldn’t be considered.

In the response area, it’s AI that comes to the fore.  Machine learning is often represented as a graph, where we have steps, conditions, and processes linked in a specific way.  Think of this as an augmentation of normal state/event behavior.  Because all service lifecycle processes should (nay, I would say “must!”) be based on state/event handling, augmenting that to provide for more advanced analysis of the state/event relationship would be expected to improve lifecycle automation.

That doesn’t mean simulation isn’t helpful here, either to supplement AI by providing insight into the way that proposed responses will impact services, or by offering a path to predict network behavior if conditions fall outside what AI has been prepared to handle.  Most useful “process automation AI” is likely a form of machine learning, and simulation can help with that by constraining possible solutions and by offering something that’s perhaps almost judgment-like to solve problems for which specific rules haven’t been defined.

It should be clear here that the combination of AI and simulation serve in lifecycle automation almost as a substitute for human judgment or “intuition”.  The more complex the network is, the more complex the services are, the more difficult it is to set up specific plans to counter unexpected conditions—there are too many of them and the results are too complex.  It should also be clear that neither simulation nor AI are the primary tools of lifecycle automation.  We still need good overall service modeling (intent modeling) and we still need state/event or possibly graph-based interpretation of conditions.

All of this, we should note, is likely to be managed best at the facility level rather than at the virtual network level.  NaaS is a wonderful thing for operations, but less so as the target of remediation.  Real faults can’t be corrected in the virtual world; the best you could do is to rebuild the virtual structure to accommodate real conditions.  That could create the classic fault avalanche, and it would be easier and less process-intensive in any event to remediate what can really be fixed—there’s less of it and the actions are more likely to be decisive.

There’s an impact on testing here, too.  The industry has been struggling with the question of whether you should do “connection network” testing in even IP networks, where finding a traffic flow means examining routing tables to see where it’s supposed to go.  In NaaS-modeled networks, we have a clean separation of connectivity and the associated underlayment, but even that underlayment may be based on IP and adaptive behavior.

One possibility for testing is the “deep dive” model; you go to the bottom layer that’s essential for everything else and test the facilities there.  This is difficult to separate from monitoring, though.  Another possibility is to maintain connection-level testing, but that’s going to be much more difficult unless you have a standard mechanism for setting up an overlay network.  Today in SD-WAN, for example, we have dozens of different overlay encapsulation approaches; can we expect test vendors to adapt to them all?

The best approach is probably to provide what could be called a “testing API”, which probably doesn’t need to be anything more complicated than a way to establish a specific pathway then examine how traffic flows through it.  Since this pathway would look like a standard connection to the NaaS layer, you’d be able to get it in a standardized way even if the underlying encapsulation of the overlay network differed from vendor to vendor.

There’s still our reality of fixing virtual problems in the real world to contend with.  Do we really want to do “testing” anywhere except at the connection edge of a NaaS or at the physical facilities level?  What do we expect to learn there?  That’s something testing vendors will need to think about, particularly as we expand our goals for automation of service lifecycle management.  There’s little point to that if we then expect networking personnel to dive into bitstreams.

Again, simulation and AI can help.  We need to think more in terms of what abnormal conditions would look like and how they should be remediated, and less in terms of geeks with data line analyzers digging into protocols.  The rational goals for network automation, like the implementations of network modeling, should focus on intent-modeled elements that self-remedy within their scope.  If we set that goal, then AI and simulation are going to prove to be very powerful tools for us in the coming years.

Taking a Model-Side View of Ops Automation

Earlier this week I blogged about virtualization and infrastructure abstraction.  I left the modeling aspect of that to the end because of my intentional focus on the abstraction side.  Now I’d like to look at the same problem of virtualization from the modeling side, so see where we end up and whether conclusions from that direction support the conclusions you get when you start with the notion of an “abstraction layer”.

Operations automation, whether we’re talking about applications or service features, is usually based on one of two models that were popularized in the “DevOps” (Development/Operations) automation activity that started about a decade ago.  One model, the oldest, is an extension of the “shell scripts” or “batch files” routinely used on computer systems to execute a repetitively used series of commands.  This is now called the “prescriptive” or “imperative” model; it tells the system what to do explicitly.  The other model, the “declarative” model, describes the goal state of a system and then takes automatic (and invisible) steps to achieve it.

What we call “intent modeling” today is an expansion of or variation to the declarative approach.  With an intent model, an application or service is divided into functional elements, each of which is represented by a “black box” that has externally visible interfaces and properties, but whose interior processes are opaque.  These elements don’t represent the functionality of the pieces in the data plane, but rather the lifecycle processes associated with them, which we could call the management or control plane.  If you tell an intent-modeled element to “deploy”, for example, the interior processes (that, you recall, are invisible) do what’s needed to accomplish that goal or “intent”.  Any element that meets the intent is equivalent to any other such element from the outside, so differences in configuration and infrastructure can be accommodated without impacting the model at the high level.

This raises a very important point in modeling, which is that the model represents an operations process set associated with lifecycle automation.  There are process steps that are defined by the model, and others that are embedded within model elements, opaque to the outside.  Model elements in a good operations model could be hierarchical, meaning that inside a given element could be references to other elements.  This hierarchical approach is how a general function (“Access-Element”) might be decoded to a specific kind/technology/vendor element.  In most cases, the bottom of any hierarchical “inverted tree” would be an element that decoded into an implementation rather than another model hierarchy.

The qualifier “in most cases” here arises from the fact that there’s an intrinsic limit to how low a model hierarchy can go.  A management system API, which is probably the logical bottom layer of a model, could represent anything from the specific control of an individual device to a broad control of features of a device community.  Remember the old OSI management model of element/network/service management layers?  If the API you’re manipulating in the end is a service-level API, then you can’t expect to model devices—they’re inside the bottom-level black box.

This means that there are really two levels of “automation” going on.  One layer is represented by the model itself, which is under the control of the organization who writes/develops the model.  The other is represented by the opaque processes inside the bottom model element, which is under the control of whoever wrote the implementation of those opaque processes.  In our example, that would be the implementation of the service management API.  Whatever isn’t in the bottom layer has to be in the top.

Another implication of this point is that if different products or vendors offer different levels of abstraction in their management system APIs, there might be a different modeling requirement for each.  If Vendor A offers only element management access, then services have to be created by manipulating individual elements, which means that the model would have to represent that.  If Vendor B offers full service management, then the elements wouldn’t even be visible in that vendor’s management system, and could not be modeled at the higher level.

This same issue occurs if you use a model above a software tool that offers a form of resource abstraction.  The Apache Mesos technology that’s the preferred enterprise solution for very large-scale virtualization and container deployments has a basic single-abstraction API for an arbitrarily large resource pool, and all the deployment and connection decisions are made within the Mesos layer.  Obviously you don’t then describe how those decisions should be made within your higher model layer.  If you don’t use something like Mesos, though, you need to have the modeling take over those deployment/connection tasks.

The cloud community, who I believe are driving the bus on virtualization and lifecycle automation, have already come down on the resource abstraction side of this debate, which to me means that for service lifecycle automation the network operator community should be doing the same thing.  That would align their work with the real development in the space.  To do otherwise would be to expect that network/service lifecycle automation and application/cloud automation will follow different paths with different tools.

My dispute with the NFV ISG, with ONAP, with the ETSI ZTA stuff, revolves around this point.  When we try to describe little details about how we’d pick the place to put some virtual function, we are discarding the experience of the cloud community and deliberately departing from their development path.  Can the networking community then totally replicate the cloud’s work, in a different way, and sustain that decision in an ever-changing market?  Can they do that when a hosted application component and a hosted virtual function are the same thing every way except in nomenclature?

That so many in the operator world are now infatuated with “cloud-native” functions only makes the separation of network functions and cloud applications sillier.  How cloud-native are you if you decide to build your transformation on developments that are in little or no way related to what’s going on in the cloud you’re trying to be native to?  It makes you wonder whether the term is anything more than fluff, an attempt to give new life to old and irrelevant initiatives.

We are, with developments like containers, Kubernetes, Mesos, Istio, and other cloud-related tools, reaching the point where the cloud has defined the ecosystem within which we should be considering future network operator transformation.  Surely everything above the connection layer of services is more like the cloud than like the network.  Much of what’s inside the connection layer (the hosted elements) are also cloud-like, and the way that connection/network devices relate to each other and to hosted components is totally cloud-like as well.  Why the heck are we not using these tools?  What ETSI should be doing now is not spawning new groups like ZTA, or continuing old groups that are only getting further off-track daily as the NFV ISG is.  They should be aligning their goals to the cloud, period.

One critical step here is the tension between modeling and resource abstraction; in fact, abstraction in general.  We could accomplish a lot even in the short term if we could convince vendors in the networking space to think about “abstraction APIs” and how these APIs could integrate with modeling tools like TOSCA.  This kind of thing would not only benefit the network transformation efforts of operators, but also the cloud itself.  Resource abstraction is increasingly a part of the evolution of cloud tools, after all.

Along the way, they can raise some valid issues for the cloud community.  There is already cloud work being done in data-plane connectivity for hosted elements; it should be expanded to address the needs of virtualization of network data-plane elements like routers and switches.  There are constraints to hosting location or to “proximity” of components to each other or to common failure sources that have been pointed out in networks and are only now becoming visible in cloud services.  This can be a give-and-take, but only if everyone is part of the same initiatives.  Let’s get it together.

What’s Behind the Smartphone Woes?

We all know by now that both Apple and Samsung, giants in the smartphone space, have issued sales warnings.  What we don’t all know, and what in fact none of us really knows, is what this means to networking both at the service and the infrastructure level.  Smartphones are arguably the most significant single devices in tech history, so surely a seismic change in the market would mean something.  What?

How many of you remember Novell?  The company was a giant in networking in the ‘80s, and its NetWare technology was for a time the gold standard in workgroup networks.  By the end of the decade, Novell was clearly in trouble, an also-ran in the network space.  The reason was simple; feature starvation.  A workgroup network feature set is 90% or more about file and printer sharing.  Once you have that, willingness to pay for more is very limited.  Competition for the enabling features quickly kills profits on new sales, and the number of workgroups you can sell to is finite, which means new sales are harder to find in any case.

That’s where we are in smartphones.  While there are many new things we do with smartphones, those new things are almost entirely supported by the apps and not the phones.  What has kept the market afloat up to now?  Three things.

Thing number one is new smartphone markets, outside places like the US and EU.  Rapid growth in emerging economies, including China, opened new smartphone opportunities and that fed enough new users into the market to fend off the forces of commoditization.  Now, more and more of those new markets are penetrated to the point of sales tapering off.  Apple cited the softness in the China market as the reason for its shortfall, but it’s only a factor and not the full explanation.

The second thing that’s supported the market is curb appeal for new models.  Remember the long lines at Apple stores to get the latest iPhones?  Many of those in line wanted just that—the “latest iPhone”.  Not anything in particular about it, other than it was the latest.  Apple more than anyone else has appealed to the “yuppie market”, the upscale buyer who wants the latest and greatest of everything.  Today, we take smartphones more for granted, and the cachet of new models has less impact.

The third thing is application horsepower requirements.  Yes, it’s the apps that drive new uses for smartphones, but you still need the phone to run those apps acceptably.  In the period between 2014 and 2017, my research suggested that the non-call, non-text activities of smartphones doubled every year.  At the end of last year (2018), that usage was up only about 20% versus the prior year.

There are still new markets, but fewer than before.  Cachet for the latest and greatest phones is still a factor, but less than it has been.  Without the new-phone-every-year plans from wireless carriers, in fact, my model says that the average user would keep their phone for 28 months.  Most won’t run out of phone horsepower in that period because of new app demands.  That’s why we’re seeing issues with smartphone sales.

It also explains why we’re hearing so much about things like 5G or augmented reality.  The former would obsolete old phones at the radio level, encouraging or even forcing upgrades and new sales.  The latter would usher in a new set of apps that would likely demand more horsepower from phones.  The question is whether either or both of these “new things” could come about, and more specifically whether the smartphone revenue warnings could change the dynamic to promote them.

The most obvious and direct way to jump-start a refresh in the smartphone space would be for smartphone vendors to offer new 5G-ready models at no increase in price.  As these phones become available, operators who offer them would have a competitive incentive to offer 5G services, and other smartphone vendors and operators would be under competitive pressure to match this.  In time, if users demonstrated interest in the 5G models, we’d have the start of a refresh in the smartphone space.  The question would be how long this might take, since it presumes early smartphone buyers would value 5G capability even without networks to connect with.

Another approach that gets around this lack of service/smartphone synchrony would be to aggressively promote 5G both at the service level and the handset level.  What could happen is that the handset vendors would agree to sharp discounts for 5G handsets that operators deploy in a renewal of the new-phone-every-year pattern.  The operators would, in return, agree to deploy 5G (the Non-Stand-Alone or NSA version that’s really a radio-network upgrade only) in key market areas where the phone program was available.  That would get both ends of the radio side of smartphone 5G in place at the same time.  It spreads the risk to both operators and smartphone vendors, which is fairer at one level, but which also complicates the dynamic because both parties would have to believe the other wouldn’t blink.  Otherwise they could be stuck with a massive sunk upgrade.

The obvious question that this all raises relates to the other of our two possible drivers of change, the augmented reality or artificial reality stuff.  5G offers the potential for higher connection speeds, and of course that potential may be enough to activate our “core appeal” driver, but actual benefits would be better.  It’s been difficult to cite specific needs for mobile broadband speeds greater than 4GLTE can provide, but the AR application is usually considered the most credible.  Even so, it has its problems.

The biggest problem is that virtually all smartphone apps are “socialized” more than marketed, meaning that users of the application promote it to their friends and co-workers.  That means that getting something started requires a fairly diverse initial population of users to “seed” the market, and that’s inhibited by the fact that AR apps require AR gear, which means an ecosystem has to be put into place before you even get to the question of broadband speed.  Then you have to ask whether the AR-receptive subset of the 5G population is large enough to build anything on, and whether requiring that AR and 5G both be available would limit the critical socialization effect.  If you decide you need less inhibitors, then you tie to 4G and that then means less pull-through of new smartphones.

The net of all of this is that it’s possible that concerted efforts by operators and smartphone vendors, driven by growing concern that the smartphone space is topping out, could drive 5G adoption forward faster.  It’s also possible that such a move would fail because it requires too much harmony across the market.  If anyone blinks, the deal sours for all.  In the long run, of course, you can’t expect a major wireless upgrade every time you want to drive a phone refresh.  We have to accept the facts here, which is that all tech markets tend to lose feature differentiation over time, and vendors lose pricing power and growth potential with the shift.

It All Comes Down to Resources

I blogged yesterday that virtualization was the key to operations automation, which in turn is the key to the future, the thing we needed to look for in 2019.  The central vision in cloud and virtualization is that of a resource pool.  With resource pools, the trick is to make them as inclusive in terms of resources as you can, while limiting the integration needed to bind resources to the virtualization’s hosting abstraction—VM, container, etc.  We’ve had a lot of advances in this space, but the problem is that they’re divided by technology and terminology even as they seem to be aligning in terms of mission.

SDxCentral did a nice piece at the end of the year on “composable infrastructure”, so that’s one term.  We’ve also had the Apache Mesos approach, “resource abstraction”.  The NFV ISG has a “virtual infrastructure manager”, Apstra has “intent-based” infrastructure, Juniper’s HTBASE Jute’s “multi-cloud control and data fabric”…you get the picture.  Lacking a single clear term for what’s going on, we have a quest for differentiation that’s leading to a lot of confusion.

The biggest problem virtualization solves is that of resource diversity.  When a resource pool has diverse technologies in it, the software needed to commit, recommit, and connect resources has to reflect those differences somehow.  Otherwise, every time you change resource structure you have to change software, something that’s usually called a “brittle” approach because it breaks easily.

Let’s forget for the moment what people call their stuff and work with some neutral definitions.  There are two basic ways to deal with a diverse resource pool.  One is harmonization by plugin, meaning that the deployment function has a general feature set per resource type, adapted to specific resources with a plugin.  A lot of software works this way, including OpenStack.  The other is abstraction layer, meaning that there is a “layer” of functionality that presents resources in a consistent way to the higher (like orchestration) functions, and maps resources to that abstraction.

The critical point often hidden in the discussion of these approaches is the relationship between the resources and the operations processes that act on the resources.  We have stuff we want to do, like deploy things or connect them, and these processes have to operate somehow on resources.  The two options for dealing with resource pools do that mapping differently.

Harmonization by plugin is the older approach to the problem of resource variability.  If you have a vision of the operations processes you want to apply to resources, then designing tools around those processes makes sense.  So does adapting the end result of each process (the “steps” or “actions”) to each specific resource control API.  The downside of this approach is that operations processes might vary depending on the specifics of the resource, and that’s even more likely when you cross over between different types of resources.  The cloud is not the same as the data center, and scaling within either isn’t the same as scaling across the boundary.

Where this tends to cut the most is where you look at virtualization in the broadest sense, when a “resource” is a distributed resource that has to be connected through networking, for example.  You want compute resources for a distributed application.  When you select a resource, you have to deploy it and then connect your workflow to/through it, so networking becomes a resource implicit to hosting.  In a harmonization by plugin approach, you can harmonize the deployment, but you still have to connect.  The single decision to host doesn’t give any context to the rest of the component locations, so the process controlling the plugins has to make the decisions on connectivity.

Another issue with the process-to-plugin approach is the likelihood that there are different levels of abstraction inherent in the APIs to the resources involved.  If you think of the traditional management hierarchy—element, network, service—you see that as you move up the ladder of management, you have a greater inherent abstraction of the resources.  Whatever level of abstraction you have at the resource control point has to be reflected in the process you design—if you have to manipulate individual devices then the process has to be device aware.  If you happen to have a service abstraction, the process that invokes it needs no knowledge of the devices.  When abstraction of resources is variable, it’s hard to keep that from complicating the process-level design.

Harmonization by abstraction is the opposite.  You commit a resource (the abstraction layer’s model) and everything that’s associated with that resource commitment is under the covers.  Where that resource is, how it’s connected, is also under the covers, which means that it’s possible for the abstraction-layer model to hide the implicit network connectivity needed.  The process at the hosting level doesn’t need to be aware of those details, and so virtualization can extend across a pool of resources that create a variety of complex connection issues without impacting the application that asks for a place to host something.

However…that capability introduces another layer (which some vendors position against in their marketing).  It also means that if you want to make connections disconnected from hosting decisions, you need a way of harmonizing that task with whatever implicit connections are being set up in your abstraction layer.  That, to me, means that it’s very difficult to do hosting or server abstraction without abstracting other interreactive resources like networking, database, etc.

The database stuff it a superset of what I once called “Infrastructure Services”.  These are services available to applications using virtualization, in the same way that AWS or Azure “web services” are available to cloud applications.  They look a lot like a generalized set of application components, available to and shared by all.  These services, because they’re presumed to be generalized resources, have to be distributable to wherever you happen to put application components.  They’d need to be abstracted too, at least to the point where they could be discoverable.

Nobody has thought much about infrastructure service or database abstraction, beyond Hadoop (which has kind of boomed and busted in terms of popularity) or the web-service stuff inherent in the cloud.  The big problem with service abstraction is that since services are created by software, they’re highly variable.  Some users have suggested mechanisms for database abstraction to me that seem workable at least on a limited basis, but none of these would be suitable for the abstraction of a broader set of services.

Cloud-specific, virtualization-specific, applications (“cloud-native” in the modern terminology) depends on having everything abstracted and virtualized, because that which is not constrains the agility that virtualization can provide in other areas.  Nail one element of an application’s resources in place, and however flexible you make the rest, the whole structure is nailed to the ground.

The lesson of that truth is clear; network, database, feature, server, and any other form of abstraction has to be addressed, meaning resource abstraction in the most general sense is the only thing that can work in the long term.  That’s one reason why network abstraction, meaning network-as-a-service or NaaS, is so critical.  It’s also why we should focus or virtualization and cloud attention on the strategies that offer resource abstraction in complete form.

The primary benefit of generalized resource abstraction is that it allows a resource pool, however widely it’s distributed, to be represented as a single huge, expandable, computer and at the same time a vast connection resource.  The services of the latter are harnessed to enable the former; all the distributed elements of hosting are effectively centralized through connectivity.  By doing this once, in an abstraction layer, all of the application and service lifecycle processes are immunized against the complexity and variability of “the cloud”.  You deploy something, and deploy it the same way wherever it goes.  The cloud is no more complicated than a single server, because virtually speaking, that’s what it is.

We do have a number of initiatives aiming for this “total virtualization” approach, but nothing is quite there in terms of features or even just positioning.  The latter needs to be done even more than the former, because without a strong explanation of where this is heading and a reason why users would need to go there, features won’t matter.

Users themselves are partly to blame for this state of affairs.  We’ve become accustomed to very tactical thinking in IT and networking because we’ve shifted the focus more and more toward simply doing the same for less money.  Large-scale architectural changes tend to defer benefits to the long term while focusing costs in the present.  Vendors hate this too because they delay getting their quarterly quota fodder.  Finally, obviously, it’s a lot harder to explain massive change than little tweaks, and that means that the political coverage the media usually provides for senior management is limited.  Where risk of decisions rises, decisiveness declines.

The good news, of course, is that very avalanche of partial but increasingly capable solutions we’re getting out of the cloud side of the industry.  We’re creeping up on a real strategy here, and while it will take longer than it could have taken had we realized from the first where we were heading, I think it’s obvious now that we’re going to get to the right place, and that’s important news for the coming year.

2019: It’s About Facing Reality

How different is January first from December thirty-first?  Not much in most ways, and of course it’s after January first now, but it’s still true that businesses think in terms of quarters and years, and so it’s fair to look ahead at our annual transition point to see what important things are coming down the pipe.  That’s particularly true this year, because we’re reaching critical mass in the most important technology transition of all.

Let me start by referring to an interesting Red Hat survey blog on business’ IT priorities.  What I find interesting here is less the absolute numbers, or even the way that priorities line up for this year and last, but rather the pattern of focus year over year.  I think you could read the chart as saying that we started to face reality in 2018.  Look at the 2017 priorities; cloud and security.  These are the tech equivalent of mom’s apple pie; they’re the kind of response you’d expect someone to offer off the cuff.  Look at the top priority for 2018—operations automation.  That’s reality.

Well, it’s at least a realistic goal.  The challenge for operations automation, which I’ve been calling lifecycle management, is that it’s easy to love it and hard to do it.  Translating the goal into realistic technical steps is critical to fulfilling it, and we’ve so far not been able to do that.  I think we’re starting to see, even if it’s still through some fog, the shape of the concept that’s key to our achieving IT goals…virtualization.

There’s implicit evidence that nearly everyone agrees virtualization is the key to the future, tech-wise.  problem hasn’t been accepting the primacy of virtualization as the issue of our time, but realizing it in a way that harvests all the reasons for putting it on top in the first place.  Virtualization of both applications and resources provide the mechanism for generalizing tools, which is the only path that can lead to a good place in terms of lifecycle management for either applications or services.

That means having the right abstraction and the right implementation of how it’s realized.  The abstraction part of virtualization creates what appears to a user as a “real” something, and the realization implements the abstraction to make it usable.  A “virtual machine” or VM has the properties of a real server, and we host VMs on a real machine via the intermediary tool of a hypervisor.  Anything that implements the abstraction fully is equivalent to everything else that does so, but the “fully” part is important.

The nice thing about virtual machines is that we know what the abstraction is—a machine.  That makes realization fairly straightforward too, because there’s a single accepted model of features and capabilities that an implementation of a virtual machine has to meet.  Even with VMs, though, there are some vague aspects to the abstraction that could bite users.  For example, are VMs resilient?  If so, there should be some specific parameters relating to how fast a replacement is offered and whether what’s run on the old is somehow moved to the new.  Users of VMs today know that implementations vary in those important areas.

The abstraction that NFV is based on, the “virtual function”, is a good example of excessive fuzz.  The best we can say for definition is that a VNF is the virtual form of a physical network function, meaning a device or appliance.  The obvious problem is that there are tens of thousands of different PNFs.  Obviously an abstract VNF representing a “router” is different from one representing a “firewall”, so what’s really needed with NFV (and in general with virtualization) is the notion of a class hierarchy that defines detailed things as subsets of something general.

We might, for example, have a “Class VNF”.  We might define that Class to include “Control-Plane-Functions”, “Data-Plane-Functions”, and “Management-Plane Functions”.  We might then subdivide “DPFs” into functional groups, one of which would be “routers” and another “firewalls”.  The advantage of this is that the properties that routers and firewalls share could be defined in a higher class in the hierarchy, and so would be shared automatically and standardized.  Further, everything that purported to be a “Firewall” would be expected to present the same features.

The primary challenge in realizing an abstraction is making sure your implementation matches the features of the “class hierarchy” the abstraction defines.  That’s not always easy, not always done, but it’s at least broadly understood.  The challenge in realizing virtualization comes in how we visualize the entire implementation and map it to hardware.  If there is no systemic vision of “infrastructure” then implementations of abstractions tend to become specialized and brittle (they break when you bend them, meaning make changes).

In modern virtualization abstractions, the goal is to utilize a collection of resources that are suited to fulfill parts of our implementation—a “resource pool”.  That resource pool is essentially a distributed platform for realization, and to make it useful it’s important that everything needed to utilize the distributed elements is standardized so everything uses it consistently.

What’s happened in 2018 in cloud computing is that we’ve taken some specific steps to do what’s necessary.  We’ve created a standard framework into which applications deploy, which we can call “containers”.  We’ve settled (pretty much) on a way of deploying application components into containers in an orderly and consistent way, which we call “orchestration” and for which we’re increasingly focused on Kubernetes.

What we realized in 2018 was that this basic container/Kubernetes model wasn’t enough.  We need to be able to generalize deployment not only across different container hosts, but also between data centers, in a hybrid cloud, and in multi-cloud.  There are a variety of tools that have emerged to fulfill this extension mission, just as there’ve been a variety of container orchestration tools.  We settled on one of the latter in 2018—Kubernetes.  We should expect to settle on a cloud-extension tool in 2019.  Stackpoint?  Rancher?  We’ll see.

There’s less disorder in the next extension to the basic 2018 model.  When you distribute resources you have to distribute work, which is a lot more than simple load balancing.  Google, who launched Kubernetes, has started to integrate it with a broad workflow product, Istio, and that shows a lot of promise in framing the way that pools of resources can share work.

The missing ingredient in this is virtual networking.  The Internet and IP have become the standard mechanism for connectivity.  IP is highly integrated into most software, which means that we have to support its basic features.  The Internet is the perfect fabric for building OTT applications, which is why we’ve built them using it, but for a given application its general, universal, connectivity is a problem.  What we’ve known for a decade is needed is “network-as-a-service” (NaaS), meaning the ability to spin up an IP network extemporaneously but have it behave as though it were a true private network.

We have “virtual network” technology in the data center, and Kubernetes in fact has a plug-in capability to provide for it.  However, this is generally focused on what could be called “subnet connectivity”, the subnet being where application components live in a private address space.  We also have, with SD-WAN, technology to virtualize networking in the wide area, meaning within a corporate VPN.  There are some commercial products that offer these two in combination, and you could of course assemble your own solution by picking a product from both spaces.  What’s lacking is the architectural vision of NaaS.

Should a network service spin up to connect the elements of an application or service, in the data center, in the cloud, in the real world?  How can it be made to handle things like web servers that are inherently part of multiple user services or applications?  These are questions that illustrate the essential mission trade-off of NaaS.  The more usefully specific it is, the more difficult it is to harmonize NaaS with the behavior of current web, network, and information technologies.  I can see, and surely many can see, how you could address trade-offs like the one I’ve shown here, but there’s no consensus and very little awareness of the problem.

That wasn’t always the case.  Earlier network technologies recognized the notion of a closed user group, a subset of the total possible connective community that, for a time at least, communicated only within itself.  These are fairly easy to set up in connection-oriented protocols like frame relay or ATM because you can control connection by controlling the signaling.  In datagram or connectionless IP networks, the presumption of open connectivity is present at the network level and you have to interdict relationships you don’t like.  Since source address validation isn’t provided in IP, that can be problematic.

While classic SDN (OpenFlow) isn’t an overlay technology, the first data center networking technology (Nicira, later bought by VMware to become NSX) is.  Since SD-WAN is also an overlay technology, it’s fair to wonder (as I did in an earlier blog when VMware bought VeloCloud and created “NSX SD-WAN”) whether somebody might unite the two with not only a product (which VMware and Nokia/Nuage both have) but also with effective positioning.  In point of fact, any SD-WAN vendor could provide the necessary link, and if classic network principles for containers (meaning Kubernetes) is adopted you can really “gateway” between the data center and WAN without excessive complexity…if you know why you want to do it.

Why haven’t we fitted all the pieces together on this already?  We’ve had the capability to do the things necessary for four or five years now, and nobody has promoted it effectively.  That may be because everyone is stuck in local thinking—keep your focus on what you expected to do.  It may be because vendors fear that a broad NaaS mission would be too complicated to sell.  Finally, it may be because the SD-WAN space is the logical place for the solution to fit, and vendors there are uniformly poor at marketing and positioning.  Without a compelling story the media can pick up on, market-driven changes in networking and virtualization are unlikely.

Competitive-driven changes, then?  The Oracle deal for Talari is, as many have pointed out, the first acquisition of an SD-WAN vendor by a cloud provider.  Of course, Oracle’s cloud isn’t exactly setting the market on fire, but that might be the motivation Oracle needs.  A strong offering in SD-WAN, combined with strong features in containers and orchestration, could give Oracle something to sing about—at least for as long as it took Amazon, Google, and Microsoft to move.  As I noted in a prior blog, the cloud providers may end up being the source of virtual networking advances from now on, as well as the source of full “cloud-native” capability.

That may be the missing link here.  Virtualization has progressed recently because there was a concept—containers—that offered both tactical and strategic benefits.  The realization of the former is leading us closer to the latter, influenced in large part by the cloud.  That realization is driven at the technical level, meaning it doesn’t really require market socialization to happen.  I’m leaning toward the view that it’s going to be some sort of NaaS extension to the container/virtualization challenge that ends up creating the full spectrum of virtualization features we need, including NaaS.  That might come from a cloud provider or from a premises IT player looking for hybrid cloud.  Wherever it comes from, I think we’re going to see it emerging in 2019.

That brings us to the last step, the one that closes the loop on the Red Hat blog entry I opened with.  To automate the relationship between abstract applications/services and real resources, you need modeling.  A model, meaning a data-driven description of the relationships of abstract elements and the recipes for realizing them on various resource alternatives, is the key to automated lifecycle management.

Models are at the top of the whole process, so starting there might have been nice, but the reality is that you can’t do service or application lifecycle automation without the framework of virtualization fully developed.  Since that framework is logically compartmentalized, we really have to expect the modeling to retrospectively absorb what’s being standardized below.  It’s also true that the whole model thing is only important insofar as we have some standard pieces included.  The details will matter eventually, because without model portability among users it’s difficult to see how we could ever fully harmonize lifecycle management.

We need models to do two things.  First, to represent abstractions as either the decomposition of a primary model into subordinate ones, or into steps to realize the model on infrastructure.  Second, to steer events to processes based on the state of the target model element(s) in the structure.  The need for “model specificity” lies in the fact that neither of these can be done using a generalize tool; we would always need to decompose the specific model structure we used.

All this ties together because true infrastructure virtualization, which has to virtualize everything we’d call “infrastructure”, can define virtual networking within itself, exposing only the address sockets that will connect to corporate VPNs or the Internet.  SD-WAN can virtualize the former, and the latter is something we’ll have to think about in 2019.  Should we presume that Internet relationships are transient closed-user-group relationships?  There would be benefits, and of course scalability costs.  Security already costs, so perhaps a trade-off point can be found.

Maybe December of 2018 wasn’t as different from January 2018 as many had hoped—including me.  Maybe that will be true in 2019 as well, but I don’t think so.  Happy New Year!