Composed versus Abstracted Resources: Why it’s Critical

The blog I did yesterday on Cisco’s approach to edge or universal distributed computing got me some email questions and raised some LinkedIn comments.  These combine to suggest that spending a little time on the issue of resource abstraction or “virtualization” versus infrastructure composition might be useful.  As always, it seems, we’re hampered by a lack of consistency in definition.

The “composable infrastructure” or “infrastructure as code” movement seeks to allow what is much like a real server to be assembled from memory, CPU, storage, etc.  This presupposes you have very tight coupling among the elements that you’re going to compose with, so that the result of your composition is an efficient almost-real machine.  “Abstract infrastructure”, “resource abstraction”, or “virtualization” aim at a different target.  The goal there is to define a model, an abstraction, a virtual something, that has consistent properties but will be mapped to a variety of actual hardware platforms.

It’s my view that composable infrastructure has specialized value while resource abstraction has general value.  Cloud computing, containers, virtual machines, and virtual networks all prove my point.  The reason is that if you define the right resource abstraction and map it properly, you can build applications so their components exploit the properties of the system you’ve created.  There is then uniformity in the requirements at the “composable” or real-server level (which is why you can map an abstraction to a general pool of resources), so you don’t need to compose anything.

I’m not dismissing the value of composable infrastructure here, but I am suggesting that it’s not as broadly critical a movement as the resource abstraction stuff.  You could use composable infrastructure inside a resource pool, of course, but if the goal is resource and operations efficiency, it’s probably better to have resources that are more uniform in their capability so that you have optimized utilization.  Differences in resources, which composable infrastructure enables/creates, don’t contribute to equivalent resources that are the foundation principle of efficient resource pools.

There also seems to be two different approaches to resource abstraction.  OpenStack epitomizes the first, oldest, approach, which is based on abstracting through process generalization.  The lifecycle tasks associated with deploying an application or service feature on infrastructure are well-known.  If each task is provided with an API, and if the logic for that task is then connected to a “plug-in” that specificizes it to a given resource or class of resources, then invoking that task with the proper plug-in will control the resource(s) you’ve assigned.  The newer approach is based on the virtual model and layer.  With this approach, you define a virtual resource class, like a server with memory and storage, and you define an API or APIs to do stuff with it.  Your processes then act directly on the model APIs; there are not visible plug-ins or adaptations.  Under the model you have a new layer of logic that invisibly maps the virtual model to real resources.

The difference between these approaches can be understood by thinking for a moment about a resource pool that consists of servers in very different locations, perhaps some in the cloud, some at the edge, and some in the data center.  When you want to assign a resource to a given application/feature, the tasks associated with deciding what resource to assign and making the assignment are likely at least somewhat (and often radically) different depending on where the resource is.  Network connectivity in particular may be different.  In process-generalized virtualization, this may mean that your processes have to know about those differences, which means lifecycle automation has to test for the conditions that could change how you deploy.  In virtual-model resource abstraction, you make all these decisions within the new layer, so your processes are simple and truly resource-independent.

You can also explore the differences between these two abstraction approaches by looking at the current cloud infrastructure platform software space.  There are two proven-at-scale approaches to “hosting”.  The first is the Apache Mesos and DC/OS combination (Mesosphere is a with-support provider of the combination, along with the Marathon orchestrator), which is very explicitly a virtual-model-and-layer tool.  The second is what we could call the “Kubernetes ecosystem”, which is the container-based strategy evolving piece by piece, and it’s what’s in this approach that’s interesting.

Kubernetes is an orchestrator.  You have a plug-in point for a virtual-network tool as well.  You need to have an additional component to provide distributed load-balancing and workflow control—Google-sponsored Istio is a prime example.  You need a strategy to distribute the control plane of Kubernetes, like Stackpoint.  Maybe you use a combination tool like Juke from HTBASE, which Juniper is now acquiring, or Rancher.  You can probably see the point here; all this ecosystemic stuff has to be added on, there are multiple options for most of the areas, and your overall process of lifecycle automation is likely impacted by each selection you make.  If you simply push everything into a lower layer and make it invisible, your operational life is much easier.

This is why I said, in yesterday’s blog, that Cisco’s “stuff-Anywhere” expansion was a good idea as far as it went, but didn’t include a critical piece.  Any resource abstraction strategy should promote vendor independence, efficient use of resources, and hybrid and multi-cloud operation.  It also has to support efficient and error-free operations practices, and if you don’t stick the variables in an abstraction-independent lower layer they rise up to bite your operations staff in the you-know-where.  That bites your business case in the same place.

You might wonder why we don’t see vendors touting ecosystemic resource abstraction if it’s so clear this is the long-term winning strategy.  The answer is simple; it’s easier to get somebody to climb to the next ledge than to confront them with the challenge of getting all the way to the summit.  Vendors aren’t in this for the long term, they’re in it for the next quarter’s earnings.  They push the stuff that can be sold quickly, which tends to mean that buyers are moved to a short-term goal, held there till they start to see the limitations of where they are, then presented with the next step.

There’s a downside to the model-and-layer approach to resource abstraction; two, in fact.  The first is that the underlying layer is more complex and has to be kept up to date with respect to the best tools and practices.  The second is that you are effectively creating an intent model, which means you have to supply the new layer with parametric data about the stuff you’re deploying, parametric data about your resources, and policies that relate the two.  This stuff is easier for some organizations to develop, but for those used to hands-on operation, it’s harder to relate to than process steps would be.

This same argument has been used to justify prescriptive (script-like) versus declarative (model-like) DevOps, and I think it’s clear that model-based is winning.  It’s very easy to recast any parameter set to a different format as long as all the necessary data is represented in some way.  It’s easy to get parametric values incorporated in a repository to support what the development community is calling “GitOps” and “single source of truth”.  In short, I think it’s easy to say that the industry is really moving toward the resource abstraction model.

Which is where the media, including people like me who blog regularly, should step in.  Vendors may not want to do pie-in-the-sky positioning, but if enough people tell the market which way is up, we can at least hope to get our ledge-climbers headed in the right direction.

You Can’t Dismiss the Latest Cisco “Data Center Anywhere” Tale

You could argue that SDN evolved to SD-WAN, then perhaps to SDN plus SD-WAN.  Now Cisco wants to evolve SD-stuff to another level, which they call “Data Center Anywhere”.  In concept, this combines aspects of software-defined networking and edge computing to permit not only uniform access to central resources but also access to distributed resources.  As always with Cisco, we have to look at the idea through the combined lens of marketing and positioning and of technology and strategy.  And yes, I know, this is another Cisco-centric blog, but with the Cisco Barcelona event fresh in our minds we have to look at their stuff to see where they’re leading customers.

The classic notion of corporate IT is the company data center located in some major site, a bunch of super-empowered workers local to that data center, and a group of decentralized (branch-office and remote) workers located elsewhere.  SDN technology targeted data center networking, especially in the era of the cloud, and SD-WAN technology targeted the decentralized worker community, particularly those in sites too thinly connected to justify traditional VPNs.

A parallel concept has emerged recently on the IT resources side.  Edge computing presumes not a distribution of workers but a distribution of IT assets.  So, in truth, does cloud computing, particularly hybrid and multi-cloud (which IMHO are dimensions of the same trend).  The notion of “data center anywhere” is one founded on the idea that information processing and storage is usually more effective if it’s done near to a major point of consumption, if indeed consumption of information is concentrated in some small number of major places.

If you look at this from the top, you find that there are three issues to be addressed in “everywhering” the data center.  One is connectivity; you have to be able to connect people not just to a central resource but to a distributed set of resources.  Two is bandwidth; if information is stored and processed at the edge, then you need more bandwidth there to connect users not in that location to the information.  Three is hosting; you need to be able to economically deploy and manage distributed resources when the industry has been thinking for a decade or more about “server consolidation” to eliminate those edge resources.

Cisco’s approach is based on HyperFlex 4.0, which is an extension of Cisco’s data center model to the edge.  Cisco links this with HyperFlex Anywhere (branch platform) and ACI Anywhere (networks).  You could argue that this combination addresses all of the issues I’ve noted in the last paragraph.

While a lot of the story is about HyperFlex, the heart of Cisco’s approach, from a platform perspective, is ACI Anywhere because it’s the platform framework that lets Cisco build not just an edge-hosting model but a distributed resource model that includes the cloud.  I like the approach, but there are still a couple of soft points.  One is the focus on IoT, and the other is the lack of specific support for infrastructure abstraction.

Enterprises tell me that there is indeed a desire to distribute IT assets more, but it’s not as much the IoT-centric one that Cisco touts.  Less than 8% of enterprises, according to my contacts, are planning data centers around future IoT applications.  The real issue in the near term is a combination of resiliency and information specialization.  In both financial and healthcare, the two verticals I do the most enterprise work in, there’s a natural distributability of information arising from the “point of service” concept.  There’s also a strong desire to avoid having a single problem, ranging from power to natural disaster, take out all your IT.

I think that despite Cisco’s positioning hype, they’re actually aligning their HyperFlex approach to the current market drivers, and relying on sales initiative to introduce the bandwidth and connectivity solutions that may also be needed.  Thus, I think they have a viable model that supports their goals…except perhaps as the framework for tracking cloud evolution.

The real question for Cisco, its competitors, and the buyers in the space, is how this all fits into resource abstraction.  A highly distributed edge-computing framework demands an application orchestration and lifecycle automation model that optimizes it.  Add in the reality of hybrid cloud and you have a clear need for some infrastructure abstraction or composable infrastructure story.  Cisco provides that, but it does so based on common orchestration and tools, which exposes infrastructure complexity at the operational level.  That’s dangerous for a number of reasons.

First, applications should be designed for the hosting environment, which means that if you expect them to be distributed through a data-center-plus-edge-computing and hybrid cloud environment, you should design them that way and operationalize them within that assumption.  If the data center is treated independently, not only of the abstraction model but also the cloud, you could design yourself into a corner.

Second, lack of an abstract resource approach to hosting risks building an optimum or even strong business case.  There are too many things that have to be added on to the Cisco framework to make it resource-independent, and too many variations in what might be selected.  Some pieces might even be missed, and that could erode the business case.

Third, how the Cisco approach navigates the inevitable challenge of transitional states is unclear.  If you have abstract infrastructure, transitioning from a single data center to a far-flung distributed and hybrid cloud framework would be invisible, taking place inside an intent model and under the covers.  If it’s not invisible, it has to be strategized and accommodated, and each step creates risks to the evolution.

I don’t think this adds up to a miss for Cisco, though.  Right now, infrastructure abstraction is an outsider in terms of enterprise IT planning, and Cisco is well-known as avoiding taking a market-lead position.  Why take a risk and develop markets that competitors share when you can stomp on them if the market establishes itself?  As always, though, the fast-follower approach means that a determined (and strategically influential) player could derail Cisco’s plans.

Among the network equipment rivals, there’s nobody likely to step up.  Juniper has often been seen as the most direct rival to Cisco, and for a time was out ahead of Cisco in terms of recognizing and exploiting market trends.  They made a smart acquisition of HTBASE that might be hoped to gain some traction in the virtual-infrastructure space, but they’ve failed so far to establish even a minimal mission for it that would threaten Cisco.  Not only that, their quarter was viewed as “bad” by the Street, which puts pressure on Juniper to deliver short-term gains when longer-term insight is really what’s needed.

VMware and Red Hat/IBM are the obvious candidates to advance the strategy of infrastructure independence.  The advantage they have is a platform-based versus server-based approach to hosting.  If you focus on hardware sales, it’s hard to talk much about hardware abstraction or you end up letting everyone’s products in on your strategy.  That’s probably why Cisco doesn’t push the concept, but there’s no barrier to these other two competitors doing so.

A less obvious but perhaps more dangerous set of competitors are the cloud giants, meaning Amazon, Google, and Microsoft.  The cloud is by its very nature an abstract hosting point.  The cloud has also pioneered in creating abstract models of hosting that cross over into the data center, given that hybrid cloud is the enterprise strategy of choice.

The thing is, even if Cisco just sings pretty everything-everywhere songs, they’ll get a lot of prospects singing along with them.  They’ll get media coverage, set the mindset of senior management, and create at least some form of competitive response.  In short, they’ll help to bring about the thing they’re talking about, even if their own solution is imperfect.  A marketing trial balloon sounds cynical, but it’s a smart strategy.  “Run it up the flagpole and see if anyone salutes” used to be a marketing cliché, but that doesn’t mean it’s not a wise move, and Cisco at the least is making it.

The Ups and Downs of Cisco’s “Self-Publishing” Network

Light Reading had an interesting article that featured a Cisco presentation on “The Self-Publishing Network”.  The points Cisco’s quoted on are interesting, and while I don’t fully agree with them, I think they reflect some important changes in the way we visualize networks, network services, and service lifecycle automation.

The basic premise is that network operators need to think of their networks in three layers—resources, orchestration/automation, and operations/business support (OSS/BSS).  I’ve said much the same thing in my own blogs, with the same layers in fact, so I don’t disagree with that premise.  We used to have a two-layer structure where OSS/BSS linked (via manual processes) directly to the resource layer, but we’ve moved into an age of APIs and models, and that introduces the orchestration and automation layer in between the two original layers.

The article and Cisco also make a good point about what I’ll call “resource modeling”, the use of things like YANG and NETCONF to control network resources and provide a vendor-independent approach to coercing service-cooperative behavior from switches and routers.  In my view, though, this really creates a kind of “sublayer” structure of the middle orchestration/automation layer.  As I noted yesterday, I postulated in my ExperiaSphere work that network services were made up of a service and resource layer, each having its own orchestration/modeling joined by having the service layer bind to “Behaviors” of the network asserted by the resource layer.

All of this works fine as long as we’re talking about networks built up from switching and routing devices.  The challenge comes when you add in hosted features, either augmenting/replacing traditional switching and routing or living above them, perhaps all the way up to the OTT layers where experiences rather than connections are the service targets.  When you get to hosted stuff, you run into the problem I’ve noted before (including yesterday), which is that management of the service at the functional level has separated from managing the resources that make the services up.

The article quotes the Cisco spokesperson as advocating the abandonment of things like SNMP and CORBA as “exotic”, in favor of network-centric stuff like YANG.  Even for connectivity services like IP or Ethernet, this doesn’t recognize the fact that a set of software-generated and server-hosted features have to be orchestrated at the resource level, and should be orchestrated more like the cloud works than like network devices work.

The Cisco model, stated in what I’d call “cloud terms” would be something like this.  At the top, you have a commercial service offering that includes a mixture of functions, some related to traditional connectivity and some to non-connection features.  The commercial offering would be realized by a service-level model, from which the deployment of the service would be controlled (and automated).  The bottom of the service-level model branches would be bound across to resource behaviors, some of which would be native network device behaviors (for which YANG is fine) and some of which would be software-hosted feature behaviors for which something cloud-like such as TOSCA would be more sensible.  These resource behaviors would control the actual infrastructure.

There is absolutely no way you could contend that YANG was a viable model for deploying applications in the cloud.  Why then should we be even thinking about it as a means of deploying features (which, in software terms, are equivalent to application components) in the cloud?  As always, there are two possible explanations for this.  First, Cisco is being Cisco-the-marketing-giant, and since it effectively owns YANG (having bought Tail-f, who was the primary developer/promoter of it), is simply trying to own that middle orchestration/automation layer.  Second, Cisco has IP blinders on.

If the first explanation is true, Cisco has a problem because the operators are really looking for strategies that support carrier-cloud-based services in the long run.  While I think NFV has gone way off track, there are already many in the NFV community who think “cloud-native” is the way to go.  NFV even now is based mostly on TOSCA-related modeling.  5G, which promotes hosted features, would surely drive operators more in the TOSCA/cloud direction as it deploys.  You can’t own the orchestration/automation layer by promoting a modeling approach that’s already been rejected.  Still, Cisco has a history of pushing its own approach in defiance of logic and market commitment for as long as it can, then pretending it had the other (right) approach all along.

If the second explanation is true, then Cisco is stuck in IP-neutral.  They think of “services” as being nothing more than IP connectivity.  Operators are doomed to build only dumb networks, using of course Cisco devices.  This would be, IMHO, a worse problem for Cisco because it would risk Cisco isolating themselves from there operators know they need to be going for higher-level, higher-revenue services as well as for the implementation of agile virtual elements of mobile and content services.  It’s bad enough to get the modeling for these new opportunities wrong (which the first explanation would suggest), but to get the mission wrong altogether would be a big problem.

I also think Cisco is wrong in proposing that OSS/BSS systems be modernized in orchestration terms, unless you want to make service orchestration as a sublayer a part of the OSS/BSS process, which flies in the face of the way that service orchestration has to tie in specific features and functions.  In any event, I think the clean approach is to assume that the top of the resource layer, the abstractions that are consumed to fulfill functional requirements makes the service orchestration and modeling process a consumer of abstract resources.  The OSS/BSS should then be a consumer of abstract services, neither knowing nor caring about how they’re made up, only how they’re offered to customers.

Ironically, this is exactly where Cisco seems to have been heading with the self-publishing notion.  Any layer publishes abstractions that are consumed as the input to the layer above.  You build “services” from resource “behaviors” (to use my ExperiaSphere term).  You build customer relationships by selling them services.  When you add anything as the “output” of any layer, you can then exploit it up the line and make money from it.   You can “publish” the capabilities of any layer to the layer above, and since there’s that inter-layer exchange the result looks a lot like the old OSI model where each service layer uses the features of the one below.

Publishing is interesting, but you can’t publish what you don’t have, or what you can’t present in an organized way.  Intent modeling, as Cisco suggests, is a key piece of the notion because it lets services or service features or service resources be represented by their capabilities not their implementations.  Cisco has a lot of good points in its self-publishing approach, but if it wants it to be more than marketing eye candy, it needs to align it more with clouds and less with networks.  Without that shift, this is more about a self-aggrandizing network than a self-publishing one.

The Management Side of Resource Abstraction

One interesting question a client of mine raised recently is the impact of infrastructure abstraction on management tools and practices.  What goes on inside a black box, meaning an abstract intent model of a feature or application component, is opaque.  The more that you put inside, the more opacity you generate.  How do you manage something like this?

Abstraction, as a piece of virtualization, allows a “client-side” user of something to see a consistent interface and set of properties, while that view is mapped to a variety of implementations on the “resource side”.  The client-side view can (and, in many or even most cases, should) include an abstract management interface as well as the normal “data-plane” interface(s).  Management, then, can be an explicit property of abstraction of infrastructure.

That doesn’t mean that management is unaffected.  To start with, when you have an abstraction based on an intent model, you are managing the abstraction.  That means that your management practices are now disconnected from the real devices/resources and connected instead to the abstract behaviors that you’ve generated.  That changes the way that real resource management works too.

You can visualize an intent-modeled service or application as an organization chart.  You have a top-level block that represents the service/application overall.  That’s divided into major component blocks, and each of those into minor blocks.  You continue this reversed tree downward until each branch reaches an actual implementation, meaning that the intent model at that level encloses actual resource control and not another lay of modeling.

From the perspective of a service user, this can be a very nice thing.  Services/applications are made up of things that are generally related to functionality, which is how a user would tend to see them.  If we assume that each level of our structure consists of objects that are designed to meet their own SLA through self-healing or to report a fault up the line, this hierarchy serves as the basis for automatic remediation at one level and of fault reporting (when remediation isn’t possible) that identifies the stuff that’s broken.  Users know about a functional breach when the included functions can’t self-heal, and they know what that breach means because they recognize the nature of the model that failed.

The challenge is that, as I’ve noted before, you can’t send a real technician to fix a virtual problem.  Users of applications or services with no remediation role may love the abstraction-driven management model, but it doesn’t always help those who have to manage real infrastructure resources.  That means that you have to be able to relate the abstraction-driven model to the real world, or you have to assume that the two are sort-of-parallel universes.

The parallel-universe theory is the simplest.  Resources themselves can be assigned SLAs to meet, SLAs that are then input into the capacity plans that commit SLAs to the bottom level of the abstraction model.  As long as the resources are meeting their SLA, and if your capacity plans are correct, you can assume that the higher-level abstractions are meeting theirs.  Thus, the resource managers worry about sustaining resources and not applications or services.

This isn’t much different from the way that IP services work today.  While MPLS adds traffic management capability and enhances the kind of SLAs you can write, the model of management is to plan the services and manage the infrastructure.  Thus, there’s no reason this can’t work.  In fact, it’s well-suited to the analytics-driven vision that many have of network management.  You capacity-plan your infrastructure, keep service adds within your capacity plan, and the plan ensures that there’s enough network capacity available to handle things.  If something breaks, you remedy at the level the break occurs, meaning you fix the real infrastructure without regard for the services side.

What about service management, then?  The answer is that service management becomes more a process of SLA violations redress and refunding.  If something breaks, you can tell what functional element in the service “organization chart” broke, which at least tells the customer where the fault is.  You presume your infrastructure-level remediation process will fix it, and you do whatever the SLA requires to address customer ire.

The more complicated approach is to provide a management linkage across the “organization chart” model hierarchy.  The TMF offers this sort of thing with its MTOSI (Multi-Technology Operations System Interface), which lets a management system parse through the structure of manageable elements in another system.  What it means is that if an element in a model reports an SLA violation, it’s because what’s inside it has failed and can’t be automatically remediated.  The logical response is to find out what the failure was, which means tracing down from where you are (where the fault was reported as being beyond remedy) to where the fault occurred, then digging into the “why”.

One technical problem with this second approach is the potential for overloading the management APIs of lower-level elements when something breaks that has multiple service-level impacts—the classic “cascaded fault” problem.  I’ve proposed (in my ExperiaSphere work and elsewhere) that management information be extracted from individual APIs using an agent process that stores the data in a database, to be extracted by queries that present the data in whatever format is best.  This would ensure that the rate of access to the real APIs was controlled, and the database processes could be made as multi-threaded as needed to fulfill the design level of management queries.

A more serious problem with this second approach exists at the process level.  Service-driven remediation of infrastructure problems can result in spinning many wheels, particularly if a fault creates widespread service impact.  You don’t want customer service reps or (worse) individual customer portals to be launching any infrastructure-level initiatives, or you’ll risk having thousands of cooks spoiling the soup, as they say.

I think it’s logical to assume, again as I’ve done in ExperiaSphere, that service-level management visibility would be blocked at a certain point while parsing the “organization chart”, letting CSRs and customers know enough to feel confident that the problem had been identified, but not so far as to make individual resources visible at the service management level.

Overall, I think that virtualization and resource abstraction tend to pull traditional management apart, and the more abstract things are the further apart the two pieces get.  Both applications and services need to manage quality of experience, and that means understanding what that is (at the high level) and dissecting issues to isolate the functional pieces responsible.  However, remediation has to occur where real resources are visible.

You need service or application management because networks are valuable because their services are valuable.  The “organization chart” model can provide it.  You also need resource management to address the actual state of infrastructure.  It’s possible to do resource management based on the same kind of modeling hierarchy, providing that you have a barrier to visibility between services and resources at the appropriate point (in ExperiaSphere, that was handled by separating the service and resource layers).  It’s also possible to manage resources against an independent SLA using tools that suit the resource process.  That conserves current tools and practices, and it may be the path of least resistance toward the future.

How “New” is the “New IBM?”

For years now, IBM has been struggling to arrest revenue and profit problems, and they got it at least somewhat right in their last quarter.  Perhaps the positive response to their quarterly report this week was almost a “relief rally”, but the stock was up this week on their beating estimates.  The question isn’t so much whether this is a one-time move, but rather what tide has risen to lift IBM’s boats.  If it’s a true tide and not a rogue wave, IBM might be heading for a bigger position in the market.

The upside IBM saw was, according to some IBM users I’ve chatted with, largely due to the hybrid cloud trend that has been developing since 2018.  After almost a decade of shooting at the wrong duck, the cloud industry finally figured out what cloud computing really meant for enterprises.  That’s spawned a shift in IT thinking, a shift toward an application model more suited to hybrid deployment, and that’s gotten many IBM customers excited.

The bad news is that IBM didn’t create the hybrid awareness.  IBM’s strategic influence in the market has steadily declined, in no small way because they’ve not articulated a strong cloud position and have no broad base to do so.  You can see in the earnings call transcript, where their first point about customer engagement was about analytics.  IBM understands what it’s been touting for half-a-generation.  The cloud, even hybrid cloud, is alternately a direction IBM is moving in and a necessity in leveraging some of those old notions that IBM thinks should be important but that customers have been yawning about.

The good news is that IBM’s acquisition of Red Hat could be giving it the best platform in the industry for leveraging where hybrid cloud is going.  Red Hat is the premier provider of open-source software, and that in itself offers IBM the kind of market breadth that it used to have in its heyday and lost when it sold off so much of its hardware business.  Not only that, enterprises tell me that their ideal platform toolkit for building and running hybrid cloud applications would be built on open-source components offered by a vendor, bundled with support, and pre-integrated for ease of adoption.  That’s exactly what Red Hat has been about from the first, and now IBM has them.

It’s probably clear to everyone at this point that I think the question for IBM, the factor that will determine whether they’re on a roll or rolling over, is how these two news points interplay with each other.  When I blogged about IBM’s acquisition of Red Hat, I noted that my concern was that IBM just wasn’t strategic enough to play them correctly.  That concern can be expressed in terms of my “bad-news” point; IBM is still thinking in twenty-year-old market terms, is still seeing open-source and hybrid cloud as implementation details and not strategic shifts.  That’s what we need to discuss.

Let me open the discussion with a long but highly relevant quote from their earnings call: “Let me pause here to remind you of the value we see from the combination of IBM and Red Hat, which is all about accelerating hybrid cloud adoption. The client response to the announcement has been overwhelmingly positive. They understand the power of this acquisition, and the combination of IBM and Red Hat capabilities, in helping them move beyond their initial cloud work to really shifting their business applications to the cloud. They are concerned about the secure portability of data and workloads across cloud environments, about consistency in management and security protocols across clouds, and in avoiding vendor lock-in. They understand how the combination of IBM and Red Hat will help them address these issues. We see the strong bookings Red Hat recently reported as further evidence of clients’ confidence in the value. Remember, the quarter ended a month after the transaction was announced.”

To me, the best thing about this quote is that it shows IBM is aware that the migration to hybrid cloud demands sustaining the same kind of security, governance, and reliability/availability that enterprises have demanded from the first.  For IBM customers, customers who know all about these constraints, IBM is a credible source of insight and products.  That’s proven by the overwhelmingly positive response to the Red Hat deal that IBM is citing.

The worst thing in the quote is the assertion that the Red Hat deal is “helping them move beyond their initial cloud work to really shifting their business applications to the cloud.”  Earth to IBM: hybrid cloud is not about shifting, it’s about refactoring to a dualistic hosting environment.  The earnings call isn’t a perfect place to start strategic marketing and positioning (most buyers don’t listen to it), but it’s a place that demonstrates the bias of senior management.  If senior IBM management is still talking about “shifting work to the cloud” when hybrid environments are really about building cloud front-ends for existing applications that will stay in the data center, then IBM has a problem.

The other, obvious, problem is that if the good news is that Red Hat gives IBM a broad base, then it would be better news if we saw IBM recognizing that.  Where is a statement that Red Hat opens a whole new market for IBM, a market much larger than its current customer base?  I couldn’t find anything in the earnings call transcript that committed IBM to exploiting Red Hat’s almost-universal market story instead of IBM’s very limited my-customers story.

Red Hat has the high ground in hybrid cloud.  They have a position that’s extremely good even now, and with a little strategic insight could be made even better.  IBM had a good cloud quarter, and with a little strategic insight, it could have been (and could continue to be) better.  If IBM lets Red Hat alone, it will reap some benefit from Red Hat’s story and probably turn gradually positive over the longer term.  If IBM tries to pull Red Hat and hybrid cloud into a pedestrian infrastructure evolution story, it will kill Red Hat and its own momentum, and that will be very bad for IBM and the market.

If IBM leans on Red Hat thinking to re-frame its entire view of how IT is evolving, it would be very positive for IBM and Red Hat, and also good for the market.  I remember well the days when IBM was the insight leader in IT.  Not only was IBM stronger then, but IBM’s competitors and the market overall were stronger, because IBM had a unique ability to articulate a vision of the future that you could buy into whether you bought IBM products or not.  That’s the kind of thinking IBM needs to keep those good quarters coming, and growing.

The prospect of IBM coming of age, cloud-wise and market-breadth-wise, is enough to strike fear into competitors’ hearts, not only vendors but cloud providers.  It’s not so much that IBM’s own cloud presents a risk, but rather than a realistic hybrid cloud vision could have a profound impact on the market dynamic.  Microsoft is the hybrid cloud favorite because it has better cloud-to-premises symbiosis in its tools.  An open solution from Red Hat and IBM could level the hybrid cloud playing field.

On the vendor side, though, there’s plenty of risk to watch.  HPE and Dell both have massive server businesses to protect, and if there’s an open hardware-independent solution to hybrid cloud then neither of the two has as much potential for competitive differentiation.  The Red Hat and IBM coalition, properly drive, could commoditize servers completely—which doesn’t hurt a vendor who’s exited that market, of course.

It may be Cisco who has the most to gain, and lose, here.  Does Cisco, who’s never been a software kingpin, overtly or covertly take on the role of platform for IBM’s new hybrid model?  That rides IBM’s coattails, which is tempting but dangerous if IBM boots its strategic exploitation of Red Hat.  Does Cisco go it alone, with “digitized spaces?”  We’ll see.

Two More Cloud-Things Needed for Cloud-Native Network Features

If we really want cloud-native network service features, and optimum use of them, we should probably look at two factors beyond those that I laid out to make NFV itself more cloud-native.  One is the issue of “configuration” and “parameter” control, and the other is the ever-complicated issue of address space.  Again, the cloud world offers us an approach, but again it’s different from what’s been evolving in the service provider world.

When you deploy an application, especially on a virtual resource, you’ll generally need to address two setup issues.  First, the deployment itself is likely to require setup of the hosting resources, including the application’s own hosting points and the hosting of external but coupled resources like databases.  Second, you’ll usually need to parameterize the application (and its components) to the framework of operation you’re expecting.

The issue of address space can be considered related.  Everything you deploy in a hosted environment has to be deployed in a network environment at the same time, meaning that you have to assign IP addresses to things and provide for the expected connectivity among all the elements of the applications, and with selected components and the outside world.  How this addressing is done impacts not only the application and its users, but also ongoing management.

In the application software and cloud world, we seem to be converging on the container model of deployment, based at the lowest level on the Linux container capabilities, at a higher level on things like Docker or rkt, and at the highest level by orchestrators like Kubernetes or Marathon.  Container systems combine the setup/configuration and addressing frameworks by defining a standard model for deployment.  It’s less flexible than virtual machines, but for most users and applications the minimal loss of flexibility is offset by the operations value of a single model for hosting.

One challenge with our configuration/parameter control is version control.  It’s common for both deployment configuration and component parameters to evolve as software is changed to fix problems or add features.  Software evolution always requires synchrony of versions through all the components, and so version control in software goes all the way from development to deployment.  The most popular mechanism for ensuring that software version control is maintained is the repository, and Git is probably the best-known tool for that.

Recently, application deployment has recognized the dependency of configuration/parameter sets with software versions, and so we’ve evolved the idea of keeping everything related to an application in a common repository—called a “single source of truth”.  This creates an envelope around traditional “DevOps” practices and creates what’s being called “GitOps”.

The reason this is important is that GitOps should be the framework for configuration/parameter control for things like NFV and SDN, in part because to do otherwise would break the connection with the cloud and current application development, and in part because inventing a whole different approach wastes effort and resources.  You can store repository information in a common language like YAML, and then transform from that single source to feed both parametric data to applications and configuration data to orchestrators.  You can also describe your goal-state for each element, and then check your current state against it to provide operational validation.

If we were to use the GitOps approach for feature hosting, it would mean that instead of trying to frame out specific VNF descriptors, we’d lay out a structure of “class-specific” information and rely on an adapting element to derive what a VNF needs from this generalized data.  That could be based on the same “derived operations” concept I explained in my earlier blog.  The goal would then be to create a good general structure for describing parametric data, since we already have examples from the cloud in how to code configuration data via GitOps.

The challenge of address spaces is a bit more complicated, in part because even container systems today take a different slant on it.  Containers might deploy within individual private address spaces or deploy within one massive (but still private) space, and exposed interfaces are then translated to public (VPN or Internet) spaces.  You can see that how addresses are assigned is important, and that the assignment process has to be adapted to conditions as applications scale and are redeployed if something fails.  Network virtualization standards have generally paid little attention to address spaces, but the cloud has devoted considerable effort to defining how they work under all conditions.

The NFV concept of “service chaining” is at least an illustration of why we have neglected address space discussions.  The notion of a service chain is that you build a linear pipeline of VNFs that perform a function normally hosted in CPE.  That’s fine as a mission if all you want to do is linear chains and vCPE, but it neglects a couple of major truths that we need to address.

Truth one is that IP, which is what the cloud and NFV hosting are based on, is a connectionless protocol.  If you establish a service or application that consists of multiple components/features, then IP would support any sort of connection relationship you’d like among them.  The notion that only a specific linear tunnel set is available is limiting, and in fact one of the best and earliest examples of VNFs that we have comes from Metaswitch and its open-source IMS implementation (Project Clearwater at the time, now Clearwater Core).  Clearwater lets the component VNFs communicate as they should, as normal IP elements.

Why did we get to the “chain” concept?  Part of it was just tunnel vision on vCPE, but a bigger part I think is the fact that the standards people thought too literally about VNFs.  Physical devices were connected by wires, so virtual functions should be connected by virtual wires.  Yes, you can build them (tunnels of some sort) in IP networks, but would you want to do that when natural IP communications works directly?

Tunnel-focus elevates network thinking to the tunnel level, and since you can in theory tunnel over almost anything, that means that it’s easy to forget what’s below, including address spaces.  If you want to deploy applications, though, you need to understand what’s addressed and how, and how addresses either create or limit open connectivity.  That’s a lesson that needs to be learned.

I’d argue that these two points are actually easier to address in a rethinking of NFV than the points I’d made in my earlier blog, but it wouldn’t do a lot of good to address them without having dealt with those more complex issues.  If you are really going to be cloud-native, you need to have framed your notion of NFV in the cloud from Day One.  Getting to that from where we are today is possible, but as these two blog points show, the cloud advances in its own relentless flow and the longer NFV diverges, the greater the gulf that has to be bridged.  We’re already at the point where it’s going to be difficult; shortly it’s not going to be possible.

That might not be the worst outcome.  Little that’s done in NFV today is really optimum, even if we can diddle things to make it at least workable.  Cloud architects doing NFV would have done it differently (as my own work demonstrated), and it’s still possible to meet all the goals of the 2012 Call for Action using cloud tools alone.  I think we’re going to see some try that in 2019.

Adapting NFV to Cloud-Native

Carrier cloud is IMHO the foundation of any rational network operator virtualization strategy.  It would make zero sense for operators to build out hosting infrastructure for specific applications or service missions.  This is the industry, after all, that has decried the notion of service-specific silos from the very first.  Capital and operations efficiency alike depend on a single resource pool.

I’ve noted in previous blogs that my biggest problem with the NFV community is its current and seemingly increasing divergence from the most relevant trends in cloud computing.  There is nothing that NFV has or needs that isn’t also needed in the cloud overall, and in fact nearly everything NFV has or needs is already provided in the cloud.  If we keep on our current track, we’ll build an NFV model that fails to take advantage of cloud development (in deployment and lifecycle management) and risks a separation between carrier cloud infrastructure and “NFV Infrastructure” or NFVi.  That would raise the capital and operations cost of NFV, and risk the whole concept.  We all know that would be bad, but can anything be done at this point?

I think there are two possible pathways to fixing the NFV situation.  The first is to limit NFV to deployment of VNFs on universal CPE (uCPE), and make a decision to separate uCPE from carrier cloud infrastructure.  By keeping NFV out of cloud hosting (in favor of per-site premises hosting), we could eliminate the risk of creating an NFVi silo cloud.  The second is to find some way of laying NFV on top of carrier cloud without separating the processes of NFV from cloud processes.  That requires making NFV overall into a “carrier cloud application”, not making VNFs into such an application.  All of this must also somehow address the “cloud-native” thrust we now see from operators, both in NFV specifically and in carrier cloud overall.  Let’s look at each approach to see what would be required.

Starting, obviously, with limiting VNFs to uCPE deployment.  That sounds drastic, but I’ve never liked the way that the NFV ISG glommed onto “service chaining” and “virtual CPE” as the prime focus within the cloud.  The basic problem is that any service termination needs some CPE, so you can’t pull everything in the cloud.  Consumer and small business/branch terminations need a WiFi hub, and the commercial products to support that include the basic firewall and termination-service features.  These cost between about $50 and $300, mostly depending on the WiFi features, so it seems clear that the vCPE mission is valid only for larger business-site terminations.  To support them inside the cloud using service chaining, custom NFVi, and an NFV-specific deployment and lifecycle automation process, seems unlikely to be justified.  uCPE is a better solution.

As I pointed out in a blog last week, though, vCPE/uCPE is very likely not a great mission that needs NFV.  You don’t need to service chain within a device (presuming that service chaining overall is even useful, which I doubt).  Some order in how you load and manage uCPE would be helpful, but a simple spec on the platform could resolve that.  In any event, it is a possible mission for NFV, and if operators believed in it, then at least some of the work of the ISG could be redeemed.

The second option is a lot more complicated.  Rather than doing what many had expected, which was to simply identify cloud strategies that NFV could leverage, the ISG framed a very specific model for management, deployment, configuration and parameterization, and even infrastructure.  That model didn’t align well even with the cloud framework of 2013, when the model evolved, and it’s not tracking further developments in the cloud.  Thus, the question is how easily we could retrofit NFV, not to past or even current cloud, but to the broad track of cloud evolution.

Let’s start with what I think would have to be the critical accommodation.  Right now, NFV expects to interact with its resource pool (NFV Infrastructure or NFVi) through the mediation of the Virtual Infrastructure Manager or VIM.  In order for this to track cloud evolution, we have to assume that the VIM offers a single virtual-host abstraction that’s mapped to the resource pool by a composable-infrastructure layer.  Think something like (and preferably based on) Apache Mesos with DC/OS, but the implementation specifics matter less than the architectural model.  Everything gets hosted on the virtual-host abstraction.  The mapping happens below, which means that a VNF has a “descriptor” that corresponds to the application description in cloud hosting.  There are also resource-level descriptors/policies that guide the mapping.

What I think would likely be required here is a kind of adapter function, one that presented a uniform hosting abstraction via the VIM API, and accepted the VNF Descriptor (VNFD) as a parameter.  This would then be mapped through into the parameters and policies required by the composable infrastructure API below.  That’s helpful because it ensures that we don’t, at a point where the cloud itself has perhaps agreed on infrastructure abstraction but not on the means, lock in an implementation at this level.

This isn’t too destructive to the ISG work, but even here there’s the fact that the ISG is tossing in a lot of parameters to control the service lifecycle process that are not found in the cloud at all.  Some may be helpful, and these should be reviewed for inclusion in cloud implementations.  We need, for example, the ability to define hosting points that are dependent on the same power resources (to avoid backing up to them) and to define places where certain processing and storage can’t take place for regulatory reasons.  Many are just diddling with hosting to the point where they could compromise resource efficiency.

The next step is more difficult.  There is no reason whatsoever to stay with the NFV ISG concept of MANO, though orchestration is clearly important.  The reason for the departure is that the cloud is offering a number of much more flexible, powerful, and broadly supported orchestration options.  If NFV wants to get itself aligned with the cloud, they have to give up the notion of NFV-specific orchestration.

From MANO-less NFV, we move to the real knot of the problem, which is management.  Deployment automation is fairly straightforward, which may be what led the NFV ISG to try to define it independent of the cloud.  If you add in “lifecycle automation”, you make things a lot more complicated because you introduce a very broad set of events into the picture.

There are two levels of management issue with virtualization and carrier cloud.  One level is the “simple” task of reflecting the status of virtual network functions in the same way that you’d reflect physical device status.  The NFV ISG dealt with that, in a sense at least, with its concept of VNF Manager (VNFM).  The second level is the real problem; how do you deal with management requirements that exist for the virtual function but not for the physical device?  An example is a service-chained set of VNFs replacing a simple piece of access CPE.  There are hosts, connections, in the former that are all hidden inside a box in the latter, so traditional management won’t handle these.

Once you accept that you need a broader vision of management to handle what happens inside virtual elements, it makes no sense to assume that you’d stick with the old operations tools and practices to handle the outside.  In order for that to work, the NFV community has to extend its view of service modeling (which has already embraced, sort of, the TOSCA cloud-centric approach) to include the TMF’s NGOSS-Contract vision of data models steering events to processes.  That requires a major shift in the way that VNFM happens today, because there is no longer a monolithic operations process handling things, but rather a series of processes coupled individually to events via TOSCA.

This doesn’t mean that VNFM goes away, or even that the notion of using traditional management elements (the classic element-management hierarchy) goes away.  You can feed a management system with data obtained from a modeled service.  I proposed using a management repository with query-derived management views (“derived operations”) in several operator meetings, well before the NFV ISG started.  You can retain a management-agent approach where you want to, but remember that management systems that are supposed to be driving lifecycle automation have to be automatic, so the state/event NGOSS-Contract approach is absolutely critical.

Derived operations, the extraction of information from a database through a query interface, is also a reasonable way to deliver service model data to a process that’s the target of an event.  It’s also possible to simply offer the entire model, but that creates efficiency and security issues.  You could offer an object only the model data that represented the object the event was associated with, which is more secure and efficient, but perhaps begs the question of where this data is stored.  The question of model-data distribution is one we can explore in a future blog.

Can we expect the standards community, and the NFV ISG in particular, to adopt this approach?  I doubt it.  Technically, none of these things presents a major challenge.  We have cloud tools that do the right thing already, as I’ve pointed out.  What’s more likely to be a problem is the inertia of the standards processes involved—the NFV ISG and the ETSI ZTA (Zero-Touch Automation) activity.  It’s hard to admit that you’ve spent five years (in the case of the ISG) doing something and now have to discard or change most of it.  But the alternative is to spend even longer doing something that will surely be overtaken by events.

Looking at Enterprise IT in 2019

Enterprises have their own plans for IT and networking for 2019, and since I’ve done a blog on the service provider side, I also want to do one for enterprises.  My own enterprise survey process isn’t as comprehensive as for the network operators, so I’m also commenting on predictions made by the financial industry to provide broader input to this blog.

Among the enterprises I’ve interacted with in the last quarter, the most interesting point is that there’s a considerable amount of agreement across all the verticals.  Enterprises are looking to make more and better use of the cloud in 2019, and in fact that’s their highest IT priority.  A close second, and related to the first, is the goal of improving application lifecycle automation, meaning the automation of the deployment, redeployment, and scaling of applications.  Third is the desire to improve hosting efficiency within the data center, and forth is to incorporate AI or machine learning (enterprises are fuzzy on what the distinction is, or even whether there is one) in applications and operations.

At the core of this combination is the fact that enterprises are only now understanding the real mission of the cloud.  It’s not to replace traditional data centers in hosting mission-critical apps, or in fact to replace data centers for hosting many non-mission-critical apps.  It’s for hosting application components that are more tied to data presentation to users/workers than to transaction processing.  The former is a piece of the latter, and so the cloud is a piece of hosting strategy—what we call “hybrid cloud” is the order of the day.

Wall Street research picks up much of this, but in a more technology-specific form.  They see container technology and data center efficiency (hyperscale, networking) as key elements, for example.  Overall, though, Street research validates the notion that hybrid cloud is the model, to the point where it eclipses public cloud in most analysts’ view.  The hybrid shift favors Microsoft (who has always had a better private cloud story).  They also see a shift in application development practices and data center hosting models to support hybridization, and a shift in operations behavior to support efficient management of hybrid hosting.

The hybrid cloud mission is thus perhaps the most significant single driver/factor in IT spending in 2019.  It’s going to reshape how applications are developed, encouraging containerization, stateless behavior to enable easy scalability, and an expanded view of orchestration.  In the networking area, it’s the primary driver of change for data center switching, the primary motivator for virtual networking in the data center, and a major driver for SD-WAN and SD-WAN feature evolution.

Behind the hybrid cloud, in a technical sense, is a broader and at the same time more thoughtful application of virtualization.  The problem with specific hosting technology and specific application requirements is the specificity; it makes it difficult to create broad pools of resources and fit an increasingly complicated set of operations tools to the varied environment you’ve created.  Virtualization presumes that you’ll have abstractions of applications hosted on abstractions of resources, connected through abstract network services.  If all the tools operate on and through the abstractions, then a single straightforward toolkit fits practically everything, and works the same way for everything as well.

Enterprises are moving to this new virtualization approach, but my research suggests that they’re doing their moving largely by accident rather than through deliberate planning.  Most enterprises don’t really see a holistic goal or approach, in fact.  They’re addressing the issues of the hybrid cloud as they encounter them.  It’s fortunate for enterprises that the open-source movement has largely unified the technical goals and is developing toward them.  Otherwise we might well be creating a bunch of silos instead of breaking them down.

The open-source frameworks (particularly for containers and orchestration, meaning Kubernetes and Apache Mesos/Marathon) have also provided glue to hook in public cloud and even proprietary software tools that would otherwise have tended to be too tactical to be helpful in hybrid cloud.  Various SDN concepts, both open and proprietary (and including SD-WAN) are providing a strong network glue to bind distributed components.

On the hardware side, most enterprises agree that it’s important to unify their hosting platforms, converging on a compatible set of server CPU options and also a fairly unified OS and middleware mixture.  Containers are helpful in that they frame a portable hosting slot model.  Network hardware doesn’t need unification as much as a common overlay model (SDN and SD-WAN, in the form of a highly flexible virtual network and network-as-a-service) and high-speed connectivity, both within each data center and among data centers.

The hardware side of things needs to be matched to the software a bit better.  The hybrid cloud solution to hardware resource mapping to a virtual abstraction is still evolving.  One thread is a “control-plane” or “cluster” extension that lets orchestration tools (Kubernetes) map data center and cloud resources as different clusters in the same pool.  A set of “data-plane” approaches seeks to unify the resource pool through a universal abstraction and network connectivity (Apache Mesos).  Data center vendors are not touting any particular affinity with either approach.

That brings up the opportunities and risks.  Red Hat/IBM could be a major beneficiary of the hybrid cloud shift for a bunch of very obvious reasons.  Red Hat’s OpenShift is perhaps the best-known “productized” container application platform; it contains Kubernetes.  OpenShift could be a vehicle for IBM to take a really big position in the emerging hybrid cloud space.  VMware, perhaps a bit preoccupied by the drama of Dell’s possible reverse merger (now perhaps laid to rest), is advancing its basic network and hosting tools, but without the clear framework of OpenShift.  HPE similarly has all the right pieces but lacks ecosystemic marketing.

How about networking?  VMware is, for the network community, the most interesting of the possible hybrid cloud play-makers.  Their NSX virtual-network tool, augmented though a bit feebly by the Velocloud acquisition, could be seen as a universal virtual-network solution if it’s enhanced further.  Nokia/Nuage already has a comparable product, but they’re far less a player in the enterprise space than the service provider space.  Juniper, through its acquisition of HTBASE, might have enough critical virtual-network assets to play in hybrid cloud, but they seem stubbornly committed to “multi-cloud” when the term is almost always used to refer to users with more than one public cloud provider, not to those with a public-cloud-and-data-center strategy.

This is a good point to raise the question of open-source technology for enterprises.  My research says that enterprises strongly prefer open-source platform software to proprietary software.  However, most enterprises want to get that platform software from a source that bundles support.  They don’t insist on “purist” open-source at all, and in fact many don’t know anything about the topic or even the different classes of open-source licensing.  Open-source gets good ink, it’s “free” (except for the support that they’re expecting to pay for), and most of all it’s not vendor-specific.  In short, open-source is insurance against vendor lock-in.  That view is expanding among enterprises; the number who think open-source is the best protection against lock-in has doubled in just the last five years.

On the network side, it’s different.  Enterprises are not as committed to “open” network technology like switches or routers as the operators and cloud providers are.  The reason that’s important is not only that it means enterprise network equipment sales is less likely to be eroded by open platforms, but also that the virtual-network technology that’s essential to the cloud is not automatically expected to open-source.

The software framework of 2019, then, is set by open-source.  Applications and some specialized tools can still be proprietary as long as they integrate with or operate on the open-source platform overall.  That means that, from the bottom up, we’re setting expectations for software and IT that’s not going to provide vendors with automatic lock-in once the first deal is done.  That may bring about massive changes in buying practices in 2019, and it’s certain to impact 2020 and beyond.

What Operators Think We Can Expect of NFV in 2019

I’ve been having some interesting exchanges with operations planners in a couple dozen network operators who have done some NFV deployments.  My goal was to see what the issues are likely to be in 2019, and the results were in some cases unexpected.  Perhaps the area where there have been the most surprises has been the disconnect between what’s getting reported and what these specialists were seeing.

The area where we hear the most about NFV today is “virtual CPE” or vCPE.  The basic notion of vCPE hasn’t changed enormously since 2013 when the seminal end-to-end model of NFV was first developed.  CPE is essentially a linear progression of features, including some for security and some for encapsulation.  This sequence was called a “service chain”, and it was presumed that the service chain would either be built in the cloud by linking hosted virtual network functions (VNFs) or in a single “universal” CPE device on premises.

Service chaining and vCPE has, according to the operators, been the source of most of the VNF onboarding problems, problems related with converting an arbitrary set of network features into VNFs that can be deployed and managed using the framework described by the NFV ISG.  Here are the points operators have made about all of this.

First, operators say that the notion of assembling an inventory of VNFs to create a virtual device is essentially flawed.  The presumption is that this approach would let you respond to buyer needs for different feature sets, creating a more agile service.  The problem is that the operators with the most experience say that a maximum of five different configurations of features support over 97% of all buyer requirements, and that buyers will typically stay with their first configuration, even when changes to that configuration are supported.

What this adds up to is that it’s easier and cheaper for operators to simply package a couple variant configurations of vCPE software and load one as a unit based on buyer needs.  That means that there’s less operational complexity—there is only one “composite VPF” to be hosted wherever you decide to host it.  If somebody does want to change their vCPE configuration, you just load a different composite image.

Then there’s the question of how you load things onto uCPE to create vCPE.  If you want to follow the NFV rules for deployment and management, you’d need to treat a premises device and a cloud-hosted alternative in the same way, which means that you’d need to use uCPE that looks like an extension of the carrier cloud.  That raises a whole new set of operations/management questions because the device is on the user premises.  Operators told me that they’d prefer to have some simpler approach to VNF loading, even if it means that they had to use a different process to load up their uCPE versus cloud-host some VNFs.

In the cloud, operators say that service chaining isn’t helpful if you presume that most vCPE is really going to live on the premises.  Yes, some operators admit they’ve offered cloud-hosted vCPE features, but they tell me that they’re not selling well and are also not profitable enough to justify a major commitment to the approach.

If you go beyond vCPE, things get really hazy.  According to operators, the non-vCPE applications of NFV and cloud-hosted VNFs are getting pulled strongly in a “cloud-native” direction.  To most of my operations contacts, “cloud-native” means deployed as it would be in cloud computing, discarding the NFV tools and specifications.  They point out that for applications like mobile infrastructure and video caching, the software features are very much like the components of cloud applications, meaning that they’re multi-tenant by nature when NFV has focused on tenant-specific services.  Three-quarters of the operators who had any VNFs deployed in such missions admit that they’re not using mandated NFV specifications to guide their deployment and management.  Most of those say they are moving even further toward the cloud picture.

The biggest reason for this according to my contacts is that “NFV was designed to build legacy services from hosted components and we need non-legacy services for revenue growth” (to quote one operator).  The services operators think will fill their revenue coffers in the future are more like OTT services, which means that they are more likely to need infrastructure resembling that of public cloud providers than like NFV was designed to utilize.

All this comes amidst a general concern that NFV, as a cloud application, is diverging from the cloud.  From the first, VNF onboarding has proved much more complicated than expected because the framework for deploying and managing VNFs has continued to evolve away from the way that applications are deployed in the cloud.  Unless we want to view the software that’s a candidate for VNF creation as the product of a “VNF industry”, we have to assume that software is part of mainstream development, which means it’s cloud-ready rather than NFV-ready.  The biggest problem with onboarding, say the operators, is what they’re expected to onboard to.

None of the two-dozen operators I’ve exchanged views with told me they were dropping NFV, but nearly all of them said that NFV’s success (even in a limited way) depended on reconnecting it with the cloud.  Operators are not going to deploy one architecture to deploy and manage legacy service virtualization and another to build OTT service elements that are the primary driver for “carrier cloud” in the first place.

What about things like ONAP and the various open-source NFV platforms?  Operators say these are diverging from cloud-accommodating implementations of lifecycle management, which means that they would magnify the problems of cloud-readiness rather than solve them.  I’m also seeing increased skepticism that these platforms are developing fast enough to support profitable service deployments, and the longer it takes for that to happen, the harder it will be for legacy-service NFV to resist the pressure of the more cloud-centric and OTT-like applications of carrier cloud.

My models have never suggested any realistic NFV contribution to carrier cloud beyond 2019, and I don’t think any of the operators I’ve talked with believe that ISG-modeled NFV is going to expand much in 2019.  If that’s the case, then this may be the year when operators have to start thinking of “cloud-native” in a different, broader, and more serious way.

Resource Equivalence and Composable Infrastructure

The notion of composable infrastructure or infrastructure abstraction is one of my favorites.  I think it’s probably the most important piece of our overall virtualization puzzle, in fact, but it’s also something that could present serious problems.  That’s particularly true given how fuzzy the concept is to most users.

At the end of last year, enterprises had less than 20% literacy on composable/abstract infrastructure at the CIO level, by my model measurements.  That means that only one in five could state the value proposition and basic technology elements correctly.  This sort of thing obviously contributes to the classic “washing” problem where vendors slap a composable coat of paint on just about anything.  It also makes it difficult for prospective users to assess products, and most important, assess their support of critical features.

The basic notion of abstraction in infrastructure is that you build a virtual hosting framework that applications and operations tools see as a single server or cluster.  There’s then a layer of software that maps that abstraction to a pool of resources that should include servers, storage, database, network connectivity, and so forth.  The complexity associated with using this diverse set of resources, each of which could be implemented in multiple places by multiple vendors, is hidden in that new layer of software.  If that layer works, then composable infrastructure works.

And if that layer doesn’t work, you’re dead.  There’s a risk that it might not work, too, because the complexity associated with a diverse resource pool doesn’t disappear in an abstract or composable infrastructure model, it just disappears from view.  It’s still there, in that new layer.  The implementation of the abstraction layer could mess things up, but even if it doesn’t, the layer hides issues about resources that will bite or defeat the goals of composability, and that users might remediate if they realized what was going on.

The problems go back to what I’ve been calling “resource equivalence”.  A resource pool is a resource pool only if the resources in it can be freely assigned from the group and deliver comparable performance and cost points (capital and operations).  If that’s not true, then there will be some assignments that are first perhaps a bit better or worse, and eventually end up being required or improper.  That fragments the resource pool, so you have several of them instead of one.

When we abstract resources for composable infrastructure, we take responsibility for making all the resources in the pool equally available, not equally desirable.  A data center in Alaska and one in Cape Town may be adapted to support the same hosting processes and run the same applications, but if the user of the applications is much closer to South Africa than to the Bering Sea, the performance of the applications will almost surely vary considerably depending on where you host things.

The easy answer to this sort of problem would be to say that you can create composable infrastructure only where there’s full resource equivalence in the pool.  One problem with that is that all applications aren’t equally susceptible to non-equivalence.  Another is that all resource pools that aren’t equivalent aren’t non-equivalent in the same way.  A third is that most modern applications are componentized, and each component may have its own resource sensitivities.

A better answer is to offer the abstraction layer a combination of a measurement of the non-equivalence of various distributed resources in the pool, and a requirement for specific resource behavior for each application/component.  That could be used to help the abstraction layer map between the virtual representation of a hosting resource set and the resource pool realization.  It’s not a perfect solution, because it effectively segments the resource pool and raises the risk that applications might not find any suitable resources, but this risk could be addressed by maintaining an inventory of the resources available and allowing a management system to query the abstraction layer to learn how much capacity is available for a given application resource behavior requirement set.

The current answer, if we can call it that, is to try to reduce the impact of things that would make resources non-equivalent.  The most significant of these is the network linkage that binds resources to the collective pool.  If we had infinite capacity and zero delay in network paths, we could create uniform resource pools distributed everywhere.  Data center interconnect (DCI) is the strongest element in any plan for a resource pool, and thus for any implementation of composable infrastructure.

The current answer doesn’t fully address the problem, at least for composable infrastructure to be used in public cloud or telco cloud applications.  There are significant compliance or security issues associated with the location of resources.  I’ve been involved in a number of operator-driven initiatives that involve resource virtualization, and all of them had requirements for controlling hosting locations based on regulatory requirements (don’t put certain data here because it would give government access to it or break local laws), availability (don’t put a backup resource in an area that has common power with the primary), or security (don’t put a resource here for applications that require highly secure connections).

We have no shortage of indicators that many in the industry are aware of this problem.  All of the bodies charged with cloud hosting or feature hosting have at least nibbled on the issue.  The challenge is to converge on something.  With a dozen possible abstraction layers offering a dozen possible solutions to the resource equivalence problem, we’d have nothing likely to induce operators to actually implement something.  With a different solution set for the cloud and for network operators (which we tend to have today), we’re crippling the solutions overall by limiting the problem set that each solution is targeted to, which limits how far any of them can develop.

You might fairly wonder whether the need to address this sort of resource selection requirement by means of supporting resource granularity in many dimensions doesn’t make our abstraction layer so complicated it’s impractical.  No, it doesn’t, but that’s because it’s the requirements that are complicated, not the implementation.  In my view, the alternative to making the abstraction layer handle this kind of precise resource selection is letting that task roll upward into the service model, which begs for all sorts of errors in modeling the services.  Better to have an abstraction layer that hides the complexity but not the visibility.