Exploring the Justification for a “New IP”

Well, we have another new-IP story emerging, one that includes both some of our old familiar themes and some new-ish ones.  The questions are first, whether any of the suggestions make sense, and second, whether they could be implemented if they did.  I think there may even be a better way to achieve the same goals.

A Light Reading story lays out the issue with perhaps a tolerable touch of assumption.  Future 5G applications like “remote surgery, haptic suits, holographic calls and self-driving robots” might be impossible in an IP world because of the inherent latency of packet handling in IP networks.  A new IP, or something decidedly non-IP, could reduce that latency and get those robots moving (perhaps to do surgery in haptic suits).

There is no question that there are applications whose latency requirements would be difficult or impossible to meet over the Internet.  Answering the two questions posed at the opening of this blog means weighing the credibility of those applications, the technical credibility of proposed solutions, and the likelihood that the solutions would be adopted by operators, which is a question of return on investment.

I don’t think many people would rule out any of the possible low-latency applications that the article lays out.  That’s not the same as saying there’s pent-up demand for them, or that their value would justify a major fork-lift to infrastructure.  To a degree, I think the proponents of new-IP models are guilty of circular justification.  We need 5G.  5G needs a differentiator.  Low latency is a differentiator.  Therefore, we need to rebuild the Internet to offer low latency so we can justify 5G.  The obvious problem with that sequence is that creating the “driver” for 5G to justify investing in it demands a much larger driver, to justify a much broader infrastructure change.

There is no clear justification for broad changes to IP to lower latency.  The Internet today demonstrates that IP is fine for what we’re doing with it.  Sure, it could be better (we’ll get to that below), but to say that it’s not suitable because it won’t support self-driving cars or factory automation begs the question of how we got both of them with the old IP.

Factory automation doesn’t require us to haul sensor data a thousand miles or more.  What factories run with sensors in one city and controllers in another?  That would be a silly approach.  Industrial IoT may not always be line-of-sight, but it’s darn sure unlikely to be intercontinental.  And self-driving cars?  Why would you put vehicle automation anywhere but in the vehicle?  Sure, the vehicle needs information on routes and traffic conditions, but those aren’t the things with a latency problem.  Pedestrian steps off sidewalk without looking, or vehicle runs the light.  Onboard sensors are how those things are detected.

I’m not saying that low-latency is stupid and unnecessary, only that it’s not broadly justified if it costs much to achieve it.  The article says that latencies of one or two milliseconds could disrupt remote surgery or self-drive.  We’ve already seen the latter doesn’t really need low latency.  How about remote surgery?  It is possible that we might want a specialist a thousand miles away to perform an operation, isn’t it?

Yes, but.  The “but” is that just having one surgeon doing that doesn’t mean we have to rebuild the Internet.  Do we think that all surgery would be remote?  Remember, “remote” would mean somewhere far enough to need the Internet to make the connection.  It would require a lot of surgery, and a lot of surgeons, to justify a major Internet-wide shift.  Remember, we’re not going to spend a trillion dollars for something that might happen, eventually, on a small scale.  There’d be a better way (again, see below).

OK, let’s move past the fact that none of the missions cited as a justification for remaking the Internet and IP have real broad-based demand at the scale needed.  What about the simple question of whether there would be a better way of doing IP?  The article quotes John Grant of the European Telecommunications Standards Institute (ETSI).  “All these extra things you want to do to make IP usable in a mobile system have involved putting extra headers on the packet, and that means you have to do more processing and send more bits over the interface.”

Wait!  Now we’re doing mobile remote surgery?  But let’s put this aside too.  Extra headers, more packet processing per header, more bits on the interface.  Well, yes, that’s true.  The average Internet packet is about 570 bytes, of which about 22 bytes is the IP header.  Control processes like IoT, including likely remote surgery, would require smaller packets because incremental changes in things have to be reported quickly.  We could say that an IoT packet might run only about 40 bytes, over half of which would be header.

Let’s do some math.  The speed of bits in copper is about 100,000 miles per second, and in fiber optics is roughly 120,000 miles/second, we’d have a propagation delay of 8.3 x10-6 seconds per mile, or 8.3 x10-3 milliseconds per mile.  So, if our surgeon is more than 240 miles from the patient, the maximum of our “1 or 2 milliseconds” cited as the tolerable delay for remote surgery is already shot.

You can’t speed up light, and it’s hard to see how remote surgery limited to a maximum of 240 miles is a game-changer.  Also note that this doesn’t include the fact that whatever the surgeon does has to be reflected back along the same path as the event that caused it.  We can achieve propagation event/response delays of a millisecond over 60 miles only, so I think the robotic surgery story is too limited an application, given physical limits of data transmission.

Handling, meaning serialization delay of the packets at the data rate plus the handling delay of the devices, queuing, etc. is a large factor in total delay.  Where it might add up is in the network, where a packet has to be read, switched, and written multiple times.  Industry savvy says that a decent core router with a typical packet would introduce about 52 microseconds per hop, absent congestion.  The delay on ingress to the Internet and on egress to the destination would be longer because the speeds would be lower.  The point is that absent congestion and propagation delay between nodes, you could do about 20 hops in a millisecond.

Would changing the headers or forwarding strategy improve this?  I’ve not seen much to suggest that a big improvement is possible.  Most big routers have chips for table lookup, and while you might reduce the time required to switch a packet a bit by making the tables smaller, it’s not clear how easy that would be without rethinking what’s in them, which would mean restructuring how destinations are addressed.  Even if you did that, what would the benefit be, given the other factors in switching delay.

Which are related to queuing and congestion.  Some of the measures people are proposing for reducing latency are designed to manage bandwidth better.  Have we learned nothing in the last decade?  Capacity these days is almost always cheaper than managing capacity, particularly when you consider that the latter approach has tended to increase complexity and opex.

My view on this is the same it’s always been.  The network of the future doesn’t need a new IP.  It might benefit from things like segment routing, but the biggest gains would come simply by creating an incredibly rich optical network that provided direct optical paths between all the major areas.  If we presumed a hop from source to metro on-ramp, to destination off-ramp, to destination, we would have 4 transit hops.  If everything was oversupplied with capacity, we’d have little chance for congestion, and that would mean 8 round-trip hops for 0.4 milliseconds, plus round-trip fiber transit delays—say for 500 miles each way, 8.3 milliseconds.

This pair of numbers tells the whole story, I think.  The great majority of latency, even in an optimum new-IP structure, comes from the source we can’t change—the speed of light in fiber.  We’re proposing to tweak something that accounts for (in my example) 4.8% of the latency.  And to do that, we invent a new IP?  I think that we could address legitimate low-latency requirements by augmenting capacity, reducing hops with more direct optical pathways, and even separating traffic that does require low latency from “normal” Internet traffic.  No new IP required.

This doesn’t address the other issue mixed into the narrative, mobile networks and 5G.  It is true that mobility poses major issues, because users move between cells and so their own IP address isn’t sufficient to route them correctly.  It’s also true that the solution to this problem, which is tunneling user traffic to the right cell based on mobile registration, adds header overhead and processing.  But….

…is that necessarily a bad thing?  It is if you want to assume that low-latency, edge computing, and the like are all justifications in themselves, not things that need to be justified.  We’re not there yet.

…is there an approach on the table to fix it, without changing IP overall?  Well, back in the heyday of SDN, I proposed that an SDN “route”, established as it was based on centrally imposed rules, could include a topology I called the “whip”.  It was held at one end and swirled around at the other.  If we established an SDN route for a given mobile device, from the packet gateway as the “held” end of the whip, we could centrally move the tip around from cell to cell.  I submit that this would work, that it would simplify mobility (the mobility management system only has to tell the central SDN controller when something moves), and that it would radically simplify the EPC or 5G core handling.

Might a wiser, identity-based rather than location-based, system of routing have done better for us had it been adopted before Internet usage exploded?  Sure, the usual 20-20 hindsight applies.  I do not think that major revisions to the Internet are justified just to solve the mobility problem, which is fairly local in scope.  Until we can teleport people around, we’re not going to rearrange the relationship between cell sites and smartphones across continents in a flash.  It’s worthwhile to consider the “whip” model of forwarding path via SDN, or a similar approach, to modernize mobility support.  No need for a new IP here either.

It’s going to be very difficult to create a “new IP” at this point.  One obvious reason is that the Internet’s clients and services dominate our society, and the total investment in client-side technology alone would drive a deep stake into the ground at the site of “current IP”.  The second reason is that the “needs” being cited to justify pulling out that stake are themselves in need of justifications.  We might have done things differently had we known where all this was heading.  We might even have done it better, but we didn’t know, didn’t do, and so now the incremental rather than revolutionary approach is the right answer.  We can tweak IP here and there, but it will remain what it was, and is, for a long time to come.

Could Our Collaborative Future Lie in a Game

Ever see a couple at dinner, each on their phone and ignoring the other?  It’s easy to dismiss this behavior, but in fact it’s a sign of the times.  Social behavior is an arbitrary set of standards that govern interactions, and if you open a new medium for that interaction, you can enable a new social behavior.  Facebook recently launched an app (“Tuned”) for couples, designed to let them share memories, music, and moods, which surely demonstrates this point.  Online life is “in”.

Given that, it’s interesting to speculate on whether things like this are a precursor to a future that virtualizes us, meaning that places us inside an online community that not only eliminates familiar concepts like physical meeting, but also creates virtual equivalents, or even concepts that step beyond what we’ve developed for a physical-centric world.  The current youth, and in fact people into their 20s, evolved their critical social behaviors in an online world.  Could we all learn to accept a virtual world, at least for business collaboration, and if so, how might it happen?  Here’s a possibility; it’s all a game.

Over the last four or five decades, there have been countless studies conducted on the value of business videoconferencing.  The ones that actually tried to get answers (as opposed to trying to generate sales) generally found that there was little or no value to video in pairwise business relationships.  As the number of people involved grows, the value increases quickly, to the point where at around five people in a conference, the absence of video stalls most progress.  Yet video-call applications like Facetime are among the most popular.  Socially, at least, we like to see each other, and it may be that past research on videoconferencing was simply imperfect, extending one dimension of a multidimensional problem.

Sociologists probably love this kind of stuff, but it could be important for us all, and not just because a virus creates a flood of working from home.  Bringing a lot of people together to do something is expensive, and it tends to foster concentration of population in cities, lengthening commutes as people try to flee the crowds, and increasing pollution.  Not to mention the cost of real estate.  If we could all work virtually, wouldn’t it create better lives?  But can we do that; are we wired to be more social, in the old physical sense, than we think?

Here’s a fundamental truth.  Our visual sense is the most powerful of all our senses.  We can assimilate more information faster by “looking” than through any other means.  Visual terms dot our conversations, even where the meaning isn’t directly visual.  “I see” means “I understand”.  We establish connections with people by seeing them.  A visually impaired friend once told me that the biggest thing they had to get over after losing sight was the feeling of loneliness.  It’s not that they were really alone, or even thought they were, but that the lack of that visual connection made them feel that way.  “Out of sight, out of mind” in a different sense.

What does this mean for remote work?  I’ve blogged about the technology elements that could improve work-from-home productivity, and about project management techniques to promote efficient WFH, but would these strategies hit the wall at some point if everyone stayed home all the time?  Could we extend the workplace into the virtual dimension more efficiently, increasing the time period when workers could tolerate it and be productive?  Better video, bigger monitors?  I don’t think that would do it, because the virtual experience video creates is still a flat and insipid version of reality.  So perhaps we have to dodge reality to solve the problem, which is where gaming comes in.

A couple of my gamer friends tell me that the ultimate answer to establishing a virtual workplace that’s truly productive and immersive is MMPG, which stands for “massive multi-player games”.  Everyone gets a character/avatar, and everyone inhabits a virtual reality complete with scenery.  The more real the interactions in that virtual reality are, the more likely it is that it feels real.  Think “Second Life”, but with a wrinkle that within that virtual world, there’s virtual work.

Twenty years ago, about the time “Second Life” came out, we were just seeing the potential for creating virtual worlds.  It was clear even then that it’s not just a matter of creating a “virtual world”, external to ourselves.  We had to be a part of that world.  You had to have virtual-reality goggles that shifted their virtual view as you turned your head and moved, and you had to “see” your hands and feet, be able to pick up something and put it somewhere, or pat a friend on the back.  Immersion in an alternate reality is critical for making that alternate real.  Augmented or virtual reality has to embrace us, and all that we expect to interact with, including people and facilities.

AR/VR can obviously allow us to create any kind of virtual reality.  We can use the tools to create fantasy or science fiction, but you could also use it to mirror the real world, while at the same time fixing real-world issues that might interfere with your enjoyment or with worker productivity.  I don’t know whether there’s any research that shows that a virtual conference of avatars sitting on top of Everest would be as productive as a conference in a virtual conference room, but I suspect that it wouldn’t be.  We need to craft the virtual environment for the task, which is work and not climbing.  What would the ultimate virtual workplace look like?  It might be fun to find out.

We also have to craft the people, or at least the avatars that represent them.  There’s nothing wrong with having your virtual alter ego aligned with your self-image and not reality, if you don’t take it too far.  I’m an average-height guy, a bit gray, and certainly I don’t appear physically threatening.  Might I like to be six-foot-two, three hundred pounds of muscle?  That’s not happening in real life, but an avatar that represents me in a virtual world could certainly have those attributes.  I could look like I want to look, no matter whether I’d bothered to comb my hair or shave that morning.  My surroundings could look like the abode of a king, even if I hadn’t cleaned my home office.  You can’t go crazy here and develop a couple heads or the body of a horse, but a little ego-preening wouldn’t hurt.

It might actually help.  Half those I webconference with these days won’t enable video, so whatever value video has to establish personal connection is lost.  In old studies on virtual work, people would hang coats on the cameras because they didn’t want to be virtually viewed.  Would they be prepared to connect via an avatar, particularly an avatar that was a kind of perfect self?  If that avatar could reflect their movements and expressions, could others in the conference make an avatar connection as easily as a personal connection?  If that avatar is a virtual person that effectively represents the real person beyond, isn’t that enough?

This might be just the sort of thing we need to consider for work-from-home.  Might COVID have presented a hard, business-centric, justification for artificial reality?  Could we build, as some stories have suggested, an enormous computer game that becomes reality?  No more offices, no desks or cubicles, no videoconferences.  Well, maybe virtual forms of those things.  It’s easier to move avatars than humans, to decorate virtual conference rooms, after all.  Could we become so accustomed to this sort of thing that we could adapt business practices to optimize its unique capabilities?

This sort of thing could have a major impact on another area that the virus has stressed—education.  Reports show that over 40% of public-school students haven’t attended any online classes at all. Of those who have, the majority of students aren’t happy with remote learning, with some saying that they get perhaps a half hour to an hour of useful instruction in a (virtual) school day.  If we had virtual classrooms and avatars (with proper control to ensure nobody picked or did anything disruptive) could we make a virtual school as effective as a real one, or perhaps even more “real?”

We actually have the tools to do everything needed to create a truly virtual world, and work within it, even from the office.  There are multiple open-source MMPG frameworks, and it wouldn’t be a major challenge to integrate traditional collaboration and web conferencing into one or several of them.  Similarly, it wouldn’t be difficult to manage more complex collaborative relationships (the mashups and panels I’ve blogged about in the past).  The tools are there.

What we’d need is a model, a framework, that would establish and police a virtual workplace.  No weapons in the conference room, all programmer avatars must be fully dressed, rules that mirror those of the real world.  We’d need policies to enforce the rules too, and mechanisms to punish those who won’t conform.  Most of all, we’d need extraordinary security and identity management.  All six-foot-plus superhero avatars look alike, so anyone adopting one could look like any of the others who did the same.  Avatar impersonation could be a huge risk, as big or bigger than digitally manipulated video.  All this is going to take money and time.

There’s no question that an experiment in virtual work wouldn’t be cheap; everyone would need a suitable VR/AR headset, a suitable PC, the software, and a means of reading movement and expression.  Is that more expensive than equipping a real office, though?  It seems to me that our recent experience is telling us that we need to think outside the box here.  Maybe “think inside the game” is the answer.

Cisco Has (Re)discovered Intent Modeling, and it Means it!

We need a new network architecture.  That’s a view that Cisco has expressed, relating to the coronavirus impact, but it’s actually been true for over a decade.  Not only that, most network operators and many CIOs have agreed with that position, according to my own research.  The problem isn’t recognition of the need as much as recognition of what the architecture would look like.  Cisco’s vision isn’t new, but it seems pretty clear that they’re newly committed to it, and that could be very significant for both Cisco and the market.

Scott Harrell, head of Cisco’s Intent-Based Networking unit, is quoted in what’s certainly a strong lead-in to this topic: “This is a huge innovation cycle we have to go through,” Harrell explained. “But just doing it from a hardware point of view won’t be enough.”  Let me phrase the point slightly differently:  our focus has been on building network devices, and it needs to be on building network services.  Even Harrell’s turn of the phrase is a revolution for a device vendor, and that’s why I’m seeing this as a new and significant shift in Cisco positioning.

Devices have been central to networking from the first, so having a device-centric bias is certainly understandable.  The problem is that if you evolve network services by evolving devices, you tend to be constrained by the historic way that devices relate to and create services.  If you evolve a super-modern conceptualization of a router, and you stick the new device in place of an older router, how truly different is the result going to be?  Can a single-device tail, or even a small set of new devices, wag a very large dog of installed base?

I also think Harrell is right in his view of the role of intent modeling.  It’s a “huge shift in how you build and operate networks….”, and while he’s talking specifically about enterprise networks, it’s just as true for service provider networks.  An intent model is a function in a black box, an abstract feature that you can compose into a complex multi-feature service, and implement in any way that meets the “intent”.  If today’s networks are networks of devices, then future networks should be a network of intent-modeled functions.

That doesn’t foreclose having devices inside those functional black boxes, of course, nor does it commit Cisco to filling them with something other than devices.  What it does do is tell the world that open-model networking is coming, in some or several forms, and that vendors like Cisco need to differentiate themselves not at the box level, but at some higher level—like those functional black boxes.

Things get a little murky at this point, and as usual it’s difficult to say whether Cisco’s Harrell isn’t explaining things well, whether Cisco/Harrell don’t believe intent models are what I just explained them to be, or whether the publication didn’t grasp what Harrell was saying.  Whatever the case, the article now brings in things like 5G, WiFi6, and AI, and it’s hard to make a clear connection between any of them and what intent-model networks would look like.  Let’s try to fill in the gaps.

One principle that Harrell articulates is that enterprise networks are transforming to a wireless-first world.  The argument here, which I think is valid, is that even without coronavirus pressure, we were already seeing a need to decouple workers from a specific workplace.  You used to go to work, and now work comes to you.  Part of the drive here is obviously work from home (WFH), and part is the longer-term drive to empower mobile workers.  Whatever the root cause, it’s clear that as you try to create a virtual workplace, you have to focus on a network of identities not a network of termination points.  Who/what something is matters a lot more than where it is, which in a mobile or WFH sense could be anywhere.

I think the wireless-centric vision is a driver for the new architecture rather than an intent-modeled element, but you could express the notion of an identity network in an intent model.  The “feature” of the modeled element is the ability to connect named entities whose names reflect “what” and not “where”.  That’s actually at least somewhat true of mobile networks, so it’s hardly a revolutionary concept, and in fact the IETF has wrestled with the dichotomy of what/where for decades.  WiFi 6 and 5G are certainly ways of achieving real connectivity to identity-named elements.

The link with AI is, I think, a good example of a benefit of intent modeling.  If we look at the trends in software design and deployment, we see that the emerging standard way of automating the operations lifecycle of a network or IT service is through state modeling.  A cooperative system like a network has a “goal state” that represents optimum operation.  It also has a “current state” that may not be the goal state, if something has happened that has to be dealt with.  Operations automation in any form can be viewed as the set of tools and functions that translate the cooperative system from an abnormal current-state condition to the goal-state condition.

This is easier to do with intent modeling, for two reasons.  First, the intent model can express, as its external properties, the specifics of the goal state.  That expression, the interfaces to the black box, can be seen as a contract made with the higher level, and also as a specification that the contents of the black box have to meet.  Second, the concept of the intent-modeled functional black box lets each modeled element work within its implementation to meet the external specifications.  That breaks down the task of remediation into manageable (no pun intended) pieces.

Suppose we have an intent modeled function called “VPN-Core”.  This would define (directly or implicitly) the SLA of the service.  If that SLA isn’t met, then the fault would have to be reported upward to the retail service level, which would then either try to remediate through recomposition, or recognize an SLA violation.  However, VPN-Core would first try to fix its own problems internally.  If it included all the trunks and devices and software of an entire VPN core, that could be a mighty task.  In reality, though, VPN-Core would likely decompose into more primitive intent models, like “MPLS-VPN-Core” or “SD-WAN-VPN-Core”, and these would then decompose into administrative or vendor subnetworks.  Only the bottom layer would decompose into devices/software instances.

In this approach, a fault (which of course has to occur somewhere at the bottom) would first be managed by the low-level model, which could use AI/ML or whatever else seems best.  It would try to fix the problem, and if normal operation was then restored, everything is sweetness and light.  If a fix wasn’t possible, it would report an SLA fault to the superior modeled element.  There, AL/ML could again be applied to recompose that superior element’s inside-the-box structure to rebuild a working feature that meets its SLA.  As before, success would stop the process and failure would escalate up the structure toward the commercial/retail top.

This, I think, is what Harrell is getting at when he says “It’s about going from managing a box at a time to managing the entire fabric at a time. So instead of just managing one box, I’m managing the entire network.”  Intent modeling lets us decompose a service into features, and decompose a service SLA into feature SLAs.  Those features and their SLAs are then further decomposed until you get to a primitive behavior set, and if there are multiple possible ways of creating those features, you have multiple behavior sets.  All this is managed at every level, via AI/ML or whatever works.

If all this is correct, then the tie-in with coronavirus is also clear.  An intent-modeled network is easy to equip with automated operations lifecycle management; in fact, it’s implicit.  An intent-modeled network is easier to recompose under load—again, it’s implicit.  An intent-modeled network is easily adapted to new implementations because any implementation of a specific intent-modeled feature is conformant to the service usage requirements.

And Cisco could be the ideal candidate for leadership here.  They’re committed to change, clearly, and for the best possible reason—that the status quo isn’t going to be profitable, and may not even be survivable.  They have a mighty marketing fist to batter opposition aside, and sweet talk to build consensus among wary CIO and CTO types.  Best of all, they actually grasp what’s needed.  As is often the case these days, they do need to refine their messaging a bit, creating the same kind of intent-within-intent that the whole intent modeling thing proposes to do, but at the positioning level.

We need a revolution, applied as evolution.  For that, you need a cheerleader, but one with a careful eye on the bottom line, who can make revolution palatable by evolving to it.  Cisco could be the one, and of the network equipment vendors, I think it’s likely the only player who could say that.

Can IBM Defend it’s Hybrid Cloud Positioning?

Hybrid cloud is the cloud, so it’s good that IBM recognizes that.  It’s also true that IBM and Red Hat combine to create a highly credible hybrid cloud solution.  What’s still an open question is whether IBM or Red Hat really recognize what’s going on in hybrid cloud, what’s needed to win there, and who might be the big threat to their achieving their hybrid goals.

In a post to IBM employees, the new IBM CEO (Arvind Krishna) mentioned the strategic importance of hybrid cloud, centered on this paragraph:

IBM has already built enduring platforms in mainframe, services, and middleware. All three continue to serve our clients. I believe now is the time to build a fourth platform in hybrid cloud. An essential, ubiquitous hybrid cloud platform our clients will rely on to do their most critical work in this century. A platform that can last even longer than the others.

Marketing material has always been at least only loosely related to reality if not orthogonal to it, and that’s true with this critical statement.  Hybrid cloud as the “Fourth Platform” (caps are mine) is a decent marketing gimmick.  That it builds on IBM’s past successes in mainframe, services, and middleware is another plus.  You can even say that making “hybrid cloud” an explicit platform is congruent with my own comments that we need to have a hybrid cloud architecture.

Here’s another quote: “There’s a unique window of opportunity for IBM and Red Hat to establish Linux, containers and Kubernetes as the new standard. We can make Red Hat OpenShift the default choice for hybrid cloud in the same way that Red Hat Enterprise Linux is the default choice for the operating system.”  The second sentence here is clearly true; an architecture for hybrid cloud could be framed credibly through Red Hat’s OpenShift framework.  It’s the first sentence that gives me a bit of concern.

I’d be the last person to denigrate the role of containers and Kubernetes in hybrid cloud.  They are a necessary condition, so symbolic logic would say.  They’re not a sufficient condition, though.  Hybrid cloud has to combine cloud-native with traditional IT.  If you reference my blog yesterday, there are four legs to the cloud-native stool, and in this analogy, containers and Kubernetes are the seat.  Important deployment rests on the two, but without the four legs, everything is going down.  That could include IBM’s plans.

If you look through IBM’s website blogs, you find a pretty decent articulation of the main IT principles and priorities that could relate to hybrid cloud.  The problem is that you’re left to assemble the end-game of hybrid cloud on your own, or to accept at face value the notion that containers and Kubernetes define everything you need.  It’s back to the old joke of the elephant behind the screen, with people tasked to reach in and touch a piece, from which they must identify the whole.  No, hybrid cloud is not a tree, a mountain, or a snake, it’s an elephant.

OpenShift is important, to IBM/Red Hat and to buyers, largely because it’s more than containers and Kubernetes.  Everyone has that, at this point, or they’re spinning their wheels.  What makes OpenShift powerful is that it is, in fact, a qualified “cloud-native” and “hybrid cloud” platform.  The big question is whether, lacking any specific IBM/Red Hat articulation of why that’s true, the value of OpenShift can be recognized by buyers and exploited by IBM.

If lack of insight in presenting the entire elephant versus its parts isn’t handicap enough, there’s the “cognitive” link-in.  IBM’s CEOs, and in fact IBM marketing overall, is obsessed with AI and Watson in all its guises.  Jim Whitehurst, the new IBM President, is head of “Cloud and Cognitive Software”.  The notion that the two should be linked in some way isn’t clear, at least if you demand functional reasons rather than something organizational (like “both are new and we didn’t know where else to put them”).  Are we conditioning cloud success on AI integration?  The other way around?  It seems a needless tie-in, and since organizations tend to message according to their own mission, isn’t this linkage an invitation to creating a joining of the concepts even if that’s not IBM’s intention at all?

Here’s the thing.  Software architecture has to match the mission of the software at the business level, with the resources expected to be committed to hosting that software.  The first task in that matching process is defining the properties of hosting options that will dictate the optimum selection of resources, and the management of the selection process.  The cloud’s properties are well-known, and to me the most important thing that’s happened in cloud computing is the 2019 burst of insight among buyers that “hybrid cloud” meant “cloud-front-end-to-data-center”.  A hybrid cloud architecture has to embrace that insight.  OpenShift technology does, but OpenShift positioning?  Not so much.

There are two risks to undershooting market reality.  One is that you’ll delay your own revenue growth because you won’t address opportunities as quickly as you could.  That could demoralize salespeople and even induce strategy changes that could hurt the technology while trying to fix the positioning.  The other, which is clearly worse, is that somebody else gets it right.  In the hybrid cloud area, there’s no shortage of possible candidates for the role of hybrid-cloud Pied Piper.

Among the direct IT competitors, VMware looms large…maybe.  In raw technology terms, meaning summing up the pieces of the cloud-native puzzle that VMware could bring to bear, they’re equal to OpenShift.  Their problem is that their positioning is even worse.  What is VMware’s over-arching cloud-native story?  It’s a book without a title.  Part of the problem is the relationship between VMware and Dell, which compels VMware to position its wares for general server use only by risking competing by proxy with a major shareholder.  A bigger part is Pivotal.

In 2012, Pivotal was spun out of EMC, which at the time owned VMware.  Dell bought EMC (and thus VMware) in 2015, and VMware was issued publicly as a tracking stock, traded independently.  Pivotal became a container and cloud software player, and its container tools were adopted by VMware as Pivotal Container Service.  Pivotal was also the main commercial supporter of Cloud Foundry, an open-source cloud-native platform.  In 2019, VMware said they were buying Pivotal back.  This sounds a lot like a very convoluted family tree, and where all this stuff leaves VMware in terms of cloud-native positioning is hard to say.

Pivotal had its own orchestrator, competing with Kubernetes, but was in 2019 switching to the mainstream Kubernetes as its centerpiece.  That essentially leaves the value of Pivotal as being the Cloud Foundry connection, but while Cloud Foundry is open-sourced (at the Foundation level, anyway), the up-and-coming Knative, a Kubernetes-centric open-source cloud-native and serverless approach.  Does VMware intend to pursue Cloud Foundry instead?  If so, they’re risking getting on the wrong side of market momentum with Knative, as they did with Kubernetes.  If not, then why did they buy Pivotal?

Perhaps the biggest risk to IBM’s goal of hybrid cloud dominance comes from the cloud providers.  All three of the major providers have hybrid cloud aspirations, and since IBM clearly expects to have hybrid cloud successes pull through its own IBM Cloud offerings, a cloud-provider encroachment into IBM’s hybrid turf would hurt IBM on two fronts.  Who might do the hurting?  We have to look at both the technical and marketing dimensions to decide.

Technically speaking, Google is a formidable potential risk to IBM’s hybrid plans.  Google is the originator of Kubernetes, and they’ve used their own private search data centers and networks to prove out a lot of cloud-native concepts.  They have what might well be the only practical experience on the planet in the application of cloud-native on a mega-scale basis, and the only experience in tight integration of next-gen networking concepts like SDN.

On the marketing side, Google’s big problem is that they don’t have an account presence like IBM does.  In fact, they’re not particularly known as a provider of enterprise software.  Their vision of cloud-native is powerful, but hybrid cloud meshes cloud-native and the data center, and Google doesn’t have traditional data center street cred.

Amazon has the same marketing problem as Google, despite the fact that they’ve tried to align themselves with data center mavens.  They have a VMware relationship, in fact, which raises what IBM’s CEO must think is the ultimate ugly competitive risk—Amazon buys VMware.  I’m doubtful that’s in the cards, though.  The big point, though, is that Amazon is winning “the cloud”, and if they accept that hybrid-cloud credentials are important, they might be devaluing their own approach, which is to offer superior facilitating tools and services on the cloud side.

That’s the technical side of Amazon’s story.  They might well believe that enterprises know the relationship between front- and back-end elements in a hybrid cloud is a simple queue or bus.  Consider it a given, and consider the creation of specialized front-end features the differentiating factor.  They could be right, because what truly differentiates the cloud is much easier to build in as a service than to add on as a software tool from the outside.

Microsoft is our last cloud provider to consider, and they’ve been an IBM rival for a long time.  In terms of strategic influence on accounts, Microsoft has a slightly lower score than IBM but they touch a larger audience.  You could argue that IBM plus Red Hat roughly matches the reach of Microsoft’s influence, but of course IBM and Red Hat are still singing slightly different songs even if they’re seated in the same choir.  Microsoft has one very strong and singular voice.

Technologically, Microsoft has likely done more than any other public cloud provider in framing a hybrid cloud story, from the very first.  They have a specific product designed to integrate cloud and data center elements (Service Fabric) and it works for stateful and stateless components.  They support containers and Kubernetes, but Service Fabric may be a competitor with Istio for service mesh, which is OK as long as Microsoft pushes Service Fabric ahead quicker.

It’s the legacy of Service Fabric that poses the threat; it’s been around a long time, and so it doesn’t get much incremental coverage.  It’s perhaps a product ahead of its time, one where early adopters loved it, but the mainstream mission of a service mesh didn’t come along until after the media lost interest in Service Fabric.

That makes this a three-way battle of articulation, IMHO.  IBM/Red Hat, VMware, and all the public cloud providers have the pieces they need to articulate a hybrid cloud architecture, though they’d each take a different slant at it.  Right now, enterprises don’t know whether they need one, and if they do which one is best.  A technology war in this space might take years to resolve because real products are required.  A positioning war could be resolved overnight; as I’ve said before “Bulls**t has no inertia”.  The question, then, is whether the contenders themselves have inertia, and I think we’ll know by the end of the summer.

Evaluating “Cloud-Native” Claims

It may be that the only thing worse than a trite “prescription” for something is a complete lack of definition for that same “something”.  We may be in that situation with “cloud-native”.  Worse, because we don’t really have a broadly accepted definition of what cloud-native is, we may be facing major challenges in achieving it.

The definitional part, I think, may be easier than we think.  “Cloud-native” to me is an application model that works optimally in public cloud services, but is either difficult or impossible to achieve in a data center.  The trick in getting anything from this definition is to confront what’s truly different about public cloud.

This can’t be something subtle, like economics. Even if it were true that public cloud was universally cheaper than data centers (which it is not), the universality of the truth would mean that you couldn’t accept “more economical” was the fundamental property of “cloud-native”.  After all, “native” means “emerging uniquely from”.  What are the “native” properties of the cloud?  I think we could define four of them.

Number one is an elastic relationship between an application component and the resource that host it.  Elasticity here means both “scalable” in the sense that the instances of a component can scale (infinitely, cost permitting) based on load, and resilience, meaning that a component can be replaced seamlessly if it’s broken.

This is the most profound of the defining points of cloud-native, because it reflects the focus of the whole idea.  People think that containers equals cloud-native, but the fact is that you can run anything in a container, and run it the same way it could run in a data center.  There’s nothing uniquely “cloud” about containers, nor should there be.  A container is a mechanism to simplify component deployment by constraining what specifics that component can call upon from the hosting environment.  A virtual machine is a general hosting concept; it looks like a server.  A container looks like a specific kind of server, deployed in a specific network access framework.

I think you can make a strong argument that a cloud-native application should be based on containers, not because no other basis is possible, but because other hosting frameworks are (relatively) impractical.  By presuming a specific but not particularly constraining framework within which componentized applications run, container-based systems enhance application lifecycle management.

In order for a component to be freely scalable (additional instances can be created in response to load) and resilient (the component can be replaced without impact if it fails), it cannot store information within itself.  Otherwise, the execution of multiple instances or new instances would be different because that stored data wouldn’t be there.  This is sometimes called “stateless” behavior, or “functional” behavior.

It’s difficult to build a transaction processing application entirely with stateless components; cashing a check or buying a product is inherently stateful.  Hybrid cloud applications reflect this by creating an elastic front-end component set in the cloud, and then feed transactions to a transactional back-end usually run in the data center.  That architecture is critical for cloud-native behavior.

The next thing you need is a mechanism for virtualizing the cloud-resident components.  Containers, as I’ve said, aren’t about that.  A “virtual component” looks like a single functional element no matter whether it’s been running an hour or has just spun up, no matter whether there’s only one of them or a thousand.  To backstop the ability to scale and replace components because they’re stateless, we need to have a way of binding a request to a specific component instance.  That means finding the darn thing, since it might have been spun up anywhere, and also finding an instance if it scaled, via a load-balancing mechanism.

You can do discovery and load-balancing already, and in fact many of the “level 3 switches” of the recent past were server load-balancing tools.  The thing about cloud-native is that you don’t have servers in a virtual sense, so you don’t have a data center to sit load-balancing switches in.  You need distributed load balancing and discovery, and that means you need service mesh technology.

A service mesh is both a concept and a tool.  Conceptually, a service mesh is a kind of fabric into which you deploy stateless components (microservices), and which provides those components the necessary discovery and load-balancing capability.  As a tool, service meshes today are based on what are called “sidecars”, which are small software elements that act as “adapters”.  The sidecar binds with (connects to) the microservice on one end, and with the fabric and related software tools that establish the service mesh on the other.  Sidecars mean that every microservice doesn’t need to be written to include service-mesh logic.

Istio is the most popular of the service mesh tools today, linkerd is in second place.  However, unlike container orchestration which has decisively centered on Kubernetes, there’s still a chance that something better will come along in the service mesh space.  Right now, service meshes tend to introduce considerable latency.

The third thing that cloud-native implementations need is serverless or “functional” capability.  In programming, a “function” is an expression whose outputs are a function of the inputs and nothing else.  That’s a subset of what we’ve called “stateless” behavior above.  The functional property is important for serverless cloud, because in that cloud model, a component/microservice is loaded only when it’s invoked, so there’s no precommitted resource set in place to manage.

Serverless isn’t for everything, of course.  The more often a microservice is invoked, the greater the chance that it would be cheaper just to instantiate the component permanently and pay for the resources full-time.  But agility and resiliency are the fundamental properties of the cloud, and serverless extends these properties to applications that are so bursty in use that they wouldn’t be economical in the cloud at all without a different payment model.

Serverless raises the last of our cloud-native needs, but only because it came to light with the invention of serverless cloud.  Actually, it’s needed in all forms of true stateless cloud-native computing.  We’ll call it sequencing.

An “application” is usually considered to be an ordered set of actions along a “workflow”.   The question is, how does Microservice A know what’s supposed to run next?  Workflows stitch components into applications, but what defines them?  We’ve had some sequencing strategies for decades, but cloud-native needs a complete set.

Most applications are sequenced by internal logic, meaning that there’s some master component that “calls” the others at the time and in the order needed.  That’s sometimes called a “service broker” and sometimes referred to as a “storefront design pattern”.  Some applications are sequenced by “source routing”, meaning that a list of tasks in a particular order are attached to a unit of work, and at each step the current item on the list is stripped off so the next item will be done next.  In some systems, each unit of work has a specific ID, and a state/event table associates processes with the state and “event” ID.

There are tools available to support all of these options, and there’s really no “best” single strategy for sequencing, so it’s unlikely that we’ll see one supreme strategy emerge.  What’s critical is that all strategies be possible.

That closes an important loop.  Cloud-native is an application architecture, a model of development and deployment that maximizes the benefits of cloud computing.  Like all application architectures, it has to be there in its entirety, not in pieces that vendors find convenient.  That means that when you read that somebody has a “cloud-native” approach, you need to understand how their approach fits into all these cloud-native requirements.  Most don’t, and that means they’re blowing smoke.

Why WFH Needs a Different Project Management Model

We now have a good reason to ask, in earnest, a question that’s been asked for theoretical reasons before.  “Could what we call ‘office work’ be conducted in a purely remote or work-from-home (WFH) context?”  We might even be approaching the point where we ask whether many of the jobs that we see as requiring workers-in-a-workplace could be done in a different way.  I’ve diddled with modeling, and found some interesting points.

Office work is a mixture of task-specific behaviors, and it appears that we can safely group the behaviors into two models.  One of the behaviors is the work-with-this model, where a worker interacts with their assignment and its associated data.  The other is the work-with-people model, where the worker collaborates with other workers to achieve a collective result.  We can extend this to the overall workforce by adding a third model, the work-on-this model, where the task involves direct physical interaction with something, a product, for example.  That involves robotics, and it’s a topic for a later time.

It’s pretty obvious that work-with-this tasks could be conducted anywhere that the task-related data could be made available, in a secure and governance-conformant way.  When I blogged about the broad impact of COVID on tech HERE, I discussed ways of providing tech support for the enhanced data delivery that the work-with-this model would require.  There are other issues to address for that model, which I’ll get to below.

The work-with-people model requires that we support not an exchange of data, but rather facilitate a manner of human interaction.  The interaction could be pairwise or group, casual and tactical or formal and scheduled, built around a context or carrying its own in-conversation context.  I talked a bit about the collaborative channel in the blog referenced above, but again there are things about the model beyond the technical facilitation that need to be addressed.

The biggest issue we have in remote/WFH work is that we have work practices that have evolved with the expectation that the work, the people, and the interactions take place within a facility, meaning that the “office” becomes a place where we collect all the elements of “work”, so our work practices presume that we have in fact collected them.  My suggestion has been that to make remote and WFH stuff work, we’d need to virtualize the office, meaning create a framework within which we can work via technological coupling rather than through physical colocation.

The work practice issue is the one I want to talk about here.  We could spend an enormous amount of time, money, effort, creating a perfect virtual analog of a physical workspace.  Decades ago, when collaboration of this sort was called “computer-supported cooperative work” or CSCW, Bellcore did an ambitious project that tried to do just that, including the notion that because people could pass by a cubicle and glance in to see if the person was busy, they’d support a “glance” as a momentary activation of a webcam to allow someone to virtually check the status of somebody.  It failed, in no small part because people didn’t believe the camera was activated only for that purpose, so they covered it.  The point is that sometimes you have to jigger the problem in order to solve it.

What would an ideal work practice for remote/WFH look like?  Let’s start by presuming that it wouldn’t have developed the same way as office work practices would, and then look at optimal goals, working from them to remote practices.

Most work will be a combination of our work-with-this and work-with-people, meaning that workers will have individual tasks that they perform largely on their own, interspersed with collaborative relationships.  It’s not difficult to see that if this is the case, then the optimum approach would be to manage the workflow so that periods of collaboration were concentrated into specific episodes, leaving the worker independently pursuing their own tasks in between.  This would reduce the risk that collaborative needs that couldn’t be easily scheduled would hold up the progress of workers.  WFH or remote work, then, is substantially optimized through project management.  To do that, we have to look at the nature of task-related interactions.

“Let me know if you have questions” is a popular ending to an assignment, but I submit it’s an example of a “colocation” presumption in work practices.  Rather than provide full and detailed information, we take a stab and tell the other person to come back with questions.  That might even be smart, if we could assume that in most cases the other person wouldn’t have questions.  The problem is that it’s not compatible with WFH/remote work, because “letting me know” now involves a more complicated scheduling process.

Not an impossible one, though.  The mistake I think Bellcore made was thinking that the casual interactions had to be supported for the practice not the goal.  We could probably address a lot of the “let me know” interactions through an email.  We could address them within the context of some project management software that provided for collaborative messaging. Casual “let-me-know” collaboration, then, can be supported remotely without too much effort, as long as we make sure that we keep the solution casual too.

“Let’s get together” is an example of what might be considered extemporaneous meetings.  Something has come up (possibly one of those “let-me-knows”) that suggests that multiple people need to be level-set on something, or need to collaborate in dealing with something.  These extemporaneous meetings may need to reference something, create something, or simply communicate something.  That means that we need to accommodate the distribution and sharing of collateral information, which might mean application access, sending a document, collaborating on a document, etc.

This kind of collaboration marks a sort of boundary.  It’s likely that casual collaboration is well-supported via a generalized set of tools.  It can be integrated with project management tools already in use, if those tools support the workers involved in remote work and also support casual messaging, but it’s probably not useful to acquire and socialize new tools for this purpose.  The get-together collaboration might make a tool worthwhile.

A collaborative tool is most likely needed/justified where the team working on a project is fairly homogeneous and sharing the project’s work over a fairly long period.  This permits the tool to be specialized to the work, which makes it more likely to be accepted by the team members, and to augment productivity.  You could probably say that get-together collaboration either justifies a specialized tool for collaboration, or it should be considered a target for general WFH/remote work augmentation.

The question here is the uniformity of the group, and as an example let’s take software development teams.  These teams have a common mission, their work is a classic mix of on-your-own and reviews, and all the workers in the group are likely to have a similar frame of reference and similar data/application frameworks that drive both their individual work and their collaboration.  There are many software project management and code review tools, and even organizations that don’t use them might find them helpful if the team members were suddenly transported to remote/home locations.

“Set up a meeting” is the last example, and this is obviously a formalized collaboration.  These meetings are scheduled, must-attend, and significant in their role as part of the overall project workflow.  Just as the “get-together” meetings, these meetings may be important enough to justify special tools just to create/support them.  Since nearly all companies will use this approach for cross-organizational coordination, it may make sense to look to more general project management tools.  Again, this is most likely to be successful if the tools are already in use, but if there are enough of these kinds of formal cross-organizational collaboration activities to support, the broader use may justify a new software tool to provide that support.

Overall management of WFH/remote workers engaged in common project activity seems to work best if the project is organized as a series of regular meetings (no more than weekly, no less than biweekly, according to companies I’ve chatted with on the topic).  The greatest success comes if these meetings are scheduled on Mondays, so the meeting level-sets for the week’s activity.  Again, best practice seems to be to divide the meetings into three segments—review of scheduled completed tasks, reviews of longer in-progress task status, and problem review.

The most important thing about task-oriented project management as a WFH/remote work tool is to be sure the tool provides integration of collateral in a useful way, which is where the mashup-and-panel stuff I cited in the blog I’ve referenced above comes in.  Project management for efficient WFH and remote work, but WFH in particular, needs to be flexible enough to schedule things that used to be extemporaneous, to organize and optimize workflows to minimize the need for interaction, and to integrate with things that can present and organize collateral.  If we have all of that, we can make WFH very efficient indeed, enough to justify looking at it as an alternate to “the office.”

Are Vendors Responding to the “Lost” Carrier Cloud?

Large-scale data center deployment by operators depends on having large-scale drivers.  I’ve pointed out in past blogs (one earlier this week) that public cloud providers saw the lack of a sound carrier cloud strategy as an opportunity to address those drivers, and thus induce operators to outsource their carrier-cloud missions.  5G is an obvious target area, as my earlier blog said.

If carrier cloud is a hundred thousand data centers’ worth of business, it’s clear that vendors would like to see operators building their own clouds.  To make that happen, there would have to be a credible model for deployment, one that didn’t create the threat of vendor lock-in.  HPE may have decided to take the lead in generating one, because it’s announced its Open Distributed Infrastructure Management Platform, an alliance between HPE, the Linux Foundation, AMI, Apstra, IBM’s Red Hat, Intel, Tech Mahindra and World Wide Technology.

ODIMP (why do vendors pick such long product names?) isn’t actually a 5G-specific tool, it’s a carrier cloud deployment and management framework that’s intended to address the biggest potential risk of 5G core, which is that the complexity of a hosted-function service framework would overwhelm traditional operations.  Nokia, as I said in yesterday’s blog, has taken its own swipe at both the 5G space and service automation overall with its AVA Platform.  Is ODIMP a better strategy?  Is it a real solution to the 5G Core problem?  Let’s try to dig a bit and see.

The platform is based on DMTF Redfish, which is a set of specifications that define open, RESTful, APIs for management of carrier cloud and other “converged, hybrid” infrastructure.  Redfish schema are nice because they’re rather like intent models; an element represents a resource, a collection of resources, a service, etc.  While the first release focuses on servers, the goal is to cover the whole of the “software-defined data center” (SDDC) concept.

Having an entire data center, or in fact a collection of data centers, abstracted into a set of schema/elements is a nice touch, something that would benefit any application or service that depended on hosting features on pooled resources, particularly if the pools were made up of edge and more centralized data centers.  This model lets an operator build up a data center from Redfish-compatible gear, then define its elements and structure, or define a structure and backfill it with gear.  Since everything conforms to the Redfish APIs, the applications that manipulate the SDDC are vendor-independent, so lock-in isn’t a worry.

The Resource Aggregator is perhaps the nicest feature of the platform.  This is what does the modeling work, and modeling is the underpinning of software-centric zero-touch service lifecycle automation.  It’s also the foundation of the TMF’s NGOSS Contract work, seminal in my view with regard to data-driven service management (as opposed to AI management).  The ODIMP Resource Aggregator is not, as some stories have stated, a tool specifically for enterprises, meaning non-service-provider.  It’s HPE’s implementation (supported and augmented) of the ODIMP.  The models produced, as I’ve noted, structure and generalize by abstracting infrastructure, surely the right approach.

There’s a lot of good stuff here, but it’s important to note that the whole announcement is about a management framework for the data center infrastructure associated with hosted virtual functions of some sort, including those used in 5G.  It deals with the complexity of 5G and other “carrier-cloud” services by standardizing the SDDC framework that hosts stuff, but it doesn’t provide either the “stuff” that’s being hosted, or the specific applications that do the deployment and management.  Think of it as middleware, or think of it as analogous to Nokia’s AVA platform, whose “solutions” then include 5G.

Well, sort of.  It’s fair to say that there are overlaps between AVA and ODIMP, but what the latter does is sort-of-implicit in the former.  There’s base-line management intelligence built into AVA, and that is not a part of ODIMP.  For the two to be equivalent, you’d need to lay on a service management application that could do the zero-touch stuff you wanted, and work with the Redfish-schema framework.  To be equivalent to the AVA 5G support, you’d need to add specific solution logic for 5G.  HPE has such stuff, of course, but that’s not what’s being announced.

I think, as I suggested at the start of this blog, that moves like this are a response to the growing risk that vendors like HPE see from carrier-cloud outsourcing to the public cloud.  Carrier cloud is an enormous investment (a hundred thousand data centers, remember, if all drivers are realized).  Furthermore, if the early justifications for carrier cloud are even a tiny bit sketchy, this isn’t the time (in global economic reality terms) to be taking a risk.  The more a data center can be positioned as software-defined and vendor-neutral, the more compatible higher-layer service and 5G Core software are with the data center, the more palatable the build-out choice seems versus the “rent” choice for operators.

HPE has management tools; back in the early NFV days they had one of the few kits that recognized the idea of management by abstraction.  What I’d like to see HPE do is to frame ODIMP and DCC as a “hosting layer”, frame cloud tools as middleware, and frame their management tools as the ZTA layer—all in one document.  Right now, their stuff (like that of most vendors, frankly) suffers from dilution through microsegmentation.  If you break down even the most wonderful thing into small enough pieces, everything you look at seems to do nothing useful.

Some operators tell me that this problem arises from the engagement model with the operators.  The CTO organizations, focused as they are on standards and initiatives like ODIMP, tend trees and not forests.  Most I’ve been involved with have shunned the notion of taking a top-down approach or addressing a systemic problem, in favor of making local process.  That’s fine if everyone knows how to convert a series of turns into a route, but it’s a prescription for meandering if they don’t.

The biggest benefit I see from this is that it could unite the SDDC initiatives from the cloud, with a hodgepodge of carrier-cloud-related initiatives.  We do need to think about creating infrastructure based on a strong abstraction-and-modeling approach, even if we use AI above it, or we risk too much difficulty adapting generalized software to the specifics of where we host it.  I’d still like to see either HPE or the ODIM people in the open-source project expand their presentation to give more overall service and infrastructure context.  Sometimes the devil isn’t in the details, it is the details.  Missions matter.

Could a 5G Automation Strategy Become ZTA?

Zero-touch automation, in the form of some sort of over-arching service lifecycle management system, was the holy grail for operators just a few years ago.  What changed that is the combination of a lack of insight into how to proceed, and the pressing need to reduce opex.  The combination resulted in point-of-problem strategies that tapped off a lot of the easy automation fixes.  The loss of these then reduced the accessible benefits of broader ZTA, and stalled practical initiatives.  Now, we just might see some ZTA creep in from below.

5G deployment may be an example of another of those point-of-problem drivers of service automation, or rather a formalization of one.  For some time now, we’ve been seeing operators shifting toward an open-device model, made up of a combination of open-source software and white-box hardware.  This approach admits to the possibility that some open-source network software might be hosted in a resource pool, including even public clouds.  That opens a question that’s been around since 2013; do hosted functions create more network complexity and cost than device networks?

Substituting one or more virtual functions for a physical device has admitted capex benefits, if you’re not paying through the nose for the functions and if your hosting capabilities are efficient.  It also has admitted opex problems.  A half-dozen operators told me that their experience with cloud-hosting of network functions was that the opex for that approach was running 30% higher, roughly, than for traditional appliances.  That was enough to erode or even erase the capex benefits.

A super-ZTA model for top-down service lifecycle automation could have addressed this problem, if it had been developed.  A generalized approach to operationalizing hosting of functions, adopted from the cloud’s exploding repertoire of tools, could also solve it.  What could also solve it is a point-of-problem strategy aimed specifically at 5G.  The question is whether that’s the right approach.  Point-of-problem strategies have helped operators reduce opex already; adding one in 5G could do the same, but it could also build another specialized solution to what should be recognized as a generalized problem.

I’ve blogged many times (you can find them via the search tool on my blog page) on the topic of ZTA and the right way of approaching it.  The problem with the “right” ZTA at the strategic level is that it takes time to develop a general strategy, and then deploying it might require changes to operations tools and practices.  At the tactical level, the way network operators are organized and socialized tends to focus solutions to things that have budgets.  5G deployment has a budget, specific people responsible, P&L, and so forth.  Broad-based ZTA has little specific budget support, according to operators.

The real question, though, may be whether specialized point-of-problem 5G automation could set an example for the rest of the network space, even provide some elements applicable beyond the 5G target.  Properly done, 5G service automation might be able to rise up and support other missions, but this would require that the 5G strategy be based on generalizable principles and tools.

Nokia just announced what it calls “AVA 5G Cognitive Operations.  5GCO (I’m using this shorthand because it takes too long to type the real name!) is part of the overall Nokia AVA Telco Ecosystem, which means that it’s based on a broader strategy and not one-offed for 5G.  The Ecosystem is a public-cloud (Azure, currently, with others possibly in the wings) AI framework that analyzes operations data to predict problems and guide solutions.  All of this is based on Nokia’s AVA concept, so it’s best we look at AVA overall.

AVA stands for “Automation, Virtualized, and Analytics”, and the idea has been around since 2016.  Since then, Nokia has introduced more and more AI and also a collaborative framework, additional information-gathering (including crowdsourcing), and “solutions” that apply all of this to specific problem sets.  5GCO is one such solution.  The basic Nokia AVA approach is to use broad data collection, and AI/analytics to drive automated processes that alter network behavior.  This reduces or eliminates the need for specific service modeling, goal-states, and other event-interpretation and steering processes, or so it’s hoped.  By analyzing data broadly, Nokia AVA can determine whether anything is going wrong, and invoke steps to correct it…again, so it’s hoped.

One reason for this “hoped” qualifier is that this whole process could be better explained.  Even operators who have adopted some of the Nokia AVA solutions aren’t able to articulate the basic platform concepts particularly well.  It appears to them that the AVA platform is specialized into solutions by predefining some rules and policies that then guide the AI stuff.  The approach appears to classify “events” into standard categories, which can then be associated with action policies that include kicking off automated processes.

There’s broad capability within AVA to define new events and classifications, so it’s possible to incorporate almost anything you can know about as an event to be analyzed.  There’s also a collaborative framework to help tune solutions to fit specific combinations of network equipment, services, and available information.  If this extensibility aspect of AVA is true (meaning that I’ve interpreted the material correctly), you could not only build 5G automation from it, but also build service automation for other services, both new and legacy.

There’s a lot of merit to this approach, I think.  Networks already generate reams of direct event data and telemetry, and certainly an analysis of this information could be used either “symptomatically” to deal with conditions, or holistically to establish goal states, operating states, the current state (relative to the last two) and overall root causes and remedies.  If you compare this to the IT world, the AI process would be creating the states (needed for descriptive DevOps) from retrospective analytics, and then using AI to interpret events and trigger state transformations, which is what descriptive DevOps tools are moving to do.

Compared to service modeling, the approach has the advantage of doing its own retrospective analysis to establish a “model” and define states (normal and otherwise), rather than requiring that the model be created explicitly.  Service models and state/event tools are “anticipatory”, and of course the limit of any such process is the ability to anticipate.  The AVA approach can at least minimize this by having analytics do the heavy lifting.

AI analysis of events is a kind of soft-state-event approach.  You don’t have explicit events driving you to explicit next-state conditions and invoking corresponding processes, and as I said, that helps by eliminating the need to develop specific service models that, for each element, represent state/event/process relationships explicitly.  It’s possible using the intent modeling techniques I’ve described in past blogs and my ExperiaSphere projects, to retrofit the models to an existing network, but it’s still a considerable amount of work.

All AI systems have a potential risk that’s directly proportional to just how much they really replace “real” or human intelligence.  Anyone who, like me, has worked for decades with state/event systems knows that they have one significant benefit—you can look at the state/event model and where you are in it, and know what’s supposed to be happening.  If it didn’t happen, or if something was improperly modeled in state/event terms, you can find the problem quickly and address it.  If the whole state/event analysis is soft, known within the AI processes only, then it’s essential that the AI steps to make a decision, and the steps taken to implement it, be explicitly visible and auditable.  Otherwise, your AI assistant turns into “Hal”.

The AVA approach to 5G is smart, in that it builds on a general strategy.  The AVA approach, as a general strategy, is smart because it can be applied to existing networks and services with minimal effort and probably no retrofitting of technology elements.  I just need to know how smart the AI is, and there’s not enough material available for me to say.

I’m sure Nokia would have been willing to give me a briefing on this, but as I’ve noted in the past, I can’t blog on information that’s presented to me specifically, and not reflected in public material that all can reference.  So I’ll ask Nokia to give this blog a read and send along any specific links to website material that I could use to dig deeper into the process.  If I get that, or if I get specific responses to the details of the solution from operators, I’ll revisit this.

https://venturebeat.com/2020/03/31/nokias-ava-5g-cognitive-operations-offers-carriers-ai-as-a-service/

https://www.sdxcentral.com/articles/news/nokia-blends-ai-cloud-for-5g-automation/2020/03/

https://www.nokia.com/about-us/news/releases/2016/02/08/nokia-ava-the-new-cloud-based-cognitive-platform-for-fast-flawless-service-delivery-to-operators-mwc16/

https://telecoms.com/465252/nokia-launches-ava-cloud-service-delivery-platform/