Are Operators Still Looking at Virtual Routers?

What is the value of hosted routing instances?  This is a question that’s been asked for a full decade now, and the view of operators has shifted back and forth.  It’s also true that when operator technology planners are asked, they often give a knee-jerk answer and take it back on consideration.  Finally, many are coming to believe that open-model devices, specifically white-box routers with open-source routing software, will obviate most of the need for hosting router instances at the software level.  For quite some time, operators have wrestled with the idea of building their networks cheaper, and with different approaches.  What are they thinking now, and might something actually get done?

If we go back about a decade, we find that almost half of Tier One operators conducted at least one experiment with router software hosted on servers.  Initially, the goal was to overcome what operators felt was increasingly predatory pricing from the big network vendors.  The business driver was the first of several studies that showed that revenue per bit was dropping faster than cost per bit, which meant that future investments might have negative ROI.  Cost-cutting was in order.  At least a couple of Tier Ones went so far as to license a significant number of instances during the early 2010 decade.  None really carried the trial forward.

Part of the issue here was one of application.  Nobody believed they could replace larger core routers with software and servers, they were interested in using the software routers as edge routers, both for general consumer Internet and for business VPNs.  In the former mission, they quickly realized that there were specialized devices that were more reliable and scalable in the mission, so it was the corporate mission that was interesting.  The problem in that mission was largely one of availability/reliability.  Servers broke more often than proprietary routers did, and the higher failure rate compromised customer experience and account control.

This early experience is likely why the Network Functions Virtualization initiative, launched by a paper in 2012, focused on network appliances other than switches and routers.  The NFV initiatives fell short of their goals, in part because their goals evolved over time based in large part on vendor support.  One non-vendor issue, though, was that limiting virtual-feature hosting to what were essentially control or specialized feature roles didn’t address enough cost.  Over time, NFV tended to create a model for universal CPE rather than for hosted software features at the network level.

SDN, which came along (in its ONF OpenFlow version) just about the same time as NFV, seemed initially to be more of a competitor for router devices than an application of feature hosting.  However, there are some operators who now wonder whether SDN might actually have a role in another attempt to create router networks without proprietary devices or white boxes.

SDN builds connectivity by deploying “flow switches” whose forwarding tables are controlled from a central software element.  You can visualize an SDN network as being a router network whose control plane has been separated and centralized, which is exactly what the originators of the concept meant.  The thing is, this central-control thing has an implication in the router instance debate.

Today, we don’t think of router instances as being 1:1 coupled to a server, but as an application hosted on a resource pool.  The resource pool concept offers a general benefit of easy replacement and scaling, and a specific benefit of being able to reconfigure a network broadly in response to any changes in condition.  The latter comes about because the pool is presumed to be highly distributed, so you can spin up a router anywhere you might need one, as opposed to sending out a tech to install it.

The problem with router instances in this mission is that the new routers would have to learn the routes and connectivity of the network, which is often called “converging” on a topology.  It takes time, and because it’s a process that involves a bunch of control-packet exchanges that percolate through the network, it can create issues network-wide until everything settles down.  This is where SDN might make a difference.

A flow-switch instance, receiving its forwarding tables from the central controller, would immediately fit into the network topology as it was intended to fit.  There’s no convergence to worry about.  Even if the central controller had a more radical change in topology in mind than simply replacing an instance that failed or scaling something, the controller could game out the new topology and replace the forwarding tables where they changed, in an orderly way.  There might be local disruptions as a topology was reconfigured, but it could be managed through simulation of the state changes at the controller level.

This raises the potential for agile topology at the network level.  Instead of thinking of networks as being static topologies, think of them as being a set of devices that kind of migrate into the path of traffic and arrange themselves for optimum performance and economy.  It’s a whole new way of thinking about networks, and one that’s being considered by the literati among tech planners in many operator organizations.

The challenge with this is that same old issue about the performance of hosted router instances and server network throughput limitations.  Fortunately, we may be seeing technology changes that could address this.  One possibility is a “white-box resource pool” and another is the use of GPUs to provide flow switching.

If white boxes were commoditized enough, it would be feasible to put a pool of them in at all locations where network transit trunks terminated.  If we were to use agile-optics grooming, we could push lambdas (DWDM wavelengths) or even electrical-layer virtual trunks to all of the pool, and create service-or even customer-specific networks in parallel, all controlled by that One Central Controller or by a hierarchy of it and lesser gods.  These white boxes could be equipped with flow-switching chips, each with a P4 driver to make the device compatible with open-source software.

It might be possible to stick this kind of chip into a server too, but it might also be possible to use GPUs to do the switching decision-making.  OpenFlow switching doesn’t have to be as complex as routing, and there are many missions for GPUs in resource pools that would justify adding them in.  This seems to be the approach currently favored by that operator literati.

The One Central Controller (or the hierarchy) obviously needs to be strong and resilient for this to work.  OpenFlow and the ONF have been slow (and many say, ineffective) in addressing the performance of the central controllers and the way a hierarchy of controllers (like that used for DNS) could be created and secured.  Even slower, and with fewer participants, has been the development of an integrated vision of agile optics and flow switching, which is also needed for optimum value to this approach.

But, presuming we were to adopt this approach, where would that leave the issue of hosting overall?  Bet you’re not surprised that it’s complicated.

One clear truth is that adopting a flow-switch-instance approach would accelerate the overall shift to open-model networking.  At the same time, it could promote differentiation in the new model.  Because NFV hasn’t succeeded in impacting the hosted-router issue (for reasons I’ve blogged on many times before), and because it’s focused more on per-customer services like VPNs, we have no well-respected architectural models, much less standards, for this sort of network.  It’s ripe for vendor exploitation, in short.

A counterforce to this, just as clear, is that moving from a router/device network to an agile cloud-based flow-switch-instance network supported by an agile optics underlay is a long slog.  There’s a ton of sunk cost in current networks, and a perception of a ton of risk in the migration.  The latter is especially true when you consider that it’s tough to make the flow-switch-instance approach work on a small scale; the value is at full scale.

It may be, then, that the future of router or flow-switch instances will not pull through broader open-model networking at all, but that it will be pulled through by it.  Operators are already looking at white-box-and-open-source for edge devices.  The boundary between edge and core is soft, and easily crossed by technology changes.  First the access, then the metro, then the core?  Maybe.  It would appear that replacing routers with router instances, or with flow-switch instances, is the final step in network evolution.  That would give router vendors more time to come up with a strategy to address that issue.

It doesn’t stop the outside-in open-model commoditization, of course.  Will the vendors, presented with a band-aid while free-falling, fix the cut they got going out the door of the plane and ignore the coming impact?  We’ll see.

How Virtual Networking Could Create Service Value-Add

I blogged yesterday about vendor differentiation in an open-model network, and closed with the point that a higher virtual layer might offer the only place where “connectivity networking” could find the features and opportunities vendors needed.  SD-WAN, I pointed out, was an example of an exploitation of that higher virtual layer, but one that’s not been developed by the SD-WAN community at large.  SD-WAN, which is really just an application of generalized virtual-network principles, has gotten mired in VPN-extension mode, when it could actually be a platform for building new, revenue-generating, user-empowering, connection services.

Most of you who follow my SD-WAN blogs know I’ve long been a fan of one vendor, 128 Technology, who I’ve believed has best grasped the full potential of SD-WAN.  The company has been pretty quiet up to now, but they’ve just made their first big announcement of features and capabilities.  Given the facts that 1) SD-WAN isn’t much of a news item in general, and 2) 128 Technology is essentially doing an update, coverage has been limited.  This is a bigger story than the coverage would suggest, so I want to talk about them in the context of what “the full potential” means.  And for the record, they have nothing to do with this blog, have never seen it or even been made aware of it, didn’t pay for it, and didn’t influence a word of its content.  As always, this blog is my own writing, my own view.

To set the stage, it’s always been my view that SD-WAN was a mission or application of virtual-network technology to the specific problem of small-branch connection.  We have a couple dozen vendors who have introduced SD-WAN products to support that mission, and most of them do little or nothing else.  To me, the SD-WAN market raised an important question, which was whether we needed to think about virtual networking not as a bunch of little niche value propositions, but as a new way of defining connectivity.

IP has become a universal framework for communications of all sorts.  It’s based on proven technology, there are a lot of vendors who provide devices, and the client side of IP is supported in almost every computer, phone, and tablet, and even many doorbells and thermostats.  One of the reasons is that as a network protocol, IP is “connectionless”.  You don’t make IP calls, really, you simply send packets.  Any sense of a persistent relationship like a “call” or the general concept of a “session” between parties, is created by the endpoints themselves.  IP is connectionless, and unaware of these relationships, it just carries them as disconnected packets.

The combination of proliferation and connectionlessness (to coin an awkward term) creates some challenges, though.  One is that there are billions or trillions of possible connections, most of which would (if they happened) turn security-minded people and organizations white with fear.  We spend a lot of money on an IP that can connect anything to anything else, and then spend more to stop most of the connections from happening.

We can’t change IP, but if we’re going to create a virtual network that can ride on IP (or anything else, for that matter), why not give that network properties that we can’t give IP because of the enormous investment sunk in devices on and within IP networks?  That virtual network could be applied to the SD-WAN mission, but also to other missions that demand a lot of explicit control that IP alone doesn’t provide.  128 Technology did that, which is why I took a position on their board of advisors, something I almost never do.  Unfortunately, it’s been hard to pull out that value proposition from their material, until this announcement.

The new announcement (the start of Release 4.x) is the result of working to apply the generalized virtual-network model of 128 Technology to a variety of customer accounts.  Some of these are public, but in many cases the user organization mandated confidentiality.  In either case, the experience with these customers refined both the product and its positioning, and that’s what got us to the point where I’m comfortable blogging about how it all works.

What 128 Technology does starts with session awareness.  The 128T (I’ll use that shorthand to save my fingers!) nodes can identify those persistent relationships that to IP are just a string of connectionless packets.  When a session starts, they can introduce a “tag”, a small parcel of information that identifies it, and at the same time they can classify the endpoints—who they are if they’re users (“tenants” in 128T terms) and what they are if they’re applications (“services”).  Session awareness introduces the notion of a higher layer to vanilla IP, because “sessions” are higher-layer relationships.

Session awareness yields connectivity control, as a starting point.  Based on the classifications and the policies that users set through central control, they can assign a priority, designate a route, and even block the connection completely.  In fact, by default, all connections are blocked unless a policy authorizes them—zero-trust.  All this is independent of the IP-level connectivity, which of course is promiscuous.

The control system for all of this has some cute features to make defining connection policies easier.  It works on a hierarchy, meaning that you could define sites or roles, and workers/users within them.  Policies for connection could be set by individual, but also by role or site.  The same holds true with “services”.

Even Internet traffic-handling is improved.  Many, or even most, SD-WANs or VPNs will route all traffic onto the virtual network, which means all Internet traffic carried back to the off-ramp and then pushed out onto the Internet.  Because 128T remote nodes see all traffic and sessions, they recognize Internet traffic and don’t tag it for virtual-network handling.  In fact, they can recognize classes of applications through a partnership with Alexa, so selective connectivity management can be applied even to websites and web services.

All of this is highly efficient.  Because sessions carry these little tags, and the tags determine the handling, 128T is unique in that it doesn’t create an explicit overlay network, a kind of IP-on-top-of-IP-via-a-tunnel model.  Their nodes are enhanced, session-aware, routers that can replace traditional edge routers and can be cloud-hosted.  The 128T “Session-Smart” (their trademark term) network isn’t an overlay, adds virtually no overhead in traffic, and most important, doesn’t require tunnel terminations that limit the number of users that can be supported by a given device at the on- and off-ramp.

This approach is extremely simple from a network-building perspective.  You’re not building a network on top of IP as much as coercing IP to behave as you’d want it to.  You’re not running two levels of network, you’re running one.  In some of those early customers, this simplicity meant getting literally thousands of sites up and running in a matter of days, not months.  Opex is lower and reliability is higher, because instead of having two layers of network to fail, you have only one.

The combination of network-building simplicity and the no-tunnel-overlay approach also contributes to significant scalability.  Even the minimum-sized 128T node can support hundreds of user sessions, and the larger devices intended for central installation can support thousands.  This means that even retail operations with thousands of locations can be networked, and brought up and sustained quickly and easily.

And yes, you can still do SD-WAN.  At the point of session recognition, addresses can be remapped, which is why the product can provide SD-WAN services that “look” traditional, mapping branch offices with Internet connections or even home workers, to the same address space as a company’s MPLS VPN.  It’s also why a 128T virtual network is inherently multi-cloud; all components in all clouds can be made addressable in the same address space, so the whole multi-cloud context looks like one virtual data center on one network.  You can also, with the newest release, use a 128T node to provide Ethernet-over-IP.

When the work-from-home impetus of the pandemic came along, 128T themselves gave mini-node versions of their appliances to key workers, which made them into little portable branch offices.  They also worked with the Wireguard open-source VPN client to provide a means of connecting remote workers back to their local branch offices as though they were on the office LAN.  That offers economical remote-work connectivity and preserves policy control of traffic at the branch office level.

This is the difference between virtual networking and the limited SD-WAN mission, and it’s an example of how a virtual-network layer, by taking over connection control, could create differentiation above the traditional network.  It’s also an example of how network connection services, built correctly, could include features like security that we’re currently paying (a lot) to glue on.  But perhaps that’s why the big network vendors haven’t pushed the concept; they already get a lot of that security money.

I’m glad that somebody is illustrating the value of true virtual networking.  Perhaps it will catch on, and perhaps it will finally make networking a full partner to server virtualization and cloud computing.

How Do Network Vendors Differentiate in an Open-Model World?

How do you compete in an open-device future?  It seems obvious that we’re not only headed in that direction, in some market areas we’re already on the doorstep.  Vendors are not going to lie down with a rose on their chest, they’ll try to fight, but how will they do that?  There are a number of potentially fruitful avenues they could take, and some may try to take them all.

The challenge of open devices is the dualism of software and hardware.  Both software and hardware have to be interchangeable, which means that the marriage point for the two has to be maintained in a standard way.  Fortunately for vendors, there’s a solution.

One of the most obvious, and most promising, strategies for differentiating future products in an open-model world is through custom silicon.  We see this in the PC market already, with graphics processing.  There are various standard APIs for computers already (DirectX, OpenCL, and OpenGL for example) that permit specialized hardware implementation for open functions.  We just got such a standard for network devices in P4, a flow-programming language that uses a plugin to match the specific details of the silicon used.  As long as a vendor provides a P4 driver for their chip, open P4 software will run on it.

In order for this approach to work, the vendor would have to do custom silicon on their own, or others would be able to reproduce their hardware model and it wouldn’t be differentiating any longer.  That’s a challenge for at least some vendors, but it could well spur significant advances in flow-handling silicon, both by network vendors and by chip players who want to take advantage of P4 for special missions.

Chip-level differentiation by itself has challenges.  Features are what normally differentiate things, meaning useful features.  Chips might make things a bit cheaper or faster, but the scope of applications in which these subtleties are meaningful is narrow.  It would be better to have something that changed the basic utility of a service.

Which introduces our second possibility for vendors; a “service architecture”.  A network is a cooperative community of functional elements.  The elements can commoditize if the community architecture can be differentiated.  Since features tend to be created above the flow or traffic level, this kind of differentiation would fit the goal of creating high-level benefits.

We have, for the cloud at least, some examples of service architectures in the collections of mission-targeted web services offered by cloud providers.  In technical terms, this approach means defining “services” made available through APIs that could then be coupled in some way to the open-model software.  In order not to have your new service ecosystem look like a renewal of proprietary networking, though, you’d have to ensure that the open-model coupling was itself open, meaning you couldn’t build specific device linkages to the service level to pull through your whole product line.

This could be a heavy lift, though, for a couple of reasons.  First, service providers at least are very wary of proprietary models.  Think of a 5G that didn’t conform to international standards and you’d get my point.  On the enterprise side, there’s a greater willingness to trade “open standards” for a specific cost (capital or operations) advantage, or for unique benefits.

The evolution of the “Kubernetes ecosystem” offers an example of an approach.  Google’s Anthos just won a DoD deal for multi-cloud Kubernetes orchestration.  Anthos isn’t a proprietary tool, at least not directly, but because Google effectively branded it as part of their multicloud, it’s associated strongly with Google, and others have so far been inclined to compete rather than adopt.

The biggest problem, though, is the almost-universal reluctance of operators to contemplate services above the network layer.  Vendors themselves aren’t exactly eager to shift their focus to higher-layer services; it opens them up to competition from the computer and platform software side, the cloud provider side…you get the picture.  Add to that the fact that their buyers are willing to believe any silly story that’s presented (like everyone buying 5G to support their IoT sensors) as a way of dodging the need to face an unknown technical world.

That doesn’t mean that the services option is a bad one, but it probably means it’s not a quick one.  If operators are straining to sustain capital spending, they’re even less likely to jump into a new area.  That’s even more likely to be a problem when that new area would require a considerable up-front capital investment in infrastructure (the dreaded “first cost”).  If operators let the old earth take a couple of whirls, as the song goes, then it might well whirl into 2022 before attitudes shift toward services, and vendors are going to have a problem waiting that long.

That leaves our third option, which is another sort of “ecosystem” story.  If you offer network equipment that’s linked into a network-level ecosystem, then your devices tend to pull through, particularly in the enterprise space.  Enterprises say they want open technology choices, but they still tend to buy from their favored and traditional vendor unless that vendor messes up badly.

Cisco is the proof in this pudding, of course.  Their last quarter was decent, especially considering the onrushing pandemic that contaminated the end of it.  Their salvation was the enterprise market, and that market was sustained for Cisco by skillful framing of a pan-network story, things like ACI, and a strong program to certify technologists, making their careers somewhat dependent on Cisco’s continuing dominance of the enterprise.

The problem in this area is that, for most vendors, you can’t simply declare an ecosystem.  Cisco had the best account control of all the network vendors before the current problems came along, and so they could exploit their assets immediately.  Other vendors would have to wrap ecosystem claims around something novel.  Juniper is already trying that with Mist/AI.

Management, or operations automation, is certainly a logical basis for a network-level ecosystem.  If you want to collectivize network-layer behavior and so sustain your focus while building differentiation, the next level up is a great choice.  The challenge here is that for enterprises, zero-touch or automated lifecycle management is a harder sell.  The majority of enterprise-owned network technology goes in the data center, where you can oversupply with capacity at little or no ongoing cost.

Hard or not, this seems like it’s going to be the strategy of choice for network vendors in the near term.  But just because it’s inevitable doesn’t mean it’s going to work.  Except for Cisco, the idea of a network-layer ecosystem is hard to promote in the enterprise because it’s inherently a defense of your base, and Cisco is the market leader.  In the service provider space, the problem of open-model pressure on a capital spending plan that’s already in decline for return-on-infrastructure reasons, isn’t easily solved.

Bits aren’t differentiable; they’re ones or zeros.  Inevitably, bit-pushing is going to commoditize, and so as long as either enterprises or service providers focus on bit-pushing as their bottom line in networking, the network vendors face ongoing challenges.  Like it or not, broader network virtualization issues may be the only answer.

What’s “above” the network, if not the “virtual network?”  There’s still an open zone up there, a place where new security, governance, and management features can be added.  SD-WAN is only now starting to move out of the specialized mission that launched it, into a new virtual network space.  That may be where everyone needs to go, but even that’s a challenge, because SD-WAN’s old mission competes with MPLS VPNs that drive service provider spending.  The next decade is going to require a careful navigation of risks and opportunities for network equipment vendors, and some aren’t going to stay out of the deep brown.

Remote Work Lessons from WFH Experiences

There is no question that videoconferencing tools have saved companies from a pandemic lockdown crisis.  There’s little question that familiarity with the tools will promote their use in the future, even when there’s no pandemic to drive change.  They may indeed replace voice calls for many business interactions, even those involving only two parties.  We’ve had some time now to absorb videoconferencing and web conferencing into business operations.  What have we learned?

Five organizations I’ve been chatting with (virtually, of course!) have been working on this optimality point.  They’re not unified in their thinking yet, but there is an emerging set of common points leading to their view that “we have to work differently, not just work remotely.”

The critical starting point everyone agreed on is that companies are too culturally dependent on face-to-face connections in getting things done, and there’s little reason for it.  All five of the companies said they’d believed that their people would find working via video a significant drag on productivity.  In-house discussions were thought to be marginally acceptable via video, but contact with customers, suppliers, and prospects that had previous been handled F2F?  Expect the worst.  But the worst didn’t happen; they found that not only were they able to get things done via video collaboration, there were actual advantages in using it.

For these companies, video collaboration saved people about 38% of their time versus F2F.  They were able to “get a meeting” on the average two days sooner when they were trying to reach a customer, and cut the time to respond to a customer request for a meeting from a day and a half to three hours.  It took a bit of time (two weeks or so, on the average) to get used to the new way of working, but once the learning curve passed, the results were a major improvement in productivity.

Another point that emerged is that providing people with material to review and work with during collaboration further improved not only efficiency/productivity (by about ten percent), it improved the subjective view of the results of the virtual meeting.  In most cases, people used a white-board to do extemporaneous drawings or worked from a slide deck.  The companies found that if the material was prepared and sent in advance, there was a slight increase in time committed per meeting to review the material, but the meetings were shorter.  The net was an improvement in productivity.

What created the improved subjective result was partly that the meeting came to point faster, making it “feel” more productive, and that the material seemed to be better absorbed.  The optimum approach seemed to be to schedule a video meeting with a short period in advance tacked on for material review.  If the material were sent at the beginning of that review period, the imminence of the meeting induced an immediate review, and the information was better retained for the meeting itself.

Most of the companies noted that there was an evolution to the way they prepared the material, and that shift was likely a major reason why the meetings seemed to produce better results.  Because the material was reviewed unsupervised, those who prepared the material were forced to be more educational in their approach, and that meant the material was grasped quicker.  The meeting then tended to focus on questions, and on confirming key points, which again gave it an “action feel” that improved the subjective view of the outcome.

Another interesting point that came out of my discussions was that the companies found that their virtual meetings got smaller over time, as people became familiar with the process and introduced their material for review before the meeting started.  First, having more people meant much more time to accommodate everyone’s schedules, and since the virtual meetings were normally quicker to initiate than F2F, that extra time was a noticeable delay and drag on the process.  Meetings shrunk from inviting an average of 12 people to an average of 7, without any reported loss of effectiveness.  In addition, another one or two would, having reviewed the material, determine they didn’t need to attend.  Since the virtual meeting doesn’t interrupt things as much as F2F would, people didn’t resent last-minute drop-outs.

One company took all this information and developed a “meeting track” approach, using project management and scheduling software.  They’d tried in the past to create a project trajectory that included the necessary meetings, but found that with F2F meetings there were too many cases where people couldn’t make the meetings or they ran long.  With the new virtual approach and previewed material, they were able to keep to schedule for most activities, and the meeting-track approach further improved both their control of task execution and their overall efficiency.

A fair question on the findings is whether work-from-home created or exacerbated the classic problem of “theft of time”.  Many companies worry about home working because they’re convinced that people wouldn’t work as hard, and all five companies said that there were issues with that, for some workers and some managers/supervisors.  They also said that having a meeting-track approach where work activities were explicitly scheduled, including video collaboration, helped reduce the problem significantly.  Most said they didn’t believe that where the approach was used, there was a need for any sort of monitoring of the kind usually seen as “spying”.

There’s a qualifier for all of this, though.  The subjects of the detailed reviews these five companies did were “knowledge workers”, largely staff experts and managerial/executive types.  There were a few administrative people also reviewed, those who worked regularly with other high-level people.  Four of the five companies also had a broader collection of office workers at home, and they all subjected them to only minimal study.  Why?  Because this groups WFH activity didn’t involve video.

One of the interesting truths here is that the majority of WFH workers do their work without a need for any significant collaboration.  They can process invoices, take calls from customers or suppliers, or run their regular applications without being video-empowered.  The four companies all felt that “more could be done” to support the non-collaborative model of remote work, but they’d not really done it.  We may have learned a lot about video collaboration as a replacement for F2F meetings, but we still have a long way to go learning about remote work.

Why Would Google Want To Be Your Babysitter?

Google wants to be your babysitter?  That’s a pretty interesting tag line, and one that Protocol has offered us.  According to their piece, parent Alphabet has filed a patent in the space.  “Not that many of us are leaving home much, but in that distant future when the world returns to normal, Google wants to be in charge of looking after your kids.”  While I doubt that Google engineers are swarming over babysitting robot prototypes, I do think that they’re exploring what could be done with what I’ve been calling “contextual services”.  I’ve talked about them mostly in the context of the wide-area network, as part of carrier cloud, but they could have applications within a workplace or home.

The basic theory of contextual services is that in order for a form of artificial intelligence to work appropriately in a real-world scenario, it needs to know what the real world is.  Actions are judged in context, meaning what’s happening around you in that real world.  Turning on a light if it’s dark is smart, but it wastes money if it’s already bright daylight.  A robot cleaner should stop short of hitting a wall, but it has to know where the wall is, where it is, and what “short” means, in order to do that.

We already use machines to monitor us.  In most hospitals, patients are hooked up to monitors that constantly check critical health parameters and signal if something is wrong.  There are some situations where a response could be automatic; a pacemaker implant responds to EKG conditions, it doesn’t signal a nurse to run out and find the patient.  So, it’s not outlandish to think that we could employ some sort of contextual toolkit to watch our families, pets, etc.

Not outlandish, but maybe outlandishly complicated.  I’ve fiddled with notions of contextual services for about a decade now, and with some robotic concepts for over 30 years.  In the process, I’ve uncovered some issues, and some principles too.  They suggest Google might be on to something.

My instinctive approach to something like an automated home was a central control intelligence.  A supergadget sits somewhere, wires and wireless connecting it to all manner of sensors and controlled devices/systems (vacuums, heaters, etc.).  This works pretty well for what I’ll call “condition-reactive” systems like heaters, because you can set simple policies and apply them easily because the conditions are simple and fairly regular.  It’s cold-turn up the heat.  It doesn’t work as well for what I’ll call “interactive” systems, which means things that have to relate to humans, pets, or other things that are not really under central control at all.

In order for an automated home to handle interactive systems, it has to be able to understand where all those uncontrolled things are, and from that and a timeline, infer things like their motion.  It also has to have a set of policies that define abnormal relationships between the uncontrolled elements and either other elements or controlled/controllable things.  No, mister robot vacuum, don’t sweep up Junior, and Junior, don’t keep following the vacuum or jumping in front of it.

The more complicated the controllable things are, the more useful an automated thing can be.  If all you can do is turn lights on and off, or set the heat up and down, you’re not getting much automation.  If you can unlock doors, change room lighting based on occupancy, warn people about risky or unauthorized movement, and even perhaps step in by closing a gate, then you’ve got something a lot more valuable.  The problem is that you also have a lot of complex stuff for your central supergadget to handle.

My view has evolved to the point where I tend to think of automated facilities in terms of hierarchies of autonomous systems.  A home robot, for example, might have a controller responsible for moving its wheels and controlling their speed, and another to sense surroundings via optical interpretation, motion/proximity, etc.  A third system might then apply policies to what the sense-surroundings system reported, and send commands to the motion-control system.  This approach leads to easier coding of the individual elements, and guarantees that no task starves the others for execution time.  If one system can’t respond to a condition, it reports it up the line.  I hit something and there shouldn’t be anything there to hit!  What do I do now?

We can apply this to automating a home.  A robot vacuum has an internal control system that’s responsible for having it cover a space without getting trapped, hitting things, or missing spots.  That system does its thing, blind to the rest of the home, until it encounters something that it has no policies to handle.  Say, Junior keeps jumping in front of it.  Our robot vacuum might see this as an obstacle that doesn’t stay in place, and so can’t be avoided through normal means.  When this happens, it kicks the problem upstairs.

There are also situations where “upstairs” kicks a condition downward.  Suppose there’s a power failure.  Yes, the process of going into power-fail mode would normally involve many predictable steps; emergency lighting is an example.  If we assume our supergadget has UPS, we could have it signal a power-fail condition downstream to other power-back-up control points.  If there is a human or pet in a room, set emergency power on there.  That rule could be applied by a lower-level controller.  For our robot vac, it might use battery backup to move into a corner so somebody doesn’t trip over it, and of course we’d likely alert a human agent about the new condition.

This can be visualized at a high level as a kind of finite-state problem.  There is a normal state to the home, in terms of lighting conditions, power, temperature of the rooms and refrigerators and other stuff, detecting of smoke or odors, etc.  Everything in every room has a place, which can be static (furniture) or dynamic (Junior, Spot, or the robot vacuum).  Dynamic stuff has a kind of position-space, where “green” positions are good, “yellow” positions are warnings, and “red” means intervention is necessary.  Same with the state of things; body temperature, heart and breathing rate, and even time since a last meal or drink could be mapped into the three-zone condition set.

It’s interesting to me that contextualization needed for home automation and Google’s hypothetical babysitting goal, matches so closely with the state-mapping needs of lifecycle automation of operations processes.  It shouldn’t be a surprise, because the underlying approaches are actually very similar.  What makes it interesting, therefore, is less its novelty than the opportunity that might be represented.  A good state/event-model system could be used to babysit, control warehouses and movement of goods, automate operations processes, drive vehicles, and so forth.  There is, in short, an opportunity for an underlying platform.

Could we use TOSCA or SDL to describe an arbitrary state/event system?  At least at some level, I think we could.  The questions with both are the way in which a hierarchy we represent in our model can be related to a map of that real world we have to keep in mind, and the way that events are represented and associated with processes.  I’d love to see some useful work done in these areas, and I think the reward for something insightful could be very substantial.

Can Microsoft Make the Most of Metaswitch?

A tale of two announcements; that’s a nice literary play on my blog topic.  Cisco announced its earnings, and it’s always a useful barometer on telco spending trends.  In addition, Microsoft announced the acquisition of Metaswitch, and that seems to me to decide the question of how its recent Azure Edge Zone stuff should be assessed.  What connects the two could be pretty interesting.

Let’s start with Cisco.  They said on the call that their quarter was ahead of guidance prior to the pandemic and lockdown, and they still beat their revenue and earnings expectations, and guided better than expected.  This shows that networking is still broadly strong and that it likely was impacted less by the pandemic than some other tech sectors.

The IP giant reported a decline in switching and routing revenues in the service provider sector in the quarter.  If you consider that in the context of all the pandemic comments Cisco made on their earnings call, you’d likely jump to the conclusion that spending was suppressed by the virus.  The problem is that service providers wouldn’t be expected to react to a broad market force like a pandemic so quickly, since during Q1 it wasn’t known how long it would last, and since a lockdown would seem to increase the need for network services.

At the same time, Cisco’s bright spot was its as-a-service software model, which gained strength in the quarter.  Cisco cites the pandemic as a driver for its WebEx stuff, of course, but it’s also reasonable to say that in chaotic financial periods, companies would tend to move to expense-based acquisition of stuff rather than outright purchase.  Cisco was also positive about their traction in the cloud space, meaning their support for multi-cloud via SD-WAN.

To me, this adds up to an enterprise segment that’s not prepared to curtail networking investment, either in terms of products or services.  Unless there’s a major long-term depression, I think that what Cisco’s numbers and guidance suggests is that enterprises may slow-roll some network projects, and may try to expense rather than capitalize enhancements, but they’re not going to hold back their networking significantly.

Not so for the service providers.  They were already putting downward pressure on capital spending before the pandemic, and few have suggested they’d be increasing capex in response.  Cisco’s guidance doesn’t suggest to me that they believe the sector will be exploding in opportunities at any point in 2020.  The problem of profit per bit, which I’ve harped on for half-a-decade, are biting their return on infrastructure.

The presumption normally made (including by me) is that service providers would move to higher-layer services, the “carrier cloud” model.  But carrier cloud has always been a kind of threading-the-capital-and-first-cost-needle proposition.  You needed to have some early applications for carrier cloud as part of legacy service topologies, in order to fund early deployment to a scale large enough to be competitive for public cloud and other consumeristic services.  These early applications have not emerged; the NFV initiatives that operators hoped would be the first application of carrier cloud have been minimal and have only driven “universal CPE” rather than cloud-hosted functions.

Where does this leave operators?  The second phase of the carrier cloud driver evolution was supposed to be 5G.  Despite all the market hype, there is no credible set of new services emerging from 5G that would drive incremental revenue.  Thus, there is no ROI out of 5G infrastructure other than that which could be associated with higher cell densities, which is more related to lower cost than to higher revenue.  This puts the 5G driver in jeopardy, which would virtually guarantee that carrier cloud overall is in jeopardy.  Which leads us to Microsoft.

Microsoft positioned its Azure Edge Zones as targeting 5G and telco applications specifically, in contrast to IBM/Red Hat, whose edge story is generalized.  Why not, in an emerging space, use a shotgun rather than a rifle in going after opportunity?  Answer: You think there’s one absolutely prime target.  Why?  Because you believe that the financial situation service providers find themselves in forecloses big carrier cloud investment as part of 5G.  Thus, they will want to outsource it.  Thus, Microsoft wants to supply it.

The challenge for Microsoft is that it’s not enough to have a cloud platform to host 5G components, you need those components to host.  To paraphrase a song from “Camelot”, “Where in the world is there in the world a 5G virtual function set so extraordinaire?”  Pause with great anticipation…Metaswitch.

There is nobody in the market who thought longer and harder about creating cloud-compatible telecom software than Metaswitch.  Their “Project Clearwater” IMS application was the one I’d picked to be the foundation of the first use case submitted and approved for NFV, because it was clear that it would be IMS and not virtual CPE that could drive carrier cloud.  They have carrier successes already, and they’re well-known in the carrier space.  Of the 77 operators I’ve talked with in the last year, Metaswitch was known to every one, visibility that matches that of the major network vendors.

Metaswitch augments Microsoft’s 5G mobile story, already supported by its Affirmed acquisition, completed earlier this year.  I think Microsoft could have gotten Affirmed’s stuff to where they wanted to be, but I also think Metaswitch was closer.  If operators are serious about 5G Core, then they could move quickly, and so accelerating Microsoft’s capabilities is smart.  Not only that, Metaswitch running wild and free in the market could stimulate others to consider getting into the virtual 5G space.  Both Cisco and Juniper could use such a story, and a few insiders have told me both considered a Metaswitch deal themselves.

You have to wonder whether they should have.  The problem with being a “platform player” is that nobody needs you until they have something to put on the platform.  The network vendors who have 5G are able to deliver the solution and the platform.  For everyone who doesn’t have a solution, Metaswitch would have been a great addition.

That doesn’t necessarily mean that Microsoft’s edge approach, depending on carrier cloud and 5G outsourcing, is the best (or even the right) approach.  IMHO, they’re betting that operators are going to push for full 5G deployment, meaning 5G Core as well as the New Radio.  Up to now, the emphasis has been on the radio network and 5G Non-Stand-Alone (NSA) hybrid of 5G NR and 4G IMS/EPC.  However, it’s obvious that isn’t going to create much (if any) new revenue for operators.  The increased operator interest in 5G Core, and in edge computing and outsourcing, may be based on a hope that broader-technology-scope 5G implementation will boost revenues.

By itself, it will not.  Microsoft is now in a position where it will need to address the more complicated higher-level services that were drivers to carrier cloud, services like the IoT “information fields” I’ve blogged about, or personalization and contextualization.  I can’t say whether Microsoft knows they need this additional layer, or that they know how to get it.  I hope they do, because traditional network vendors have been nesting like hens in a henhouse for a decade, on this same issue set.  We need progress, not more hens.

What’s Holding Back Open-Model Networking?

Openness is good, right?  We need open-source software.  We need open APIs, open standards, and open-model networking.  OK, let’s agree to agree on those points.  The question is how exactly does openness come about, and particularly in the network space.  A collateral question is whether there are degrees of success, and if there is, the final question is how we can optimize things.  Openness isn’t new, so we can look to the past for some lessons.

Perhaps the greatest success in open technology is Linux.  It’s become the de facto server operating system, and some (a decreasing number, IMHO) think it could even take over the desktop.  How did Linux establish the concept of open-source operating systems, and win out?  In an important sense, it didn’t, and the details behind that point are a good place to start our consideration of the Optimum Open.

Minicomputers were one of the transformational changes in the computing market.  Starting in late 1960s and moving through the early 1980s, the “mini” was a more populist form of computing.  IBM had launched the true mainframes in the mid-60s with the System 360, and as more companies got interested in more computing missions, the minicomputer was the obvious solution.  Companies like Digital Equipment Corporation, Data General, CDC, Perkin-Elmer, and of course the mainstream IBM, HP, and Honeywell jumped in.

Every one of the mini vendors had their own software, their own operating system and middleware.  Within 10 years, it was becoming obvious that the breadth of computer use created by the lower price points for minicomputers couldn’t be realized fully except through packaged software.  Smaller companies could no more afford to build all their own applications than they could afford a mainframe.  But there was a problem; the balkanization of operating systems and tools meant that software firms faced major costs building their wares for every mini option.  By the early ‘80s, it was clear that most minicomputer vendors didn’t have an installed base large enough to attract third-party software.

There was, at this point, an alternative in UNIX.  UNIX was a Bell Labs project that created an open-source operating system that was offered free to universities, research firms, the government, and so forth.  Mini vendors started to offer UNIX in parallel with their own operating systems, and gradually put all their efforts into UNIX.  But UNIX had a problem too; the early spread had created multiple warring UNIX versions, and many small variations on the two main themes.  Standards aimed at the APIs (POSIX) came along to reduce the impact of the divergence, but it was still there.

Linux, which came along in 1991, was an implementation of the UNIX/POSIX APIs in open-source form, with one open-source group behind it.  It was modern, lightweight, and license terms seemed to ensure there’d be no lock-in or takeovers possible.  It won.

The moral of this, for our Optimum Open, is that most open initiatives start with some credible issue that drives vendors to buy in.  That’s important, because today’s open-model network has a mixture of driving forces, but arguably they tend to come from the buyer rather than from the vendor side.

Standards groups attempt to create open network elements by specifying functionality and interfaces, and while these efforts have been fairly successful in creating competition, they failed to create commoditization.  The problem is that a network element is a software-based functional unit that runs on hardware.  As long as network software could be sustained in proprietary mode, the hardware side didn’t matter much, and open routing software is as much a slog as open operating systems.  Even today, we’re only beginning to understand what an open switch/router software package should look like, and what its relationship with hardware should be.  Standards, in addition, have proven to be too high-inertia to keep pace with rapid market developments.  That’s why there’s been so much pressure to come up with things like open-source router software and white-box devices.

Why hasn’t open-model networking pushed for creative software for the same reason that open-model computing did?  The answer is that in the computing world, the purpose of the open operating system was to build a third-party ecosystem that would justify hardware sales for the key players.  In open-model networking that falls apart for three reasons.

The first reason is that open-model networking is trying to commoditize both hardware and software.  Who promotes the open software model that pulls through hardware nobody makes much money on?  Since the software is free and the hardware commoditized, there’s not a lot of dollars on the table for vendors, so only buyers have an incentive to drive the initiatives.  Buyers of network devices don’t typically have big software development staffs to commit to open-source projects.  Most such projects are advanced by…vendors.

The second reason is that there is no third-party software industry driving adoption of network devices.  Open-source router software plus commodity white box equals no money to be made on any piece of the ecosystem, if there’s no software add-ons to sell.

The final reason is the vendor certification programs.  You can be a Cisco or Juniper or Nokia or Ericsson certified specialist in using the devices.  That certification makes employees more valuable, as long as the stuff they’re certified on is valuable.  That encourages certified people to think in terms of the familiar vendors/products.

I don’t think that the kind of openness we’ve seen in software for decades will come about in networking unless one of two things happen.  The first is that the service providers take a bigger role in developing open-model networking.  Most of the progress so far has come from cloud providers and social network players (Google and Facebook, respectively).  The second is that the computer hardware vendors get more aggressive.

I’ve worked with the service providers for much of my career, both within standards groups and independently.  The great majority of them want open networking, but only a small minority is prepared to spend money on staffing up open-source or even internal network projects.  Of that small group, perhaps a quarter or less has taken even a fruitful, much less optimum, approach.  One comment I got last year seems relevant; “We’re ready to build networks from open devices, but we’re not ready to build open devices.”  That puts the onus back on the vendors.

Even computer/server vendors have limited incentives to promote open-model networking, given that the goal of the buyer is for the stuff to be as close to free as possible.  The server vendors could play in a hosted-virtual-function game, but NFV has failed to create a broadly useful model for that.  The bigger hope may be the “embedded system hardware” devices, things like Raspberry Pi or the appliances on which SD-WAN is often hosted.  But the best hope would be the chip vendors, and they have their own issues.

You make money selling what you make, which for chip vendors like Intel, AMD, NVIDIA, Qualcomm, Broadcom, and the rest is chips.  Since these aren’t the kind of chips you eat, people have to consume them within devices that are purchased, so the things that promote those devices are the things chip vendors really love.  A network these days has way more devices connected to it than living within it, which means that the stuff chip vendors focus on today (smartphones, PCs, tablets) are likely to be the things they focus on in the future.

This doesn’t mean that they wouldn’t take an excursion into a new space if there was enough money on the table.  But how much money is there in open-model networking that is aimed at near-zero cost to the buyer?  Not only that, the devices and even the network model of an open-model network are only hazily understood today.  Does Intel (for example) rush out to educate the market so competitors can jump in and try to displace them without bearing the educational costs?

We’re back to buyer-side support here.  There’s only one group that has a true incentive to push open-model networking forward, and that’s the network operators.  The big cloud providers have in fact advanced data center switching in an open direction, but their incentives to go further are limited.  Telcos are another story.

And here, the AT&T change in leadership may be the big factor.  Stephenson was a supporter of AT&T’s efforts in open-model networking, which while they weren’t perfect, were likely the best in the market.  Will Stankey offer strong support, even better leadership?  He says he will continue Stephenson’s vision, but whether he does and how he does it may decide future of open-model networking.

Using Models to Mediate Lifecycle Behavior in Networks

Lifecycle automation is all about handling events with automated processes.  Event interpretation has to be contextual, which means that this process of event-handling needs to include a recognition of specific contexts or states, so that events are properly interpreted.

There are advantages of state/event tables over policies in controlling event-handling in lifecycle automation, at least in my view.  There are also some disadvantages, and there are some special issues that relate to state/event systems that consist of multiple interrelated elements.  I did extensive testing of state/event behavior over the last decade, in my ExperiaSphere project, and I want to both explain what I learned and point out a possible improvement in the classic state/event-table model.

One of the problems that quickly emerges in any lifecycle automation project is the sheer number of devices/elements that make up a network or hosting system.  If you consider a “service” or an “application” as being the product of ten thousand individual elements, each of which has its own operating states and events to consider, you quickly reach a number of possible interactions that explode beyond any potential for handling.  You have to think instead of structures.

A typical service or application has a discrete number of functional pieces.  For services, you have a collection of access networks that feed a core, and for applications you have front- and back-end elements, database and compute, etc.  The first lesson I learned is that you have to model a service using a “function hierarchy”.  Staying with networks now (the same lessons apply to applications, but ExperiaSphere was a network project), you could model a service as “access” and “core”, and then further divide each based on geography/administration.  If I had 30 different cities involved, I’d have 30 “access” subdivisions, and if I had three operators cooperating to create my core, I’d have three subdivisions there.  Each of these subdivisions should be visualized as black-box or intent model elements.

The purpose of this hierarchy is to reduce the number of discrete elements that lifecycle automation has to model and handle.  Each of the model elements in my example represent cohesive collections of devices that serve a unified purpose.  Further, because of the way that traditional networks and network protocols work, each of these model elements also represents a collection of devices that perform as a collective, most having adaptive behavior that’s self-healing.

In order for that to happen, the service has to be structured by a coupled model, each element/node of which represents a “composition/decomposition” engine.  At the lowest level, the elements encapsulate real infrastructure via the exposed management APIs, and harmonize this with the “intent” of the elements that model them.  At higher levels, the purpose of the engine is to mediate the flow of commands and events through the structure, so the elements behave collectively, as they must.

A model of a service like the one I described would contain one “master service” element, two elements at the high level, and 33 at the next (30 under “access” and 3 under “core”).  When the model was presented from a service order and “activated”, the order decomposes the two main elements into the proper lower-level elements based on service topology and carrier relationships.  My testing showed that all of the elements (there are now 36 instantiated) should have its own state/event structure.  In the work I did, it was possible to impose a standard state structure on all elements and presume a common set of events.  Any condition that was recognized “inside” an element had to be communicated by mapping it to one of the standard events.

In this structure, events could flow only to adjacent elements.  A lower-level element was always instantiated by the action of deploying the higher-level element, its “parent”, and parent and child elements “know” of each other and can communicate.  No such knowledge or communications is possible beyond the adjacent, because for the structure to be scalable and manageable, you couldn’t presume to know what’s inside any element—it’s a black box or intent model.

The presumption in this structure is that each parent element instantiates a series of child elements, with the child elements presenting an implicit or explicit service-level agreement.  The child is responsible for meeting the SLA, and if that cannot be done, for generating an event “upstream” to the parent.  The parent can then do any number of things, including accepting a lower level of service, or replacing the child element by reprovisioning.  That reprovisioning might be done via the same infrastructure, or it might seek a parallel option.  It might even involve reprovisioning other elements to build the higher-level service represented by the parent in a different way.

An important point here is that when the child element changes state from “working” to “not-working” (or whatever you’d call them), it’s responsible for generating an event.  If the parent wants to decommission the child at this point, they would issue an event calling for that, to the child.  If the parent cannot jigger things to work according to the SLA that it, in turn, provided to the highest level (the singular object representing the service as a whole), then it must report a change of state via an event, and its own parent must then take whatever action is available.

I presumed in my testing that all the variables associated with a given element’s operation were stored in the service model.  I also presumed that the variables associated with a given model element would be delivered to a process instance activated as the result of the element having received an event.  This is consistent with the NGOSS Contract proposal of the TMF, which I acknowledged as the source of the concept.  Thus, any instance of that software process could respond to the event, including one just scaled up or created.

My tests all used a state/event table within each model element variable space.  The table represented all possible process states and all possible events, and where a given state/event intersection was invalid or should not occur, the system response was to enter the “offline” state and report a “procedure error” event to the higher level.

State/event tables like this were familiar to me from decades of work on protocol handlers, where they’re used routinely.  Perhaps that familiarity makes them easy for me to read, but state/event relationships can also be conveyed in graph form.  A system with two states could be represented as two ovals, labeled with the state (“working” and “not-working”, for example).  Events are depicted as arrows from one state to another and labeled with the event.  If necessary, process names can be inserted in the arrows as a box.  There are plenty of languages that could be used to describe this, and the languages could be decomposed into an implementation.

In many cases, looking back on my tests, a modeled service had two or three “progressions” that represented transitions from the “normal” state and back, through perhaps several intermediate states.  This kind of service structure could be drawn out as a kind of flow chart or directed graph.  If that were done, then each “node” in the graph would represent a state and the event arrows out of it would represent the processes and progressions.  In that case, if the variables for the model elements included the “node-position” instead of the “state”, the graph itself would describe event handling and state progression.  For some, this approach is more readable and better expresses the primary flow of events associated with a service lifecycle.

Another point the testing showed is that you don’t want to get too cute with the number of states and events you define.  Less is almost always better.  Operators tended to create states and events that were designed to reflect everything, when what they need to do is reflect everything relevant to process activation.  If two events are handled the same way, make them one.  If two states handle all their events the same way, they need to be one state.  Reducing the number of state/event intersects makes processing easier and also makes it easier to understand what’s happening.

A small number of states and events also contributes to the ability to construct a “flow graph” of multiple elements at once.  An operations process would generate a fairly limited number of normal state/event flows across the entire spectrum of elements.  That would be easier to inspect, and so to identify any illogical sequences.  For example, there are only two “end-states” logical for a service.  One is the normal operating state, and the second is the “disabled/decommissioned” state.  All possible state/event flows should end up in one or the other of these.  If some combination of states and events sticks you somewhere else, then there’s something wrong.

Operators seem to be able to understand, and like, the state/event approach if it’s explained to them, but most of them don’t see it much in vendor presentations.  I think that policies have been more broadly used, and I think that’s a shame because I don’t think we have the same mature thinking on how to do “stateless policy handling” of lifecycle events.  Perhaps more exposure to the TOSCA policy approach will offer some examples of implementations we can review, or perhaps people will take the TMF approach more seriously (including the TMF itself!).  We’re coming to a critical point in lifecycle automation; too much more spent on one-off opex improvement strategies will pick all the low apples to the point where funding a better and more general approach will be difficult.

What Might Apple Have in Mind for a Cloud?

Why is Apple seemingly behind in the cloud?  This isn’t a new question (I’ve pointed out their lagging position for a decade), but it’s perhaps a serious issue now that it’s clear that the cloud is playing a growing role in IT.  Not only that, the cloud is impacting application development.  One could argue that evading the cloud will become more and more difficult for Apple, so what might it do?

Protocol did an article asking why Apple seems to be hiring cloud types, especially people with container/Kubernetes skills.  Obviously, iPads, iPhones, watches and consumer electronics, and even the Mac are hardly likely to be hosting containers.  The moves offer a couple of possibilities, so let’s explore them to see if any make sense.

The one we can dismiss pretty quickly is that Apple intends to get into the server or hosting-platform business.  These spaces are mature, they’re based almost entirely on open-source software, and it’s hard to see how Apple could plot a successful late-entrant strategy.  Or why they’d want to, given the competition and the fact that the whole space is already under incredible open-model pressure.  Apple is hardly an open-model company.

This argument can also be applied to another possibility, which is that Apple might want to focus more on open-source software for its own tools.  The problem with that is that you can’t charge for it, and that again collides with Apple’s strategy.  There are open-source products for Mac already, and it’s difficult to believe that Apple thinks that lack of them is somehow hurting Mac sales.  Forget this one too.

That leaves some sort of cloud strategy.  Going back to the point about Apple’s desire to build and rely on a closed ecosystem for profits, public cloud services would seem almost as unlikely a target as building servers or selling platform software like a Kubernetes ecosystem.  In addition, jumping into public cloud would be an enormous gamble from a business perspective because of the first-cost challenge.  The cloud competitors had the first-mover advantage, and they’re highly built out.  Apple could hardly start small, and to match the scope of competitors would require a mind-boggling (and Street-frightening) investment.

Of course, we can’t assume they’re starting a programmers’ relief effort for aging Kubernetes experts either.  There has to be some cloud connection, and what I think is left is that Apple is going to deploy cloud infrastructure for its own service mission.  The question would then be “What service?”

It’s possible that this is nothing more than a desire to enhance their App Store model.  Many Apple insiders say that the current App Store is behind the times in terms of architecture, and could certainly use an upgrade.  Apple is expected to become more dependent on recurring sales to current customers as the refresh rate on popular things like smartphones slows.  However, some of the key hires recently seem to come more from a broad cloud-platform background.  That, to me, suggests that Apple is planning to build their own cloud infrastructure using generalized tools, which means that they have a generalized mission for it.  It’s not a service, except in the sense that they’d likely start with one specific one, but services.

Perhaps, the obvious candidate for the starting point of an Apple cloud is streaming video, which of course Apple already offers in the form of Apple TV.  Amazon and Google already have their own streaming services, and during the pandemic both companies got a lot of street creds simply by being associated with what customers (yes, sometimes in desperation) tuned into.  But again, those key early hires didn’t seem associated with streaming, but with more generalized cloud.  Streaming video differentiation would have to be based on content, not on delivery technology.  Also, Microsoft is perhaps the arch-rival to Apple among the cloud providers, if you look at the broad business targets of Microsoft and Apple and not at “the cloud” in particular.  Would Apple take a swipe at competition in the cloud with a strategy that their arch-rival doesn’t have?  Probably not, but there are also persistent rumors that Microsoft has been very quietly exploring the idea of streaming services.

I don’t think video is the starting point Apple would pick, though.  Their activity seems too container-focused, so I’m inclined to think that what Apple has in mind is a new device or device set that’s linked to an Apple-cloud-hosted set of services.  I also think they’d link this new stuff back to their current products, particularly the iPhones, to create the familiar Apple pull-through-your-friends symbiosis.  What might this new stuff be?

Perhaps IoT.  Apple launched HomeKit in 2019 in what was clearly a counterpunch with Amazon and Google initiatives.  Late that same year, they joined the ZigBee alliance (along with Amazon and Google).  Microsoft has a big Azure IoT initiative, and of course Amazon and Google have likewise.  So far, Apple has seemed to target the “intelligence” of IoT more than the devices.  Amazon and Google have some IoT devices, too.  A set of home IoT devices for Apple could make a lot of sense, and more sense if it were linked with a set of subscription services that related to the devices and their role in Apple’s overall ecosystem (both HomeKit and Apple overall).

Apple Watch is already an IoT device, providing health monitoring, exercise monitoring, etc.  Since Amazon and Google already have doorbell/camera stuff, it seems likely Apple would consider the same stuff as table stakes.  Might they also then build ZigBee devices of their own, and focus for differentiation on the Apple ecosystem in the cloud?  Might they even work to sell that ecosystem as a service to non-Apple users (perhaps a bit crippled, of course!)?

There’s a lot that cloud intelligence could do for IoT.  Security is a good example, and so is control of lighting, TV, music, appliances, and so forth.  The more stuff you can sense and control, the more logical it is to provide for policies and policy sets that interpret sensor data and other “readable” stuff like time, temperature, etc. and then activate something.  The longer the range of your IoT network, the more stuff can be in it, and the broader the set of things you can do with it.

Amazon is clearly heading in this direction with its Sidewalk initiative, announced last year as a kind of super-IoT protocol operating in the 900ghz range.  The Sidewalk initiative seems aimed at bringing the “Ring neighborhood” concept into a more meaningful place by making it possible to share some data with others nearby.  Add this to Bluetooth in phones and you can see how a community could be built, a community greater than the sum of its members.  Does that sound like something Apple might like?

These communities live in the cloud, not on the sidewalk (nod to Amazon for naming, though!)  Apple also loves AI and would like to leverage it, and this would be a perfect place to marry the cloud, AI, and IoT in one initiative.  That it’s a priority for Amazon (and, according to rumor, Google as well), then so much the better.  A little competitive scrum is just what Apple would like to see, to build buyer awareness.  They then rely on their brand and its loyal base to do the rest.

This might actually help 5G IoT supporters.  That concept has been mired in hypotheticals from day one, and in addition faces the classic chicken-and-egg issue of who builds a service without devices to use it and who builds devices with no service.  If Sidewalk extends home IoT, then could 5G IoT extend Sidewalk and similar initiatives?  It’s easier to extend a workable concept to a wider area than to start wide from scratch.

I think the most likely direction for Apple, if it indeed has cloud aspirations, is the IoT cloud.  It’s hard to see what else they could do that’s consistent with their long-term, successful, marketing strategies.

Is There an Application Design Dimension to Cloud Optimization?

The cloud seems to be a clear winner in a pandemic world, but it’s often true that winning, placing, and showing (as they say in horseraces) are separated by a whisker.  The cloud trends, during and after the pandemic, are far from simple.  I blogged earlier about the need for a new way of providing the data connection between cloud and data center.  It also seems that some changes to how we think about applications may be needed.

The reason “the cloud” is an overall winner in the current market is simple.  When you’re not comfortable about the future, you try to substitute expenses for capital purchases.  An average IT buy currently involves a 39-month capital commitment.  Building out your data center is a capital IT buy, and cloud computing offers an expense-based alternative, one that doesn’t create a risk of stranded capital if your assessment of the recovery from the pandemic doesn’t happen to be correct.  Cloud computing is an expense, and if cloud providers continue to be forgiving with respect to the length of contract terms, it’s an expense with an early exit.

I’ve noted in past blogs that it’s not clear whether the transition to cloud-think is going to be persistent.  Enterprises benefit, in the net, from cloud front-end handling of mobile/web traffic, but they lose as much as 23% in the cost of hosting the deeper “back-end” applications and components, relative to self-hosting, and they also have those nagging security and compliance concerns.  There’s also a pretty significant variation in the impact of the cloud on opex, and all of these inevitably factor into the way that the cloud-versus-data-center decision will be made in the future.

Applications that don’t exhibit a lot of variability in terms of load are often cheaper to host in-house than in the cloud.  Corporate data center economies of scale are often just a percent or two less than those of cloud providers, and the cloud providers’ profit margins more than cover that difference.  What corporations can’t easily do is address application hosting where needs are highly variable, as is often the case with web and mobile front-end components.

Then there’s opex.  The closer the cloud mimics the data center, the more the opex of the cloud matches that of the data center.  Operationalizing applications running on VMs isn’t all that less expensive than operationalizing server-hosted applications, according to the enterprises I’ve talked to.  Thus, it’s containers and serverless that really make the cloud a potential savings in opex terms, and containers and serverless rely on their economics to an efficient orchestration model, which is why Kubernetes is the market darling-of-the-moment.  Even Kubernetes doesn’t address full-scope application lifecycle automation, but the broader Kubernetes ecosystem does.

A full-blown Kubernetes ecosystem is a mixture of elements that most enterprises are still uncomfortable mixing.  It’s interesting to note that even basic Kubernetes literacy decreases as you move from development to operations to IT management to executives.  For the ecosystem as a whole, literacy gradients are particularly sharp.  Only about 15% of operations personnel are fully literate in the Kubernetes tools needed to fully automate a hybrid-and-multi-cloud deployment, and less than 3% of CIOs have the skills. For executives overall, the literacy is statistically insignificant.

Different cloud-hosting models differ in how much the literacy issue impacts them. SaaS is the easiest way to transform capex into expenses, but most companies find it difficult to fit their current business practices to fixed SaaS models of applications.  PaaS, meaning platform-based hosting, reduces opex issues relative to self-hosting in data centers, largely because of the burden of keeping middleware and operating systems tools synchronized.  IaaS, which is a VM-in-a-cloud model, will (not surprisingly) offer little benefit versus VMs in the data center from an opex perspective. Since profit margins for cloud providers increase IaaS-to-VM relative costs, that means that it’s harder to create a migration benefit for IaaS.

The cloud doesn’t need a “migration benefit” for front-end mobile and web elements of applications, of course, which is why these have driven enterprise cloud spending for the last 18 months.  Web front-end enhancements, based on cloud-hosting of front-end components, will be more than enough to keep cloud services growing, but if the cloud is to claw a bigger piece of the overall IT pie, it will have to move beyond that.

The key to that move is a better vertical-specialized SaaS model.  SaaS has done well in “horizontal” niches like CRM, but it’s not been a big factor in the core business applications that represent the majority of in-house IT costs.  That raises two interesting questions.

First, could a SaaS model, if we came at it differently, help broaden cloud usage?  In a past blog, I noted that having a better approach to database-sharing between cloud and data center would help, but could super-SaaS augment this approach?

Super-SaaS could be based on either or both of two approaches.  One would be to try to decompose a mission-critical application set into a series of “horizontal verticals”.  CRM works as SaaS because the business mission is very similar across many verticals.  Retail PoS is another area that has major similarities across many verticals, and there are some applications for at least pieces of that already.  The problem is that once you get away from the point of sale into the rest of the business, there are linkages to other applications that are more specialized.

The second option would be to actually build more full-vertical applications, but in a cloud-friendly form.  The challenge with this, historically, has been the cost and compliance implications.  Company data stored in the cloud gives executives the heebie-jeebies, as I’ve noted before.  But suppose that we had a transaction-driven query engine and database on the premises, with the more cost-effective data interface I talked about in that prior blog?

Second, could low-code/no-code be a path to moving some traditional data center applications to the cloud?  Citizen developers like a cloud platform, many SaaS providers (including Salesforce) have low-code capability, and there are already successful cloud/low-code applications in play, as pure cloud front-ends to data center apps.  Could this be expanded?

I think these two might well go together.  Low-code today is dominantly about mobile and web development for application front-ends.  Both low- and no-code are based at least in part on drag-and-drop application assembly, using predefined logic or data elements.  It wouldn’t be rocket science to extend both approaches so they could customize those “horizontal verticals” or pick up from full vertical applications to add those customizing hooks and tweaks.

The point is that if we really want to make the cloud a better response, we not only have to rethink our data access model, we also have to rethink how we build applications.  I know of a dozen or so enterprises who are considering points like these, and I’m hopeful that either one or more of them will cobble a solution together, or that one of their vendors will see the light.