The Evolution of Public Cloud Services and Applications

Some recent stories on cloud provider growth and total revenue show Amazon well out in front of everyone else, Microsoft in a comfortable second place with faster-than-average growth, and Google and IBM locked in third place.  I’ve noted that a part of the total cloud dynamic is the market segments each provider is addressing.  Amazon has a very strong position with web-based startups, Microsoft has hybrid cloud strength, and IBM and Google are both struggling to find a niche to focus on.  Perhaps everyone on both the provider consumer side are doing the same.

I’m a programmer, software architect, and director of software development by background.  In my view, the challenges of the cloud begin with the foundation of business IT, which is “transaction processing”.  In the early days of IT, nearly everything done by computer was “batch processing” meaning that records of commercial activity were captured and entered into what was essentially a repository of business activity.  The actual activity was offline.

When “online transaction processing” (OLTP) came along, what happened was that the traditional “TP” stuff was augmented with an “OL” part, meaning that applications were extended with the components needed to allow direct human interaction with transaction processing to bypass the batch process.  Personally, I think a better term would have been “real-time transaction processing”, partly because the goal was to connect workers to work in real time, and partly because “online” means “on the Internet” these days.

The split between “OL” and “TP” is really critical for the cloud, because the cloud is really good at web-like, web-related stuff and much less so for linear processes that go from start to finish without human intervention.  The reality of enterprise use of the cloud today is that most of it relates to creating a cloud-hosted front-end to traditional IT applications, meaning it’s an implementation of that “OL” part.

Web server activity tends to be stateless, meaning that if you send an HTTP request for a web page, you could obtain it from any load-balanced set of resources assigned to the URL.  Stateless stuff can thus be scaled on demand, replaced on demand.  In addition, when you make a web request it’s fairly easy to decode it and send the details from the baseline server to a deeper element depending on what’s being asked.  The delay, in the context of human request/response expectations, isn’t critical.  Not only that, you could in theory connect the user to a kind of “storefront” that assembled the “order” by calling a number of backend services and collecting the results.

The “OL” piece, in technical terms, is a good candidate for container hosting and microservices.  Containers are also a good strategy for enterprise server virtualization.  If we assume the two trends converge, then you can see why Kubernetes and various service-mesh and federation extensions to it are suddenly very popular.  Microsoft’s hybrid cloud primacy is less due to its credentials in Kubernetes (which Google developed, after all) than to the fact that it has a foot in both cloud and premises and was more aware of the bridging of the two, both technically and politically.

Event processing and “functional” or “lambda” computing is a further recognition by cloud providers that the real-time space is their natural home.  For all the cloud providers, event-driven applications represent a second front, meaning a class of application that may still ultimately feed “TP” and thus be “OL” in my little sequence, but isn’t tied to web stuff and may have looser ties even to “TP” overall.  More “OL”, then.  Amazon and Microsoft saw (and perhaps still see) functional event processing as a separate “serverless-class” application, but I think both companies are eventually going to merge their serverless and container stories; Amazon IMHO is already showing signs of that.

Google from the first was pushing microservices, and that’s what hurt them.  Microservices are a development/application model, not a business mission.  Of all the cloud providers, they’ve been the slowest to embrace the hybrid cloud model as the dominant approach to the enterprise, the “OL” and “TP” segmentation I’ve talked about here.  Are they perhaps the most insightful about “microservices” as a development and deployment model?  Perhaps, but enterprises don’t tell me of any helpful conversations with Google cloud people on the “OL” versus “TP” or event front.  Still, they could succeed if they worked at it.

IBM is in a different kind of bind after Red Hat.  What’s good for IBM cloud may not be best for IBM.  Red Hat and OpenShift are more and more the container/Kubernetes turnkey platform of choice.  An optimum vision there would make the hybrid-cloud link totally generic, meaning OpenShift would work to deploy an “OL” “TP” pair for an enterprise, and even develop specifically for that pairing and the linkage.  That open approach would make OpenShift a great partner for any cloud provider, not just IBM.

IBM and Google are tied in the cloud, essentially.  IBM’s best option to gain there would be to create a specific bridge between OpenShift and IBM’s cloud, but aimed where?  The “OL” part of OLTP is already highly committed to Microsoft, and Amazon is gaining ground there.  Does IBM meet those two head-on?  Do they look instead at an “OL” that’s not so “TP”-connected, like event processing?  Or do they look elsewhere?  Does Google try to leverage its Kubernetes position more in the “OL” and “TP” front-end-model race, or does it try to go after events too?  Or do one or both look elsewhere?

Carrier cloud is the big imponderable here.  If network operators were to fully realize the potential of the cloud in transformation, they’d generate the largest number of new cloud data centers in the market, and become collectively the largest owner of cloud infrastructure.  However, operators have six specific driver missions; vCPE and NFV, streaming advertising and video service features, IMS/EPC/5G, personalized and contextual services, IoT, and network operator cloud computing services.  The last of these would mean facing the same mission realities as the other public cloud providers, but all the others are more specialized and some wouldn’t look much like containers and microservices at all.  Whatever missions drive carrier cloud, or if none drive it convincingly, the result will be market-changing.

Google’s decision to name a head of a telecom group and focus on that market is likely an indicator that Google at least recognizes that 1) carriers are approaching carrier cloud in a mission-specific way and 2) that means each mission could end up being outsourced to a cloud provider…like Google of course.  If Google could pick up a couple carrier missions, they could gain tremendously in cloud revenue and at the same time hone their skills and tools for another set of cloud applications, a set that might eventually fit into enterprise cloud usage as well.

IBM is a player that operators would love to have in the game.  Unlike Google, IBM isn’t seen as a threat to any operator service future that’s being seriously considered.  They’re strong in IT and software, where operators are almost pathetically weak.  They’re highly credible among C-level executives.  They’re hungry.  But they’re probably on the horns of the dilemma I mentioned earlier in this blog; do they want the IBM cloud to gain, or IBM overall?  I think the latter will be the answer.

A decade ago, when I was involved in the IPsphere Forum, the operators in the body were more interested in working with Google than any other player in the industry.  Bet that’s true today too.

Fitting Cloud-Industry’s Cloud-Native Vision to Networks

I had a really interesting talk with a senior operator technology planner about a term we hear all the time, “cloud-native”.  She suggested that we substitute another term for it, “cloud-naïve” because she believed that there was “zero realism” in how the cloud in general and cloud-native in particular were being considered.  If that’s true, we could be in for a long slog in carrier cloud, so let’s look both at what the truth is and what seems to be happening.

The cloud industry, meaning the development community that both builds cloud development and deployment tools and creates cloud applications, doesn’t have a single consistent definition.  The most general one is that cloud-native is building and running applications to exploit the special characteristics of cloud computing.  The Cloud Native Computing Foundation (CNCF) says “Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.”

This isn’t much of a prescription for cloud-nativeness, so to speak.  The reason, perhaps, that we don’t have such a nice recipe is that it’s not a simple issue.  There are many applications that shouldn’t be done in the cloud at all, and many for which containers, service meshes, and microservices are a truly awful idea.  There are even multiple definitions of all three of these things, multiple paths through which they could be implemented.  Thus, it’s not surprising that operators are vexed by the whole topic.  If we’re going to take a stab at making things clear, we have to do what any application developer has to do in considering the cloud—work through the functionality of the application to see where the cloud could be applied, and how.

A network service typically has three layers of functionality.  The primary layer is the data plane, where information moves through the network from source to destination.  Above it is the control plane where in-service signaling takes place, and above that is the management plane where service operations is focused.  Obviously these three layers are related, but from an implementation perspective they are so different we could almost consider them to be different applications.  That’s important when we consider cloud-native behavior.

The data plane is all about packet forwarding, which in my view means that the design of features there and they way they’re committed to resources has to be highly performance-efficient and reliable.  So much so that one could argue that as you get deeper into the network and traffic levels rise, you might want to fulfill data plane requirements outside the general cloud resource pool.  That doesn’t mean you can’t edge-host them or that where traffic permits, you can’t use cloud hosting.  It does mean that specialized missions don’t fit with general resource pools.

The control plane is a kind of transitional thing.  If you look at the IP protocol, you find that packets are divided into those that pass end to end and packets that are actioned within the network.  The former packets are data-plane elements, and the latter are IMHO best thought of as representing events.

Events are where we see an evolution of software thinking, in part at least because of the cloud.  In classic software, you’d have a monolithic application running somewhere, with an “event queue” to hold events as they are generated and from which the application pulls work when it’s done with the last thing it was doing.  More recently, we had a queue that was serviced by a number of “threads” to allow for parallel operation.  Today, we tend to see event processing in state/event terms.

State/event processing is based on the notion that a functional system is made up of a number of “operating states”, one of which is the baseline “running” and the others states that represent transitions to and from the running state based on conditions signaled by events.  The states represent contexts in which events are interpreted.  Human conversation works this way; if you hear someone say “Can you…” you assume that the next phrase will be a request for you to do something.  “Did you…” would be interpreted as asking whether you did.  The response, obviously, has to be interpreted in context, and state provides that.

The other benefit of state/event processing is that it “atomizes” the application.  Each state/event intersection (state/event processing is usually visualized as a table with rows representing events and columns representing states, or vice versa) consists of a “run-process” and “next-state”.  The processes are then specific to the state/event intersection.  If you presume that the service data model is available (it’s probably what contains the state/event table) then that data model has everything the process needs, and you can kick off a process anywhere, in any quantity, with access to that model alone.

An example of a control-plane process is the topology and route exchanges that happen in IP.  The associated packets, coming from a number of port sources, drive what’s in effect a state/event engine with a data model that’s the routing table.  This model might exist in each router/node in a traditional IP network, or in a centralized server in an SDN implementation.

The point here is that control plane behavior relating to packets that are actioned rather than passed, if it’s managed using state/event processing, lends itself to containerized, microservice-based, service-mesh-connected handling.  Cloud-native, in other words.

The management layer is another place where we have a tradition of monolithic, serial, processes.  OSS/BSS systems have generally followed that model for ages, in fact, and most still do despite the protestations of vendors.  There’s a difference between having an event-driven management model and having a management system that processes events.

An event-driven management process is what I’ve described in my ExperiaSphere work.  A new service (or in theory any new application or combination of functions and features) is represented by a model that describes how the elements relate and cooperate.  Each element in this model has an associated dataset and state/event table.  The initial form of the model, the “orderable” form, is taken by an order entry or deployment system and instantiated as a “service” (or “application”).  The model is then “sent” an “Order” event, and things progress from there.

As this process shows, the entire service operations cycle can be assembled using a model structure to represent the functional elements of the service and a set of events to step things through the lifecycle.  There’s no need for monolithic processes, because every element of operations is defined as the product of a state and an event.

In cloud-native terms, this is where microservices and scalability come in.  The entire state/event system is a grid into which simple processes are inserted.  Each of these processes knows only what it’s passed, meaning the piece of the service data model that includes the state/event table that contains the process reference.  The process, when activated, can be standing by in a convenient place already, spun up because it’s newly needed, scaled because it’s already in use, and so forth.  Each process is stateless meaning that any instance of it can serve as well as any other, since nothing is used by the process other than what’s passed to it.

What this proves to me is that since scalability and resiliency demand that you be able to spin up a process when needed, and that process then has to serve in the same way as the original process, you need some form of stateless behavior, and the assertion that microservices are a reasonable requirement for cloud-native control and data plane activity.

Packet forwarding requires a packet to forward, and a process that gets a packet and then dies can’t expect a replacement to forward it; the storage of the packet in transit is stateful behavior.  Further, spinning up multiple packet-handlers to respond to an increase in traffic within a given flow could create out-of-order arrivals, which not all protocols will tolerate.  So, again, I think the forwarding piece of the story is a specialized mission.

Applying this to modern NFV and zero-touch automation implementations, it seems to me that creating virtual network functions (VNFs) as an analog to a physical network function or appliance means that there is no multi-planar functional separation.  The physical device is not cloud-native, and transporting features from that device into the cloud intact won’t create the separation of layers I’ve described, so it’s going to be difficult to make it cloud-native.

The implementation of NFV in a management/deployment sense, and the overall zero-touch operations automation process, could be made cloud-native providing that you had state/event-based logic that managed context and allowed stateless processes.  The logic of operations wouldn’t be written into an application whose components you could diagram, though.  There would be none of the blocks you find in NFV or ONAP diagrams.  Instead, you’d have systems of processes built as microservices and integrated via a model structure.

There is an interesting and not only unanswered but unasked question here, which is what an ideal cloud-hosted data-plane element would look like.  It’s not as simple as it sounds, particularly when at some point (or points) in a data path, you’d have to separate control and management traffic.  This question is part of the broader question of what a hosted network element should look like, how its planes should connect, and how it should be hosted.  It’s hard to take cloud native network transformation discussions seriously when we’re a long way from that answer.

Can We Fit Testing into Zero-Touch Automation?

Test equipment was a mainstay in networking when I first got into it 40 years ago or so, but the role of testing in modern networking has been increasingly hard to define.  Part of that is that a lot of “testing” was never about networks at all, and part is because what a network is has been shifting over time.  There is a role for testing today, but it’s very different from the role during its glory days, and not everyone is going to like it.

I’ve used test equipment extensively for years, and when I reviewed the specific projects I’d used it on, I was surprised to find that 88% of them focused on debugging end-devices or software, 8% on a mixture of those and network issues, and only 4% on purely network issues.  In one memorable case, I used test equipment to analyze the exchanges between IBM mainframes and financial terminals so I could implement IBM’s SNA on a non-IBM minicomputer.  In another I used the equipment to help NASA emulate an obsolete terminal on a PC.  Even in the “pure network issues” situations, what I was doing was looking for the impact of network faults on the protocols being used.

Test equipment, meaning either devices or software probes inserted into protocol streams, has always been more about the protocols than the stream, so to speak.  More than anything else, according to enterprises, the decline in dependence on testing relates to the perceived limited value of protocol analysis.  Protocols are the language of exchange in data systems, and since the data systems themselves increasingly record any unusual conditions the value of looking at protocols is declining.  This, while the knowledge required to interpret what you see if you look at a protocol steaming on a path is increasing.

Traditional “data line monitoring” testing is particularly challenging given that somebody has to go to a location and stick a device into the connection.  Soft probes, including things like RMON, have grown in favor because a “network” today is a series of nodes and trunks, and testing the data interfaces with a device would mean running between nodes to look at the situation overall.  While you’re in transit, everything could be changing.

Other modern network advances haven’t helped either.  IP networking, and virtual networks, make even soft probes a challenge.  IP traffic by nature is connectionless, which means that data involved in a “session” where persistent information is exchanged may be routed along a variety of paths depending on conditions.  Virtual networking, including all forms of tunneling and overlay encapsulation, can make it difficult to trace where something is supposed to be going, or even to identify exactly what messages in a given place are actually part of what you’re trying to watch.

Back in the early days of NFV, I did a presentation on how test probes could be incorporated into an NFV deployment.  This was an advanced aspect of my work with the NFV ISG on a PoC, and when I dropped out of the activity (which was generating no revenue and taking much of my time) nothing further came of it.  The presentation, now removed, pointed out that in the real world, you needed to integrate testing with management and modeling or you had no way of establishing context and might find it difficult or impossible to even find what you wanted to probe.

The virtual world, particularly where services are composed from hosted functions, physical elements, and trunks, lends itself to probe insertion through deep packet inspection.  My approach was to support the use of service models to define “tap points” and “probes” as a part of a service, to be inserted and activated when needed.  Since I’ve always believed that both “services” and “resources” could be managed, and therefore could be probed and tested, this capability was aimed both at the service layer in the model and the resource layer.  The model provided the correlation between the testing probe and the service and resource management context, which as I’ve said is critical in today’s networking.  Since the DPI probe connection connected to a specified process (a “management visualizer”) the probe could either drive a GUI for display or drive an AI element to generate events that would then trigger other process actions, as part of service lifecycle management.

I also suggested that test data generators be considered part of testing functions, supported by their own modeled-in processes at appropriate points.  The combination of TDGs and a set of probes could define a “testing service” modeled separately and referenced in the service model of the actual service.  That would allow for automatic data generation and testing, which I think is also a requirement for modern testing.

I still believe, five years after that presentation was prepared, that the only way for testing to work in a modern world is to contextualize it with service lifecycle automation and the associated models.  The problem we see in testing today, the reason for the decline in its perceived value, is that it doesn’t work that way, doesn’t include the model-driven mechanisms for introducing context, coupling it to management, and automating the testing process fully.

I also think that testing has to be seen as part of the service automation process I’ve described in earlier blogs.  Testing, AI, and simulation all play a role in commissioning network resources and validating service composition and status.  The craze today is automating everything, and so it shouldn’t be surprising that what’s inherently a manual testing process is hard to fit in.

That’s the point that test equipment advocates need to address first.  Any technology has to fit into the thrust of business process changes or it runs against the buyer’s committed path.  First, we don’t want decision support, we want to be told what to do.  Then, we don’t want to be told what to do, we want it done automatically.  Test equipment used to be protocol visualization, which proved ineffective because too few people could do it.  Then it became protocol analysis, decoding the meaning of sequences.  Now it’s event automation.  You identify one of those sequences and respond without human interpretation or intervention.

You can’t do this without context, which is why expert people (who are pretty good at contextualizing out of their own knowledge) were always needed for testing.  Contextualization of testing using models, simulation, and AI, provide an automated substitute for that human expertise.  Yes, there will always be some number of things an automated process has to punt to a higher (meaning human) level, but a proper system would learn from that (the machine-learning piece of AI) and do the right thing the next time.

In an implementation sense, this means a kind of layered vision of testing, one that applies AI and human interaction to change the basic model that sets the context overall.  It doesn’t eliminate the older elements of probes and protocols, but it gradually shifts the burden from people to processes.  That’s what the market needs, and what successful testing strategies will eventually provide.

Two Lessons from “Earnings Pairings”

What have we learned so far from earnings season?  It’s always hard to interpret these quarterly events, particularly since they’re at least as much a hype-Wall-Street exercise as a report on the company’s activities.  Still, you can often interpret some interesting things, particularly if you look at multiple players in the same space.

AT&T and Verizon both reported earnings that were a bit of a disappointment to the Street.  Both companies are facing an inevitable plateau in the number of broadband mobile customers, and with it a plateau in their total addressable market (TAM).  What both need is simple; more revenue.  It’s clear to them now that there are no short-term cost management initiatives that they can depend on to improve profits.  In part, that’s because they’ve totally failed in their support of service lifecycle automation technologies, and that’s partly due to the fact that they recognized the need too late.  Two points seem to stand out in their earnings calls; 5G and video.

Both Verizon and AT&T need mobile broadband 5G to succeed quickly and massively.  Both also need the Street to believe it will do so even faster and on a larger scale than is likely.  Verizon used the term over 40 times in its earnings call, and AT&T 26 times.  The difference in the rate the term was used reflects, I think, a difference in how the two operators see 5G opportunity.

Verizon is looking at 5G, at least in part, as both a solution to thin-density home broadband and IoT device networking.  Yes, they want to be competitive in 5G services, but they don’t seem to be thinking that 5G smartphones will create a revenue boom for them.  It’s important that they don’t lose market share in the 5G race, so they’ll push it to stay even in the next market generation.  They are still hoping to see IoT devices connect via cellular service on a large scale; they mention IoT three times in their call and talk about the opportunity at least at some length.

AT&T seems to be seeing 5G as an opportunity to gain market share, and they’re thinking about it almost exclusively in the context of smartphones.  They never mention IoT on their call, and they dismiss 5G mm-wave hybrids with FTTN as a near-term market opportunity; perhaps three or more years out, according to Stephenson.  Their video position is clearly very defensive at this point; they suffered a massive customer loss on their combination of satellite and streaming video services.

The biggest factor separating the two operators is demand density, the revenue opportunity per square mile.  Verizon’s territory is dense and rich, and so it can earn a higher return on broadband delivery.  That means that Verizon can look at the 5G/FTTN combination and expect to profit from it, where for AT&T the equation is a lot more complicated.  That’s why Stephenson is dismissive of the technology in the near term.  However, AT&T needs to settle on a strategy, having purchased Time Warner.

A second factor is that while neither operator has really managed to gain a lot from virtualization and lifecycle automation, AT&T is still committed to it.  Part of that is because AT&T’s ECOMP is the foundation of the current open-source ONAP project that represents the great (and last, and perhaps futile) hope of operators for a service automation framework.  That, in turn, is likely motivated by the fact that AT&T has deeper financial issues than Verizon.

If 5G is the future for operators, vendors aren’t seeing green yet.  Nokia and Ericsson have both expresses pessimism about the pace of 5G adoption.  Apple says it doesn’t expect 5G handsets till 2020, and there’s a growing sense that 5G (like most everything in tech, let’s face it) has been overhyped.  Operators like AT&T and Verizon are obviously trying to become less dependent on vendors, but without a realistic sense of where they need to go and how they need to get there, it’s not going to be easy for any of them.

Another interesting earnings-call pairing is that of Amazon and Microsoft.  It would be silly for me to compare their retail stuff, so let’s focus instead on their public cloud services.  They’re numbers one and two in the market hit parade for the cloud today, and they’re increasingly dependent on an important consideration in cloud computing, which is just exactly who you think is, and will be, buying it.

The brightest spot in Microsoft’s quarter was the cloud, with revenues there up 76%.  What’s most interesting about Microsoft’s growth is that it’s largely due to success with enterprise hybrid cloud customers, rather than social-media or content startups.  The enterprise space is potentially a trillion dollar market so it obviously has a lot of upside to be reaped.  Microsoft’s competitors realize that, of course, and they’re targeting the hybrid space aggressively (if still not really effectively).

Amazon’s cloud was also its strongest segment; it contributed most of the company’s operating income.  Sales of cloud services were up 45% and generally consistent with Amazon’s recent cloud growth numbers.  Amazon is still the runaway winner in the cloud services space, a favorite of startups that can drive some big-revenue deals.  However, these big-time Amazon customers are ad-driven, and the total global ad revenues available for all market segments, including the Internet, is far short of that trillion-dollar enterprise TAM.

Amazon, of course, has been working to expand its hybrid credentials, particularly in welding a closer relationship with VMware, who has their own challenges in the hybrid cloud age.  This kind of deal is specialized, of course, but so is Microsoft’s own hybrid position—it links a Microsoft cloud to Microsoft services and premises software.  The big question for Amazon in the hybrid market is whether VMware will/can push hybrid deals from the premises side.  Microsoft, owning both sides, doesn’t have to worry about that.

Microsoft does have to worry about other developments, developments that while not favoring competitors directly, at least level the playing field a bit.  If hybrid-cloud architecture is the way of the future for applications and platforms, then premises software overall will have to integrate with it.  We already have, in the Kubernetes orchestration ecosystem, a lively community of open-source and commercial players building a hybrid and multi-cloud framework.  That would, as it matures, remove some of the special sauce from the Microsoft story.

The IBM deal for Red Hat may be the defining point in the premises-side-hybrid-cloud story.  To make the acquisition work, IBM has to move Red Hat out ahead of the defining enterprise trends, among which hybrid-cloud surely ranks at or near the top.  Forget IBM’s own cloud aspirations; they only gild the move a bit from IBM’s own revenue and pride perspective.  Red Hat has all the right tools, including OpenShift, which is moving quickly to frame the complete hybrid and multi-cloud technology picture.

A strong and open premises-side framework for hybrid cloud is not an automatic win for Amazon, of course, particularly since Google is trying very hard to escape from the third spot in public cloud.  In the near term in particular, Amazon is going to lag Microsoft in growth because it can’t draw on as large a TAM.  That means Amazon will have to work harder on new web service features, harder on hybrid integration from the cloud side onto the premises.  They’re taking hybrid seriously now, but it’s probably going to take a couple quarters for them to erase Microsoft’s early advantage.

What’s the lesson learned from these two earnings-call pairings?  To me, the big lesson is that the facts in both the markets represented by the pairs have been clear for years.  There was always more to gain from operations automation than from capex savings for network operators, and yet we wasted years focusing on capex with NFV, and we’re still not on track to get opex automation right.  There was never any enterprise cloud market other than hybrid cloud, and it was clear five years ago that the biggest pie in the cloud space was the cloud-native stuff, things that weren’t “moving to the cloud” but rather were waiting for the cloud to arrive.  It has arrived, and now people are getting smart, but they could have been richer now had they been smarter earlier.