Distributed Service Elements and Risk Management

In our search for edge computing justifications, are we pushing the complexity of cloud and Internet hosting too far? Some recent developments suggest that we may be opening a new set of dependencies for applications, creating an availability challenge that cries out for some architectural solution to complex component relationships. A solution we don’t seem to have, that most may not even care about, but that we need badly.

Web pages have gotten more interactive and more complicated over time, and that has created a number of challenges, the primary one being performance. The more visual items there are to be served, the more time it takes to assemble them all and deliver them for viewing. Almost a decade ago, startups were already being created to create intermediate caching technology, creating a page at the server end (server site rendering or SSR) or somewhere in the middle, and delivering the completely assembled page to the client. This process was called “pre-rendering”. Interactivity made that more difficult to do, of course.

One of the impacts of the web-page-complexity issue was the creation of CDNs, and the use of CDN storage for more than video. Caching routine image elements, for example, could prevent the delays associated in pulling them from a company website. Your web page could then be viewed as two sections; a variable part that the company served from their data center because it contained variable data, and a bunch of image stuff cached locally. This approach contributed to the impact of Fastly’s outage; anything that needed a cached element was now going to fail.

There’s still plenty of impetus for improving the visual experience of web pages, though. Back in 2016, we saw the development is Jamstack, a model of web interaction that replaces the old web server, app server, and databases with a CDN and microservices. It’s a successor to the pre-rendering approach that combines its static-piece support, drawn from CDNs, with APIs, JavaScript, and microservices that could, for example, be hosted in edge computing. One article on Jamstack talks about “possibilities that a great unbundling or decoupling that next-generation web development could bring”, and that statement is both exciting and problematic.

Yes, it’s exciting that things like Jamstack could enhance the Internet experience. Jamstack and edge computing could turn web pages into distributed applications that could in theory span a range of resources and a range of providers. There is absolutely no question that this sort of thing could (and has) enhanced web sites and improved QoE for their users. It also both demonstrates and multiplies risks.

We tend to think of distributed models of experiences as being resilient and scalable, and that’s true in some cases. However, if an experience must draw from, let’s say, three distinct resource sets to be complete, then a failure of any of them causes it to fail, and that’s where the “problematic” side comes in. If we say that each of these resource sets has the same reliability (the probability of being up and running), of R, then the chances of all three being up is R3. Since R is always less than one (let’s make it 0.95 for this example), cubing it will give a smaller number—in this case, 0.857. Were we to say that we had three additional composing resources, the chances of them all working would be less than 3 out of 4.

There’s a lesson here in distributed applications and distributed web processing alike. If you create something that depends on the availability of multiple things, then you have to ensure that each of those things has a proportionally higher level of availability, or you risk creating something that has an unacceptably high risk of failing completely.

Creating higher availability means shifting our dependent relationships from all of these to any of these. If we take the same three resource sets, but say that any of the three can fulfill the experience on its own, then for the experience to fail, we’d need all three of them to fail. If our original probability of working was 0.95, then our chance of failing was 0.05. The chances of them all failing is that number cubed, which is 1.25 in ten thousand, pretty good odds. Cloud computing has higher availability because you can fail over to another hosting point.

Complex applications, and complex web pages, can build dependencies and complexities quickly and without any fanfare. In fact, if we think of complex structures that are composed from other complex structures, we might not even see the risks that we’re building in by including some seemingly simple component/element. When we partner with another provider of features, we inherit their risks. How many times do you suppose we know what that failure risk actually is?

We’ve seen symptoms of this problem already in past reports of outages. Most of our recent problems in network services have come from something that seems unrelated to the core logic of the services, from a small software bug that was introduced in some backwater element, and that then spread to have a broader impact. If we can’t deal with this sort of thing better, we risk a steady decline in service and application availability.

There’s another problem that’s harder to quantify but perhaps scarier to contemplate. Having a widely distributed and complex application means having a larger attack surface. Every API, every hosting point and administration, poses a risk. This risk accumulates as the numbers grow, and as before, when we partner to draw something from another party, we inherit the security problems there. We probably know less about the security risk than about the availability risk, too.

What this means for the future, particularly the future of the cloud, edge computing, and applications based on either of the two, is that we need to be very aware of the availability and security metrics associated with all the things we’re going to composite together to create an experience.

Composing experiences or applications from multiple components is a fundamental part of exploiting the features and benefits of the cloud. Reducing coding by sharing components across applications or services is a fundamental part of a modern response to managing development costs and reducing the time it takes to create or change applications. We can’t lose these benefits, so we have to think harder about risk management. At the application level, we have some experience with this in the web services offered by cloud providers. At the service level, including the service of the Internet, we’re still a bit in the wild west.

At the application level, there’s some awareness of the issues of component sharing, but my limited input from enterprises suggests that this awareness tends to be confined to what we might call “data center development”, meaning the core applications and not the front-end pieces that are increasingly cloud-hosted. In the cloud, I hear about project failures created in part by excessive componentization and by a failure to assess the availability, performance, and security impacts of components. Cloud provider web services sometimes pose a problem, but usually it’s because the developers haven’t taken the impact of using them into account.

At the service level, we’re caught in three issues. First, service providers are less likely to have a cadre of experienced cloud developers on hand, and so are more likely to create problems through lack of experience. Second, telco standards tend to reflect a box-network bias, meaning that services are created through the cooperation of “functional monoliths” rather than elastic, agile, individual features. Finally, a lot of decisions about how to create service experiences are made by people focusing on the experience and not on the framework that builds it, which makes risk assessment more difficult.

A good edge computing strategy should address these risks, particularly if we believe (as most do) that 5G service hosting will be the early driver of edge deployment. Recognition of that truth seems limited; a recent article in Protocol on Jamstack doesn’t even mention either the availability or security risk potential. We should look closely at all the emerging edge options and measure how they could introduce availability and security risks, and then deal with how to communicate and mitigate risks, so we don’t end up with wider-scope, higher-impact, problems down the road.