The Real Lessons from Fastly

The recent Fastly problem generated a raft of stories featuring pundits claiming this or that lesson. It was interesting to me that none of the stories addressed what I think is the key point in the whole mess, which is that the Internet isn’t what we think it is. A corollary is that the cloud isn’t what we think it is, either, so maybe we need to look at what the two really are.

There’s a lot in a name, often a lot of misconceptions. We talk about “the Internet” as though we’re talking about a single network, and the Internet isn’t that. It’s actually a kind of multidimensional federation. The connectivity is created by the interconnection of independent Internet Service Providers (ISPs) and what we perceive as the service of the Internet is created through multiple technologies.

Internet connections are made through ISPs, and these ISPs have “peering agreements” that define how they interconnect. ISPs, or at least the major ones, also connect through national exchange points that provide connectivity where no private peering agreements exist. Each ISP has its own peering pathway, and so some aspects of your Internet experience vary depending on the ISP or ISPs you use.

The primary service of the Internet, from an objective technology perspective, is connectivity. The primary service of the Internet, from the perspective of its users, is experience delivery. In fact, if we judge Internet use by traffic, the primary service is video streaming, and how video streaming actually works is very different from how people think it works.

When you access an Internet element, a provider of an experience, you do it via a URL, which is a symbolic reference to something that, to be used, has to be translated to an Internet address. That translation is normally done using the Domain Name Server (DNS) system. If URLs are translated to the web servers of the provider of a web page or a video, the traffic associated with the experience would have to transit all the ISPs and peering points between user and experience. The result would be significant latency and congestion potential, and the core of the Internet would be jammed with video traffic.

ISPs follow what’s called a “bill and keep” practice; everyone charges their own users and keeps all the money. ISPs who would generate a lot of demand for video would load down the ISPs who hosted the video sources, and the traffic wouldn’t be paid for. Some ISPs might refuse to peer with video-greedy ISPs. To address both the delay/congestion issue and the billing issue, Internet content that’s widely used and generates a lot of traffic can be moved to a Content Delivery Network, or CDN.

CDNs are a community of caches, which are servers and storage where content can be staged. When CDN content is requested, instead of decoding the URL to the actual owner of the content, the URL is decoded to the address of the best-positioned cache point for the content, and that cache point delivers it. In most cases, the cache points are directly connected to “access ISPs” who provide consumer Internet service, and so the majority of the Internet is bypassed completely. Content providers pay for this service, and it improves the quality of experience of their users.

Fastly isn’t an ISP, they’re a CDN provider. Their problem didn’t take down the Internet, as has been widely stated, or even a portion of it. What happened was that it took down the content that Fastly was caching, including the specific websites that were noted in various articles. Do CDN providers’ failures put their customers at risk? Sure, obviously, just as any provider of hosting will put those who use the hosting capability at risk should they fail…just like the cloud.

CDN applications and cloud applications have a lot of similarity, and the relationship between cloud providers and the Internet is the same as the relationship between the Internet and CDN providers, like Fastly. If Amazon’s cloud goes down, the applications that are hosted in it are lost to the applications’ owners and users.

The core of the problem with Fastly is that the CDN service it provides is part of the Internet experience even though it’s not strictly part of the Internet. That opens up an interesting and problematic point, which is that if public cloud providers absorb some of the features of another service, as Fastly has absorbed some features of the Internet experience, then a cloud failure would create a failure of the service it supported. For example, if cloud providers hosted 5G O-RAN and the cloud failed, then 5G O-RAN would fail for the customers of the network operator who used the cloud.

Most network operators are obsessed with reliability, the proverbial five-nines mindset. The cloud, the Internet, and the CDN communities are much less so, and part of the reason is that network services have been traditionally subject to regulation, where the other services have not. Another part is that service experiences created by a community of providers are usually subject to failure if any of those providers fail. The reliability of a “series connection”, one that reflects dependency on all elements functioning, is always lower than the reliability of any given element.

Do network operators have it right, thinking that cloud and Internet technology isn’t the answer as a foundation for telecommunications services? That depends on what you’re willing to trade for those five nines, and whether an NFV-and-operator-centric vision of hosting would offer any better level of reliability. Frankly, I don’t think it would.

The real lesson of Fastly is that we need to accept that the Internet, and cloud computing, create much more complex foundations for the experiences we depend on for work and life. Complicated things break more easily, and so we can expect that one price of our Internet and cloud riches is a growing risk that what we need is going to break.

Facing doesn’t mean accepting, though. One truth about cloud computing is that it’s been working to deliver better availability even through a higher level of application or service complexity. What we call “cloud-native” design is a path to that, but the reality of cloud-native is that almost everything that claims to be so really isn’t at all. There’s way more hype than reality. My biggest concern about things like cloud hosting of 5G and O-RAN, and about edge computing, is that we’ll accept the marketectural view of cloud-native rather than the architectural view. If we do, then we’re going to see more of these Fastly-like incidents.