On November 18th Cloudflare had an incident that impacted thousands of customers, including customers using our service. Our proxy service is hosted on AWS in an ultra high availability architecture, which was not affected (even by the recent AWS outage either). We also designed our system to be resilient against centralized failures, and to have limited customer impact if they do go down.
The incident lasted approximately 5h 34m, from time to down to full resolution (we saw recovery as soon as 3 hours into the incident). You can see our incident timeline here.
We would like to discuss some interesting observations during the outage, as well as highlight some finer points about our architecture that limited the impact on our customers during this outage.
Internal Observations + How We Limited Customer Impact
Since our proxy and internal processing pipeline are hosted in AWS, there was no impact to any of our critical operations. However our dashboard is hosted on Cloudflare, so that was affected as most websites on the internet were. Because we use Cloudflare for hosting (Cloudflare Workers/Pages/etc) and not just proxying, we were not able to just re-route DNS to work around the outage. On top of that, many upstream services rely on Cloudflare in one way or another. When in need for a cheap asset distribution through a CDN you naturally end up looking at Cloudflare.
Upstream Outages
From our point of view we were able to observe the outage from upstream servers. We saw a high number of 5XX errors from servers impacted by the Cloudflare outage on our proxy. We also received alerts about this, and you can note that the time of the increase in errors matches up almost exactly with when the Cloudflare outage happened at 11:48 UTC.

Since our proxy goes through AWS load balancers, and we return the same HTTP response as the upstream script sources, we get all of the metrics when outages like this happen. This is a benefit of having traffic routed through our system since we get to observe outages like this immediately and we can notify our customers of the impact.
How we kept serving scripts during the outage
We cache requests to identical scripts where caching policy (Cache-Control) allows, so in this case scripts that were hosted on Cloudflare were still accessible and would be accessible until the cache is invalidated. This is a benefit to using the cside proxy.
Here is a screenshot of our internal Grafana dashboard showing our script metrics during the outage time period.

During the outage: It shows that we had a 70.8% cache hit rate, which means that many scripts were still being served during the outage that may have otherwise been inaccessible.
Regular baseline: This percentage is close to normal for us. For example, on November 17th the average cache hit rate was 74%, meaning we were still serving our usual number of cached scripts.
The total number of requests did go down however.
cside is designed to handle widespread outages
These sorts of widespread outages are unavoidable due to the centralized nature of cloud providers, but we do our best to limit the impact of them by having multi-region deployments of our proxy and a “Fail Open" architecture that means requests will still go through even if everything goes down.
It’s also important to point out that our edge services are designed to operate in an “isolated” mode if our centralized pipeline goes down. This means that even if we are unable to communicate with that system, our proxy will still be operational and can still receive and return requests for scripts. So by design, a centralized system going down cannot completely take down all of our edge nodes.
You can read a breakdown of how our architecture prevents sites from going down here.
The Cloudflare blog post here goes into a lot of detail as well, which is worth reading through.
Sidenote about error handling:
- The cause of Cloudflare’s outage happened to be related to a particular failure mode of Rust programs using
.unwrap()function calls, which is what caused the 500 errors we saw. We do not use this function at all in our proxy codebase, which is also written in Rust.
Cside is a team of seasoned distributed web engineers. From core contributors to browser like Servo, ex-Cloudflare engineers and early open source contributors to Tailwind and Bootstrap. We care about the web, we treat our infrastructure and architecture as a piece of art. We applaud companies like Cloudflare for sharing deep details about incidents and learned from them throughout our career to prevent them from happened wherever possible.









