Canva Outage: What Went Wrong and What Was Learned
In November 2024, Canva experienced a critical system outage that affected the availability of canva.com. The engineering team recently published a comprehensive post-mortem, detailing the causes and lessons learned from this incident. Brendan Humphreys, Canva’s CTO, provided a detailed account of the event.
The Outage Timeline
The outage began at 9:08 AM UTC on November 12, 2024, and lasted until approximately 10:00 AM UTC. During this time, canva.com was unavailable to users. Humphreys explained that the API Gateway cluster failed due to several factors:
Our API Gateway cluster failing due to multiple factors, including a software deployment of Canva’s editor, a locking issue, and network issues in Cloudflare, our CDN provider.
The Role of Canva’s Editor
Canva’s editor is a single-page application, deployed multiple times daily. Client devices fetch new assets through Cloudflare using a tiered caching system. A routing issue within the CDN disrupted traffic between two regions. When the assets became available on the CDN, all clients began downloading them simultaneously. This led to a sudden surge of over 270,000 pending requests being processed at the same time.
Normally, an increase in errors would cause our canary system to abort a deployment. However, in this case, no errors were recorded because requests didn’t complete. As a result, over 270,000+ user requests for the JavaScript file waited on the same cache stream.
The Impact on the API Gateway
Suddenly, the new object panel loaded simultaneously across all waiting devices, resulting in over 1.5 million requests per second to the API Gateway—a surge approximately three times the typical peak load. This overwhelming wave caused the load balancer to transform into an “overload balancer,” turning healthy nodes into unhealthy ones.
This is a classic example of a positive feedback loop: the more tasks go unhealthy, the more traffic the healthy nodes received, the more likely those tasks will go unhealthy as well.
Resilience and Failure
As autoscaling failed to keep pace, API Gateway tasks began failing due to memory exhaustion, ultimately leading to a complete collapse. Lorin Hochstein, a staff software engineer at Airbnb and author of the Surfing Complexity blog, described the outage as a tale of saturation and resilience. Hochstein highlighted the critical role of incident responders in adapting system behavior:
It was up to the incident responders to adapt the behavior of the system, to change the way it functioned in order to get it back to a healthy state. (…) This is a classic example of resilience, of acting to reconfigure the behavior of your system when it enters a state that it wasn’t originally designed to handle.
Containment Measures
To address the issue, Canva’s team attempted to manually increase capacity while simultaneously reducing the load on the nodes, achieving mixed results. The situation was finally mitigated when traffic was entirely blocked at the CDN layer. Humphreys detailed:
At 9:29 AM UTC, we added a temporary Cloudflare firewall rule to block all traffic at the CDN. This prevented any traffic reaching the API Gateway, allowing new tasks to start up without being overwhelmed with incoming requests. We later redirected canva.com to our status page to make it clear to users that we were experiencing an incident.
The Canva engineers gradually ramped up traffic, fully restoring it in approximately 20 minutes.
Lessons Learned
To minimize the likelihood of similar incidents in the future, the team focused on incident response process improvements, including runbooks for traffic blocking and restoration, and increased resilience of the API Gateway.
Expert Insights
John Nagle, a commentator in a popular HackerNews thread, drew an analogy to electric utilities:
This problem is similar to what electric utilities call “load takeup”. After a power outage, when power is turned back on, there are many loads that draw more power at startup. (…) Bringing up a power grid is thus done by sections, not all at once.
The Future of System Resilience
Humphreys summarized the incident, stating:
The full picture took some time to assemble, in coordination with our very capable and helpful partners at Cloudflare (…) a riveting tale involving lost packets, cache dynamics, traffic spikes, thread contention, and task headroom.
The incident serves as a stark reminder of the complexities involved in maintaining high availability in large-scale web applications and the importance of robust incident response processes.
If you’re interested in learning more about system architecture, resilience engineering, or incident response best practices, don’t forget to subscribe to Archynetys. Share your thoughts and experiences with similar incidents in the comments section below.
Join the discussion: Comment below with your thoughts on the Canva outage and how it affects your approach to system resilience and incident response.
Subscribe: Stay informed with the latest in tech news and insights by subscribing to Archynetys.
Share: Help others learn from this incident by sharing this article on your social media platforms.