This Tuesday, a “blackout” at Cloudflare brought down multiple online platformsincluding social network X, OpenAI’s ChatGPT, games like League of Legends, and various national and international social media sites. After having resolved the problem, the company now comes forward with a detailed explanation of what caused it.
Don’t miss any important current technology news and follow everything at tek.sapo.pt
As Matthew Prince, CEO of Cloudflare, highlights, “the problem was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind”.
In an official publication on the North American technology blog, the person responsible, who describes the disruption as the “worst blackout since 2019”explains that the disruptions were caused by a change in one of its database systems.
The change ended up generating several entries in a file used by the company’s bot management system. The file, which doubled in size, was sent to all machines that make up the Cloudflare network.
Matthew Prince details that the software used by these machines to manage network traffic reads the file in question to keep the bot management system updated. But, as the software has a size limit for the files read, the unexpected size caused a technical problem, the effects of which began to manifest themselves at 11:20 UTC.
According to the person responsible, initially, the company suspected that the observed “symptoms” had been caused by a large-scale DDoS attack. But, later, it was possible to correctly identify the main flaw, replacing the file that had increased in size with a previous version.
After a fix was implemented, essential traffic began flowing again at 2:30 pm UTCwith teams working to mitigate the consequences of the failure. “At 5:06 pm, all Cloudflare systems were operating normally”indicates the company’s CEO, who regrets the impact caused not only to its customers, but also to the Internet in general.
“Given Cloudflare’s importance in the Internet ecosystem, any disruption to our systems is unacceptable”says Matthew Prince. The person responsible indicates that the technology teams have already started working to make their systems more robust.
