Cloudflare Outage Causes Major Web Downtime Including ChatGPT, Blamed on Database Query Flaw Not AI or Cyberattack
Cloudflare has explained the widespread outage that disrupted services across the internet on Tuesday, including major platforms like X, ChatGPT, and the outage tracker Downdetector. The incident, which lasted several hours, affected a significant portion of the web and bore similarities to recent outages caused by issues at Microsoft Azure and Amazon Web Services. Cloudflare, which handles around 20% of global web traffic, is designed to distribute load and protect websites from traffic spikes and DDoS attacks. However, this time, a technical flaw in its internal systems led to a cascading failure. The root cause was not a cyberattack, malicious activity, or a problem with DNS, as initially suspected. Instead, Cloudflare traced the issue to a change in the permissions system of a database used to power its bot management tools. Specifically, the problem stemmed from a modification in the behavior of a ClickHouse query responsible for generating a frequently updated configuration file used by the machine learning model behind Bot Management. This change caused the query to produce a large number of duplicate "feature" rows, rapidly inflating the size of the configuration file. As the file grew beyond preset memory limits, it overwhelmed the core proxy system that processes traffic for Cloudflare’s customers—particularly those relying on the bot management module. The result was a system failure that led to false positives in bot detection. Websites using Cloudflare’s rules to block certain bots began incorrectly flagging legitimate user traffic as malicious, effectively cutting off real users. Meanwhile, customers who did not use the bot score system in their configurations remained online, highlighting the targeted nature of the failure. Cloudflare’s bot controls are designed to combat web crawlers, especially those used to scrape data for training generative AI models. The company recently introduced the "AI Labyrinth," a new defense system that uses AI-generated content to slow down and confuse bots that ignore "no crawl" directives. However, the Tuesday outage was unrelated to this AI-powered feature. In a detailed post, Cloudflare’s engineering team, led by Prince, emphasized that the issue was purely technical—stemming from a flawed database query—and not a security breach or external attack. The company has since restored services and is working to prevent similar incidents by improving monitoring and validation processes for configuration updates.
