Cloudflare announces a new resilience plan called Fail Small, following several global outages in a short period of time. The incidents were not caused by external attacks, but by errors within its own infrastructure and processes.
Cloudflare acknowledges that configuration changes rolled out globally in one go had too great an impact. As a result, relatively minor errors escalated into a large-scale outage.
This outage comes at a time when pressure on internet infrastructure is increasing. The Cloudflare Radar Year in Review 2025 shows that global internet traffic grew by about 20 percent last year. This growth is increasingly less driven by end users and streaming services alone, and increasingly by automated traffic. Bots and AI-related crawlers cause continuously high volumes and unpredictable peaks, which structurally increases the load on networks.
Against this backdrop, the recent disruptions were particularly burdensome. According to Computing, although the incidents in November and December had different immediate causes, they shared the same underlying factor: a configuration change that was rolled out globally shortly before the outage. According to that publication, this revealed a structural difference between the way Cloudflare manages software updates and how configuration changes have been implemented to date.
The outages made it clear that Cloudflare’s network was not sufficiently equipped to keep errors local. Instead of limited disruptions, large parts of the platform were affected, impacting customers and end users worldwide. Precisely because Cloudflare is deeply intertwined with DNS, content distribution, and security services, an internal error had an immediate impact on large parts of the internet.
Limiting the impact after failure
With the Fail Small plan, Cloudflare wants to address this vulnerability structurally. The starting point is that systems must be designed in such a way that failure is inevitable, but its impact remains limited. Changes must be implemented in a controlled and phased manner so that errors are detected early and can be automatically reversed before they spread across the entire network. Cloudflare explicitly positions this as a review of design choices and operational processes, not as a one-off technical intervention.
Computing also reports that Cloudflare is reviewing its internal emergency procedures. During the recent incidents, security measures and interdependencies between systems slowed down recovery because employees did not have immediate access to the necessary tools. These so-called break glass procedures are now being modified to prevent security from becoming a barrier during a failure.
The combination of rapidly growing internet traffic, increasing automation, and complex infrastructure amplifies the consequences of internal errors at large providers. Figures from Cloudflare Radar show that dependence on these types of platforms is continuing to increase, while tolerance for outages is decreasing. In that light, the Fail Small initiative takes on a broader meaning than just incident recovery.
With this approach, Cloudflare implicitly acknowledges that scale and speed alone are not enough to guarantee reliability. As networks grow and traffic becomes more complex, managing change becomes at least as important as delivering capacity. That makes this announcement relevant to anyone who depends on large-scale cloud and internet infrastructure.