
Alex Holeniev
In February 2026, an event made us all rethink just how dependent modern infrastructure is on a handful of providers. A single major Cloudflare outage disrupted a wide range of global online services. Corporate tools, consumer platforms, food delivery apps, betting sites, even other cloud providers – none of them were immune.
Alex Holeniev knows the answer from both sides. A senior engineering leader with over 15 years of experience building planet-scale distributed systems, he has run infrastructure teams at Google, served as Managing Director of IT at Sberbank during a $30M modernization across 36 data centers, and is today recognized as an expert in high-load distributed systems.
In this interview we discussed why today's cloud still fails at global scale, what the Cloudflare outage reveals about high-load systems, and what comes next.
Alex, to begin this interview: could you explain what made the recent Cloudflare outage so impactful at a global scale, from a distributed systems perspective?
Well, the February outage was, I'd say, a textbook example of a "control plane" failure, and type of failure matters. An internal bug in an addressing API led to the unintended withdrawal of BGP (Border Gateway Protocol) prefixes. In plain terms, Cloudflare effectively told the rest of the internet, "we're not here anymore." From a distributed systems perspective, that created a "routing black hole."
And why did the effect spread globally rather than stay local?
Because Cloudflare acts as a critical gateway for a huge portion of the web, the disappearance of those routes triggered "BGP path hunting" globally. ISPs were forced into constant, massive recalculations of internet paths. That's what produced the global latency spikes everyone felt, and the strange situation where services were technically online but logically unreachable. The servers were fine. The map to them wasn't.
You know, from my experience working with planet-scale distributed systems, the pattern that always strikes me is this: at this scale, you don't fail because the hardware breaks. You fail because the map of the system breaks. Cloudflare is the most visible recent example, but the underlying mechanic is the same one I've watched play out for years.
Why do modern distributed systems tend to exhibit cascading failures and "emergent fragility" as they scale, even with redundancy and failover in place?
In massive systems, redundancy can quietly turn against you. Failures at this scale are rarely about a single hardware node dying. They're about correlated errors. As we saw in the Cloudflare event, the issue wasn't a "retry storm," it was a configuration "poison pill." When a flawed configuration or a logic bug is pushed to a redundant system, it doesn't "fail over." It replicates the failure across every node at the same time.
So the redundancy itself becomes the vector?
Pretty much. This is what people mean by "emergent fragility." The complexity of these interactions grows faster than our ability to map them, which lets a single logical error bypass every physical redundancy layer we've built in. I saw versions of this many times during my consulting years at Accenture, leading the IT merger of two banks and the carve-out of a large FMCG company in Europe. When you combine or separate systems of that size, you discover very quickly that "redundant" and "independent" are not the same word. Two systems can look fully duplicated on paper and still share a hidden upstream, a shared library, a shared config service. The moment that one thing fails, both copies go down together.
Has today's internet become structurally dependent on a small number of large infrastructure providers?
Yes. The internet has structurally centralized on top of a decentralized protocol. That's the paradox of the modern web. The routing layer is still spread out. The application layer, the CDN layer and the DNS layer have all come together under a few big providers.
But isn't that exactly the problem? A handful of companies as a single point of failure for the global economy?
It's a real risk, I won't pretend otherwise. But it's worth saying that the major providers, Google, Cloudflare, AWS, take this seriously, and you can see it in how they invest. Having led teams at Google, I've seen the scale of investment in defense in depth, specialized isolation zones, and SRE (Site Reliability Engineering) standards. These companies aren't just selling services. They're effectively maintaining the plumbing of the modern internet, and the automated guardrails they build are what stop most local issues from becoming global catastrophes.
I'd add that I've sat on both sides of this. At Sberbank, we ran IT for a bank with 70 million users and over a billion payments a day, and at that scale you are deeply dependent on your infrastructure providers. At Google, I was on the other side of the table, running parts of the global file copying and replication services that other companies depend on. From both perspectives, the honest answer is the same. The dependency is real, the concentration is real, and the trade-off we've collectively made is that consolidation has bought us a level of reliability and engineering rigor that a more fragmented internet would not have. Whether that trade is sustainable long-term is a separate question.
Are there fundamental limitations to classical distributed systems models like CAP, consensus, and replication at global scale?
CAP is a theorem, not a law, and in practice you work around its limits with architectural choices, mostly localization. In my work at Google, instead of forcing global consensus for every transaction, we used internal Bigtable implementations where we pinned instances to specific regions. By keeping data management close to the user and minimizing cross-cluster dependencies, we kept the "blast radius" contained. That way, when a specific cluster runs into trouble, the failure stays where it started, instead of spreading regionally or globally.
Would you still call CAP the right mental model, or is the industry quietly moving past it?
I'd say, a bit of both. CAP is still useful as a thinking tool, especially for junior engineers learning where the hard trade-offs live. But after 15 years of working with databases, distributed storage, and replication, in Java and C++, I'd say the practical limits aren't really CAP itself. They're things CAP doesn't talk about: latency, partial failures, clock skew, the cost of coordination. Going back to my time at NetCracker, where I led product performance and scalability, including horizontal scaling and database optimization, the same lesson keeps repeating. The theoretical model tells you what's impossible. It doesn't tell you what's expensive, and at global scale, "expensive" is usually what kills you. So the field hasn't moved past CAP. It's moved around it.
From your experience, which architectural principles in large-scale storage systems (like those at Google) are most critical for preventing cascading failures?
In the Google Warsaw Infra Storage team, our two primary defenses were transparent migration and replication. Transparent migration let us move exabytes of data between hardware sets or locations without the application layer ever knowing the shift happened. This is a way to move systems that're at risk before they cause any problems. Think about it like this: you are using a service and the people in charge are moving your data from one place to another but you keep using it. You see nothing. No downtime, no error, no reconnect. That's the goal. We deployed transparent migration into production and piloted replication as part of a new Google file system. The hard part isn't the move itself. It's making sure the application above it has no idea anything is happening.
Replication is the second piece, and we treated it as a deliberate trade-off. We chose to spend more on storage. This way if something goes wrong it won't affect us much. We made copies of our data. We spread these copies across areas so if one area has a big problem the others keep working. This helps us make sure we don't lose any data and don't have to stop working. No data gets. We don't have any downtime. It costs a lot but not doing it would have cost us even more in the long run.
Why do systems designed to be fault-tolerant still fail in production? Which aspects of real-world behavior are hardest to capture in theory?
Theory tends to assume every code path is tested, but real-world "black swan" events almost always live in the gaps. This is why test coverage matters so much. Any logic that isn't regularly exercised in a test environment becomes a potential liability the moment it hits production.
Real-world behavior is also plagued by what I'd call "dark dependencies." These are connections between systems that aren't documented anywhere, and they only become visible during a failure. They're the wiring nobody drew on the diagram. These logical corruptions and untested edge cases are far more dangerous than any simple hardware failure, because they don't trigger your monitoring, they don't show up in your dashboards, and they don't fit the failure modes your runbooks were written for.
If today's cloud and distributed systems model is approaching its limits, what architectural paradigm might replace it?
We're moving toward what I'd call autonomous, intent-based infrastructure. The current model leans too heavily on static, human-managed configurations like BGP tables, and those are inherently prone to human error. The next paradigm will be systems where engineers define high-level "intents," things like required latency or isolation level, and the infrastructure uses AI-driven orchestration to reconfigure itself on the fly. This shifts the focus from "managing servers" to "managing policies," where the system automatically isolates itself or re-routes traffic based on real-time health signals before a human even detects a bottleneck.
Have you ever seen a system behave in a way that fundamentally contradicted its underlying model or architectural assumptions?
Yes. When I was running IT at Sberbank, we took on a simplification project. We ended up shutting down 1,480 applications across 36 data centers, which saved the bank about $30 million a year. On paper, shutting down these so-called "legacy" apps looked straightforward. In practice, those shutdowns frequently triggered outages in modern systems we believed were completely independent. These "phantom dependencies" showed us that the documented architecture and the actual communication paths weren't even close. The system had quietly evolved over decades, and no diagram had kept up. It was a stark reminder that in complex systems, the "as-built" reality often contradicts the "as-designed" model.
Which projects or decisions have most shaped how you think about reliability and scalability in large-scale systems?
My experience leading the Google Warsaw Infra Storage team was the most formative. We operated under a clear philosophy. Reliability has to be tested proactively, through what we call reliability drills. We didn't just design for success. We ran specific drills to make sure we could roll back quickly and safely in any scenario. The goal was twofold. First, our systems had to survive outages in the services they depended on, without crashing. Second, we had to make sure we ourselves weren't a critical dependency for anyone else. That's what I mean by "decoupled reliability." A failure in one part of the stack shouldn't automatically become a failure for the rest of the ecosystem. And that's what matters. The systems change, the languages change, the scale changes, but the discipline of asking "how does this fail" before "how does this work" is what separates infrastructure that lasts from infrastructure that doesn't.
