
I’ve spent some time the past year on an important project, potentially life-saving for many. But if the site went down for a few hours or even a day, no one would die immediately. Maybe some revenue loss, but nothing catastrophic. What about yours? Is it mission or safety critical? Or doing so much business that a half a day is millions in loss? Not to mention the customer service issues?
Major digital outages keep happening. The Internet was designed to survive nuclear war, yet we’ve layered on centralization that increases failure risk. And it seems to be getting worse as new inflection points pile up.
The Gray Rhino in the Server Room
These outages… Internet today, crypto tomorrow, aren’t low probability Black Swans. They’re Gray Rhinos. For my fellow CPOs, CTOs, Heads of Product, and CEOs, here’s some things to consider.
If you lost access to half your products during the Nov 18, 2025 Cloudflare incident, you weren’t alone. Same for the October 2025 AWS US-EAST-1 meltdown, or the July 2024 CrowdStrike update paralyzing companies worldwide.
These aren’t rare, unpredictable Black Swans anymore. They’re Gray Rhinos: obvious, high-impact risks leaders still treat as someone else’s problem. (Economic negative externalities.) Foreseeable? Maybe debatable, but we’ve long understood “known risks” vs “known unknowns.” These recent outages were predictable in everything except timing and duration.
- Nov 18 2025 (today as I write this) – Cloudflare: 2–4 hours, X/Twitter, Discord, OpenAI/ChatGPT, Shopify stores, thousands of SaaS logins broken.
- Oct 20–21 2025 – AWS (US-EAST-1): ~15 hours, Snapchat, Roblox, Coinbase, McDonald’s ordering, half of DeFi front-ends down.
- Jul 19 2024 – CrowdStrike: 12–72 hours, 8.5 million Windows devices, global flights grounded, hospitals offline, billions in insured losses. (Reuters on some losses.)
Three events, 18 months, billions in combined damage. That’s the new cadence. And I’d argue, That’s not bad luck. That’s a pattern. (Of course, there’s many more.)
Five Choke Points
There’s more, but here’s some of the big ones.
- AWS (~29–31 % % of all cloud compute) (Statista)
- Cloudflare (~50 % of all public-facing HTTPS traffic at peak. Maybe. This is an older estimate. More conservatively, we can consider where 16% of global traffic flows and 20% of websites. use it.)
- Microsoft Azure + Microsoft 365/Active Directory (Azure 20% share. Global Cloud Market Share Q3 2025)
- Google Cloud Global Cloud Market Share Q3 2025 (Note that Google OAuth works independently, but involves logins in the billions for millions of third-party sites/apps.) Personal Opinion: Crypto identity continues to struggle because it’s so hard to use and there’s not a lot of major consumer perceived benefit yet. If we get a great big failure here, that will be the next little darling investment sector to pop.
- CrowdStrike Falcon (runs many Fortune-1000 endpoints, at least we can infer that.)
One misconfiguration in any of these can create immediate, material P&L and reputational risk. Think customers have no alternatives? They might think otherwise. Or at least start looking.
Same Pattern Is Repeating in Crypto
We told ourselves blockchain would fix centralization. Instead we’ve recreated it:
- 70–80 % of Ethereum RPC traffic still goes through Infura or Alchemy (both hosted primarily on… AWS).
- Most “decentralized” front-ends are Cloudflare-only.
- L2 sequencers and many bridges are single-region or single-cloud. (An L2 sequencer is a centralized (or progressively decentralizing) component in Ethereum Layer-2 rollups that collects, orders, and batches user transactions off-chain before submitting them to the main Ethereum blockchain for final settlement.)
- Oracle networks (Chainlink included) have concentrated node operators.
- Almost every major CEX is a systemic risk.
So when AWS blinked in Oct 2025, Uniswap, MetaMask, and many NFT marketplaces went dark. That’s not decentralization. It’s centralized risk in a blockchain hoodie. (Full disclosure: I have a couple of blockchain hoodies.)
I’m not trying to bust chops on crypto here. I think blockchain may help fix some of this. But as crypto rails go mainstream, teams should learn from existing infra failures. Many “decentralized” components still rely on the same centralized network and transport layers.
Concentration Risk
Concentration risk is logical: the best providers get huge, and new entrants can’t compete. It’s a global pattern worth its own essay, but it clearly contributes to these failures.
Mitigation Playbook for… Product?
Is this Product’s job? Usually no. Technical mitigation falls to Engineering/IT. But if outages hit your P&L or reputation, it becomes your problem. And pushing too hard can strain the CTO relationship. Still: you need to engage.
I’ve been burned here. I flagged concerns about a key implementation but let it proceed because delaying was riskier. It failed, cost real money, and I caught heat twice; once for raising concerns, then for being right. I should have pushed earlier for clearer mitigation and failure-mode planning. Painful lesson learned.
I’m Not a Doctor, But…
This is another, “I’m not a doctor, but I play one on the Internet,” things. I’m not a systems architect, but I’ve co-owned enough logical application architecture diagrams with dev teams to offer high-level questions Product should ask. You can’t own every line item, but you can ensure someone does. There’s no way my list is anywhere near exhaustive; just some examples. It might not be your job. And I know you want to get back to talking to customers, doing research, and thinking grand thoughts about strategy or diving into Figma with your design team. Sorry, I meant vibing AI prototypes now. But it is potentially your problem. So just be aware. Consider these, though in most cases, major backups will be cost prohibitive. So you may be left with other failover choices. The thing is to at least make sure the issues are explicitly considered.
- CDN: Do you need more than one? You should know a CDN is a Content Delivery Network; a globally distributed network of edge servers that deliver content to users as quickly and reliably as possible. Maybe you need to combine several, even if this may cost significantly more both in contracts and in delivery complexity: Cloudflare + Fastly/Akamai + AWS CloudFront. Cost: Guesstimating maybe +10% on edge bills. Payoff: less risk of downtime during CDN outages.
- CDN “origin-shield bypass”: Check into this. An Origin Shield is an optional extra layer that many CDNs offer. Instead of every edge server on the planet hammering your origin directly on a cache miss, edge servers forward the request to one (or a few) designated “shield” nodes first. The shield then talks to your origin. This dramatically reduces load on your real server circuits. Though, it might not actually work depending on just where things are breaking.
- Main Cloud Providers: Maybe require at least two cloud providers for anything customer-facing. Even if 90 % lives on AWS, make sure the other 10 % (DNS, critical APIs, failover queues) is on Google Cloud Platform (GCP) or Azure or wherever and tested every so often. Testing means actually testing. Which is scary. But that’s why you do it at 3:00 AM in your primary market.
- More DNS!: You know DNS is how your website actually gets found, right? Run your own authoritative DNS on at least two independent anycast networks (e.g., Cloudflare + Route 53 + one smaller player).
- Crypto: RPCs: Your wallet or dApp never talks directly to the blockchain. It talks to a Remote Procedure Call (RPC) endpoint; a remote server that does the heavy lifting of reading the chain, submitting transactions, and returning data. Consider RPC diversity in your wallets/dApps. Default to a fallback stack (Infura → Alchemy → QuickNode → self-hosted). Most users never notice, but your availability can be better. Fun Fact! The dirty secret of “decentralized” apps in 2025: ~70–80 % of all user traffic still goes through just three centralized RPC providers. When one of them hiccups (or when AWS US-East-1 hiccups), huge portions of the ecosystem go blind or stop accepting transactions at the same time. So much for decentralization.
- Chaos-engineer these scenarios in production: This means intentionally injecting realistic failures into your production systems (on purpose, in a controlled way) to verify that your system behaves as expected when things break. (Yes, actually pull the plug on Cloudflare during business hours. You may be amazed how many single points of failure you still have.)
- Put outage risk on your planning board: Quantify it in dollars, not just uptime. “Expected annual loss from centralized infra failure: $X million.” This can maybe help focus some folks when you ask for budget for some of these mitigations.
Hopefully, you’re getting the idea that this lousy little list I put together is nothing. It’s just a handful of examples. A real CTO or systems architect might look at this and say, “Cute! Nice start kid!” Your job isn’t mastering it, but ensuring it exists and lives with the right owner.
Punchline
The internet was engineered to survive nuclear war. We’re breaking it with vendor lock-in and capex avoidance.
Gray Rhinos do not require magic prescience to spot. They require leadership willingness to spend political and actual capital. They require intellectual honesty.
It is a question of whose users, revenue, and reputation will be collateral damage. Start treating resilience as a P&L line item, not an engineering nice-to-have. Your shareholders and your customers will thank you when the inevitable happens. Or actually, not really. They won’t thank you because no one generally thanks you when nothing happens. Of course, if something bad happens, a whole variety of folks might want to have a little chat. Anyone on your team own the “Gray Rhino Mitigation” OKR this quarter? If nobody does, you have a good time out there in that wind. Because you are riding without a helmet. Which is maybe nice sometimes. The problem is, if you’ve been paying attention, some of the roads around here seem to be getting worse, not better. I’m an optimist, so I think over time things will get better. But in the short to medium term, I’m not so sure, mostly because interdependencies seem to be rising fast and that means more hidden cascade effects. We’ll get worked out. Just, unfortunately, probably the hard way.
