A Reality Check on Cloud Reliability

The Cloud isn’t as infallible we’d like to think. Test your disaster recovery (DR) and business continuity planning (BCP).
securityavailabilityriskopinion
Cover image

On 29 October 2025, Azure Front Door had a bit of a nightmare. What started as an "inadvertent tenant configuration change" (Microsoft's way of saying "oops, we pushed a dodgy config") turned into an 8+ hour outage that affected organisations worldwide. The incident exposed a rather uncomfortable truth: even the big cloud providers can have spectacularly bad days.

For full details of the incident, see Microsoft's preliminary Post Incident Review.

Microsoft's "Just Turn It Off" Advice

When things went pear-shaped, Microsoft's suggestion was essentially "bypass the CDN entirely." Now, if you're thinking that sounds a bit risky, you'd be right, because this potentially means:

  • No WAF and DDoS protection: If you relied on Azure Front Door for security, bypassing it means you're suddenly naked on the internet. Not ideal.
  • Exposed infrastructure: All those servers you thought were safely tucked away? They're now potentially accessible to anyone who fancies a look.
  • No caching, full chaos: Without the CDN doing its job, your origin servers have to handle all the traffic directly. Expect them to melt faster than an ice cream in the Sahara.
  • SSL certificate headaches: Direct connections bypass SSL termination, which can create all sorts of fun certificate problems, especially if you have used certificate pinning (but not only).
  • Hot reconfiguration of routing : For organisations using private endpoints (you know, for security), bypassing the CDN simply isn't an option. There's literally no route to the servers.

The SLA Fantasy

Azure Front Door comes with a 99.99% SLA, less than one hour of downtime per year. This single incident lasted over 10 hours for some organisations. That kind of duration means essentially burning through a decade's worth of "acceptable" downtime in one go.

This raises an awkward question: what exactly are these SLAs worth? When a single incident can blow past your annual allowance by 10x, perhaps it's time to stop treating vendor SLAs as a substitute for proper disaster recovery and business continuity planning.

👉
Perhaps it's time to stop treating vendor SLAs as a substitute for proper disaster recovery and business continuity planning

What We Can Learn (Before It Happens to You)

This incident is a masterclass in why you can't just trust the cloud to look after itself:

  • SLAs aren't guarantees: They're numbers on paper. Your Business Continuity Policy should assume that even "highly available" services can fail catastrophically.
  • Design for failure: Assume everything will break eventually—because it will. Even services from Microsoft, AWS, or Google. Your architecture should handle it when (not if) it happens.
  • Defence in depth matters: The Minimum Viable Secure Product framework exists for a reason. Keep backups in different locations and ensure your recovery procedures don't depend on the thing that's currently broken.
  • Map your dependencies: Following guidance on mitigating single points of failure, understand what breaks when your CDN goes down. If the answer is "everything," you have a problem.
  • Test your disaster recovery plan: Don't wait for a real disaster to discover your plan is more "disaster" than "recovery." As covered in business continuity best practices, run annual exercises that include restoring critical systems.

Olivier Reuland