November 2025
5
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident (to hear from our engineering leaders, and to get any questions answered by our experts) or watch a recording of the livestream (available later, on YouTube): https://aka.ms/air/2LGD-9VG
What happened?
Between approximately 16:53 UTC on 5 November 2025 and 02:25 UTC on 6 November 2025, a subset of customers in the West Europe region experienced service disruptions or degraded performance across multiple services, including Virtual Machines (VM), Azure Database for PostgreSQL Flexible Server, MySQL Flexible Server, Azure Kubernetes Service, Storage, Service Bus, and Virtual Machine Scale Sets, among others. Customers using Azure Databricks in West Europe observed degraded performance when launching or scaling all-purpose and job compute workloads, which impacted Unity Catalog and Databricks SQL operations.
What went wrong and why?
The incident was triggered by a voltage sag in the utility grid, which caused cooling units to shut down and temperatures to rise above normal thresholds in a datacenter located in Physical Availability Zone 01 in the West Europe region. As temperature rose, multiple storage scale units automatically powered down to prevent hardware damage. Under normal circumstances, cooling systems are designed and commissioned to automatically restart after power events and are built with sufficient redundancy to maintain safe operating conditions even when failures occur. While cooling unit auto-restart capability had been regularly tested and exercised in this datacenter, the specific size and duration of utility sag exposed a previously unknown hardware issue that prevented the cooling units from restarting. This resulted in temperatures exceeding safe operational limits for infrastructure.
The extended recovery time was influenced by several factors. Recovery of storage scale units was prolonged because some storage servers entered a degraded state, necessitating sequential validation before they could be brought back online. In addition, dependent services such as compute and networking could not resume until storage integrity checks were complete. Finally, additional thermal audits were performed before workloads were reactivated to ensure that no residual risk remained.
How did we respond?
Automated monitoring detected the temperature anomaly and triggered an incident response. Facilities teams restored safe operating conditions by performing hard reset on the affected units and confirmed that temperatures were trending back to safe levels. Service recovery efforts began once the facility had reached safe operating temperatures.
Following this, storage systems initiated comprehensive data consistency checks when nodes in a storage scale unit restarted after the unplanned shutdown. These checks, which may include reconstructing replicas that have fallen behind. The recovery process prioritizes data integrity and durability over availability, resulting in customer traffic being blocked until all integrity validations was completed and contributing to the extended recovery time. In parallel, engineers prioritized recovery of compute hosts to stop further VM failures and allow dependent services to resume.
Timeline of events:
- 16:53 UTC on 5 November 2025 – Customer impact began as elevated temperatures caused multiple storage scale units to shut down.
- 16:55 UTC on 5 November 2025 – Datacenter monitoring detected temperature breaches and triggered thermal alerts.
- 17:20 UTC on 5 November 2025 – Power sag and cooling unit failure to restart identified as contributing factors. Cooling recovery begins as engineers manually restart cooling units.
- 17:40 UTC on 5 November 2025 – Mitigation workstream started as cooling restoration progresses.
- 17:48 UTC on 5 November 2025 – All cooling units restored, we start seeing temperatures reducing as cooling normalized.
- 17:50 UTC on 5 November 2025 – Rack-level thermal monitoring returned to safe operational thresholds. Engineering teams initiated storage recovery.
- 18:30 UTC on 5 November 2025 – Sequential storage scale unit validation began. Each scale unit underwent integrity checks to ensure no data corruption before bringing storage services back into rotation.
- 20:00 UTC on 5 November 2025 – Gradual restoration of storage scale units. Recovery was staged to avoid overloading power and cooling systems. Dependent compute nodes remained offline during this time.
- 23:30 UTC on 5 November 2025 – Service dependency checks and orchestration. Networking and compute services were progressively re-enabled after storage scale units passed health checks.
- 02:25 UTC on 6 November 2025 – All services were brought back online, and customer impact was mitigated.
How are we making incidents like this less likely or less impactful?
We are implementing several improvements to strengthen resilience and accelerate recovery:
- We are conducting internal and manufacturer-led forensic investigations of specific device hardware and firmware to develop solutions to bolster component tolerance to power sags and voltage fluctuations. (Estimated completion: December 2025)
- We are performing a gap analysis of the control circuit to ensure all relays delivering critical function of the units are powered through Uninterruptible Power Supply (UPS) and stored energy sources. (Estimated completion: January 2026)
- We are validating our maintenance procedures and adding additional prechecks for Heating, Ventilation, and Air Conditioning (HVAC) subsystems. (Estimated completion: January 2026)
- We are optimizing recovery steps for compute and storage systems to reduce overall restoration time. (Estimated completion: January 2026)
- We are improving automated correlation between environmental signals and service health to trigger earlier, and more targeted escalations. (Estimated completion: January 2026)
- We are updating our cooling unit commissioning playbooks to expand range of power sag and swell scenarios. (Estimated completion: February 2026)
- In the longer term, we are implementing a new approach to accelerate service restoration after major incidents. Historically, recovery prioritized full data integrity checks before bringing services online, which extended downtime because it was a serial process. Going forward, integrity checks will run in the background during VM and disk mount operations, allowing availability to return much sooner without compromising data safety. This change is a direct outcome of this incident and represents a significant improvement in recovery speed. (Estimated completion: April 2026)
How can customers make incidents like this less impactful?
- Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
- Plan for regional redundancy by using active/active or failover designs across paired regions.
- Maintain regular backups and test failover procedures to ensure recovery readiness.
- Increase resilience by using retry logic and backoff strategies to prevent overload during recovery.
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/2LGD-9VG