September 2025
26
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/BT6W-FX0
What happened?
Between 23:54 UTC on 26 September 2025 and 21:59 UTC on 27 September 2025, a platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region. Many customers experienced mitigation by 04:00 UTC on 27 September 2025, while those impacted by additional residual impacts that remained after initial mitigation steps experienced full mitigation as late as 21:59 UTC on 27 September 2025.
Impacted services included Azure API Management, Azure App Service, Azure Application Gateway, Azure Application Insights, Azure Cache for Redis, Azure Cosmos DB, Azure Data Explorer, Azure Database for PostgreSQL, Azure Databricks, Azure Firewall, Azure Kubernetes Service, Azure Storage, Azure Synapse Analytics, Azure Backup, Azure Batch, Azure Data Factory, Azure Log Analytics, Azure SQL Database, Azure SQL Managed Instance, Azure Site Recovery, Azure Stream Analytics, Azure Virtual Machine Scale Sets (VMSS), Azure Virtual Machines (VMs), and Microsoft Sentinel. Other services reliant on these may have also been impacted.
What went wrong and why?
A planned configuration change was made to the certificates used to authorize the communication for our software load balancer infrastructure that routes traffic for services in the Switzerland North region. One of the new certificates contained a malformed value that was not identified during the validation process. Moreover, this specific configuration change went through an expedited path - which unexpectedly applied it across multiple availability zones in the region - without evaluation of system health safeguards.
This type of planned change is intended to be impactless but, due to the malformed certificate, authorization of connectivity for load balancer between nodes and resources could not complete. This caused significant impact to storage resources in the region. When a VM loses connectivity to its underlying virtual disk for an extended period, the platform detects the issue and safely shuts down the VM until connection is re-established, to prevent data corruption. Impact magnified because of services dependent on these resources.
We corrected the certificate error and deployed the fix to re-establish connectivity for load balancer within the region. Most affected resources automatically recovered once connectivity was restored. However, a subset of resources (such as SQL Managed Instance VMs, VMs using Trusted Launch configurations, some Service Fabric Managed Cluster based services, and services that depend on them) did not resume operation automatically. These remained in a stopped or unhealthy state.
- For SQL Managed Instance VMs, after load balancer connectivity was restored, the VMs remained unhealthy. This was due to a race condition during VM startup where different service specific operations customized for this type of VM were not completed in the expected sequence. This required each impacted VM to have its services restarted.
- For Trusted Launch VMs, which use Trusted Launch with guest state stored in remote blobs, a unique failure mode was experienced. This incident resulted in a shutdown signal that is also used when customers intentionally reboot their VMs. The platform misinterpreted these events as customer-initiated and did not trigger automated recovery. As a result, these VMs remained in a stopped state and required explicit manual intervention to restart.
- For Service Fabric Managed Cluster based services, a specific custom extension is present which required a specific sync pattern to succeed in reaching the storage account in order to provision the VM. For a subset of resources, this timed out due to transient failures. Once the VMs were properly restarted, these recovered.
How did we respond?
- 23:54 UTC on 26 September 2025 – Customer impact began, triggered by the configuration change.
- 00:08 UTC on 27 September 2025 – The issue was detected via our automated monitoring.
- 00:12 UTC on 27 September 2025 – Investigation commenced by our Azure Storage and Networking engineering teams.
- 02:33 UTC on 27 September 2025 – We performed steps to revert the configuration change for the Switzerland North region.
- 03:40 UTC on 27 September 2025 – The certificate was successfully reverted for the region.
- 04:00 UTC on 27 September 2025 – Validation of recovery was confirmed for the majority of impacted services, but a subset of resources were still unhealthy due to residual impact. We performed an extended health evaluation to confirm which remaining resources required recovery, and investigated the steps were needed to recover residual impact.
- 06:19 UTC on 27 September 2025 – We took additional time to recover remaining residual impact which included developing and validating new scripts to start resources that did not recover automatically.
- 11:24 UTC on 27 September 2025 – Performed necessary checks to test safe recovery actions for different types of resources.
- 14:15 UTC on 27 September 2025 – Recovery operations were performed for residual impact. This included executing mitigation scripts, and performing steps to safely reboot and recover remaining resources.
- 16:15 UTC on 27 September 2025 – Partial recovery for residual impact. Additional operations to recover the remaining impact included properly restarting the resources in a stopped state.
- 21:59 UTC on 27 September 2025 – Residual impact recovery activities and validation completed, confirming full recovery of all impacted services and customers.
How are we making incidents like this less likely or less impactful?
- We have implemented additional auditing into our deployment systems, designed to de-risk issues like this one. (Completed)
- We have extensively evaluated our deployment pipelines and removed expedited pipelines that could result in this class of issue. (Completed)
- We have improved our deployment safety measures for such changes, with additional automated rollback capabilities at earlier stages. (Completed)
- We are improving our monitoring signals, to better understand the health posture of different resources and to expedite recovery. (Estimated completion: November 2025)
- In the longer term, we are investing in improvements to our recovery processes for residual impact after these events. In particular, we will apply robust improvements for different custom extension processes and improve the resiliency of different unique resource startup operations to avoid requiring manual intervention.
How can customers make incidents like this less impactful?
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/azure/well-architected/design-guides/regions-availability-zones#deployment-approach-4-multi-region-deployments and https://learn.microsoft.com/azure/well-architected/design-guides/disaster-recovery
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/BT6W-FX0