Azure status history

This page contains Post Incident Reviews (PIRs) of previous service issues, each retained for 5 years. From November 20, 2019, this included PIRs for all issues about which we communicated publicly. From June 1, 2022, this includes PIRs for broad issues as described in our documentation.

Product:

Region:

Date:

November 2025

5

What happened?

Starting at approximately 17:00 UTC on 05 November 2025, a subset of customers in the West Europe region experienced service disruptions or degraded performance across multiple services, including Virtual Machines, Azure Database for PostgreSQL Flexible Servers, MySQL Flexible Servers, Azure Kubernetes Service, Storage, Service Bus, and Virtual Machine Scale Sets, among others.

This issue is now mitigated. An update with more information will be provided shortly.

3

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

What happened?

Between 16:05 and 20:20 UTC on 3 November 2025, an issue in the Australia East region affected customers’ ability to create new Virtual Machines (VMs). This issue had downstream impact on internal services that rely on VM creation, including Azure Backup, Azure Synapse Analytics, Azure Virtual Machines, Azure Virtual Machine Scale Sets, Azure Storage, and Azure Cache for Redis. Customers and impacted services attempting to rename or modify disk operations (including renaming resources) could also have experienced failures. Azure Databricks customers may have observed intermittent delays or failures when launching or upsizing all-purpose compute resources and submitting jobs. Existing VMs and running resources were not impacted.

What we know so far?

The Storage Resource Provider (SRP), part of the orchestration layer responsible for creating storage resources, was identified as the source of impact. The issue first surfaced in one SRP component within the region, where persistent allocation errors were observed. These failures subsequently spread across additional nodes in the region. Our initial mitigations included restarting affected SRP services, purging cache, and moving partitions to alternate nodes. These actions did not restore functionality. We also evaluated recent deployments and ruled them out as contributing factors. We then focused on canceling stuck transactions contributing to request-queue saturation, alongside increasing throttling thresholds. These actions freed up capacity. In parallel, we scaled up resources handling stuck transactions to provide additional processing headroom and increased cache size to help alleviate processing pressure.

How did we respond?

16:05 UTC on 3 November 2025 – Customer impact began as our alerts triggered for VM create failures.
16:29 UTC on 3 November 2025 – We launched an investigation into the issue trying to understand the trigger event.
16:57 UTC on 3 November 2025 – We correlated the downstream services that started reporting impact to this event.
17:00 UTC on 3 November 2025 – We started carrying out cache purging and node restarts.
17:14 UTC on 3 November 2025 – We started throttling requests as the queues became saturated.
18:00 UTC on 3 November 2025 – We started to take a multi-pronged approach by scaling up, increasing the cache size to help relieve pressure.
18:53 UTC on 3 November 2025 – We identified the failures in our resiliency as we discovered the bug that introduced a data mismatch.
19:03 UTC on 3 November 2025 – We started a hotfix evaluation for the bug.
19:55 UTC on 3 November 2025 – We deployed an enhancement to our throttling mechanisms
20:20 UTC on 3 November 2025 – We started seeing requests proceed and the service return to a healthy state.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail and will share findings within 14 days. Once we complete our internal retrospective, generally within 14 days, we will publish a final Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/6KVJ-JWG

October 2025

29

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

What happened?

Between 15:45 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) may have experienced latencies, timeouts, and errors.

Affected Azure services include, but are not limited to: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Azure Virtual Desktop, Container Registry, Media Services, Microsoft Copilot for Security, Microsoft Defender External Attack Surface Management, Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management UX), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), and Video Indexer.

Customer configuration changes to AFD remain temporarily blocked. We will notify customers once this block has been lifted. While error rates and latency are back to pre-incident levels, a small number of customers may still be seeing issues and we are still working to mitigate this long tail. Updates will be provided directly via Azure Service Health.

What went wrong and why?

An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.

As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.

How did we respond?

15:45 UTC on 29 October 2025 – Customer impact began.
16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.
16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.
17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.
17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.
18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.
18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.
23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.
00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.

What happens next?

To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs

The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring

Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/YKYN-BWZ

9

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/QKNQ-PB8

What happened?

Between 19:43 UTC and 23:59 UTC on 09 October 2025, customers may have experienced availability issues and failures when loading content for the Azure Portal as well as other management portals across Microsoft. Over the course of the incident, approximately 45% of customers using the management portals experienced some form of impact – with failure rates reaching their peak at approximately 20:54 UTC, after which time customers generally experienced availability improving when attempting to load content. Note that service management via programmatic methods (such as PowerShell or the REST API) and the availability of resources were not impacted.

What went wrong and why?

Various Microsoft management portals utilize backend services to host content for different portal extensions. Azure Front Door (AFD) and Content Delivery Network (CDN) services are used to accelerate delivery of application content. An earlier incident occurred (Tracking ID: QNBQ-5W8) impacting the availability of management portals primarily from Africa and Europe, with lesser impact to Asia Pacific and the Middle East. We invoked our Business Continuity Disaster Recovery (BCDR) protocol and updated our network filters to allow traffic to bypass AFD so that customer traffic could reach backend services hosting content.

Automation scripts were used to update the traffic load balancing configuration to split traffic across multiple routes. However, the scripts inadvertently removed a configuration value, because the script used an API version that predated the introduction of that value. The AFD endpoint began failing its health checks because the monitor of the AFD route falsely interpreted it as unhealthy. As a result, it was no longer routing traffic. However, since traffic was routed through alternative paths, there was no impact to customers, and we were not aware that the configuration value was removed. These automation scripts had previously been used successfully in other scenarios.

After recovery of the previous incident, we began taking steps to resume normal traffic through AFD by using these automation scripts. However, due to the previously mentioned configuration value, traffic was not served correctly. We updated network filters to allow customer traffic only through AFD and, as this propagated, the backend hosting services were no longer reachable via AFD. This resulted in the impact to customers starting at 19:43 UTC.

How did we respond?

11:59 UTC on 09 October 2025 – We performed operations to allow management portal traffic to bypass AFD.
19:39 UTC on 09 October 2025 – After verifying recovery of the previous incident, we took steps to migrate management portal traffic back completely through AFD.
19:43 UTC on 09 October 2025 – Management portal availability issues occurred and impact grew.
19:44 UTC on 09 October 2025 – Initial automated alerts were triggered.
19:50 UTC on 09 October 2025 – Engineers were engaged and began investigating.
20:30 UTC on 09 October 2025 – We confirmed the issues were related to the traffic migration and continued to investigate contributing factors.
20:54 UTC on 09 October 2025 – We recovered one hosting service domain to correctly resume traffic through AFD and purged its caches to speed up recovery. Gradual recovery happened over the next 60 minutes, though other residual impact continued.
21:30 UTC on 09 October 2025 – Azure Portal availability was mitigated, however other residual impact remained for other management portals. We shifted our focus to investigating these other residual failures and mitigating this impact.
23:59 UTC on 09 October 2025 – We recovered an additional hosting service domain to correctly resume traffic through AFD. After reviewing our configuration and monitoring telemetry, we confirmed mitigation.

How are we making incidents like this less likely or less impactful?

We have audited our architecture for the traffic routing of our management portals. (Completed)
We have updated our AFD migration scripts to use the latest API versions, to ensure that future runs will not remove the problematic configuration value. (Completed)
We are updating our standard operating procedures to add additional validation steps, to help ensure confirmation that traffic is being routed as expected. (Estimated completion: October 2025)
We are hardening our client-side logic so that browsers can fallback from AFD and hosting service incidents. (Estimated completion: November 2025)
We are making improvements to our failover systems from AFD, to be more robust and automated, so no manual intervention is required. (Estimated completion: December 2025)
We will be conducting additional simulation drills on our internal environments, to ensure that traffic shifts exactly as expected. (Estimated completion: January 2026)
In the longer term, we will be updating the architecture of our infrastructure to support regional rollout of such network traffic changes, to reduce the potential impact radius. (Estimated completion: March 2026)

How can customers make incidents like this less impactful?

As an alternative for when management portals are inaccessible, customers can consider using programmatic methods to manage resources – including the REST API (https://learn.microsoft.com/rest/api/azure) and PowerShell (https://learn.microsoft.com/powershell/azure)
During service incidents in which Azure Portal content is unavailable, customers can try using our preview Azure Portal endpoint URL: https://preview.portal.azure.com
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/QKNQ-PB8

9

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/QNBQ-5W8

What happened?

Between 07:50 UTC and 16:00 UTC on 09 October 2025, Microsoft services and Azure customers leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) may have experienced increased latency and/or timeouts – primarily across Africa and Europe, as well as Asia Pacific and the Middle East. This impacted the availability of the Azure Portal as well as other management portals across Microsoft.

Peak failure rates for AFD reached approximately 17% in Africa, 6% in Europe, and 2.7% in Asia Pacific and the Middle East. Availability was restored by 12:50 UTC, though some customers continued to experience elevated latency. Latency returned to baseline levels by 16:00 UTC, at which point the incident was mitigated.

What do we know so far?

AFD routes traffic using globally distributed edge sites and supports Microsoft services including the management portals. The AFD control plane generates system metadata that the data plane consumes for customer-initiated ‘create’, ‘update’, or ‘delete’ operations on AFD or CDN profiles. One of the trigger conditions for this incident was a software defect in the latest version of the AFD control plane which had been rolled out six weeks prior to the incident, in line with our safe deployment practices.

Newly created customer tenant profiles were being onboarded to the newer control plane version. Our service monitoring detected elevated data plane crashes due to a previously unknown bug – triggered by erroneous metadata, generated by a particular sequence of profile update operations. Our automated protection layer intercepted this in early update stages and prevented this metadata from propagating any further to the data plane, thereby averting any customer impact at that time. In addition, as the newer control plane was running in tandem with the previous version of the control plane, we disabled the new control plane from taking any requests.

On 09 October 2025, we initiated a cleanup of the affected tenant configuration with the erroneous metadata. Since the automated protection system was blocking the impacted customer tenant profile updates in the initial stage, we temporarily bypassed it to allow the cleanup of the tenant configuration to proceed. By bypassing the protection system, the erroneous metadata was inadvertently able to propagate to later stages – and triggered the bug in the data plane that crashed the data plane service. This resulted in a disruption to a significant number of edge sites across Europe and Africa, approximately 26% of AFD data plane infrastructure resources in these regions were impacted.

As part of AFD mechanisms to manage traffic, load was automatically distributed to nearby edge sites (including in Asia Pacific and the Middle East). Additionally, as regional business hours traffic started ramping up, it added to the overall traffic load. The increased volume of traffic on the remaining healthy edge sites resulted in high resource utilization, which exceeded operational thresholds. This triggered an additional layer of protection which started distributing traffic to a broader set of edge sites globally, to reduce further impact. Recovery required a combination of automated restarts, manual intervention where automated restarts were taking too long, and traffic failover operations for impacted management portals. Full mitigation was achieved once edge site infrastructure resources stabilized and latency returned to normal.

Additionally, initial customer notifications were delayed primarily due to challenges determining impact, while attempting to target communications to those impacted. We have automated communications to notify customers of incidents quickly, unfortunately this capability was not yet supported in this incident scenario.

How did we respond?

07:30 UTC on 09 October 2025 – The cleanup operation was initiated.
07:50 UTC on 09 October 2025 – Initial customer impact began, and increased over the next 90 minutes.
08:13 UTC on 09 October 2025 – Our telemetry detected resource availability loss across multiple AFD edge sites. We began investigating as impact continued to grow.
09:04 UTC on 09 October 2025 – We identified that the crashes were due to the previously identified data plane bug.
09:08 UTC on 09 October 2025 – Automated restarts began for our AFD infrastructure resources, and manual intervention began for resources that did not recover automatically.
09:15 UTC on 09 October 2025 – Customer impact had grown to be at its peak.
10:01 UTC on 09 October 2025 – Communications were published to the Azure Status page.
10:45 UTC on 09 October 2025 – Targeted customer communications were sent to Azure Service Health.
11:59 UTC on 09 October 2025 – Management portals, like the Azure Portal, performed failover operations (including using scripts to update the load balancing configuration, to split traffic between multiple routes) helping restore its service availability.
12:50 UTC on 09 October 2025 – Availability for AFD fully recovered, however a subset of customers may still have been experiencing elevated latency.
16:00 UTC on 09 October 2025 – After continuous monitoring of latency improvement, we declared the incident as mitigated after confirming recovery.

How are we making incidents like this less likely or less impactful?

We have hardened our standard operating procedures, to ensure that the configuration protection system is not bypassed for any operation. (Completed)
We have fixed the control plane defect which generated the erroneous tenant metadata that led to the data plane resource crashes. (Completed)
We have fixed the bug in the data plane. (Completed)
We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation. (Estimated completion: November 2025)
We are making improvements to our Azure Portal failover systems from AFD, to be more robust and automated. (Estimated completion: December 2025)
We are building additional runtime configuration validation pipelines against a replica of real-time data plane, as a pre-validation step prior to applying them broadly. (Estimated completion: March 2026)
We are improving data plane resource instance recovery time, following any impact to the data plane. (Estimated completion: March 2026)

How can customers make incidents like this less impactful?

Consider implementing failover strategies with Azure Traffic Manager, to fail over from Azure Front Door to your origins: https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/overview
Consider reviewing our best practices for Azure Front Door architecture: https://learn.microsoft.com/azure/well-architected/service-guides/azure-front-door
Consider implementing retry patterns with exponential backoff, to improve workload resiliency: https://learn.microsoft.com/azure/architecture/patterns/retry
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: http://aka.ms/AzPIR/QNBQ-5W8

September 2025

26

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/BT6W-FX0

What happened?

Between 23:54 UTC on 26 September 2025 and 21:59 UTC on 27 September 2025, a platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region. Many customers experienced mitigation by 04:00 UTC on 27 September 2025, while those impacted by additional residual impacts that remained after initial mitigation steps experienced full mitigation as late as 21:59 UTC on 27 September 2025.

Impacted services included Azure API Management, Azure App Service, Azure Application Gateway, Azure Application Insights, Azure Cache for Redis, Azure Cosmos DB, Azure Data Explorer, Azure Database for PostgreSQL, Azure Databricks, Azure Firewall, Azure Kubernetes Service, Azure Storage, Azure Synapse Analytics, Azure Backup, Azure Batch, Azure Data Factory, Azure Log Analytics, Azure SQL Database, Azure SQL Managed Instance, Azure Site Recovery, Azure Stream Analytics, Azure Virtual Machine Scale Sets (VMSS), Azure Virtual Machines (VMs), and Microsoft Sentinel. Other services reliant on these may have also been impacted.

What went wrong and why?

A planned configuration change was made to the certificates used to authorize the communication for our software load balancer infrastructure that routes traffic for services in the Switzerland North region. One of the new certificates contained a malformed value that was not identified during the validation process. Moreover, this specific configuration change went through an expedited path - which unexpectedly applied it across multiple availability zones in the region - without evaluation of system health safeguards.

This type of planned change is intended to be impactless but, due to the malformed certificate, authorization of connectivity for load balancer between nodes and resources could not complete. This caused significant impact to storage resources in the region. When a VM loses connectivity to its underlying virtual disk for an extended period, the platform detects the issue and safely shuts down the VM until connection is re-established, to prevent data corruption. Impact magnified because of services dependent on these resources.

We corrected the certificate error and deployed the fix to re-establish connectivity for load balancer within the region. Most affected resources automatically recovered once connectivity was restored. However, a subset of resources (such as SQL Managed Instance VMs, VMs using Trusted Launch configurations, some Service Fabric Managed Cluster based services, and services that depend on them) did not resume operation automatically. These remained in a stopped or unhealthy state.

For SQL Managed Instance VMs, after load balancer connectivity was restored, the VMs remained unhealthy. This was due to a race condition during VM startup where different service specific operations customized for this type of VM were not completed in the expected sequence. This required each impacted VM to have its services restarted.
For Trusted Launch VMs, which use Trusted Launch with guest state stored in remote blobs, a unique failure mode was experienced. This incident resulted in a shutdown signal that is also used when customers intentionally reboot their VMs. The platform misinterpreted these events as customer-initiated and did not trigger automated recovery. As a result, these VMs remained in a stopped state and required explicit manual intervention to restart.
For Service Fabric Managed Cluster based services, a specific custom extension is present which required a specific sync pattern to succeed in reaching the storage account in order to provision the VM. For a subset of resources, this timed out due to transient failures. Once the VMs were properly restarted, these recovered.

How did we respond?

23:54 UTC on 26 September 2025 – Customer impact began, triggered by the configuration change.
00:08 UTC on 27 September 2025 – The issue was detected via our automated monitoring.
00:12 UTC on 27 September 2025 – Investigation commenced by our Azure Storage and Networking engineering teams.
02:33 UTC on 27 September 2025 – We performed steps to revert the configuration change for the Switzerland North region.
03:40 UTC on 27 September 2025 – The certificate was successfully reverted for the region.
04:00 UTC on 27 September 2025 – Validation of recovery was confirmed for the majority of impacted services, but a subset of resources were still unhealthy due to residual impact. We performed an extended health evaluation to confirm which remaining resources required recovery, and investigated the steps were needed to recover residual impact.
06:19 UTC on 27 September 2025 – We took additional time to recover remaining residual impact which included developing and validating new scripts to start resources that did not recover automatically.
11:24 UTC on 27 September 2025 – Performed necessary checks to test safe recovery actions for different types of resources.
14:15 UTC on 27 September 2025 – Recovery operations were performed for residual impact. This included executing mitigation scripts, and performing steps to safely reboot and recover remaining resources.
16:15 UTC on 27 September 2025 – Partial recovery for residual impact. Additional operations to recover the remaining impact included properly restarting the resources in a stopped state.
21:59 UTC on 27 September 2025 – Residual impact recovery activities and validation completed, confirming full recovery of all impacted services and customers.

How are we making incidents like this less likely or less impactful?

We have implemented additional auditing into our deployment systems, designed to de-risk issues like this one. (Completed)
We have extensively evaluated our deployment pipelines and removed expedited pipelines that could result in this class of issue. (Completed)
We have improved our deployment safety measures for such changes, with additional automated rollback capabilities at earlier stages. (Completed)
We are improving our monitoring signals, to better understand the health posture of different resources and to expedite recovery. (Estimated completion: November 2025)
In the longer term, we are investing in improvements to our recovery processes for residual impact after these events. In particular, we will apply robust improvements for different custom extension processes and improve the resiliency of different unique resource startup operations to avoid requiring manual intervention.

How can customers make incidents like this less impactful?

For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/azure/well-architected/design-guides/regions-availability-zones#deployment-approach-4-multi-region-deployments and https://learn.microsoft.com/azure/well-architected/design-guides/disaster-recovery
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/BT6W-FX0

10

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/VKY3-PF8

What happened?

Between 09:03 UTC and 18:50 UTC on 10 September 2025, an issue with the Allocator service in the Azure Compute control plane, led to degraded service management operations for resources in two Availability Zones (physical AZ02 and AZ03) in the East US 2 region. While the Allocator issue was mitigated by 18:50 UTC, a subset of downstream services experienced delayed recovery due to backlogged operations and retry mechanisms, with full recovery observed by 19:30 UTC.

During this impact window, customers may have received error notifications and/or been unable to start, stop, scale, or deploy resources. The primary services affected were Virtual Machines (VMs) and Virtual Machines Scale Sets (VMSS) but services dependent on them were also affected – including but not limited to:

Azure Backup: Between 13:32 UTC and 19:30 UTC on 10 September 2025, customers may have experienced failures with Virtual Machine backup operations.
Azure Batch: Between 09:50 UTC and 18:50 UTC on 10 September 2025, customers may have experienced pool operations getting stuck. Primarily, pool grow/shrink and deletion operations may have become stuck. The majority of impact occurred between 15:06 and 18:50 UTC, but backlogged operations would have caught up by 19:30 UTC.
Azure Databricks: Between 10:15 UTC and 18:07 UTC on 10 September 2025, customers may have experienced delays and/or failures with job run operations and/or Databricks SQL queries.
Azure Data Factory: Between 10:32 UTC and 18:08 UTC on 10 September 2025, customers may have experienced failures in running dataflow jobs due to cluster creation issues.
Azure Kubernetes Service: Between 09:05 UTC and 19:30 UTC on 10 September 2025, customers may have experienced failures for operations, including but not limited to - create, upgrade, scale up, or delete managed clusters or node pools, or create or delete certain pods which require scaling up nodes, or attach/detach disks to VMs.
Azure Synapse Analytics: Between 10:28 UTC and 18:21 UTC on 10 September 2025, customers may have experienced spark job execution failures while executing spark activities within notebooks, data pipelines, or SDKs.
Microsoft Dev Box: Between 12:30 UTC and 17:15 UTC on 10 September 2025, customers may have experienced operation failures, including manual start via Dev Portal and connecting directly via the Windows App to a Dev Box in a shutdown or hibernated state.

While AZ01 was not affected by the allocator service issue that impacted AZ02 and AZ03, allocation failures in AZ01 were observed as result of existing constraints on AMD v4/v5 and Intel v5 SKUs.

What went wrong and why?

Azure Resource Manager (ARM) provides a management layer that enables customers to create, update, and delete resources such as VMs. As part of this service management system, ARM depends on several VM management services in the Azure control plane, which in turn rely on an Allocator service deployed per Availability Zone.

A recent change to Allocator behavior—designed to improve how the service handles heavy traffic— had been progressively rolled out over the prior two months, expanding to broader regions in early September. This update introduced a new throttling pattern that had performed effectively elsewhere. However, a performance issue affecting a subset of Allocator machines triggered the new throttling logic which, because of the high throughput demand in the East US 2 region, caused VM provisioning requests to retry more aggressively than the control plane could handle.

Customer deployments failing in AZ03 were automatically redirected to AZ02, which already had significant load. This redirection, combined with normal traffic, led to overload and throttling in AZ02. Unlike AZ03, which recovered quickly once new VM deployments were halted, AZ02 faced a more complex and persistent challenge.

By the time we paused new VM deployments in AZ02, the zone had already accumulated a significant backlog—estimated to be many times larger than normal operational volume. This backlog wasn’t just a result of repeated retries of the same operation. Instead, the platform was attempting a wide variety of retry paths to find successful outcomes for customers. These included retries across multiple internal infrastructure segments within AZ02, triggered by disk and VM co-location logic, and retries with different allocation parameters. Each retry was an attempt to find a viable configuration, not a simple repetition of the same request.

This behavior reflects a platform optimized for reliability—one that persistently attempts alternative solutions rather than failing fast. While this approach often helps customers succeed in constrained environments, in this case it led to a large, persistent queue of operations that overwhelmed control plane services in AZ02. The zone’s larger size and higher throughput compared to AZ03 further compounded the issue, making recovery significantly slower.

Importantly, the majority of the incident duration was spent in the recovery phase for AZ02. Once new deployments were halted and retries suppressed, the platform still had to drain the backlog of queued operations. This was not a simple in-memory retry loop—it was a persistent queue of work that had accumulated across multiple services and layers. Recovery was further slowed by service-level throttling configurations that, while protective, limited how quickly the backlog could be processed. These throttling values were tuned for normal operations and not optimized for large-scale recovery scenarios.

We acknowledge that the extended recovery time in AZ02 was driven by the scale and complexity of the accumulated work, the opportunistic retry behavior of the platform, and the limitations of current throttling and queue-draining mechanisms. Addressing these factors is a key focus of our repair actions—both to prevent excessive backlog accumulation in future incidents, and to accelerate recovery when it does occur.

During this incident, our initial communications strategy relied heavily on automated notifications to impacted customers, triggered by monitoring signals for each impacted service. While this system informed many customers quickly, it currently supports only a subset of services and scenarios. As a result, some impacted customers and services were not notified automatically, relying on manual communications provided later during the incident. As the scope of impact widened, we recognized that our communications were not reaching all affected customers. To address this, we sent broad targeted messaging to all customers running VMs in East US 2 – and ultimately published to the (public) Azure Status page, to improve visibility and reduce confusion. Finally, we acknowledge that early updates lacked detail around our investigation and recovery efforts. We should have shared more context sooner, and we have corresponding communication-related repair actions outlined below to improve timeliness and clarity in future incidents.

How did we respond?

09:03 UTC on 10 September 2025 – Customer impact began in AZ03.
09:13 UTC on 10 September 2025 – Monitoring systems observed a spike in allocation failure rates for Virtual Machines, triggering an alert and prompting teams to investigate.
09:23 UTC on 10 September 2025 – Initial targeted communications sent to customers via Azure Service Health. Automated messaging gradually expanded as impact increased, and additional service alerts were triggered.
09:30 UTC on 10 September 2025 – We began manual mitigation efforts in AZ03 by restarting critical service components to restore functionality, rerouting workloads from affected infrastructure, and initiated multiple recovery cycles for the impacted backend service.
10:02 UTC on 10 September 2025 – Platform services in AZ02 began to throttle requests due to the accumulation of excessive load, following customer deployments being automatically redirected from AZ03, in addition to normal customer load in AZ02.
11:20 UTC on 10 September 2025 – Platform-initiated retries caused excessive load in AZ02, leading to increased failure rates.
12:00 UTC on 10 September 2025 – We stopped new VM deployments being allocated in AZ03.
12:40 UTC on 10 September 2025 – The reduction in load in AZ03 enabled AZ03 to recover. Customer success rates for operations on VMs that were already deployed into AZ03 returned to normal levels.
13:30 UTC on 10 September 2025 – After verifying that platform-initiated retries had sufficiently drained, we re-enabled new VM deployments to AZ03. We then stopped all new VM deployments in AZ02.
13:30 UTC on 10 September 2025 – We continued to apply additional mitigations to AZ02 platform services with backlogs of retry requests. These included applying more aggressive throttling, draining existing backlogs, restarting services, and applying additional load management strategies.
15:05 UTC on 10 September 2025 – Customer success rates for operations on VMs already deployed into AZ02 started increasing.
16:36 UTC on 10 September 2025 – Broad targeted messaging sent to all customers with Virtual Machines in East US 2.
16:50 UTC on 10 September 2025 – We started gradually re-enabling traffic in AZ02.
17:17 UTC on 10 September 2025 – The first update was posted on the public Azure Status page.
18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored.
19:30 UTC on 10 September 2025 – Remaining downstream services reported recovery, following backlog drainage and retry stabilization. Customer impact confirmed mitigated.

How are we making incidents like this less likely or less impactful?

We have turned off the recently deployed Allocator behavior, reverting to the previous allocation throttling logic. (Completed)
We have already adjusted our service throttling configuration settings for platform services in the Availability Zones of East US 2, as well as other high-load zones/regions. (Completed)
We are tuning and creating additional throttling levers in control plane services, to help prevent similar increased intra-service call volume load from occurring. (Estimated completion: October 2025)
We are creating a set of internal tools to support the draining of increased backlog for faster service recovery from incident scenarios that require this. (Estimated completion: October 2025)
We are addressing identified gaps in our load testing infrastructure, to help detect these types of issues earlier – before rolling out production changes. (Estimated completion: October 2025)
We are working to improve our communication processes and tools to help eliminate delays notifying impacted customers and supplying actionable information in the future. (Estimated completion: December 2025)
In the longer term, we are improving our end-to-end load management practices, to take a more holistic view of the impacts of any throttling, prioritization, and bursts in demand. (Estimated completion: June 2026)

How can customers make incidents like this less impactful?

Note that the 'logical' Availability Zones used by each customer subscription may correspond to different physical Availability Zones - customers can use the Locations API to understand this mapping, to confirm which resources run in the physical AZs referenced above: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations
For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/VKY3-PF8