Produkt:
Region:
Datum:
september 2025
10
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
What happened?
Between 09:03 UTC and 18:50 UTC on 10 September 2025, an issue with the Allocator service that interacts with our control plane, led to degraded service management operations for resources in two Availability Zones (physical AZ02 and AZ03) in the East US 2 region.
During this impact window, customers may have received error notifications and/or been unable to start, stop, scale, or deploy resources – resulting in failed allocations and service disruptions. The primary services affected were Virtual Machines (VMs) and Virtual Machines Scale Sets (VMSS) but, due to the impact to these underlying compute resources, services dependent on them were also affected – including but not limited to Azure Backup, Azure Batch, Azure Databricks, Azure Data Factory, Azure Kubernetes Service, Azure Synapse Analytics, and Microsoft Dev Box.
Allocation issues were observed in AZ01 during the incident, these were primarily due to existing capacity constraints in that zone, and while these existed prior to this event and remain ongoing, AZ01 was not impacted by the service management and allocator service issues occurring in AZ02 and AZ03. A subset of customers may continue to experience residual allocation issues due to existing/continued capacity constraints in the region.
What went wrong and why?
Azure Resource Manager (ARM) provides a management layer that enables customers to create, update and delete resources such as VMs. As part of this service management system, ARM depends on several VM management services in the Azure control plane, which in turn depend on an 'Allocator' service in each Availability Zone. The Allocator service, like all Azure control plane services, is designed to be resilient to transient issues through a combination of automated quality of service mechanisms like throttling, restarts, and failovers to healthy systems.
A change in the Allocator behavior was recently deployed into production. This change was designed to improve the way the Allocator service responds to heavy traffic, by throttling requests through a different pattern. This update had been applied to the majority of Azure regions and was working as intended without any warnings being observed.
This incident was initially triggered by a performance issue with a subset of machines on which the Allocator service was running, so this new throttling design initiated. However, the high demand in East US 2, which is a high throughput region, caused VM provisioning service requests to retry more aggressively than the control plane could manage. This led to a surge in activity that made it harder for the system to recover smoothly.
Customer deployments were redirected from AZ03 to AZ02. However, this led to overload and throttling in AZ02, which compounded the issue as retry mechanisms further strained capacity, causing deployment success rates to drop sharply in both zones.
To break the cycle, we halted new VM deployments in AZ03, allowing it to recover. Once stability was confirmed, deployments resumed in AZ03 and were simultaneously paused in AZ02. Engineers then applied further mitigations to address the backlogs or retries, this included aggressive throttling and other service-specific actions.
Our initial communications strategy relied heavily on automated notifications to impacted customers, triggered by monitoring signals for each impacted service. While this system informed many customers quickly, it currently supports only a subset of services and scenarios. As a result, some impacted customers and services were not notified automatically, relying on manual communications provided later during the incident. As the scope of impact widened, we recognized that our communications were not reaching all affected customers. To address this, we sent broad targeted messaging to all customers running VMs in East US 2 – and ultimately published to the (public) Azure Status page, to improve visibility and reduce confusion. Finally, we acknowledge that early updates lacked detail around our investigation and recovery efforts.
How did we respond?
- 09:03 UTC on 10 September 2025 – Customer impact began in AZ03.
- 09:13 UTC on 10 September 2025 – Automated monitoring systems observed a spike in allocation failure rates for Virtual Machines, triggering an alert and prompting teams to investigate.
- 09:23 UTC on 10 September 2025 – Initial targeted communications sent to customers via Azure Service Health. Automated messaging gradually expanded as impact increased, and additional service alerts were triggered.
- 09:30 UTC on 10 September 2025 – Began manual mitigation efforts that included - restarted critical service components to restore functionality, reroute workloads from affected infrastructure, and initiated multiple recovery cycles for the impacted backend service.
- 10:02 UTC on 10 September 2025 – Accumulated excessive load in AZ02 occurred, following customer deployments automatically redirected from AZ03 to AZ02, in addition to normal customer load in AZ02. Platform services in AZ02 began to throttle.
- 11:20 UTC on 10 September 2025 – Platform-initiated retries (triggered by throttling) caused excessive load in AZ02, leading to increasing failure rates.
- 12:00 UTC on 10 September 2025 – Platform engineers stopped new VM deployments in AZ03.
- 12:40 UTC on 10 September 2025 – Reduction in load in AZ03 enabled AZ03 to recover. Customer success rates for operations on VMs already deployed into AZ03 returned to normal levels.
- 13:30 UTC on 10 September 2025 – After verifying that platform-initiated retries had sufficiently drained, platform engineers re-enabled new VM deployments to AZ03. Platform engineers then stopped all new VM deployments in AZ02.
- 13:30 UTC on 10 September 2025 – Engineers continued to apply additional mitigations to AZ02 platform services with backlogs of retry requests. These included applying more aggressive throttling, draining existing backlogs, restarting services, and applying additional load management strategies.
- 15:05 UTC on 10 September 2025 – Customer success rates for operations on VMs already deployed into AZ02 started increasing.
- 16:36 UTC on 10 September 2025 – Broad targeted messaging sent to customers with Virtual Machines in East US 2.
- 16:50 UTC on 10 September 2025 – Started gradually re-enabling traffic in AZ02.
- 17:17 UTC on 10 September 2025 – First update on the public Azure Status page.
- 18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored, and no further impact was observed to downstream services for this issue.
How are we making incidents like this less likely or less impactful?
- Turned off the recently deployed Allocator behavior, reverting to the previous allocation throttling logic. (Completed)
- We have already adjusted our service throttling configuration settings for platform services in the AZs of East US 2, as well as other high-load zones/regions. (Completed)
- We are improving our end-to-end load management practices, to take a more holistic view of the impacts of any throttling changes. (Estimated completion: TBD)
- Tune and add additional throttling levers in control plane services to help prevent similar increased intra-service call volume load from occurring. (Estimated completion: TBD)
- We are reviewing our communication processes and tooling to improve timeliness and transparency in future incidents. (Estimated completion: TBD)
- This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
How can customers make incidents like this less impactful?
- Note that the 'logical' Availability Zones used by each customer subscription may correspond to different physical Availability Zones - customers can use the Locations API to understand this mapping, to confirm which resources run in the two physical AZs referenced above: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application/ and https://learn.microsoft.com/azure/architecture/patterns/geodes
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts