Ugrás a fő tartalomra

Termék:

Régió:

Dátum:

2025. szeptember

27

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

What happened?

Between 23:54 UTC on 26 September 2025 and 21:59 UTC on 27 September 2025, a platform issue resulted in service availability issues for resources hosted in the Switzerland North region. Many customers experienced mitigation by 04:00 UTC on 27 September 2025, while those impacted by additional residual impacts - that remained after initial mitigation steps - experienced full mitigation as late as 21:59 UTC on 27 September 2025.

Impacted services included API Management, App Service, Application Gateway, Application Insights, Azure Cache for Redis, Azure Cosmos DB, Azure Data Explorer, Azure Database for PostgreSQL, Azure Databricks, Azure Firewall, Azure Kubernetes Service, Azure Storage, Azure Synapse Analytics, Backup, Batch, Data Factory, Log Analytics, Microsoft Sentinel, SQL Database, SQL Managed Instance, Site Recovery, Stream Analytics, Virtual Machine Scale Sets (VMSS), and Virtual Machines (VMs). Other services reliant on these may have also been impacted. 

What went wrong and why?

A planned change updated the accepted certificates used for Software Load Balancer node connection authorization. An error in the change was not identified during the validation process, and resulted in a malformed prefix in one of the certificates to be deployed in the region. A region-specific configuration carried that certificate through a deployment pipeline/process that was not covered by our standard deployment health checks. This change was intended to be impactless to customer workloads but, due to the error, resulted in a loss of connectivity between nodes and resources inside the region.

How did we respond?

The error was corrected and deployed to re-establish network communication within the region. Most resources were automatically restored and communication was re-established. However, some resources and services that depend on VMs with specific extensions entered a failed state that required extra manual intervention to be brought back online.

  • 23:54 UTC on 26 September 2025 – Customer impact began, triggered by the change described above.
  • 00:08 UTC on 27 September 2025 – The issue was detected via our automated monitoring.
  • 00:12 UTC on 27 September 2025 – Investigation commenced by our Azure Storage and Networking engineering teams.
  • 02:33 UTC on 27 September 2025 – We performed steps to revert the configuration change.
  • 03:40 UTC on 27 September 2025 – The configuration was successfully reverted.
  • 04:00 UTC on 27 September 2025 – Validation of recovery was confirmed for the majority of impacted services, but a subset of resources were still unhealthy due to residual impact.
  • 16:15 UTC on 27 September 2025 – Recovery operations were investigated and performed in order to recover residual impact to VMs with specific extensions that remained in a failed state. This included applying mitigation scripts and performing steps to safely reboot and recover the remaining subset of resources.
  • 21:59 UTC on 27 September 2025 – Residual impact recovery activities and validation completed, confirming full recovery of all impacted services and customers.

How are we making incidents like this less likely or less impactful?

  • We are putting in place additional validation procedures for configuration updates of this type, prior to being queued for deployment. (Estimated completion: TBD)
  • We will improve our deployment systems for such scenarios, including providing more safety measures for such updates. (Estimated completion: TBD)
  • We will investigate ways to improve the recovery time for resources affected by such issues. (Estimated completion: TBD)
  • This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: 

10

Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident (to hear from our engineering leaders, and to get any questions answered by our experts) or watch a recording of the livestream (available later, on YouTube):

What happened?

Between 09:03 UTC and 18:50 UTC on 10 September 2025, an issue with the Allocator service in the Azure Compute control plane, led to degraded service management operations for resources in two Availability Zones (physical AZ02 and AZ03) in the East US 2 region. While the Allocator issue was mitigated by 18:50 UTC, a subset of downstream services experienced delayed recovery due to backlogged operations and retry mechanisms, with full recovery observed by 19:30 UTC. 

During this impact window, customers may have received error notifications and/or been unable to start, stop, scale, or deploy resources. The primary services affected were Virtual Machines (VMs) and Virtual Machines Scale Sets (VMSS) but services dependent on them were also affected – including but not limited to: 

  • Azure Backup: Between 13:32 UTC and 19:30 UTC on 10 September 2025, customers may have experienced failures with Virtual Machine backup operations. 
  • Azure Batch: Between 09:50 UTC and 18:50 UTC on 10 September 2025, customers may have experienced pool operations getting stuck. Primarily, pool grow/shrink and deletion operations may have become stuck. The majority of impact occurred between 15:06 and 18:50 UTC, but backlogged operations would have caught up by 19:30 UTC. 
  • Azure Databricks: Between 10:15 UTC and 18:07 UTC on 10 September 2025, customers may have experienced delays and/or failures with job run operations and/or Databricks SQL queries. 
  • Azure Data Factory: Between 10:32 UTC and 18:08 UTC on 10 September 2025, customers may have experienced failures in running dataflow jobs due to cluster creation issues. 
  • Azure Kubernetes Service: Between 09:05 UTC and 19:30 UTC on 10 September 2025, customers may have experienced failures for operations, including but not limited to - create, upgrade, scale up, or delete managed clusters or node pools, or create or delete certain pods which require scaling up nodes, or attach/detach disks to VMs. 
  • Azure Synapse Analytics: Between 10:28 UTC and 18:21 UTC on 10 September 2025, customers may have experienced spark job execution failures while executing spark activities within notebooks, data pipelines, or SDKs. 
  • Microsoft Dev Box: Between 12:30 UTC and 17:15 UTC on 10 September 2025, customers may have experienced operation failures, including manual start via Dev Portal and connecting directly via the Windows App to a Dev Box in a shutdown or hibernated state.

While AZ01 was not affected by the allocator service issue that impacted AZ02 and AZ03, allocation failures in AZ01 were observed as result of existing constraints on AMD v4/v5 and Intel v5 SKUs.

What went wrong and why?

Azure Resource Manager (ARM) provides a management layer that enables customers to create, update, and delete resources such as VMs. As part of this service management system, ARM depends on several VM management services in the Azure control plane, which in turn rely on an Allocator service deployed per Availability Zone.

A recent change to Allocator behavior—designed to improve how the service handles heavy traffic— had been progressively rolled out over the prior two months, expanding to broader regions in early September. This update introduced a new throttling pattern that had performed effectively elsewhere. However, a performance issue affecting a subset of Allocator machines triggered the new throttling logic which, because of the high throughput demand in the East US 2 region, caused VM provisioning requests to retry more aggressively than the control plane could handle.

Customer deployments failing in AZ03 were automatically redirected to AZ02, which already had significant load. This redirection, combined with normal traffic, led to overload and throttling in AZ02. Unlike AZ03, which recovered quickly once new VM deployments were halted, AZ02 faced a more complex and persistent challenge.

By the time we paused new VM deployments in AZ02, the zone had already accumulated a significant backlog—estimated to be many times larger than normal operational volume. This backlog wasn’t just a result of repeated retries of the same operation. Instead, the platform was attempting a wide variety of retry paths to find successful outcomes for customers. These included retries across multiple internal infrastructure segments within AZ02, triggered by disk and VM co-location logic, and retries with different allocation parameters. Each retry was an attempt to find a viable configuration, not a simple repetition of the same request.

This behavior reflects a platform optimized for reliability—one that persistently attempts alternative solutions rather than failing fast. While this approach often helps customers succeed in constrained environments, in this case it led to a large, persistent queue of operations that overwhelmed control plane services in AZ02. The zone’s larger size and higher throughput compared to AZ03 further compounded the issue, making recovery significantly slower.

Importantly, the majority of the incident duration was spent in the recovery phase for AZ02. Once new deployments were halted and retries suppressed, the platform still had to drain the backlog of queued operations. This was not a simple in-memory retry loop—it was a persistent queue of work that had accumulated across multiple services and layers. Recovery was further slowed by service-level throttling configurations that, while protective, limited how quickly the backlog could be processed. These throttling values were tuned for normal operations and not optimized for large-scale recovery scenarios.

We acknowledge that the extended recovery time in AZ02 was driven by the scale and complexity of the accumulated work, the opportunistic retry behavior of the platform, and the limitations of current throttling and queue-draining mechanisms. Addressing these factors is a key focus of our repair actions—both to prevent excessive backlog accumulation in future incidents, and to accelerate recovery when it does occur.

During this incident, our initial communications strategy relied heavily on automated notifications to impacted customers, triggered by monitoring signals for each impacted service. While this system informed many customers quickly, it currently supports only a subset of services and scenarios. As a result, some impacted customers and services were not notified automatically, relying on manual communications provided later during the incident. As the scope of impact widened, we recognized that our communications were not reaching all affected customers. To address this, we sent broad targeted messaging to all customers running VMs in East US 2 – and ultimately published to the (public) Azure Status page, to improve visibility and reduce confusion. Finally, we acknowledge that early updates lacked detail around our investigation and recovery efforts. We should have shared more context sooner, and we have corresponding communication-related repair actions outlined below to improve timeliness and clarity in future incidents.

How did we respond?

  • 09:03 UTC on 10 September 2025 – Customer impact began in AZ03. 
  • 09:13 UTC on 10 September 2025 – Monitoring systems observed a spike in allocation failure rates for Virtual Machines, triggering an alert and prompting teams to investigate. 
  • 09:23 UTC on 10 September 2025 – Initial targeted communications sent to customers via Azure Service Health. Automated messaging gradually expanded as impact increased, and additional service alerts were triggered. 
  • 09:30 UTC on 10 September 2025 – We began manual mitigation efforts in AZ03 by restarting critical service components to restore functionality, rerouting workloads from affected infrastructure, and initiated multiple recovery cycles for the impacted backend service. 
  • 10:02 UTC on 10 September 2025 – Platform services in AZ02 began to throttle requests due to the accumulation of excessive load, following customer deployments being automatically redirected from AZ03, in addition to normal customer load in AZ02. 
  • 11:20 UTC on 10 September 2025 – Platform-initiated retries caused excessive load in AZ02, leading to increased failure rates. 
  • 12:00 UTC on 10 September 2025 – We stopped new VM deployments being allocated in AZ03.  
  • 12:40 UTC on 10 September 2025 – The reduction in load in AZ03 enabled AZ03 to recover. Customer success rates for operations on VMs that were already deployed into AZ03 returned to normal levels.  
  • 13:30 UTC on 10 September 2025 – After verifying that platform-initiated retries had sufficiently drained, we re-enabled new VM deployments to AZ03. We then stopped all new VM deployments in AZ02. 
  • 13:30 UTC on 10 September 2025 – We continued to apply additional mitigations to AZ02 platform services with backlogs of retry requests. These included applying more aggressive throttling, draining existing backlogs, restarting services, and applying additional load management strategies. 
  • 15:05 UTC on 10 September 2025 – Customer success rates for operations on VMs already deployed into AZ02 started increasing. 
  • 16:36 UTC on 10 September 2025 – Broad targeted messaging sent to all customers with Virtual Machines in East US 2. 
  • 16:50 UTC on 10 September 2025 – We started gradually re-enabling traffic in AZ02. 
  • 17:17 UTC on 10 September 2025 – The first update was posted on the public Azure Status page. 
  • 18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored. 
  • 19:30 UTC on 10 September 2025 – Remaining downstream services reported recovery, following backlog drainage and retry stabilization. Customer impact confirmed mitigated. 

How are we making incidents like this less likely or less impactful? 

  • We have turned off the recently deployed Allocator behavior, reverting to the previous allocation throttling logic. (Completed) 
  • We have already adjusted our service throttling configuration settings for platform services in the Availability Zones of East US 2, as well as other high-load zones/regions. (Completed) 
  • We are tuning and creating additional throttling levers in control plane services, to help prevent similar increased intra-service call volume load from occurring. (Estimated completion: October 2025) 
  • We are creating a set of internal tools to support the draining of increased backlog for faster service recovery from incident scenarios that require this. (Estimated completion: October 2025) 
  • We are addressing identified gaps in our load testing infrastructure, to help detect these types of issues earlier – before rolling out production changes. (Estimated completion: October 2025) 
  • We are working to improve our communication processes and tools to help eliminate delays notifying impacted customers and supplying actionable information in the future. (Estimated completion: December 2025) 
  • In the longer term, we are improving our end-to-end load management practices, to take a more holistic view of the impacts of any throttling, prioritization, and bursts in demand. (Estimated completion: June 2026) 

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: