9

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/8GCS-858

What happened?

Between 23:20 UTC on 9 March and 19:32 UTC on 10 March 2026, a platform issue resulted in impact to the Azure OpenAI Service. Impacted customers experienced HTTP 400 and HTTP 429 error responses, specifically for the GPT-5.2 model. All other GPT models were unaffected during this time.

This incident impacted customer resources and queries in the following regions: Australia East, Central US, East US 2, Korea Central, Norway East, Sweden Central, and UK South.

What went wrong and why?

The Azure OpenAI Service processes customer requests through model engines deployed across multiple Azure regions and supported by traffic routing systems. Depending on the selected deployment model (Global, Data Zone, or Regional), note that requests may be routed across multiple Azure regions within defined geographic boundaries to support availability and resilience, while customer data remains stored at rest in the selected Azure region. Learn more at: https://azure.microsoft.com/blog/enterprise-trust-in-azure-openai-service-strengthened-with-data-zones.

A recent update to the Azure OpenAI GPT 5.2 model introduced a configuration change that was not compatible with the version of the model engine code running in production. As part of this update, certain feature settings were enabled to improve service efficiency and resilience – however, the deployed engine version did not yet support those settings. As a result, when customer requests were routed to engines with this mismatch, the service was unable to process these requests correctly.

Generally speaking, any updates to the service are rolled out in line with our Safe Deployment Practices (SDP) which deploys to different regions gradually, in stages. During this update, the earlier stages of the rollout did not include sufficient backend model instances for this issue to surface before the update progressed to additional regions. As a result, the rollout had completed its deployment across our fleet before we were able to determine customer impact.

During mitigation of the primary issue, we identified a secondary issue that affected service recovery. Azure OpenAI relies on internal telemetry to understand real-time service capacity across regions and to route traffic accordingly. At the time recovery actions were underway, an unrelated issue in this internal telemetry system led to incomplete capacity information being leveraged. As a result, traffic routes were temporarily being determined using incomplete data, which led to a disproportionate amount of traffic being directed to a limited set of available regions. This created additional resource pressure in those regions and resulted in continued intermittent request failures (HTTP 429 errors) for some customers, even as the rollback of the configuration issue was progressing and other regions were actually available to receive requests. Once the routing updates were successfully completed and full capacity information was restored across regions, traffic distribution normalized and service recovery progressed as expected.

During the incident, we also identified a monitoring gap related to anomalous HTTP 400 error patterns. While HTTP 400 responses do occur during normal service usage, our monitoring was not configured for service‑side anomalies, only for client-side errors – which are typically caused by incorrect parameters in user requests. This miss in monitoring during the initial stages of the incident delayed detection and response.

How did we respond?

23:20 UTC on 09 March 2026 – Customer impact began, triggered by the recent service update.
00:19 UTC on 10 March 2026 – We detected the issue via service monitoring. This prompted us to begin our investigation, engage with other teams to troubleshoot, and start developing a hot fix.
03:18 UTC on 10 March 2026 – We determined that a rollback could mitigate more quickly than hotfix, so started the rollback for the impacted model.
10:55 UTC on 10 March 2026 – Traffic routes were determined to be using incomplete data, due to the aforementioned dependency issue.
12:40 UTC on 10 March 2026 – We identified and investigated resource constraints.
18:00 UTC on 10 March 2026 – Rollback actions completed across all affected regions.
19:30 UTC on 10 March 2026 – Full capacity information was restored across regions, traffic distribution normalized.
19:32 UTC on 10 March 2026 – Once monitoring confirmed stable recovery, we determined the service was fully restored and all customer impact had been mitigated.

How are we making incidents like this less likely or less impactful?

We have already conducted additional engineer training on our operating procedures – including which scenarios can be quickly rolled back – to reduce the time to mitigate similar issues. (Completed)
We have incorporated storing a ‘cached’ known good version of the traffic routing details, as an additional layer of resilience in case the dependent service is not able to serve the latest information on regional capacity availability. (Completed)
We have improved our monitoring surrounding HTTP 400 errors, by establishing thresholds of errors on the service side. (Completed)
To expand that monitoring further, we are improving our anomaly detection surrounding HTTP 4xx errors – to alert on anomalous error rates that may not meet our usual thresholds. (Estimated completion: April 2026)
We are incorporating additional signals to our deployment systems, to reduce potential impact by stopping problematic rollouts automatically. (Estimated completion: May 2026)
Finally, we are improving our safe deployment practices by ensuring that early stages have sufficient backend model instances to catch issues like this earlier. (Estimated completion: June 2026)

How can customers make incidents like this less impactful?

Consider reviewing our guidance and best practices related to Business Continuity and Disaster Recovery (BCDR) scenarios for Azure OpenAI: https://learn.microsoft.com/azure/ai-services/openai/how-to/business-continuity-disaster-recovery
Consider reviewing and implementing our best practices surrounding retry patterns, especially with exponential backoff, to improve workload resiliency during intermittent issues: https://learn.microsoft.com/azure/architecture/patterns/retry
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/8GCS-858

7

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/_SVS-5_G

What happened?

Between 07:58 UTC on 07 February 2026 and 04:24 UTC on 08 February 2026, impacted customers experienced intermittent service unavailability, timeouts and/or higher than normal latency for services in the West US region. The event began following a power interruption affecting one of the datacenters within the region, after which impact manifested as infrastructure availability loss and service disruptions across multiple dependent workloads in the region.

As power stabilization progressed, recovery proceeded in phases – however, a subset of storage and compute infrastructure components did not return to a healthy state immediately after power restoration, which slowed recovery for dependent components, and contributed to ongoing symptoms such as delayed telemetry and resource recovery.

Impacted services included:

Azure AI Search: Customers using Azure AI Search in West US may have experienced query failures, indexing delays, or temporarily unavailable search endpoints.
Azure App Service: Between 07:58 UTC and approximately 09:45 UTC on 07 February 2026, customers using Azure App Service in West US may have experienced HTTP 500 errors, request timeouts, or failed deployments.
Azure Backup: During the impact window, backup jobs in West US may have experienced delays or temporary failures due to storage unavailability.
Azure Cache for Redis: Customers may have observed connectivity failures, increased latency, or transient cache unavailability in West US.
Azure Container Registry: Customers may have experienced delays or failures when pulling or pushing container images in West US.
Azure Cosmos DB: Customers may have experienced a degradation in service availability and/or request latency. Some requests may have resulted in server errors or timeouts.
Azure Data Factory: Customers may have observed pipeline execution failures or delays in West US. Affected integration runtimes recovered following restoration of underlying networking and storage services.
Azure Database for MySQL – Flexible Server: Between 09:40 UTC and 20:50 UTC on 07 February 2026, customers may have experienced connection failures or degraded database availability in West US.
Azure Database for PostgreSQL – Flexible Server: Customers may have experienced database connectivity interruptions or degraded performance in West US.
Azure Databricks: Customers may have experienced workspace unavailability, cluster startup failures, or job execution interruptions in West US.
Azure DevOps: Customers may have experienced delays in log ingestion and analytics reporting in West US.
Azure Event Hubs: Between 07:58 UTC and 19:05 UTC on 7 February 2026, customers may have experienced failures or high latency when sending and processing events. Control plane operations like creating/updating/deleting Event Hubs or Consumer Groups were also impacted.
Azure IoT Hub: Customers may have observed delayed message processing or temporary endpoint unavailability in West US.
Azure Kubernetes Service (AKS): Between 07:58 UTC and 21:46 UTC on 07 February 2026, customers using AKS in West US may have experienced control plane instability, node health reporting delays, or pod scheduling failures.
Azure Key Vault: Customers may have experienced intermittent authentication or secret retrieval failures in West US.
Azure Monitor: Customers may have observed delayed metrics, log ingestion latency, or temporary alerting gaps in West US.
Azure Monitor (Application Insights): Between 07:58 UTC on 07 February 2026 and 04:24 UTC on 08 February 2026, customers may have experienced delayed telemetry ingestion and incomplete monitoring data in West US.
Azure Service Bus: Between 07:58 UTC and 19:05 UTC on 7 February 2026, customers may have experienced transient messaging failures, delayed message processing, or control plane disruptions impacting queues and topics in West US.
Azure SQL Database: Customers may have experienced SQL error 40613 (database unavailable) or connectivity interruptions in West US. Service mitigation included database failover and host recovery.
Azure Storage: Customers may have experienced read/write failures, increased latency, or temporary storage account unavailability in West US. Storage services were restored in phases as power and network infrastructure stabilized.
Azure Stream Analytics: Customers may be experiencing service availability disruptions, including delays or failures in data ingestion, processing, or output delivery.
Azure Virtual Machines: Customers may have experienced VM start failures, allocation errors, or temporary unavailability of running instances in West US.
Azure Virtual Machines (Confidential VM types): Customers may have experienced VM deployment or startup failures in West US, due to dependent storage disruption.
Microsoft Defender for Cloud Apps: Customers may have observed delayed alerts or telemetry ingestion in West US during the recovery period.

What went wrong and why?

Under normal operating conditions, Azure datacenter infrastructure is designed to tolerate utility power disturbances through redundant electrical systems and automated failover to on-site generators. During this incident, although utility power was still active, an electrical failure in an onsite transformer resulted in the loss of utility power to the datacenter. Although generators started as designed, a cascading failure within a control system prevented the automated transfer of load from utility power to generator power. As a result, Uninterruptible Power Supply (UPS) batteries carried the load for several minutes – until they were fully depleted, leading to customer impact from this power loss as early as 07:58 UTC.

Once power loss occurred, recovery was not uniform across all dependent infrastructure. Our datacenter operations team restored power to IT racks by leveraging our onsite generators, with 90% of IT racks powered by 09:31 UTC on 07 February. While infrastructure generally came back online as expected after power restoration, subsets of networking, storage and compute infrastructure required further electrical control system troubleshooting, before power could be restored. We were fully restored and running generator-backed power by 11:29 UTC. Power restoration was sequenced to avoid destabilizing power that had already been brought back online, and due to the inconsistent state of the control system.

Following power restoration at the datacenter, recovery of customer workloads progressed in stages based on dependencies. Network and Storage recovery were critical requirements, as compute hosts must be able to access persistent storage to complete boot, validate system state, and reattach customer data before workloads can resume. Network devices and services recovered as expected once power was restored.

The power loss affected six storage scale units within the datacenter, of which four recovered as expected once power was restored. Two storage scale units experienced prolonged recovery, which became a primary factor contributing to extended impact as these contained a large subset of storage nodes that failed to complete the boot process. During the boot process, critical artifacts need to be downloaded, but these were delayed or timed out – resulting in these nodes entering a state for which there was not an automated recovery action. Manual operations were attempted, and eventually succeeded in recovering the necessary nodes to bring these storage scale units back online. Due to the dependencies that many compute and platform services have on these storage scale units, there was a delay in overall service restoration. Storage recovery completed by 19:05 UTC.

Compute recovery began once power and storage dependencies were partially restored. Automated recovery systems were available and functioning, but the scale of the event resulted in a surge of required recovery activities across the datacenter. This placed sustained pressure on regional shared infrastructure components responsible for coordinating health checks, power state validation, and repair actions. During the initial recovery window, this elevated load reduced the speed at which individual compute nodes could be recovered. While many nodes returned to service as expected, some nodes required repeated recovery attempts, due to insufficient prioritization across concurrent fault categories during the surge of recovery activities. We introduced optimizations to accelerate recovery, including deprioritizing non-critical recovery activities. As pressure on these shared systems decreased, compute recovery progressed more predictably and remaining nodes were gradually restored. Compute recovery completed by 23:30 UTC.

How did we respond?

07:54 UTC on 07 February 2026 – Initial electrical failure in an onsite transformer, although various UPS batteries carried the load for several minutes.
07:58 UTC on 07 February 2026 – Customers began experiencing unavailability and delayed monitoring/log data.
08:07 UTC on 07 February 2026 – Datacenter and engineering teams engaged and initiated coordinated investigation across power, network, storage, and compute recovery workstreams.
09:31 UTC on 07 February 2026 – 90% of facility power restored on generators.
11:26 UTC on 07 February 2026 – 93% of facility power restored on generators.
11:29 UTC on 07 February 2026 – 100% of facility power restored on generators with continued recovery of hosted services.
12:15 UTC on 07 February 2026 – Storage recovery started and dependent services started seeing recovery.
15:00 UTC on 07 February 2026 – Targeted remediation actions continued for remaining unhealthy services.
15:25 UTC on 07 February 2026 – Applied configuration changes to optimize processing of recovery events, to assist in expediting the recovery of compute services.
19:05 UTC on 07 February 2026 – Storage recovery was complete.
19:24 UTC on 07 February 2026 – Compute recovery system stabilized and fully operational, continued gradually restoring remaining nodes.
23:30 UTC on 07 February 2026 – Long tail of Compute nodes impact mitigated.
04:24 UTC on 08 February 2026 – All affected services confirmed mitigated.
03:42 UTC on 09 February 2026 – Power was transitioned back to utility power, restoring the datacenter to normal operations.

How are we making incidents like this less likely or less impactful?

The control system and dependent components have been inspected, reconfigured, and validated to support transfer-to-generator events more reliably. (Completed)
We have reevaluated how best to prioritize critical infrastructure earlier in the power restoration sequence, to ensure a smoother recovery across dependent services. (Completed)
We are working to enhance the recovery system through prioritization and throttling, to improve stability during surge scenarios like this one. (Estimated completion: July 2026)
Finally, we are continually refining how our infrastructure and services come back online after datacenter power loss events, to make these more robust to different failure scenarios.

How can customers make incidents like this less impactful?

Since Availability Zones are not available in the West US region, customers should consider a multi-region geodiversity strategy for mission-critical workloads: https://learn.microsoft.com/azure/well-architected/design-guides/regions-availability-zones#deployment-approach-4-multi-region-deployments and https://learn.microsoft.com/azure/well-architected/design-guides/disaster-recovery
Consider leveraging data redundancy options such as Read-Access Geo-Redundant Storage (RA-GRS) to ensure data accessibility during regional incidents: https://learn.microsoft.com/azure/storage/common/storage-redundancy
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/_SVS-5_G

2

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/FNJ8-VQZ

What happened?

Between 18:03 UTC on 02 February 2026 and approximately 00:30 UTC on 03 February 2026, a platform issue caused some customers to experience degraded performance and control plane failures, for multiple Azure services across multiple regions. Impacted services included:

Azure Virtual Machines (VMs) – Customers may have experienced failures when deploying or scaling virtual machines, including errors during provisioning and lifecycle operations.
Azure Virtual Machine Scale Sets (VMSS) – Customers may have experienced failures when scaling instances or applying configuration changes.
Azure Kubernetes Service (AKS) – Customers may have experienced failures in node provisioning and extension installation.
Azure DevOps (ADO) and GitHub Actions – Customers may have experienced pipeline failures when tasks required VM extensions or related packages. Actions jobs queued and timed out while waiting to acquire a hosted runner. Other GitHub features that leverage this compute infrastructure were similarly impacted, including Copilot Coding Agent, Copilot Code Review, CodeQL, Dependabot, GitHub Enterprise Importer, and Pages. All regions and runner types were impacted.
Other dependent services – Customers may have experienced degraded performance or failures in operations that required downloading extension packages from Microsoft-managed storage accounts, including Azure Arc enabled servers, and Azure Database for PostgreSQL.

As part of our recovery, between 00:08 UTC and 06:05 UTC on 03 February 2026, customers using the 'Managed identities for Azure resources' service (formerly Managed Service Identity) may have experienced failures when attempting to create, update, or delete Azure resources, or acquire Managed Identity tokens – in the East US and West US regions. Originally communicated on incident tracking ID _M5B-9RZ, this impacted the creation and management of Azure resources with assigned managed identities – including but not limited to Azure AI Video Indexer, Azure Chaos Studio, Azure Container Apps, Azure Database for PostgreSQL Flexible Servers, Azure Databricks, Azure Kubernetes Service, Azure Stream Analytics, Azure Synapse Analytics, and Microsoft Copilot Studio. Newly created resources of these types may have also failed to authenticate using managed identity.

What went wrong, and why?

On 02 February 2026, a remediation workflow executed based on a resource policy, intended to disable anonymous access on Microsoft-managed storage accounts. The purpose of this policy is to reduce unintended exposure by ensuring that anonymous access is not enabled unless explicitly required.

Due to a data synchronization problem in the targeting logic, the policy was incorrectly applied to a subset of storage accounts that are intentionally configured to allow anonymous read access for platform functionality. These storage accounts form part of the VM extension package storage layer. VM extensions are small applications used during provisioning and lifecycle operations to configure, secure, and manage VMs. Extension artifacts are retrieved directly from storage accounts, during early-stage provisioning and scaling workflows.

When anonymous access was disabled across these storage accounts, VMs and dependent services were unable to retrieve required extension artifacts. This resulted in control plane failures and degraded performance across affected services.

Although the policy process followed safe deployment practices and was initially rolled out region by region, the health assessment defect caused the incorrect targeting state to propagate broadly across public cloud regions in a short time window. This accelerated the impact beyond the intended blast radius.

Once identified, the policy job was immediately disabled to prevent further changes. Rollback procedures to restore anonymous access were validated in a test region and then executed region by region across affected storage accounts. As access was restored, control plane operations gradually recovered and service management queues began to clear.

As storage access was restored, a secondary issue emerged. The re-enablement of the VM extension package storage layer allowed multiple previously blocked workflows to resume concurrently, across infrastructure orchestration systems and dependent services.

The resulting surge in resource management operations generated a significant and sudden increase in requests to the backend service supporting Managed Identities for Azure resources. As a result, starting at 00:08 UTC on 03 February 2026, customers in East US and West US experienced failures when attempting to create, update, or delete Azure resources that use managed identities, as well as when acquiring Managed Identity tokens.

Automatic retry behaviors in upstream components, designed to provide resiliency under transient failure conditions, amplified traffic against the managed identity backend. The cumulative effect exceeded service limits in East US. Resilient routing mechanisms subsequently directed retry traffic to West US, resulting in similar saturation.

Although the managed identity infrastructure was scaled out, additional capacity alone did not immediately mitigate impact – because retry amplification continued to generate load at a rate that exceeded recovery capacity. Active load shedding and controlled reintroduction of traffic were required to stabilize the service and allow backlog processing while maintaining normal operations, with full recovery by 06:05 UTC.

How did we respond?

18:03 UTC on 02 February 2026 – Customer impact began, triggered by the periodic remediation workflow starting.
18:29 UTC on 02 February 2026 – Internal service monitoring detected a subset of regions having an increasing number of control plane failures.
19:46 UTC on 02 February 2026 – Correlated issues across multiple regions.
19:55 UTC on 02 February 2026 – Service monitoring detected failure rates exceeding failure limit thresholds.
20:10 UTC on 02 February 2026 – We began collaborating to devise a mitigation solution and investigate the underlying factors.
21:15 UTC on 02 February 2026 – We applied a primary proposed mitigation and validated that it was successful on a test instance.
21:18 UTC on 02 February 2026 – We identified and disabled the remediation workflow and stopped any ongoing activity, so it would not impact additional storage accounts.
21:50 UTC on 02 February 2026 – Began broader mitigation to impacted storage accounts. Customers saw improvements as work progressed.
00:07 UTC on 03 February 2026 – Storage accounts for a few high volume VM extensions that utilize managed identity finished their re-enablement in East US.
00:08 UTC on 03 February 2026 – Unexpectedly, customer impact increased first in East US and cascaded to West US with retries, as a critical managed identity service degraded under recovery load.
00:14 UTC on 03 February 2026 – Automated alerting identified availability impact to Managed Identity services in East US. Engineers quickly recognized that the service was overloaded and began to scale out.
00:30 UTC on 03 February 2026 – All extension hosting storage accounts had been reenabled, mitigating this impact in all regions other than East US and West US.
00:50 UTC on 03 February 2026 – Initial scale outs of managed identity service infrastructure scale out completed, but the new resources were still unable to handle the traffic volume due to the increasing backlog of retried requests.
02:00 UTC on 03 February 2026 – A second, larger set of managed identity service scale outs completed. Once again, the capacity was unable to handle the volume of backlogs and retries.
02:15 UTC on 03 February 2026 – Reviewed additional data and monitored downstream services to ensure that all mitigations were in place for all impacted storage accounts.
03:55 UTC on 03 February 2026 – To recover infrastructure capacity for the managed identity service, we began rolling out a change to remove all traffic so that the infrastructure could be repaired without load.
04:25 UTC on 03 February 2026 – After infrastructure nodes recovered, we began gradually ramping traffic to them, allowing backlogged identity operations to begin to process safely.
06:05 UTC on 03 February 2026 – Backlogged operations completed, and services returned to normal operating levels. We concluded our monitoring and confirmed that all customer impact had been mitigated.

How are we making incidents like this less likely or less impactful?

We are strengthening traffic controls and throttling so that unexpected load spikes or retry storms are contained early, and cannot overwhelm the managed identity service or impact other customers. (Estimated completion: February 2026)
We are working to fix the data synchronization problem, by addressing detection gaps and code defects with validation coverage, to protect from similar scenarios. (Estimated completion: March 2026)
We will exercise and optimize the rollback of the remediation workflow, including automation where appropriate, to mitigate issues more quickly. (Estimated completion: April 2026)
We are increasing capacity headroom and tightening scaling safeguards, to ensure sufficient resources remain available during recovery events and sudden demand. (Estimated completion: April 2026)
We will improve orchestration of the remediation workflow to be more in line with our Safe Deployment Practices, integrated with service health indicators, to reduce multi-region failure risk. (Estimated completion: May 2026)
We are improving retry behavior and internal efficiency, so the service degrades more gracefully under stress. (Estimated completion: June 2026)
We are improving regional isolation and failover behavior, to prevent issues in one region from cascading into paired regions during failures or recovery. (Estimated completion: July 2026)

How can customers make incidents like this less impactful?

There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
Note that the impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/FNJ8-VQZ