跳至主要內容

產品:

地區:

日期:

2026年2月

7

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details. 

What happened?

Between 07:58 UTC on 07 February 2026 and 04:24 UTC on 08 February 2026, customers using multiple Azure services in the West US region may have experienced intermittent service unavailability, timeouts and/or higher than normal latency. The event began following an unexpected power interruption affecting a single West US datacenter, after which internal monitoring reported widespread device unreachability and downstream service degradation.

Initial impact manifested as broad infrastructure reachability loss and service disruption across multiple dependent workloads in West US. As power stabilization progressed, recovery proceeded in phases - however, a subset of storage and compute infrastructure components did not return to a healthy state immediately, which slowed recovery for dependent components, and contributed to ongoing symptoms such as delayed telemetry and workload recovery patterns.

Impacted services included Azure App Service, Azure Application Insights, Azure Backup, Azure Cache for Redis, Azure Cognitive Search, Azure Confidential Computing, Azure Container Registry, Azure Data Factory, Azure Database for MySQL – Flexible Server, Azure Databricks, Azure IoT Hub, Azure Kubernetes Service, Azure Monitor, Azure Service Bus, Azure Site Recovery, Azure Storage, Azure Stream Analytics, Azure Virtual Machines, and Microsoft Defender for Cloud Apps. Impacted customers may have observed degraded performance, reduced availability, or delayed telemetry as services continued to recover.

What went wrong and why?

Under normal operating conditions, Azure datacenter infrastructure is designed to tolerate utility power disturbances through redundant electrical systems and automated failover to on-site generators. During this incident, although utility power was still active, an electrical failure in an onsite transformer resulted in the loss of utility power to the datacenter. Although generators started as designed, a cascading failure within a control system prevented the automated transfer of load from utility power to generator power. As a result, Uninterruptible Power Supply (UPS) batteries carried the load for approximately 6 minutes – until they were fully depleted, leading to a complete loss of power at 07:58 UTC.

Once power loss occurred, recovery was not uniform across all dependent infrastructure. Our datacenter operations team restored power to IT racks by leveraging our onsite generators, with 90% of IT racks powered by 09:31 UTC. While infrastructure generally came back online as expected after power restoration, subsets of networking, storage and compute infrastructure required further electrical control system troubleshooting, before power could be restored. We were fully restored and running generator-backed power by 11:29 UTC. Power restoration was intentionally paced due to the inconsistent state of the control system, and the need to avoid destabilizing power that had already been brought back online.

In parallel, telemetry and monitoring pipelines were impacted during the power event, reducing visibility into certain health signals. We relied on alternate validation mechanisms to inform our ongoing restoration efforts. This telemetry impact also contributed to delays in monitoring signals and log ingestion during the recovery and verification phases.

Following power restoration at the datacenter, recovery of customer workloads progressed in stages based on dependencies. Storage recovery was the first critical requirement, as compute hosts must be able to access persistent storage to complete boot, validate system state, and reattach customer data before workloads can resume.

One contributing factor was the behavior of Top-of-Rack (ToR) network devices, which are the switches that provide connectivity between servers in a rack and the broader datacenter network. After the power event, a subset of these devices did not return to service as expected, limiting network reachability for affected nodes. This restricted both access to storage during boot, and the ability to run recovery and repair actions. These networking issues also caused congestion at the rack management layer, which is responsible for monitoring hardware health and coordinating recovery operations within each rack. Elevated health-check activity during recovery further constrained this layer, slowing progress on nodes already in a degraded state.

The power loss affected six storage clusters within the datacenter. Four clusters recovered as expected once power was restored. However, two storage clusters experienced prolonged recovery, which became a primary driver of extended service impact. These two clusters contained a large subset of storage nodes that failed to complete the boot process. The factors contributing to nodes not booting are currently under investigation and will be part of the final PIR. Due to the dependencies that many compute and platform services have on these storage clusters, there was a delay in overall service restoration. After troubleshooting, storage recovery completed by 19:05 UTC.

Compute recovery began once power and storage dependencies were partially restored. Automated recovery systems were available and functioning, but the scale of the event resulted in a surge of recovery activity across the datacenter. This placed sustained pressure on shared infrastructure components responsible for coordinating health checks, power state validation, and repair actions. During the initial recovery window, this elevated load reduced the speed at which individual compute nodes could be stabilized. While many nodes returned to service as expected, some nodes required repeated recovery attempts, and a subset continued to remain in an unhealthy state due to failed storage devices. These behaviors extended the long-tail of recovery and are under active investigation by the respective engineering teams. As recovery activity stabilized and pressure on shared systems decreased, compute recovery progressed more predictably and remaining nodes were gradually restored. Compute recovery completed by 23:30 UTC.

By 03:42 UTC on 08 February, the faulty onsite transformer had been repaired, and the datacenter was successfully transitioned back to utility power, without any customer interruption. The control system has since been reconfigured to ensure that future utility-to-generator transfer events can complete successfully, maintaining continuous generator-backed operation if required.

How did we respond?

  • 07:52 UTC on 07 February 2026 – Initial electrical failure in an onsite transformer. No customer impact at this stage, as the UPS batteries carried the load for approximately 6 minutes.
  • 07:58 UTC on 07 February 2026 – Initial customer impact began, once UPS batteries had depleted. Customers began experiencing unavailability and delayed monitoring/log data.
  • 08:07 UTC on 07 February 2026 – Datacenter and engineering teams engaged and initiated coordinated investigation across power, network, storage, and compute recovery workstreams.
  • 09:31 UTC on 07 February 2026 – 90% of facility power restored on backup generators, backup initiating recovery.
  • 11:26 UTC on 07 February 2026 – 93% of facility power restored on backup generators.
  • 11:29 UTC on 07 February 2026 – 100% of facility power restored on backup generators with continued recovery of hosted services.
  • 12:15 UTC on 07 February 2026 – Storage recovery started and dependent services started seeing recovery.
  • 15:00 UTC on 07 February 2026 – Targeted remediation actions continued for remaining unhealthy services.
  • 19:05 UTC on 07 February 2026 – Storage recovery was complete.
  • 23:30 UTC on 07 February 2026 – Compute impact mitigated.
  • 03:42 UTC on 08 February 2026 – Power transitioned back to utility power.
  • 04:24 UTC on 08 February 2026 – All affected services restored.

How are we making incidents like this less likely or less impactful?

  • The control system has been reconfigured to ensure that any future transfer-to-generator event will be successful in keeping the facility powered on generators. (Completed) 
  • We are investigating the factors that contributed to the delayed recovery of Top-of-Rack (ToR) network devices, to determine how to ensure this is more robust. (Estimated completion: TBD) 
  • We are investigating potential repair work related to Storage devices that did not come back online after power restoration. (Estimated completion: TBD) 
  • We are exploring ways to harden our recovery processes, including potentially limiting the number of simultaneous recovery actions during surge scenarios like this one. (Estimated completion: TBD) 
  • We are improving isolation between recovery workflows to prevent constrained dependencies from slowing unrelated repair activities, ultimately to recover more quickly. (Estimated completion: TBD) 
  • We are revaluating how best to prioritize critical infrastructure components earlier in the recovery sequence, to ensure a smoother recovery across dependent services. (Estimated completion: TBD) 
  • This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details. 

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

2

Join either of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident (to hear from our engineering leaders, and to get any questions answered by our experts) or watch a recording of the livestream (available the following week, on YouTube):


What happened? 

Between 18:03 UTC on 02 February 2026 and approximately 00:30 UTC on 03 February 2026, a platform issue caused some customers to experience degraded performance and control plane failures, for multiple Azure services across multiple regions. Impacted services included: 

  • Azure Virtual Machines (VMs) – Customers may have experienced failures when deploying or scaling virtual machines, including errors during provisioning and lifecycle operations. 
  • Azure Virtual Machine Scale Sets (VMSS) – Customers may have experienced failures when scaling instances or applying configuration changes. 
  • Azure Kubernetes Service (AKS) – Customers may have experienced failures in node provisioning and extension installation. 
  • Azure DevOps (ADO) and GitHub Actions – Customers may have experienced pipeline failures when tasks required VM extensions or related packages. Actions jobs queued and timed out while waiting to acquire a hosted runner. Other GitHub features that leverage this compute infrastructure were similarly impacted, including Copilot Coding Agent, Copilot Code Review, CodeQL, Dependabot, GitHub Enterprise Importer, and Pages. All regions and runner types were impacted.
  • Other dependent services – Customers may have experienced degraded performance or failures in operations that required downloading extension packages from Microsoft-managed storage accounts, including Azure Arc enabled servers, and Azure Database for PostgreSQL.  

As part of our recovery, between 00:08 UTC and 06:05 UTC on 03 February 2026, customers using the 'Managed identities for Azure resources' service (formerly Managed Service Identity) may have experienced failures when attempting to create, update, or delete Azure resources, or acquire Managed Identity tokens – in the East US and West US regions. Originally communicated on incident tracking ID _M5B-9RZ, this impacted the creation and management of Azure resources with assigned managed identities – including but not limited to Azure AI Video Indexer, Azure Chaos Studio, Azure Container Apps, Azure Database for PostgreSQL Flexible Servers, Azure Databricks, Azure Kubernetes Service, Azure Stream Analytics, Azure Synapse Analytics, and Microsoft Copilot Studio. Newly created resources of these types may have also failed to authenticate using managed identity. 

What went wrong, and why? 

On 02 February 2026, a remediation workflow executed based on a resource policy, intended to disable anonymous access on Microsoft-managed storage accounts. The purpose of this policy is to reduce unintended exposure by ensuring that anonymous access is not enabled unless explicitly required.

Due to a data synchronization problem in the targeting logic, the policy was incorrectly applied to a subset of storage accounts that are intentionally configured to allow anonymous read access for platform functionality. These storage accounts form part of the VM extension package storage layer. VM extensions are small applications used during provisioning and lifecycle operations to configure, secure, and manage VMs. Extension artifacts are retrieved directly from storage accounts, during early-stage provisioning and scaling workflows.

When anonymous access was disabled across these storage accounts, VMs and dependent services were unable to retrieve required extension artifacts. This resulted in control plane failures and degraded performance across affected services. 

Although the policy process followed safe deployment practices and was initially rolled out region by region, the health assessment defect caused the incorrect targeting state to propagate broadly across public cloud regions in a short time window. This accelerated the impact beyond the intended blast radius.

Once identified, the policy job was immediately disabled to prevent further changes. Rollback procedures to restore anonymous access were validated in a test region and then executed region by region across affected storage accounts. As access was restored, control plane operations gradually recovered and service management queues began to clear.

As storage access was restored, a secondary issue emerged. The re-enablement of the VM extension package storage layer allowed multiple previously blocked workflows to resume concurrently, across infrastructure orchestration systems and dependent services. 

The resulting surge in resource management operations generated a significant and sudden increase in requests to the backend service supporting Managed Identities for Azure resources. As a result, starting at 00:08 UTC on 03 February 2026, customers in East US and West US experienced failures when attempting to create, update, or delete Azure resources that use managed identities, as well as when acquiring Managed Identity tokens.

Automatic retry behaviors in upstream components, designed to provide resiliency under transient failure conditions, amplified traffic against the managed identity backend. The cumulative effect exceeded service limits in East US. Resilient routing mechanisms subsequently directed retry traffic to West US, resulting in similar saturation.

Although the managed identity infrastructure was scaled out, additional capacity alone did not immediately mitigate impact – because retry amplification continued to generate load at a rate that exceeded recovery capacity. Active load shedding and controlled reintroduction of traffic were required to stabilize the service and allow backlog processing while maintaining normal operations, with full recovery by 06:05 UTC.

How did we respond? 

  • 18:03 UTC on 02 February 2026 – Customer impact began, triggered by the periodic remediation workflow starting.
  • 18:29 UTC on 02 February 2026 – Internal service monitoring detected a subset of regions having an increasing number of control plane failures. 
  • 19:46 UTC on 02 February 2026 – Correlated issues across multiple regions.
  • 19:55 UTC on 02 February 2026 – Service monitoring detected failure rates exceeding failure limit thresholds. 
  • 20:10 UTC on 02 February 2026 – We began collaborating to devise a mitigation solution and investigate the underlying factors. 
  • 21:15 UTC on 02 February 2026 – We applied a primary proposed mitigation and validated that it was successful on a test instance.
  • 21:18 UTC on 02 February 2026 – We identified and disabled the remediation workflow and stopped any ongoing activity, so it would not impact additional storage accounts.
  • 21:50 UTC on 02 February 2026 – Began broader mitigation to impacted storage accounts. Customers saw improvements as work progressed. 
  • 00:07 UTC on 03 February 2026 – Storage accounts for a few high volume VM extensions that utilize managed identity finished their re-enablement in East US. 
  • 00:08 UTC on 03 February 2026 – Unexpectedly, customer impact increased first in East US and cascaded to West US with retries, as a critical managed identity service degraded under recovery load. 
  • 00:14 UTC on 03 February 2026 – Automated alerting identified availability impact to Managed Identity services in East US. Engineers quickly recognized that the service was overloaded and began to scale out. 
  • 00:30 UTC on 03 February 2026 – All extension hosting storage accounts had been reenabled, mitigating this impact in all regions other than East US and West US. 
  • 00:50 UTC on 03 February 2026 – Initial scale outs of managed identity service infrastructure scale out completed, but the new resources were still unable to handle the traffic volume due to the increasing backlog of retried requests.
  • 02:00 UTC on 03 February 2026 – A second, larger set of managed identity service scale outs completed. Once again, the capacity was unable to handle the volume of backlogs and retries. 
  • 02:15 UTC on 03 February 2026 – Reviewed additional data and monitored downstream services to ensure that all mitigations were in place for all impacted storage accounts. 
  • 03:55 UTC on 03 February 2026 – To recover infrastructure capacity for the managed identity service, we began rolling out a change to remove all traffic so that the infrastructure could be repaired without load. 
  • 04:25 UTC on 03 February 2026 – After infrastructure nodes recovered, we began gradually ramping traffic to them, allowing backlogged identity operations to begin to process safely. 
  • 06:05 UTC on 03 February 2026 – Backlogged operations completed, and services returned to normal operating levels. We concluded our monitoring and confirmed that all customer impact had been mitigated.

How are we making incidents like this less likely or less impactful? 

  • We are strengthening traffic controls and throttling so that unexpected load spikes or retry storms are contained early, and cannot overwhelm the managed identity service or impact other customers. (Estimated completion: February 2026)
  • We are working to fix the data synchronization problem, by addressing detection gaps and code defects with validation coverage, to protect from similar scenarios. (Estimated completion: March 2026) 
  • We will exercise and optimize the rollback of the remediation workflow, including automation where appropriate, to mitigate issues more quickly. (Estimated completion: April 2026)
  • We are increasing capacity headroom and tightening scaling safeguards, to ensure sufficient resources remain available during recovery events and sudden demand. (Estimated completion: April 2026)
  • We will improve orchestration of the remediation workflow to be more in line with our Safe Deployment Practices, integrated with service health indicators, to reduce multi-region failure risk. (Estimated completion: May 2026) 
  • We are improving retry behavior and internal efficiency, so the service degrades more gracefully under stress. (Estimated completion: June 2026) 
  • We are improving regional isolation and failover behavior, to prevent issues in one region from cascading into paired regions during failures or recovery. (Estimated completion: July 2026)

How can customers make incidents like this less impactful?

  • There was nothing that customers could have done to avoid or minimize impact from this specific service incident. 
  • Note that the impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact:
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

2025年12月

22

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 08:05 UTC and 18:30 UTC on 22 December 2025, the Microsoft Entra Privileged Identity Management (PIM) service experienced issues that impacted role activations for a subset of customers. Some requests returned server errors (5xx) and, for a portion of customers, some requests returned unexpected client errors (4xx). This issue manifested as API failures, elevated latency, and various activation errors.

Impacted actions could have included:

  • Reading eligible or active role assignments
  • Eligible or Active Role assignment management
  • PIM Policy and PIM Alerts management
  • Activation of Eligible or Active Role assignments
  • Deactivation of role assignments initiated by customer (deactivations triggered by the expiration of previous activation were not impacted)
  • Approval of role assignment activation or extension

Customer initiated PIM operations from various Microsoft management portals, mobile app, and API calls were likely impacted. During this incident period, operations that were retried may have succeeded.

What went wrong and why?

Under normal conditions, Privileged Identity Management processes requests—such as role activations—by coordinating activity across its front-end APIs, traffic routing layer, and the backend databases that store role activation information. These components work together to route requests to healthy endpoints, complete required checks, and process activations without delay. To support this flow, the system maintains a pool of active connections for responsiveness and automatically retries brief interruptions to keep error rates low, helping ensure customers experience reliable access when reading, managing, or activating role assignments.

As part of ongoing work to improve how the PIM service stores and manages data, configuration changes which manage the backend databases were deployed incrementally using safe deployment practices. The rollout progressed successfully through multiple rings, with no detected errors and all monitored signals remaining within healthy thresholds.

When the deployment reached a ring operating under higher workload, the additional per-request processing increased demand on the underlying infrastructure that hosts the API which manages connections to the backend databases. Although the service includes throttling mechanisms designed to protect against spikes in API traffic, this scenario led to elevated CPU utilization without exceeding request-count thresholds, so throttling did not engage. Over time, the sustained load caused available database connections to be exhausted, and the service became unable to process new requests efficiently. This resulted in delays, timeouts, and errors for customers attempting to view or activate privileged roles.

How did we respond?

  • 21:36 UTC on 15 December 2025 – The configuration change deployment was initiated.
  • 22:08 UTC on 19 December 2025 – The configuration change was progressed to the ring with the heavy workload.
  • 08:05 UTC on 22 December 2025 – Initial customer impact began.
  • 08:26 UTC on 22 December 2025 – Automated alerts received for intermittent, low volume of errors prompting us to start investigating.
  • 10:30 UTC on 22 December 2025 – We attempted isolated restarts on impacted database instances in an effort to mitigate low-level impact.
  • 13:03 UTC on 22 December 2025 – Automated monitoring alerted us to elevated error rates and a full incident response was initiated.
  • 13:22 UTC on 22 December 2025 – We identified that calls to the database were intermittently timing out. Traffic volume appeared to be normal with no significant surge detected however we observed the spike in CPU utilization.
  • 13:54 UTC on 22 December 2025 – Mitigation efforts began, including beginning to scale out the impacted environments.
  • 15:05 UTC on 22 December 2025 – Scale out efforts were observed as decreasing error rates but not completely eliminating failures. Further instance restarts provided temporary relief.
  • 15:25 UTC on 22 December 2025 – Scaling efforts continued. We engaged our database engineering team to help investigate.
  • 16:37 UTC on 22 December 2025 – While we did not correlate the deployment to this incident, we initiated a rollback of the configuration change.
  • 17:20 UTC on 22 December 2025 – Scale-out efforts completed.
  • 17:45 UTC on 22 December 2025 – Service availability telemetry was showing improvements. Some customers began to report recovery.
  • 18:30 UTC on 22 December 2025 – Customer impact confirmed as mitigated, after rollback of configuration change had completed and error rates had returned to normal levels.

How are we making incidents like this less likely or less impactful?

  • We have rolled back the problematic configuration change across all regions. (Completed)
  • For outages that manifest later as a result of configuration updates, we are developing a mechanism to help engineers correlate these signals more quickly. (Estimated completion: January 2026)
  • We are working to ensure this configuration change will not inadvertently introduce excessive load before we redeploy this again. (Estimated completion: January 2026)
  • We are working on updating our auto-scale configuration to be more responsive to changes in CPU usage. (Estimated completion: January 2026)
  • We are enabling monitoring and runbooks for available database connections to respond to emerging issues sooner. (Estimated completion: February 2026)

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

8

What happened?

Between 11:04 and 14:13 EST on 08 December 2025, customers using any of the Azure Government regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure App Service, Azure Backup, Azure Communication Services, Azure Data Factory, Azure Databricks, Azure Functions, Azure Kubernetes Service, Azure Maps, Azure Migrate, Azure NetApp Files, Azure OpenAI Service, Azure Policy (including Machine Configuration), Azure Resource Manager, Azure Search, Azure Service Bus, Azure Site Recovery, Azure Storage, Azure Virtual Desktop, Microsoft Fabric, and Microsoft Power Platform (including AI Builder and Power Automate).

What went wrong and why?

Azure Resource Manager (ARM) is the gateway for management operations for Azure services. ARM does authorization for these operations based on authorization policies, stored in Cosmos DB accounts that are replicated to all regions. On 08 December 2025, an inadvertent automated key rotation resulted in ARM failures to fetch authorization policies that are needed to evaluate access. As a result, ARM was temporarily unable to communicate with underlying storage resources, causing failures in service-to-service communication and affecting resource management workflows across multiple Azure services. This issue surfaced as authentication failures and 500 Internal Server errors to customers across all clients. Because the content of the Cosmos DB accounts for authorization policies is replicated globally, all regions within the Azure Government cloud were affected.

Microsoft services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the Microsoft internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

Earlier on the same day of this incident, a similar key rotation issue affected services in the Azure in China sovereign cloud. Both the Azure Government and Azure in China sovereign cloud had their (separate but equivalent) keys created on the same day, starting completely independent timers, back in February 2025 – so each was inadvertently rotated on their respective timers, approximately three hours apart. As such, the key used by ARM for the Azure Government regions was automatically rotated, before the same key rotation issue affecting the Azure in China regions was fully mitigated. Although potential impact to other sovereign clouds was discussed as part of the initial investigation, we did not have a sufficient understanding of the inadvertent key rotation to be able to prevent impact in the second sovereign cloud, Azure Government.

How did we respond?

  • 11:04 EST on 08 December 2025 – Customer impact began.
  • 11:07 EST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
  • 11:38 EST on 08 December 2025 – We began applying a fix for the impacted authentication components.
  • 13:58 EST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
  • 14:13 EST on 08 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

  • First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
  • Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
  • We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
  • Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

  • There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

8

What happened?

Between 16:50 CST on 08 December 2025 and 02:00 CST on 09 December 2025 (China Standard Time) customers using any of the Azure in China regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure AI Search, Azure API Management, Azure App Service, Azure Application Insights, Azure Arc, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Database for PostgreSQL Flexible Server, Azure Kubernetes Service, Azure Logic Apps, Azure Managed HSM, Azure Marketplace, Azure Monitor, Azure Policy (including Machine Configuration), Azure Portal, Azure Resource Manager, Azure Site Recovery, Azure Stack HCI, Azure Stream Analytics, Azure Synapse Analytics, and Microsoft Sentinel.

What went wrong and why?

Azure Resource Manager (ARM) is the gateway for management operations for Azure services. ARM does authorization for these operations based on authorization policies, stored in Cosmos DB accounts that are replicated to all regions. On 08 December 2025, an inadvertent automated key rotation resulted in ARM failures to fetch authorization policies that are needed to evaluate access. As a result, ARM was temporarily unable to communicate with underlying storage resources, causing failures in service-to-service communication and affecting resource management workflows across multiple Azure services. This issue surfaced as authentication failures and 500 Internal Server errors to customers across all clients. Because the content of the Cosmos DB accounts for authorization policies is replicated globally, all regions within the Azure in China cloud were affected.

Azure services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

How did we respond?

  • 16:50 CST on 08 December 2025 – Customer impact began.
  • 16:59 CST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
  • 18:37 CST on 08 December 2025 – We identified the underlying cause as the incorrect key rotation.
  • 19:16 CST on 08 December 2025 – We identified mitigation steps and began applying a fix for the impacted authentication components. This was tested and validated before being applied.
  • 22:00 CST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
  • 23:53 CST on 08 December 2025 – Many services had recovered but residual impact remained for some services.
  • 02:00 CST on 09 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

  • First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
  • Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
  • We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
  • Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

  • There was nothing that customers could have done to avoid or minimize impact from this specific service incident. 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: