Skip to Main Content

December 2025

8

What happened?

Between 11:04 and 14:13 EST on 08 December 2025, customers using any of the Azure Government regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure App Service, Azure Backup, Azure Communication Services, Azure Data Factory, Azure Databricks, Azure Functions, Azure Kubernetes Service, Azure Maps, Azure Migrate, Azure NetApp Files, Azure OpenAI Service, Azure Policy (including Machine Configuration), Azure Resource Manager, Azure Search, Azure Service Bus, Azure Site Recovery, Azure Storage, Azure Virtual Desktop, Microsoft Fabric, and Microsoft Power Platform (including AI Builder and Power Automate).

What went wrong and why?

Azure Resource Manager (ARM) is the gateway for management operations for Azure services. ARM does authorization for these operations based on authorization policies, stored in Cosmos DB accounts that are replicated to all regions. On 08 December 2025, an inadvertent automated key rotation resulted in ARM failures to fetch authorization policies that are needed to evaluate access. As a result, ARM was temporarily unable to communicate with underlying storage resources, causing failures in service-to-service communication and affecting resource management workflows across multiple Azure services. This issue surfaced as authentication failures and 500 Internal Server errors to customers across all clients. Because the content of the Cosmos DB accounts for authorization policies is replicated globally, all regions within the Azure Government cloud were affected.

Microsoft services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the Microsoft internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

Earlier on the same day of this incident, a similar key rotation issue affected services in the Azure in China sovereign cloud. Both the Azure Government and Azure in China sovereign cloud had their (separate but equivalent) keys created on the same day, starting completely independent timers, back in February 2025 – so each was inadvertently rotated on their respective timers, approximately three hours apart. As such, the key used by ARM for the Azure Government regions was automatically rotated, before the same key rotation issue affecting the Azure in China regions was fully mitigated. Although potential impact to other sovereign clouds was discussed as part of the initial investigation, we did not have a sufficient understanding of the inadvertent key rotation to be able to prevent impact in the second sovereign cloud, Azure Government.

How did we respond?

  • 11:04 EST on 08 December 2025 – Customer impact began.
  • 11:07 EST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
  • 11:38 EST on 08 December 2025 – We began applying a fix for the impacted authentication components.
  • 13:58 EST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
  • 14:13 EST on 08 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

  • First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
  • Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
  • We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
  • Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

  • There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: