September 2024
27
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/HSKF-FB0
What happened?
Between 14:47 and 16:25 EDT on 27 September 2024, a platform issue impacted the Azure Resource Manager (ARM) service in our Azure US Government regions. Impacted customers may have experienced issues attempting control plane operations including create, update and delete operations on resources in the Azure Gov cloud, including the US Gov Arizona, US Gov Texas, and/or US Gov Virginia regions.
The incident also impacted downstream services dependent on ARM including Azure App Service, Azure Application Insights, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Event Grid, Azure Kubernetes Service, Azure Log Search Alerts, Azure Maps, Azure Monitor, Azure NetApp Files, Azure portal, Azure Privileged Identity Management, Azure Red Hat OpenShift, Azure Search, Azure Site Recovery, Azure Storage, Azure Synapse, Azure Traffic Manager, Azure Virtual Desktop, and Microsoft Purview.
What went wrong and why?
This issue was initially flagged by our monitors detecting success rate failures. Upon investigating, we discovered that a backend Cosmos DB was misconfigured to block legitimate access from Azure Resource Manager (ARM). This prevented ARM from serving any requests, as the account is global. While Cosmos DB accounts can be replicated to multiple regions, configuration settings are global. When this account was misconfigured, the change immediately applied to all replicas impacting ARM in all regions. Once understood, the misconfiguration was reverted, which fully mitigated all customer impact.
The misconfigured account was for tracking Azure Feature Exposure Control state. Managing feature state is a resource provider where using Entra authentication is appropriate. However, it's also a critical part of processing requests into ARM – as call behavior may change, depending on what feature flags a subscription is assigned.
The Cosmos DB account managing this state had been incorrectly attributed to the resource provider platform, rather than core ARM processing. Since the resource provider platform had completed its migration to Entra authorization for Cosmos DB accounts, the service was disabling local authentication on all of the accounts it owns, as part of our Secure Futures Initiative – for more details, see: https://www.microsoft.com/trust-center/security/secure-future-initiative.
Since the account in question, used to track the exposure control state of Azure features, was misattributed to be part of the resource provider platform, it was included in the update of those accounts. Since Entra depends on ARM, ARM avoids using Entra authentication to Cosmos DB – to prevent circular dependencies.
How did we respond?
- 27 September 2024 @ 14:47 EDT - Customer impact began, triggered by a misconfiguration applied to the underlying Cosmos DB.
- 27 September 2024 @ 14:55 EDT - Monitoring detected success rate failures, on-call teams engaged to investigate.
- 27 September 2024 @ 15:45 EDT - Investigations confirmed the cause as the Cosmos DB misconfiguration.
- 27 September 2024 @ 16:25 EDT - Customer impact mitigated, by reverting the misconfiguration.
How are we making incidents like this less likely or less impactful?
- We are moving the impacted Cosmos DB account to an isolated subscription, to de-risk this failure mode. (Estimated completion: October 2024)
- Furthermore, we will apply management locks to prevent edits to this specific account, given its criticality. (Estimated completion: October 2024)
- In the longer term, our Cosmos DB team are updating processes to roll out account configuration changes in a staggered fashion, to de-risk impact from changes like this one. (Estimated completion: January 2025)
How can customers make incidents like this less impactful?
- There was nothing that customers could have done to prevent or minimize the impact from this specific ARM incident.
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/HSKF-FB0