Product:
Region:
Date:
November 2024
13
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/PSM0-BQ8
What happened?
Between 00:50 UTC and 12:30 UTC on 13 November 2024, a subset of Azure Blob Storage and Azure Data Lake Storage accounts experienced connectivity errors. The issue manifested as loss of access to Blob and Data Lake storage endpoints of the affected storage accounts, and subsequent unavailability of downstream services that depended on these storage accounts. Since many of the impacted storage accounts were used by other Azure services and major software vendor solutions, the customer impact was widespread. Although unavailable to access, the data stored in these storage accounts was not impacted during this incident. Impacted downstream services included:
- Azure Storage: Impacted customers may have experienced name (DNS) resolution failures when interacting with impacted storage accounts in Australia East, Australia Southeast, Brazil South, Brazil Southeast, Canada Central, Canada East, Central India, Central US, East Asia, East US, East US 2, East US 2 EUAP, France Central, Germany West Central, Japan East, Japan West, Korea Central, North Central US, North Europe, Norway East, South Africa North, South Central US, South India, Southeast Asia, Sweden Central, Switzerland North, UAE North, UK South, UK West, West Central US, West Europe, West US, West US 2, West US 3.
- Azure Container Registry: Impacted customers using the East US region may have experienced intermittent 5xx errors while trying to pull images from the registry.
- Azure Databricks: Impacted customers may have experienced failures with launching clusters and serverless compute resources in Australia East, Canada Central, Canada East, Central US, East US, East US 2, Japan East, South Central US, UAE North, West US, and/or West US 2.
- Azure Log Analytics: Impacted customers using the West Europe, Southeast Asia, and/or Korea Central regions may have experienced delays and/or stale data when viewing Microsoft Graph activity logs.
What went wrong and why?
Azure Traffic Manager manages and routes blob and data lake storage API requests. The incident was caused by an unintentional deletion of Traffic Manager profiles for the impacted storage accounts. These Traffic Manager profiles were originally part of an Azure subscription pool which belonged to the Azure Storage service. This original service was bifurcated to become two separate services, with one service that would eventually be deprecated. Ownership of the subscriptions containing the Traffic Manager profiles for storage accounts should have been assigned to the service that was continuing operation, which was missed. As such, the decommissioning process inadvertently deleted the Traffic Manager profiles under the subscription, leading to loss of access to the affected storage accounts. To learn more about Azure Traffic Manager profiles, see: https://learn.microsoft.com/azure/traffic-manager/traffic-manager-manage-profiles.
How did we respond?
After receiving customer reports of issues, our team immediately engaged to investigate. Once we understood what had triggered the problem, our team initiated started to restore Traffic Manager profiles of the affected storage accounts. The recovery took an extended period of time since it required care reconstructing the Traffic Manager profiles while avoiding further customer impact. We started multiple workstreams in parallel, to drive both manual recovery and create automated steps to speed up recovery. Recovery was carried out in phases, with the majority of affected accounts restored by 06:24 UTC - and the last set of storage accounts recovered and fully operational by 12:30 UTC. Timeline of key events:
- 13 November 2024 @ 00:50 UTC – First customer impact, triggered by the deletion of a Traffic Manager profile.
- 13 November 2024 @ 01:24 UTC – First customer report of issues, on-call engineering team began to investigate.
- 13 November 2024 @ 01:40 UTC – We identified that the issues were triggered by the deletion of Traffic Manager profiles.
- 13 November 2024 @ 02:16 UTC – Impacted Traffic Manager profiles identified, and recovery planning started.
- 13 November 2024 @ 02:25 UTC – Recovery workstreams started.
- 13 November 2024 @ 03:51 UTC – First batch of storage accounts recovered and validated.
- 13 November 2024 @ 06:00 UTC – Automation to perform regional recovery in place.
- 13 November 2024 @ 06:24 UTC – Majority of recovery completed; most impacted accounts were accessible by this time.
- 13 November 2024 @ 12:30 UTC – Recovery and validation 100% complete, incident mitigated.
How are we making incidents like this less likely or less impactful?
- We completed an audit of all the production service artifacts used by the Azure Storage resource provider. (Completed)
- We created a new highly restrictive deployment approval process, as an additional measure to prevent unintended mutations like deletions. (Completed)
- We are improving the process used to clean-up production service artifacts with built-in safety to prevent impact. (Estimated completion: December 2024)
- We are enhancing our monitoring of outside-in storage traffic by making it more sensitive to smaller impacts and to validate connectivity and reachability for all endpoints of storage accounts. (Some services completed, all service will complete in December 2024)
- We are expanding and completing the process of securing platform resources with resource locks, as an additional safety automation to prevent deletes. (Estimated completion: January 2025)
- We will accelerate recovery times by refining restore points and optimizing the recovery process for production service artifacts. (Estimated completion: January 2025)
- We will reduce the blast radius with service architectural improvements to improve the resiliency against issues related to traffic manager, and other upstream dependencies’, unavailability. (Estimated completion: March 2025)
How can customers make incidents like this less impactful?
- Consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/PSM0-BQ8
September 2024
27
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/HSKF-FB0
What happened?
Between 14:47 and 16:25 EDT on 27 September 2024, a platform issue impacted the Azure Resource Manager (ARM) service in our Azure US Government regions. Impacted customers may have experienced issues attempting control plane operations including create, update and delete operations on resources in the Azure Gov cloud, including the US Gov Arizona, US Gov Texas, and/or US Gov Virginia regions.
The incident also impacted downstream services dependent on ARM including Azure App Service, Azure Application Insights, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Event Grid, Azure Kubernetes Service, Azure Log Search Alerts, Azure Maps, Azure Monitor, Azure NetApp Files, Azure portal, Azure Privileged Identity Management, Azure Red Hat OpenShift, Azure Search, Azure Site Recovery, Azure Storage, Azure Synapse, Azure Traffic Manager, Azure Virtual Desktop, and Microsoft Purview.
What went wrong and why?
This issue was initially flagged by our monitors detecting success rate failures. Upon investigating, we discovered that a backend Cosmos DB was misconfigured to block legitimate access from Azure Resource Manager (ARM). This prevented ARM from serving any requests, as the account is global. While Cosmos DB accounts can be replicated to multiple regions, configuration settings are global. When this account was misconfigured, the change immediately applied to all replicas impacting ARM in all regions. Once understood, the misconfiguration was reverted, which fully mitigated all customer impact.
The misconfigured account was for tracking Azure Feature Exposure Control state. Managing feature state is a resource provider where using Entra authentication is appropriate. However, it's also a critical part of processing requests into ARM – as call behavior may change, depending on what feature flags a subscription is assigned.
The Cosmos DB account managing this state had been incorrectly attributed to the resource provider platform, rather than core ARM processing. Since the resource provider platform had completed its migration to Entra authorization for Cosmos DB accounts, the service was disabling local authentication on all of the accounts it owns, as part of our Secure Futures Initiative – for more details, see: https://www.microsoft.com/trust-center/security/secure-future-initiative.
Since the account in question, used to track the exposure control state of Azure features, was misattributed to be part of the resource provider platform, it was included in the update of those accounts. Since Entra depends on ARM, ARM avoids using Entra authentication to Cosmos DB – to prevent circular dependencies.
How did we respond?
- 27 September 2024 @ 14:47 EDT - Customer impact began, triggered by a misconfiguration applied to the underlying Cosmos DB.
- 27 September 2024 @ 14:55 EDT - Monitoring detected success rate failures, on-call teams engaged to investigate.
- 27 September 2024 @ 15:45 EDT - Investigations confirmed the cause as the Cosmos DB misconfiguration.
- 27 September 2024 @ 16:25 EDT - Customer impact mitigated, by reverting the misconfiguration.
How are we making incidents like this less likely or less impactful?
- We are moving the impacted Cosmos DB account to an isolated subscription, to de-risk this failure mode. (Estimated completion: October 2024)
- Furthermore, we will apply management locks to prevent edits to this specific account, given its criticality. (Estimated completion: October 2024)
- In the longer term, our Cosmos DB team are updating processes to roll out account configuration changes in a staggered fashion, to de-risk impact from changes like this one. (Estimated completion: January 2025)
How can customers make incidents like this less impactful?
- There was nothing that customers could have done to prevent or minimize the impact from this specific ARM incident.
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/HSKF-FB0