July 2024
18
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/1K80-N_8
What happened?
Between 21:40 UTC on 18 July and 22:00 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region, due to an Azure Storage availability event that was resolved by 02:55 UTC on 19 July 2024. This issue affected Virtual Machine (VM) availability, which caused downstream impact to multiple Azure services, including service availability and connectivity issues, and service management failures. Storage scale units hosting Premium v2 and Ultra Disk offerings were not affected.
Services affected by this event included but were not limited to - Active Directory B2C, App Configuration, App Service, Application Insights, Azure Databricks, Azure DevOps, Azure Resource Manager (ARM), Cache for Redis, Chaos Studio, Cognitive Services, Communication Services, Container Registry, Cosmos DB, Data Factory, Database for MariaDB, Database for MySQL-Flexible Server, Database for PostgreSQL-Flexible Server, Entra ID (Azure AD), Event Grid, Event Hubs, IoT Hub, Load Testing, Log Analytics, Microsoft Defender, Microsoft Sentinel, NetApp Files, Service Bus, SignalR Service, SQL Database, SQL Managed Instance, Stream Analytics, Red Hat OpenShift, and Virtual Machines.
Microsoft cloud services across Microsoft 365, Dynamics 365 and Microsoft Entra were affected as they had dependencies on Azure services impacted during this event.
What went wrong and why?
Virtual Machines with persistent disks utilize disks backed by Azure Storage. As part of security defense-in-depth measures, Storage scale units only accept disk reads and writes requests from ranges of network addresses that are known to belong to the physical hosts on which Azure VMs run. As VM hosts are added and removed, this set of addresses changes, and the updated information is published to all Storage scale units in the region as an ‘allow list’. In large regions, these updates typically happen at least once per day.
On 18 July 2024, due to routine changes to the VM host fleet, an update to the allow list was being generated for publication to Storage scale units. The source information for the list is read from a set of infrastructure file servers and is structured as one file per datacenter. Due to recent changes in the network configuration of those file servers, some became inaccessible from the server which was generating the allow list. The workflow which generates the list did not detect the ‘missing’ source files, and published an allow list with incomplete VM Host address range information to all Storage scale units in the region. This caused Storage servers to reject all VM disk requests from VM hosts for which the information was missing.
The allow list updates are applied to Storage scale units in batches but deploy through a region over a relatively short time window, generally within an hour. This deployment workflow did not check for drops in VM availability, so continued deploying through the region without following Safe Deployment Practices (SDP) such as Availability Zone sequencing, leading to widespread regional impact.
Azure SQL Database and Managed Instance:
Due to the storage availability failures, the VMs across various control and data plane clusters failed. As a result, the clusters became unhealthy, resulting in failed service management operations as well as connectivity failures for Azure SQL DB and Azure SQL Managed Instance customers in the region. Within an hour of the incident, we initiated failovers for databases with automatic failover policies. As a part of the failover, the geo-secondary is elevated as the new primary. The failover group (FOG) endpoint is updated to point to the new primary. This means that applications that connect through the FOG endpoint (as recommended) would be automatically directed to the new region. While this generally happened automatically, less than 0.5% of the failed-over databases/instances had issues in completing the failover. These databases had to be converted to geo-secondary through manual intervention. During this period, if the application did not route their connection or use FOG endpoints it would have experienced prolonged writes to the old primary. The cause of the issue was failover workflows getting terminated or throttled due to high demand on the service manager component.
After storage recovery in Central US, 98% of databases recovered and resumed normal operations. However, about 2% of databases had prolonged unavailability as they required additional mitigation to ensure gateways redirected traffic to the primary node. This was caused by the metadata information on the gateway nodes being out of sync with the actual placement of the database replicas.
Azure Cosmos DB:
Users experienced failed service management operations and connectivity failures because both the control plane and data plane rely on Azure Virtual Machine Scale Sets (VMSS) that use Azure Storage for operating system disks, which were inaccessible. The region-wide success rate of requests to Cosmos DB in Central US dropped to 82% at its lowest point, with about 50% of the VMs running the Cosmos DB service in the region being down. The impact spanned multiple availability zones, and the infrastructure went down progressively over the course of 68 minutes. Impact on the individual Cosmos DB accounts varied depending on the customer database account regional configurations and consistency settings as noted below:
- Customer database accounts configured with multi-region writes (i.e. active-active) were not impacted by the incident, and maintained availability for reads and writes by automatically directing traffic to other regions.
- Customer database accounts configured with multiple read regions, with single write region outside of the Central US configured with session or lower consistency were not impacted by the incident and maintained availability for reads and writes by directing traffic to other regions. When strong or bounded staleness consistency levels are configured, write requests can be throttled to maintain configured consistency guarantees, impacting availability, until Central US region is put offline for the database account or recovered, unblocking writes. This behavior is expected.
- Customer database accounts configured with multiple read regions, with a single write region in the Central US region (i.e. active-passive) maintained read availability but write availability was impacted until accounts were failed over to the other region.
- Customer database accounts configured with single region (multi-zonal or single zone) in the Central US region were impacted if at least one partition resided on impacted nodes.
Additionally, some customers observed errors impacting application availability even if the database accounts were available to serve traffic in other regions. Initial investigations of these reports point to client-side timeouts due to connectivity issues observed during the incident. SDKs’ ability to automatically retry read requests in another region, upon request timeout, depends on the timeout configuration. For more details, please refer to https://learn.microsoft.com/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications
Azure DevOps:
The Azure DevOps service experienced impact during this event, where multiple micro-services were impacted. A subset of Azure DevOps customers experienced impact in regions outside of Central US due to some of their data or metadata residing in Central US scale units. Azure DevOps does not offer regional affinity, which means that customers are not tied to a single region within a geography. A deep-dive report specific to the Azure DevOps impact will be published to their dedicated status page, see: https://status.dev.azure.com/_history
Privileged Identity Management (PIM):
Privileged Identity Management (PIM) experienced degradations due to the unavailability of upstream services such as Azure SQL and Cosmos DB, as well as capacity issues. PIM is deployed in multiple regions including Central US, so a subset of customers whose PIM requests are served in the Central US region were impacted. Failover to another healthy region succeeded for SQL and compute, but it took longer for Cosmos DB failover (see Cosmos DB response section for details). The issue was resolved once the failover was completed.
Azure Resource Manager (ARM):
In the United States there was an impact to ARM due to unavailability of dependent services such as Cosmos DB. ARM has a hub model for storage of global state, like subscription metadata. The Central US, West US 3, West Central US, and Mexico Central regions had a backend state in Cosmos DB in Central US. Calls into ARM going to those regions were impacted until the Central US Cosmos DB replicas were marked as offline. ARM's use of Azure Front Door (AFD) for traffic shaping meant that callers in the United States would have seen intermittent failures if calls were routed to a degraded region. As the regions were partially degraded, health checks did not take them offline. Calls eventually succeeded on retries as they were routed to different regions. Any Central US dependency would have failed throughout the primary incident's lifetime. During the incident, this caused a wider perceived impact for ARM across multiple regions due to customers in other regions homing resources in Central US.
Azure NetApp:
The Azure NetApp Files (ANF) service was impacted by this event, causing new volume creation attempts to fail in all NetApp regions. The ANF Resource Provider (RP) relies on virtual network data (utilization data) used to decide on the placement of volumes, which is provided by the Azure Dedicated RP (DRP) platform to create new volumes. The Storage issue impacted the data and control plane of several Platform-as-a-Service (PaaS) services used by DRP. This event globally affected ANF because the DRP utilization data’s primary location is in Central US, which could not efficiently failover writes, or redirect reads to replicas in other healthy regions. To recover, the DRP engineering group worked with utilization data engineers to perform administrative failovers to healthy regions to recover the ANF control plane. However, by the time the failover attempt could be made, the Storage service recovered in the region and the ANF service recovered on its own by 04:16 UTC on 19 July 2024.
How did we respond?
As the allow list update was being published in the Central US region, our service monitoring began to detect VM availability dropping, and our engineering teams were engaged. Due to the widespread impact, and the primary symptom initially appearing to be a drop in VM disk traffic to Storage scale units, it took time to rule out other possible causes and identify the incomplete storage allow list as the trigger of these issues.
Once correlated, we halted the allow list update workflow worldwide, and our engineering team updated configurations on all Storage scale units in the Central US region to restore availability, which was completed at 02:55 UTC on 19 July. Due to the scale of failures, downstream services took additional time to recover following this mitigation of the underlying Storage issue.
Azure SQL Database and Managed Instance:
Within one minute of SQL unavailability, SQL monitoring detected unhealthy nodes and login failures in the region. Investigation and mitigation workstreams were established, and customers were advised to consider putting into action their disaster recovery (DR) strategies. While we recommend customers manage their failovers, for 0.01% of databases Microsoft initiated the failovers as authorized by customers.
80% of the SQL databases became available within two hours of storage recovery, and 98% were available over the next three hours. Less than 2% required additional mitigations to achieve availability. We restarted gateway nodes to refresh the caches and ensure connections were being routed to the right nodes, and we forced completion of failovers that had not completed.
Azure Cosmos DB:
To mitigate impacted multi-region active-passive accounts with write region in the Central US region, we initiated failover of the control plane right after impact started and completed failover of customer accounts at 22:48 UTC on 18 July 2024, 34 minutes after impact detected. On average, failover of individual accounts took approximately 15 minutes. 95% of failovers were completed without additional mitigations, completing at 02:29 UTC on 19 July 2024, 4 hours 15 minutes after impact detected. The remaining 5% of database accounts required additional mitigations to complete failovers. We cancelled “graceful switch region” operations triggered by customers via the Azure portal, where this was preventing service-managed failover triggered by Microsoft to complete. We also force completed failovers that did not complete, for database accounts that had a long-running control operation with a lock on the service metadata, by removing the lock.
As storage recovery initiated and backend nodes started to come online at various times, Cosmos DB declared impacted partitions as unhealthy. A second workstream in parallel with failovers focused on repairs of impacted partitions. For customer database accounts that stayed in the Central US region (single region accounts) availability was restored to >99.9% by 09:41 UTC on, with all databases impact mitigated by approximately 19:30 UTC.
As availability of impacted customer accounts was being restored, a third workstream focused on repairs of the backend nodes required prior to initiating failback for multi-region accounts. Failback for database accounts that were previously failed-over by Microsoft, started at 08:51 UTC, as repairs progressed, and continued. During failback, we brought the Central US region online as a read region for the database accounts, then customers could switch write region to Central US if and when desired.
A subset of customer database accounts encountered issues during failback that delayed their return to Central US. These issues were addressed but required a redo of the failback by Microsoft. Firstly, a subset of MongoDB API database accounts accessed by certain versions of MongoDB drivers experienced intermittent connectivity issues during failback, which required us to redo the failback in coordination with customers. Secondly, a subset of database accounts with private endpoints after failback to Central US experienced issues connecting to Central US, requiring us to redo the failback.
Detailed timeline of events:
- 21:40 UTC on 18 July 2024 – Customer impact began.
- 22:06 UTC on 18 July 2024 – Service monitoring detected drop in VM availability.
- 22:09 UTC on 18 July 2024 – Initial targeted messaging sent to a subset of customers via Service Health (Azure Portal) as services began to become unhealthy.
- 22:09 UTC on 18 July 2024 – Customer impact for Cosmos DB began.
- 22:13 UTC on 18 July 2024 – Customer impact for Azure SQL DB began.
- 22:14 UTC on 18 July 2024 – Monitoring detected availability drop for Cosmos DB, SQL DB and SQL DB Managed Instance.
- 22:14 UTC on 18 July 2024 – Cosmos DB control plane failover was initiated.
- 22:30 UTC on 18 July 2024 – SQL DB impact was correlated to the Storage incident under investigation.
- 22:45 UTC on 18 July 2024 – Deployment of the incomplete allow list completed, and VM availability in the region reaches the lowest level experienced during the incident
- 22:48 UTC on 18 July 2024 – Cosmos DB control plane failover out of the Central US region completed, initiated service managed failover for impacted active-passive multi-region customer databases.
- 22:56 UTC on 18 July 2024 – Initial public Status Page banner posted, investigating alerts in the Central US region.
- 23:27 UTC on 18 July 2024 – All deployments in the Central US region were paused.
- 23:27 UTC on 18 July 2024 – Initial broad notifications sent via Service Health for known services impacted at the time.
- 23:35 UTC on 18 July 2024 – All compute buildout deployments paused for all regions.
- 00:15 UTC on 19 July 2024 – Azure SQL DB and SQL Managed Instance Geo-failover completed for databases with failover group policy set to Microsoft Managed.
- 00:45 UTC on 18 July 2024 – Partial storage ‘allow list’ confirmed as the underlying cause.
- 00:50 UTC on 19 July 2024 – Control plane availability improving on Azure Resource Manager (ARM).
- 01:10 UTC on 19 July 2024 – Azure Storage began updating storage scale unit configurations to restore availability.
- 01:30 UTC on 19 July 2024 – Customers and downstream services began seeing signs of recovery.
- 02:29 UTC on 19 July 2024 – 95% Cosmos DB account failovers completed.
- 02:30 UTC on 19 July 2024 – Azure SQL DB and SQL Managed Instance databases started recovering as the mitigation process for the underlying Storage incident was progressing.
- 02:51 UTC on 19 July 2024 – 99% of all impacted compute resources had recovered.
- 02:55 UTC on 19 July 2024 – Updated configuration completed on all Storage scale units in the Central US region, restoring availability of all Azure Storage scale units. Downstream service recovery and restoration of isolated customer reported issues continue.
- 05:57 UTC on 19 July 2024 – Cosmos DB availability (% requests succeeded) in the region sustained recovery to >99%.
- 08:00 UTC on 19 July 2024 – 98% of Azure SQL databases had been recovered.
- 08:15 UTC on 19 July 2024 – SQL DB team identified additional issues with gateway nodes as well as failovers that had not completed.
- 08:51 UTC on 19 July 2024 – Cosmos DB started to failback database accounts that were failed over by Microsoft, as Central US infrastructure repair progress allowed.
- 09:15 UTC on 19 July 2024 – SQL DB team started applying mitigations for the impacted databases.
- 09:41 UTC on 19 July 2024 – Cosmos DB availability (% requests succeeded) in the Central US region sustained recovery to >99.9%.
- 15:00 UTC on 19 July 2024 – SQL DB team forced completion of the incomplete failovers.
- 18:00 UTC on 19 July 2024 – SQL DB team completed all gateway node restarts.
- 19:30 UTC on 19 July 2024 – Cosmos DB mitigated all databases.
- 20:00 UTC on 19 July 2024 – SQL DB team completed additional verifications to ensure all impacted databases in the region were in the expected states.
- 22:00 UTC on 19 July 2024 – SQL DB and SQL Managed Instance issue was mitigated, and all databases were verified as recovered.
How are we making incidents like this less likely or less impactful?
- Storage: Fix the allow list generation workflow, to detect incomplete source information and halt. (Completed)
- Storage: Add alerting for requests to storage being rejected by ‘allow list’ checks. (Completed)
- Storage: Change ‘allow list’ deployment flow to serialize by Availability Zones and storage types, and increase deployment period to 24 hours. (Completed)
- Storage: Add additional VM health checks and auto-stop in the allow list deployment workflow. (Estimated completion: July 2024)
- SQL: Reevaluate the policy to initiate Microsoft managed failover of SQL failover groups. Reiterate recommendation for customers to manage their failovers. (Estimated completion: August 2024)
- Cosmos DB: Improve fail-back workflow affecting a subset of MongoDB API customers causing certain versions of MongoDB drivers to fail in connection to all regions. (Estimated completion: August 2024)
- Cosmos DB: Improve the fail-back workflow for database accounts with private endpoints experiencing connectivity issues to Central US after failback, enabling successful failback without requiring Microsoft to redo the process. (Estimated completion: August 2024)
- Storage: Storage data-plane firewall evaluation will detect invalid allow list deployments, and continue to use last-known-good state. (Estimated completion: September 2024)
- Azure NetApp Files: Improve the logic of several monitors to ensure timely detection and appropriate classification of impacting events. (Estimated completion: September 2024)
- Azure NetApp Files: Additional monitoring of several service metrics to help detect similar issues and correlate events more quickly. (Estimated completion: September 2024)
- SQL: Improve Service Fabric cluster location change notification mechanism’s reliability under load. (Estimated completion: in phases starting October 2024)
- SQL: Improve robustness of geo-failover workflows, to address completion issues. (Estimated completion: in phases starting October 2024)
- Cosmos DB: Eliminate issues that caused delay for the 5% of failovers. (Estimated completion: November 2024)
- Azure DevOps: Working to ensure that all customer metadata is migrated to the appropriate geography, to help limit multi-geography impact. (Estimated completion: January 2025)
- Azure NetApp Files: Decouple regional read/writes, to help reduce the blast radius to single region for this class of issue. (Estimated completion: January 2025)
- Azure NetApp Files: Evaluate the use of caching to reduce reliance on utilization data persisted in stores, to help harden service resilience for similar scenarios. (Estimated completion: January 2025)
- Cosmos DB: Adding automatic per-partition failover for multi-region active-passive accounts, to expedite incident mitigation by automatically handling affected partitions. (Estimated completion: March 2025)
- SQL and Cosmos DB: Azure Virtual Machines is working on the Resilient Ephemeral OS disk improvement, which improves VM resilience to Storage incidents. (Estimated completion: May 2025)
How can customers make incidents like this less impactful?
- Consider implementing a Disaster Recovery strategy for Azure SQL Database: https://learn.microsoft.com/azure/azure-sql/database/disaster-recovery-guidance?view
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/1K80-N_8