产品:
区域:
日期:
2024年9月
27
This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.
What happened?
Between 14:47 and 16:25 EDT on 27 September 2024, a platform issue impacted the Azure Resource Manager (ARM) service in our Azure US Government regions. Impacted customers may have experienced issues attempting control plane operations including create, update and delete operations on resources in the Azure Gov cloud, including the US Gov Arizona, US Gov Texas, and/or US Gov Virginia regions. The incident also impacted downstream services dependent on ARM including Azure App Service, Azure Application Insights, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Event Grid, Azure Kubernetes Service, Azure Log Search Alerts, Azure Maps, Azure Monitor, Azure NetApp Files, Azure portal, Azure Privileged Identity Management, Azure Red Hat OpenShift, Azure Search, Azure Site Recovery, Azure Storage, Azure Synapse, Azure Traffic Manager, Azure Virtual Desktop, and Microsoft Purview.
What went wrong and why?
This issue was initially flagged by our monitors detecting success rate failures. Upon investigating, we discovered that a backend Cosmos DB was misconfigured to block legitimate access from Azure Resource Manager (ARM). This prevented ARM from serving any requests, as the account is global. Once understood, the misconfiguration was reverted, which fully mitigated all customer impact. While Cosmos DB accounts can be replicated to multiple regions, configuration settings are global. When this account was misconfigured, the change immediately applied to all replicas impacting ARM in all regions.
How did we respond?
- 27 September 2024 @ 14:47 EDT - Customer impact began, triggered by a misconfiguration applied to the underlying Cosmos DB.
- 27 September 2024 @ 14:55 EDT - Monitoring detected success rate failures, on-call teams engaged to investigate.
- 27 September 2024 @ 15:45 EDT - Investigations confirmed the cause as the Cosmos DB misconfiguration.
- 27 September 2024 @ 16:25 EDT - Customer impact mitigated, by reverting the misconfiguration.
How are we making incidents like this less likely or less impactful?
- We are moving the impacted Cosmos DB account to an isolated subscription, to de-risk this failure mode. (Estimated completion: October 2024)
- Furthermore, we will apply management locks to prevent edits to this specific account, given its criticality. (Estimated completion: October 2024)
- This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.
How can customers make incidents like this less impactful?
- There was nothing that customers could have done to prevent or minimize the impact from this specific ARM incident.
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/HSKF-FB0
16
Join one of our upcoming 'Azure Incident Retrospective' livestreams about this incident, or watch on demand: https://aka.ms/AIR/1LG8-1X0
What happened?
Between 18:46 UTC and 20:36 UTC on 16 September 2024, a platform issue resulted in an impact to a subset of customers using Azure Virtual Desktop in several US regions – specifically, those who have their Azure Virtual Desktop configuration and metadata stored in the US Geography, which we host in the East US 2 region. Impacted customers may have experienced failures to access their list of available resources, make new connections, or perform management actions on their Azure Virtual Desktop resources.
This event impacted customers in the following US regions: Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, and West US 3. End users with connections established via any of these US regions may also have been impacted. This connection routing is determined dynamically, to route end users to their nearest Azure Virtual Desktop gateway. More information on this process can be found at: https://learn.microsoft.com/azure/virtual-desktop/service-architecture-resilience#user-connections.
What went wrong and why?
Azure Virtual Desktop (AVD) relies on multiple technologies including SQL Database to store configuration data. Each AVD region has a database with multiple secondary copies. We replicate data from our primary copy to these secondary copies using transaction logs, which ensures that all changes are logged and applied to the replicas. Collectively, these replicas house configuration data that is needed when connecting, managing, and troubleshooting AVD.
We determined that this incident was caused by a degradation in the redo process on one of the secondary copies servicing the US geography, which caused failures to access service resources. As the workload picked up on Monday, the redo subsystem on the readable secondary, started getting stalled while processing specific logical log records (around transaction lifetime). Furthermore, we noticed a spike in read latency to fetch the next set of log records needing to be redone. We have identified an improvement and a bug-fix in this space, which is being prioritized, along with critical detection and mitigation alerts.
How did we respond?
At 18:46 UTC on 16 September 2024, our internal telemetry system raised alarms about the redo log on one of our read replicas being several hours behind. Our engineering team immediately started investigating this anomaly. To expedite mitigation, we manually failed over the database to its secondary region.
- 18:46 UTC on 16 September 2024 – Customer impact began. Service monitoring detected that the diagnostics service was behind when processing events.
- 18:46 UTC on 16 September 2024 – We determined that a degradation in the US geo database had caused failures to access service resources.
- 18:57 UTC on 16 September 2024 – Automation attempted to perform a ‘friendly’ geo failover, to move database out of the region.
- 19:02 UTC on 16 September 2024 – The friendly geo failover operation failed since it could not synchronize data between the geo secondary and the unavailable geo primary.
- 19:59 UTC on 16 September 2024 – To mitigate the issue fully, our engineering team manually started a forced geo-failover of the underlying database to the Central US region.
- 20:00 UTC on 16 September 2024 – Geo-failover operation completed.
- 20:36 UTC on 16 September 2024 – Customer impact mitigated. Engineering telemetry confirmed that connectivity, application discovery and management issues to service have been resolved.
How are we making incidents like this less likely or less impactful?
- First and foremost, we will improve our detection of high replication latencies to secondary replicas, to identify problems like this one more quickly. (Estimated completion: October 2024)
- We will improve our troubleshooting guidelines by incorporating mechanisms to perform mitigations without impact, as needed, to bring customers back online more quickly. (Estimated completion: October 2024)
- Our AVD and SQL teams will partner together on updating the SQL Persisted Version Store (PVS) growth troubleshooting guideline, by incorporating this new scenario. (Estimated completion: October 2024)
- To address this particular contention issue in the Redo pipeline, our SQL team has identified a potential new feature (deploying in October 2024) as well as a repair item (which will be completed by December 2024) to de-risk the likelihood of redo lag.
- We will incorporate improved read log generation throttling of non-transactional log records like PVS inserts, which will add another potential mitigation ability to help secondary replicas catchup. (Estimated completion: October 2024)
- Finally, we will work to update our database architecture so that telemetry and control (aka metadata) are stored in separate databases. (Estimated completion: October 2024)
How can customers make incidents like this less impactful?
- For more information on Azure Virtual Desktop host pool object location and data stored there, see https://learn.microsoft.com/azure/virtual-desktop/data-locations#customer-input.
- For more information on Persisted Version Store (PVS), see https://learn.microsoft.com/azure/azure-sql/accelerated-database-recovery
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/1LG8-1X0
5
Join one of our upcoming 'Azure Incident Retrospective' livestreams about this incident, or watch on demand: https://aka.ms/AIR/HVZN-VB0
What happened?
Between 08:00 and 21:30 CST on 5 September 2024, customers in the Azure China regions were seeing delays in long running control plane operations including create, update and delete operations on resources. Although these delays were affecting all regions, the most significant impact was in China North 3. These delays resulted in various issues for customers and services utilizing Azure Resource Manager (ARM), including timeouts and operation failures.
- Azure Databricks – Customers may have encountered errors or failures while submitting job requests.
- Azure Data Factory – Customers may have experienced Internal Server Errors while running Data Flow Activity or acquiring Data Flow Debug Sessions.
- Azure Database for MySQL – Customers making create/update/delete operations to Azure Database for MySQL would have not seen requests completing as expected.
- Azure Event Hubs - Customers attempting to perform read or write operations may have seen slow response times.
- Azure Firewall - Customers making changes or updating their policies in Azure Firewall may have experienced delays in the updates being completed.
- Azure Kubernetes Service - Customers making Cluster Management operations (such as scaling out, updates, creating/deleting clusters) would not have seen requests completing as expected.
- Azure Service Bus - Customers attempting to perform read or write operations may have seen slow response times.
- Microsoft Purview - Customers making create/update/delete operations to Microsoft Purview accounts/resources would have not seen requests completing as expected.
- Other services leveraging Azure Resource Manager - Customers may have experienced service management operation failures if using services inside of Resource Groups in China North 3.
What went wrong and why?
Azure Resource Manager is the control plane Azure customers use to perform CRUD (create, read, update, delete) operations on their resources. For long running operations such as creating a virtual machine, there is background processing of the operation that needs to be tracked. During a certificate rotation, the processes tracking these background operations were repeatedly crashing. While background jobs were still being processed, these crashes caused significant delays and an increasing backlog of jobs. As background jobs may contain sensitive data, their state is encrypted at the application level. When the certificate used to encrypt the jobs is rotated, any in-flight jobs are re-encrypted with the new metadata. The certificate rotation in the Azure China regions exposed a latent bug in the update process which caused the processes to begin crashing.
How did we respond?
- 08:00 CST on 5 September 2024 – Customer impact began.
- 11:00 CST on 5 September 2024 – We attempted to increase the worker instance count to mitigate the incident, but were not successful.
- 13:05 CST on 5 September 2024 – We continued to investigate to find contributing factors leading to impact.
- 16:02 CST on 5 September 2024 – We identified a correlation between specific release versions and increased error processing latency, and determined a safe rollback version.
- 16:09 CST on 5 September 2024 – We began rolling back to the previous known good build for one component in China East 3, to validate our mitigation approach.
- 16:55 CST on 5 September 2024 – We started to see indications that the mitigation was working as intended.
- 17:36 CST on 5 September 2024 – We confirmed mitigation for China East 3 and China North 3, and began rolling back to the safe version in other regions using a safe deployment process which took several hours.
- 21:30 CST on 5 September 2024 – The rollback was completed and, after further monitoring, we were confident that service functionality had been fully restored, and customer impact mitigated.
How are we making incidents like this less likely or less impactful?
- We have fixed the latent code bug related to certificate rotation. (Completed)
- We are improving our monitoring and telemetry related to process crashing in the Azure China regions. (Completed)
- Finally, we are moving the background job encryption process to a newer technology that supports fully automated key rotation, to help minimize the potential for impact in the future. (Estimated completion: October 2024)
How can customers make incidents like this less impactful?
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that predominantly impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application/ and https://learn.microsoft.com/azure/architecture/patterns/geodes
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://docs.azure.cn/service-health/alerts-activity-log-service-notifications-portal
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/HVZN-VB0
2024年8月
5
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/0N_5-PQ0
What happened?
Between 18:22 and 19:49 UTC on 5 August 2024, some customers experienced intermittent connection errors, timeouts, or increased latency while connecting to Microsoft services that leverage Azure Front Door (AFD), because of an issue that impacted multiple geographies. Impacted services included Azure DevOps (ADO) and Azure Virtual Desktop (AVD), as well as a subset of LinkedIn and Microsoft 365 services. This incident specifically impacted Microsoft’s first-party services, and our customers who utilized these Microsoft services that rely on AFD. For clarity, this incident did not impact Azure customers’ services that use AFD directly.
What went wrong and why?
Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in over 200 locations worldwide, serving 20+ million requests per second. This incident was triggered by an internal service team doing a routine configuration change to their AFD profile for modifying an internal-only AFD feature. This change included an erroneous configuration file that resulted in high memory allocation rates on a subset of the AFD servers – specifically, servers that deliver traffic for an internal Microsoft customer profile.
While the rollout of the erroneous configuration file did adhere to our Safe Deployment Practices (SDP), this process requires proper validating checks from the service to which the change is being applied (in this case, AFD) before proceeding to additional regions. In this situation, there were no observable failures, so the configuration change was able to proceed – but this represents a gap in how AFD validates the change, and this prevented timely auto-rollback of the offending configuration – thus expanding the blast radius of impact to more regions.
The error in the configuration file caused resource exhaustion on AFD’s frontend servers that resulted in an initial moderate impact to AFD’s availability and performance. This initial impact on availability triggered a latent bug in the internal service’s client application, which caused an aggressive request storm. This resulted in a 25X increase in traffic volume of expensive requests that significantly impacted AFD’s availability and performance.
How did we respond?
During our investigation, we identified the offending configuration from the internal service that triggered this event, and we then initiated a rollback of the configuration change to fully restore all impacted Microsoft services.
- 18:22 UTC – Customer impact began, triggered by the configuration change.
- 18:25 UTC – Internal monitoring detected the issue and alerted our teams to investigate.
- 18:27 UTC – Initial first-party services began failing away from AFD as a temporary mitigation.
- 18:53 UTC – We identified the configuration change that caused the impact and started preparing to roll it back.
- 19:23 UTC – We began rolling back the change.
- 19:25 UTC – Rollback of the change completed, restoring AFD request serving capacity.
- 19:49 UTC – Customer impact fully mitigated.
- 20:30 UTC – First-party services began routing traffic back to AFD after mitigation declared.
How are we making incidents like this less likely or less impactful?
- As an immediate repair item, we are fixing the bug that caused the aggressive request storm by the internal service team’s client application. This development work has completed and is pending rollout. (Estimated completion: August 2024)
- We are making improvements to validate the complexity and resource overhead requirements for all internal customer configuration updates. (Estimated completion: September 2024)
- We are implementing enhancements to the AFD data plane to automatically disable performance-intensive operations for specific configurations. (Estimated completion: October 2024)
- We are enhancing our protection against similar request storms, by determining dynamically any resource intensive requests so that they can be throttled to prevent further impact. (Estimated completion: December 2024).
- In the longer term, we are augmenting our safe deployment processes surrounding configuration delivery to incorporate more signals to determine AFD service health degradations caused by erroneous configurations and automatically rollback. (Estimated completion: March 2025)
- Finally, as an additional protection measure to guard against validation gaps, we are investing in a capability to automatically fall back to a last known good configuration state, in the event of global anomalies detected in AFD’s service health – for example, in response to a multi-region availability drop. (Estimated completion: March 2025)
How can customers make incidents like this less impactful?
- Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss. A retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to https://learn.microsoft.com/azure/architecture/patterns/retry
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/0N_5-PQ0
2024年7月
30
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/KTY1-HW8
What happened?
Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN). From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts. Beyond AFD and CDN, downstream services that rely on these were also impacted – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services.
After a routine mitigation of a Distributed Denial-of-Service (DDoS) attack, a network misconfiguration caused congestion and packet loss for AFD frontends. For context, we experience an average of 1,700 DDoS attacks per day – these are mitigated automatically by our DDoS protection mechanisms. Customers can learn more about how we manage these events here: https://azure.microsoft.com/blog/unwrapping-the-2023-holiday-season-a-deep-dive-into-azures-ddos-attack-landscape. For this incident, the DDoS attack was merely a trigger event.
What went wrong and why?
Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at https://learn.microsoft.com/azure/ddos-protection/ddos-protection-overview and https://learn.microsoft.com/azure/frontdoor/front-door-ddos.
Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact. During this time period, our automated DDoS mitigations sent SYN authentication challenges, as is typical in the industry for mitigating DDoS attacks. As a result, a small subset of customers without retry logic in their application(s) may have experienced connection failures.
At around 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. These control plane failures were not caused or related to the initial DDoS trigger event – and in isolation, would not have caused any impact.
However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, when we resolved the routing issue. As was the case during the initial period, a small subset of customers without proper retry logic in their application(s) may have experienced isolated connection failures until 19:43 UTC.
How did we respond?
- 11:45 UTC on 30 July 2024 – Impact started
- 11:47 UTC on 30 July 2024 – Our Azure Portal team detected initial service degradation and began to investigate.
- 12:10 UTC on 30 July 2024 – Our network monitoring correlated this Portal incident to an underlying network issue at one specific site in Europe, and our networking engineers engaged to support the investigation.
- 12:55 UTC on 30 July 2024 – We confirmed localized congestion, so engineers began executing our standard playbook to alleviate congestion – including rerouting traffic.
- 13:13 UTC on 30 July 2024 – Communications were published that we are investigating reports of issues connecting to Microsoft services globally, and stated customers may experience timeouts connecting to Azure services.
- 13:58 UTC on 30 July 2024 – The changes to reroute traffic successfully mitigated most of the impact by this time, after which the only remaining impact was isolated connection failures.
- 16:15 UTC on 30 July 2024 – While investigating isolated connection failures, we identified a device within Europe that was not properly obeying commands from the Network control plane and was attracting traffic after it had been told to stop attracting traffic.
- 16:58 UTC on 30 July 2024 – We ordered the network control plane to reissue its commands, but the problematic device was not accessible as described above.
- 17:50 UTC on 30 July 2024 – We started the safe removal of the device from the network and began scanning the network for other potential issues.
- 19:32 UTC on 30 July 2024 – We completed the safe removal of the device from the network.
- 19:43 UTC on 30 July 2024 – Customer impact mitigated, as we confirmed availability returned to pre-incident levels.
How we are making incidents like this less likely or less impactful?
- We have already added the missing configuration on network devices, to ensure a DDoS mitigation issue in one geography cannot spread to another. (Completed)
- We are enhancing our existing validation and monitoring in the Azure network, to detect invalid configurations. (Estimated completion: November 2024)
- We are improving our monitoring where our DDoS protection service is unreachable from the control plane, but is still serving traffic. (Estimated completion: November 2024)
How can customers make incidents like this less impactful?
- For customers of Azure Front Door/Azure CDN products, implementing retry logic in your client-side applications can help handle temporary failures when connecting to a service or network resource during mitigations of network layer DDoS attacks. For more information, refer to our recommended error-handling design patterns: https://learn.microsoft.com/azure/well-architected/resiliency/app-design-error-handling#implement-retry-logic
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/KTY1-HW8
19
Virtual Machines customers may have experienced unresponsiveness and startup failures on Windows machines using the CrowdStrike Falcon agent, affecting both on-premises and various cloud platforms. This issue was caused by a CrowdStrike antivirus software update, which resulted in Windows OS virtual machines experiencing a bug check (BSOD) and entering a continuous restart loop. Customer can access the guidance which will remain available on our main blog site - https://aka.ms/CSfalcon-VMRecoveryOptions
CrowdStrike’s technical PIR for their outage can be found here: https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/
18
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/1K80-N_8
What happened?
Between 21:40 UTC on 18 July and 22:00 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region, due to an Azure Storage availability event that was resolved by 02:55 UTC on 19 July 2024. This issue affected Virtual Machine (VM) availability, which caused downstream impact to multiple Azure services, including service availability and connectivity issues, and service management failures. Storage scale units hosting Premium v2 and Ultra Disk offerings were not affected.
Services affected by this event included but were not limited to - Active Directory B2C, App Configuration, App Service, Application Insights, Azure Databricks, Azure DevOps, Azure Resource Manager (ARM), Cache for Redis, Chaos Studio, Cognitive Services, Communication Services, Container Registry, Cosmos DB, Data Factory, Database for MariaDB, Database for MySQL-Flexible Server, Database for PostgreSQL-Flexible Server, Entra ID (Azure AD), Event Grid, Event Hubs, IoT Hub, Load Testing, Log Analytics, Microsoft Defender, Microsoft Sentinel, NetApp Files, Service Bus, SignalR Service, SQL Database, SQL Managed Instance, Stream Analytics, Red Hat OpenShift, and Virtual Machines.
Microsoft cloud services across Microsoft 365, Dynamics 365 and Microsoft Entra were affected as they had dependencies on Azure services impacted during this event.
What went wrong and why?
Virtual Machines with persistent disks utilize disks backed by Azure Storage. As part of security defense-in-depth measures, Storage scale units only accept disk reads and writes requests from ranges of network addresses that are known to belong to the physical hosts on which Azure VMs run. As VM hosts are added and removed, this set of addresses changes, and the updated information is published to all Storage scale units in the region as an ‘allow list’. In large regions, these updates typically happen at least once per day.
On 18 July 2024, due to routine changes to the VM host fleet, an update to the allow list was being generated for publication to Storage scale units. The source information for the list is read from a set of infrastructure file servers and is structured as one file per datacenter. Due to recent changes in the network configuration of those file servers, some became inaccessible from the server which was generating the allow list. The workflow which generates the list did not detect the ‘missing’ source files, and published an allow list with incomplete VM Host address range information to all Storage scale units in the region. This caused Storage servers to reject all VM disk requests from VM hosts for which the information was missing.
The allow list updates are applied to Storage scale units in batches but deploy through a region over a relatively short time window, generally within an hour. This deployment workflow did not check for drops in VM availability, so continued deploying through the region without following Safe Deployment Practices (SDP) such as Availability Zone sequencing, leading to widespread regional impact.
Azure SQL Database and Managed Instance:
Due to the storage availability failures, the VMs across various control and data plane clusters failed. As a result, the clusters became unhealthy, resulting in failed service management operations as well as connectivity failures for Azure SQL DB and Azure SQL Managed Instance customers in the region. Within an hour of the incident, we initiated failovers for databases with automatic failover policies. As a part of the failover, the geo-secondary is elevated as the new primary. The failover group (FOG) endpoint is updated to point to the new primary. This means that applications that connect through the FOG endpoint (as recommended) would be automatically directed to the new region. While this generally happened automatically, less than 0.5% of the failed-over databases/instances had issues in completing the failover. These databases had to be converted to geo-secondary through manual intervention. During this period, if the application did not route their connection or use FOG endpoints it would have experienced prolonged writes to the old primary. The cause of the issue was failover workflows getting terminated or throttled due to high demand on the service manager component.
After storage recovery in Central US, 98% of databases recovered and resumed normal operations. However, about 2% of databases had prolonged unavailability as they required additional mitigation to ensure gateways redirected traffic to the primary node. This was caused by the metadata information on the gateway nodes being out of sync with the actual placement of the database replicas.
Azure Cosmos DB:
Users experienced failed service management operations and connectivity failures because both the control plane and data plane rely on Azure Virtual Machine Scale Sets (VMSS) that use Azure Storage for operating system disks, which were inaccessible. The region-wide success rate of requests to Cosmos DB in Central US dropped to 82% at its lowest point, with about 50% of the VMs running the Cosmos DB service in the region being down. The impact spanned multiple availability zones, and the infrastructure went down progressively over the course of 68 minutes. Impact on the individual Cosmos DB accounts varied depending on the customer database account regional configurations and consistency settings as noted below:
- Customer database accounts configured with multi-region writes (i.e. active-active) were not impacted by the incident, and maintained availability for reads and writes by automatically directing traffic to other regions.
- Customer database accounts configured with multiple read regions, with single write region outside of the Central US configured with session or lower consistency were not impacted by the incident and maintained availability for reads and writes by directing traffic to other regions. When strong or bounded staleness consistency levels are configured, write requests can be throttled to maintain configured consistency guarantees, impacting availability, until Central US region is put offline for the database account or recovered, unblocking writes. This behavior is expected.
- Customer database accounts configured with multiple read regions, with a single write region in the Central US region (i.e. active-passive) maintained read availability but write availability was impacted until accounts were failed over to the other region.
- Customer database accounts configured with single region (multi-zonal or single zone) in the Central US region were impacted if at least one partition resided on impacted nodes.
Additionally, some customers observed errors impacting application availability even if the database accounts were available to serve traffic in other regions. Initial investigations of these reports point to client-side timeouts due to connectivity issues observed during the incident. SDKs’ ability to automatically retry read requests in another region, upon request timeout, depends on the timeout configuration. For more details, please refer to https://learn.microsoft.com/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications
Azure DevOps:
The Azure DevOps service experienced impact during this event, where multiple micro-services were impacted. A subset of Azure DevOps customers experienced impact in regions outside of Central US due to some of their data or metadata residing in Central US scale units. Azure DevOps does not offer regional affinity, which means that customers are not tied to a single region within a geography. A deep-dive report specific to the Azure DevOps impact will be published to their dedicated status page, see: https://status.dev.azure.com/_history
Privileged Identity Management (PIM):
Privileged Identity Management (PIM) experienced degradations due to the unavailability of upstream services such as Azure SQL and Cosmos DB, as well as capacity issues. PIM is deployed in multiple regions including Central US, so a subset of customers whose PIM requests are served in the Central US region were impacted. Failover to another healthy region succeeded for SQL and compute, but it took longer for Cosmos DB failover (see Cosmos DB response section for details). The issue was resolved once the failover was completed.
Azure Resource Manager (ARM):
In the United States there was an impact to ARM due to unavailability of dependent services such as Cosmos DB. ARM has a hub model for storage of global state, like subscription metadata. The Central US, West US 3, West Central US, and Mexico Central regions had a backend state in Cosmos DB in Central US. Calls into ARM going to those regions were impacted until the Central US Cosmos DB replicas were marked as offline. ARM's use of Azure Front Door (AFD) for traffic shaping meant that callers in the United States would have seen intermittent failures if calls were routed to a degraded region. As the regions were partially degraded, health checks did not take them offline. Calls eventually succeeded on retries as they were routed to different regions. Any Central US dependency would have failed throughout the primary incident's lifetime. During the incident, this caused a wider perceived impact for ARM across multiple regions due to customers in other regions homing resources in Central US.
Azure NetApp:
The Azure NetApp Files (ANF) service was impacted by this event, causing new volume creation attempts to fail in all NetApp regions. The ANF Resource Provider (RP) relies on virtual network data (utilization data) used to decide on the placement of volumes, which is provided by the Azure Dedicated RP (DRP) platform to create new volumes. The Storage issue impacted the data and control plane of several Platform-as-a-Service (PaaS) services used by DRP. This event globally affected ANF because the DRP utilization data’s primary location is in Central US, which could not efficiently failover writes, or redirect reads to replicas in other healthy regions. To recover, the DRP engineering group worked with utilization data engineers to perform administrative failovers to healthy regions to recover the ANF control plane. However, by the time the failover attempt could be made, the Storage service recovered in the region and the ANF service recovered on its own by 04:16 UTC on 19 July 2024.
How did we respond?
As the allow list update was being published in the Central US region, our service monitoring began to detect VM availability dropping, and our engineering teams were engaged. Due to the widespread impact, and the primary symptom initially appearing to be a drop in VM disk traffic to Storage scale units, it took time to rule out other possible causes and identify the incomplete storage allow list as the trigger of these issues.
Once correlated, we halted the allow list update workflow worldwide, and our engineering team updated configurations on all Storage scale units in the Central US region to restore availability, which was completed at 02:55 UTC on 19 July. Due to the scale of failures, downstream services took additional time to recover following this mitigation of the underlying Storage issue.
Azure SQL Database and Managed Instance:
Within one minute of SQL unavailability, SQL monitoring detected unhealthy nodes and login failures in the region. Investigation and mitigation workstreams were established, and customers were advised to consider putting into action their disaster recovery (DR) strategies. While we recommend customers manage their failovers, for 0.01% of databases Microsoft initiated the failovers as authorized by customers.
80% of the SQL databases became available within two hours of storage recovery, and 98% were available over the next three hours. Less than 2% required additional mitigations to achieve availability. We restarted gateway nodes to refresh the caches and ensure connections were being routed to the right nodes, and we forced completion of failovers that had not completed.
Azure Cosmos DB:
To mitigate impacted multi-region active-passive accounts with write region in the Central US region, we initiated failover of the control plane right after impact started and completed failover of customer accounts at 22:48 UTC on 18 July 2024, 34 minutes after impact detected. On average, failover of individual accounts took approximately 15 minutes. 95% of failovers were completed without additional mitigations, completing at 02:29 UTC on 19 July 2024, 4 hours 15 minutes after impact detected. The remaining 5% of database accounts required additional mitigations to complete failovers. We cancelled “graceful switch region” operations triggered by customers via the Azure portal, where this was preventing service-managed failover triggered by Microsoft to complete. We also force completed failovers that did not complete, for database accounts that had a long-running control operation with a lock on the service metadata, by removing the lock.
As storage recovery initiated and backend nodes started to come online at various times, Cosmos DB declared impacted partitions as unhealthy. A second workstream in parallel with failovers focused on repairs of impacted partitions. For customer database accounts that stayed in the Central US region (single region accounts) availability was restored to >99.9% by 09:41 UTC on, with all databases impact mitigated by approximately 19:30 UTC.
As availability of impacted customer accounts was being restored, a third workstream focused on repairs of the backend nodes required prior to initiating failback for multi-region accounts. Failback for database accounts that were previously failed-over by Microsoft, started at 08:51 UTC, as repairs progressed, and continued. During failback, we brought the Central US region online as a read region for the database accounts, then customers could switch write region to Central US if and when desired.
A subset of customer database accounts encountered issues during failback that delayed their return to Central US. These issues were addressed but required a redo of the failback by Microsoft. Firstly, a subset of MongoDB API database accounts accessed by certain versions of MongoDB drivers experienced intermittent connectivity issues during failback, which required us to redo the failback in coordination with customers. Secondly, a subset of database accounts with private endpoints after failback to Central US experienced issues connecting to Central US, requiring us to redo the failback.
Detailed timeline of events:
- 21:40 UTC on 18 July 2024 – Customer impact began.
- 22:06 UTC on 18 July 2024 – Service monitoring detected drop in VM availability.
- 22:09 UTC on 18 July 2024 – Initial targeted messaging sent to a subset of customers via Service Health (Azure Portal) as services began to become unhealthy.
- 22:09 UTC on 18 July 2024 – Customer impact for Cosmos DB began.
- 22:13 UTC on 18 July 2024 – Customer impact for Azure SQL DB began.
- 22:14 UTC on 18 July 2024 – Monitoring detected availability drop for Cosmos DB, SQL DB and SQL DB Managed Instance.
- 22:14 UTC on 18 July 2024 – Cosmos DB control plane failover was initiated.
- 22:30 UTC on 18 July 2024 – SQL DB impact was correlated to the Storage incident under investigation.
- 22:45 UTC on 18 July 2024 – Deployment of the incomplete allow list completed, and VM availability in the region reaches the lowest level experienced during the incident
- 22:48 UTC on 18 July 2024 – Cosmos DB control plane failover out of the Central US region completed, initiated service managed failover for impacted active-passive multi-region customer databases.
- 22:56 UTC on 18 July 2024 – Initial public Status Page banner posted, investigating alerts in the Central US region.
- 23:27 UTC on 18 July 2024 – All deployments in the Central US region were paused.
- 23:27 UTC on 18 July 2024 – Initial broad notifications sent via Service Health for known services impacted at the time.
- 23:35 UTC on 18 July 2024 – All compute buildout deployments paused for all regions.
- 00:15 UTC on 19 July 2024 – Azure SQL DB and SQL Managed Instance Geo-failover completed for databases with failover group policy set to Microsoft Managed.
- 00:45 UTC on 18 July 2024 – Partial storage ‘allow list’ confirmed as the underlying cause.
- 00:50 UTC on 19 July 2024 – Control plane availability improving on Azure Resource Manager (ARM).
- 01:10 UTC on 19 July 2024 – Azure Storage began updating storage scale unit configurations to restore availability.
- 01:30 UTC on 19 July 2024 – Customers and downstream services began seeing signs of recovery.
- 02:29 UTC on 19 July 2024 – 95% Cosmos DB account failovers completed.
- 02:30 UTC on 19 July 2024 – Azure SQL DB and SQL Managed Instance databases started recovering as the mitigation process for the underlying Storage incident was progressing.
- 02:51 UTC on 19 July 2024 – 99% of all impacted compute resources had recovered.
- 02:55 UTC on 19 July 2024 – Updated configuration completed on all Storage scale units in the Central US region, restoring availability of all Azure Storage scale units. Downstream service recovery and restoration of isolated customer reported issues continue.
- 05:57 UTC on 19 July 2024 – Cosmos DB availability (% requests succeeded) in the region sustained recovery to >99%.
- 08:00 UTC on 19 July 2024 – 98% of Azure SQL databases had been recovered.
- 08:15 UTC on 19 July 2024 – SQL DB team identified additional issues with gateway nodes as well as failovers that had not completed.
- 08:51 UTC on 19 July 2024 – Cosmos DB started to failback database accounts that were failed over by Microsoft, as Central US infrastructure repair progress allowed.
- 09:15 UTC on 19 July 2024 – SQL DB team started applying mitigations for the impacted databases.
- 09:41 UTC on 19 July 2024 – Cosmos DB availability (% requests succeeded) in the Central US region sustained recovery to >99.9%.
- 15:00 UTC on 19 July 2024 – SQL DB team forced completion of the incomplete failovers.
- 18:00 UTC on 19 July 2024 – SQL DB team completed all gateway node restarts.
- 19:30 UTC on 19 July 2024 – Cosmos DB mitigated all databases.
- 20:00 UTC on 19 July 2024 – SQL DB team completed additional verifications to ensure all impacted databases in the region were in the expected states.
- 22:00 UTC on 19 July 2024 – SQL DB and SQL Managed Instance issue was mitigated, and all databases were verified as recovered.
How are we making incidents like this less likely or less impactful?
- Storage: Fix the allow list generation workflow, to detect incomplete source information and halt. (Completed)
- Storage: Add alerting for requests to storage being rejected by ‘allow list’ checks. (Completed)
- Storage: Change ‘allow list’ deployment flow to serialize by Availability Zones and storage types, and increase deployment period to 24 hours. (Completed)
- Storage: Add additional VM health checks and auto-stop in the allow list deployment workflow. (Estimated completion: July 2024)
- SQL: Reevaluate the policy to initiate Microsoft managed failover of SQL failover groups. Reiterate recommendation for customers to manage their failovers. (Estimated completion: August 2024)
- Cosmos DB: Improve fail-back workflow affecting a subset of MongoDB API customers causing certain versions of MongoDB drivers to fail in connection to all regions. (Estimated completion: August 2024)
- Cosmos DB: Improve the fail-back workflow for database accounts with private endpoints experiencing connectivity issues to Central US after failback, enabling successful failback without requiring Microsoft to redo the process. (Estimated completion: August 2024)
- Storage: Storage data-plane firewall evaluation will detect invalid allow list deployments, and continue to use last-known-good state. (Estimated completion: September 2024)
- Azure NetApp Files: Improve the logic of several monitors to ensure timely detection and appropriate classification of impacting events. (Estimated completion: September 2024)
- Azure NetApp Files: Additional monitoring of several service metrics to help detect similar issues and correlate events more quickly. (Estimated completion: September 2024)
- SQL: Improve Service Fabric cluster location change notification mechanism’s reliability under load. (Estimated completion: in phases starting October 2024)
- SQL: Improve robustness of geo-failover workflows, to address completion issues. (Estimated completion: in phases starting October 2024)
- Cosmos DB: Eliminate issues that caused delay for the 5% of failovers. (Estimated completion: November 2024)
- Azure DevOps: Working to ensure that all customer metadata is migrated to the appropriate geography, to help limit multi-geography impact. (Estimated completion: January 2025)
- Azure NetApp Files: Decouple regional read/writes, to help reduce the blast radius to single region for this class of issue. (Estimated completion: January 2025)
- Azure NetApp Files: Evaluate the use of caching to reduce reliance on utilization data persisted in stores, to help harden service resilience for similar scenarios. (Estimated completion: January 2025)
- Cosmos DB: Adding automatic per-partition failover for multi-region active-passive accounts, to expedite incident mitigation by automatically handling affected partitions. (Estimated completion: March 2025)
- SQL and Cosmos DB: Azure Virtual Machines is working on the Resilient Ephemeral OS disk improvement, which improves VM resilience to Storage incidents. (Estimated completion: May 2025)
How can customers make incidents like this less impactful?
- Consider implementing a Disaster Recovery strategy for Azure SQL Database: https://learn.microsoft.com/azure/azure-sql/database/disaster-recovery-guidance?view
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/1K80-N_8
13
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/4L44-3F0
What happened?
Between 00:00 UTC and 23:35 UTC on 13 July 2024, an issue during a cleanup operation caused a subset of critical resources to be deleted, which impacted the Azure OpenAI (AOAI) service across multiple regions. A subset of customers experienced errors when calling the Azure OpenAI endpoints and may have experienced issues, such as 5xx errors when accessing their Azure OpenAI service resources, in 14 of the 28 regions that offer AOAI – specifically Australia East, Brazil South, Canada Central, Canada East, East US 2, France Central, Korea Central, North Central US, Norway East, Poland Central, South Africa North, South India, Sweden Central and UK South.
Between 01:00 UTC on 13 July 2024 and 16:54 UTC on 15 July 2024, new Standard (Pay-As-You-Go) fine-tuned model deployments were unavailable in a subset of regions, and deployment deletions may not have been effective. This impacted four regions, specifically East US 2, North Central US, Sweden Central, and Switzerland West. However, clients could utilize inference endpoints for their currently deployed models within these regions.
There was no data loss to Retrieval-augmented generation (RAG) or fine-tuning models due to this event.
What went wrong and why?
Our Azure OpenAI service leverages an internal automation system to create and manage Azure resources that are needed by the service. These resources include Azure GPU Virtual Machines (VMs) that host the OpenAI large language models (LLMs) and Azure Machine Learning workspaces that host the managed online endpoints responsible for serving backend inferencing requests. The automation service is also responsible for orchestrating the deployment and managing the lifecycle of fine-tuned models.
While the automation service itself is a regional service, it uses a global configuration file in a code repository to define the state of the Azure resources needed for Azure OpenAI in all regions. As the service grew its footprint at a highly accelerated rate, multiple components started to leverage the same resource structure to enable new scenarios. Over time, some Azure resource groups grew to contain two different types of resources – those managed by the automation, and those not known by the automation. This discrepancy resulted in inconsistencies between which resources were provisioned in the AOAI production subscription, and which resources the automation’s configuration file identified as existing in the subscription. A specific instance of the inconsistency states that 14 resource groups defined in the repository configuration file contained only resources that are no longer needed by the service, but they actually contained sub resources that include the GPU Compute VMs and Model Endpoints needed by the service. It’s worth calling out that these Model Endpoints support both Pay-As-You-Go (PayGo) and Processing Throughput Unit (PTU) offers.
As part of the ongoing effort to clean up unused resources to overcome subscription limits and to reduce security vulnerability surface areas, a change was made to the global configuration file – to remove the 14 resource groups that were deemed unused. This change took effect soon after it was committed, and all resources in those resource groups were deleted in a short span of time, resulting in the cascade of failures that caused the incident. Because the configuration change was delivered as a content update at a global scope, there existed no safe deployment process to gate the effect of the change in region-specific order, resulting in a multi-region incident. This is the main gap in the change management process that we addressed as an immediate repair item.
The incident was detected quickly through our service’s internal monitoring. Once the deletion trigger event was understood, the recovery process was kicked off immediately to rebuild the service in parallel across the 14 impacted regions. Service availability gradually recovered for PayGo and PTU offers across different model types. Deployment of fine-tuned models was the last scenario to be re-enabled because additional changes were needed to be made and validated to ensure service availability.
How did we respond?
- 00:00 UTC on 13 July 2024 – Initial customer impact started, as cleanup operations began across regions, gradually increasing as more regions became unhealthy.
- 00:05 UTC on 13 July 2024 – Service monitoring detected failure rates going above defined thresholds, alerting our engineers who began an investigation.
- 00:30 UTC on 13 July 2024 – We determined there was a multi-region issue unfolding by correlating alerts, service metrics, and telemetry dashboards. It took 30-40 minutes for the delete to complete, so each region (and even each endpoint in each region) went offline at different times.
- 00:45 UTC on 13 July 2024 – We identified that backend endpoints were being deleted, but did not yet know how or why. Our on-call engineers were investigating two hypotheses.
- 00:52 UTC on 13 July 2024 – We classified the event as critical, engaging incident response personnel to further investigate, coordinate resources, and drive customer workstreams.
- 01:00 UTC on 13 July 2024 – We identified the automated cleanup operation deleting resources. With the problem known, we began mitigation efforts, beginning with stopping the automation that was executing the cleanup.
- 01:40 UTC on 13 July 2024 – The cleanup job stopped on its own and did not delete everything. However, we continued to work to ensure further deletion was prevented.
- 02:00 UTC on 13 July 2024 – We added resource locks and stopped delete requests at both the ARM and Automation levels, to prevent further deletion.
- 02:00 UTC on 13 July 2024 - Initial recovery efforts began to recreate deleted resources. Note regarding recovery time - AOAI models are very large and can only be transferred via secure channels. Within a model pool (there are separate model pools per region per model) each deployment is serial (so the model copying time is a significant factor). Model pools themselves are done in parallel.
- 02:22 UTC on 13 July 2024 – First communication posted to the Azure Status page. Communications were delayed due to initial difficulties scoping affected customers, and impact analysis gaps in our communications tooling.
- 03:35 UTC on 13 July 2024 – Customer impact scope determined, and first targeted communications sent to customers via Service Health in the Azure portal.
- 04:15 UTC on 13 July 2024 – GPT-3.5-Turbo started recovering in North Central US.
- 07:10 UTC on 13 July 2024 – GPT-4o recovered in East US 2.
- 07:10 UTC on 13 July 2024 – Majority of regions and models began recovering.
- 08:35 UTC on 13 July 2024 – GPT-4 in majority of regions recovered and serving traffic.
- 10:20 UTC on 13 July 2024 – Majority of models and regions recovered, error rates dropped to normal levels. GPT-4 recovered in all regions except Canada Central, Sweden Central, North Central US, UK South, Central US, and Australia East.
- 15:40 UTC on 13 July 2024 – GPT-4 in North Central recovered.
- 17:35 UTC on 13 July 2024 – Recovered in all regions except Sweden Central.
- 19:20 UTC on 13 July 2024 - DALL-E restored in all regions.
- 19:30 UTC on 13 July 2024 – GPT-4 in UK South is recovered.
- 20:20 UTC on 13 July 2024 - GPT-4o recovered in Sweden Central.
- 23:35 UTC on 13 July 2024 – All base models recovered, and service restoration completed across all affected regions.
- 14:00 UTC on 15 July 2024 - Fine tuning model recovery across various regions.
- 16:54 UTC on 15 July 2024 - All fine-tuning model deployments restored across affected regions.
Notes regarding the model-specific recoveries in the timeline above:
- Restoration of all models except GPT-4o, DALL-E, and Fine Tuning - All model pools in all regions are set to recover in parallel via automation. Order when each region reaches restoration point is not determined. For most models and regions recovery occurred over time, improving incrementally. Some model types or regions did not recover automatically via automation and required manual intervention (GPT4o, Dall-E, and Finetuning).
- Restoration of GPT4o - Due to multimodal nature, GPT4o uses REDIS and Private Link. Automation was able to bring back REDIS and Private Link, but due to complex interaction, manual intervention required reconfiguring these items with Azure REDIS and Azure Networking teams due to some complexities in Azure REDIS and Azure Private Link.
- Restoration of Dall-E - Restoration of Dall-E required manual intervention due to Managed Identity between Dall-E frontend and Dall-E backend. After restoration the Managed Identity needed to add to an allowlist.
- Restoration of Finetuning control plane - At the beginning of the incident, on 13 July, we turned off our automation to stop any possibility of further deletion. This same automation governs the creation of new finetuning deployments. We kept the automation off all weekend, out of caution to avoid the automation from deleting resources again. On 15 July, when all was recovered and stable, we turned the automation back on, and Finetuning deployment capability was restored as a part of the automation being turned back on.
How are we making incidents like this less likely or less impactful?
- We have changed the configuration policy to be regional, according to our Safe Deployment Practices (SDP) for all subsequent configuration updates. This will help ensure potentially unhealthy changes are limited to a single region. (Completed)
- We have removed incorrect metadata tags on the resources that are not to be managed by the automation, and updated the automation configuration to exclude those resources, thus preventing automation from inadvertently deleting critical resources. (Completed)
- We have tightened the regional makeup of workloads to fewer regions, to further prevent widespread issues in the event of similar unintentional deletion or comparable scenarios. (Completed)
- We have enhanced our incident response engagement automation, to ensure more efficient involvement of our incident response personnel. (Completed)
- We are investigating additional implementation of change management controls, to ensure all changes to production go through a gated safe deployment process (Estimated completion: August 2024)
- Our communications tooling will be updated with relevant AOAI data, to ensure faster notification for similar scenarios where there are difficulties scoping affected customers. (Estimated completion: September 2024)
- We are investigating additional test coverage and verification procedures to de-risk future configuration updates. (Estimated completion: September 2024)
- We will develop additional active monitoring for different model families, to assess system health without depending on customer traffic. (Estimated completion: September 2024)
- In the longer term, we will increase the parallelization of model pool buildouts, to reduce the time it takes for recovery. (Estimated completion: December 2024)
How can customers make incidents like this less impactful?
- Generally, customers can have endpoints in multiple regions for failover if the primary region goes down. For guidance on Business Continuity and Disaster Recovery (BCDR) scenarios, review https://learn.microsoft.com/azure/ai-services/openai/how-to/business-continuity-disaster-recovery
- The global offer is supported for several of the SKUs, consider evaluating and onboarding to it depending on your requirements: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/deployment-types
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/4L44-3F0
12
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/FTVR-L9Z
What happened?
A platform issue caused some of our 'Managed Identities for Azure resources' (formerly Managed Service Identity or MSI) customers to experience failures when requesting managed identity tokens or performing management operations for managed identities in the Australia East region, between 00:55 UTC and 06:28 UTC on 12 July 2024. All identities in this region were subject to impact.
For managed identities for Azure Virtual Machines (VMs) and Virtual Machine Scale Sets (VMSS) in this region, around 3% of token requests failed during this time. Established virtual machines with steady token activity in the days and hours prior to the incident were generally not impacted. For management operations, around 50% of calls via Azure Resource Manager (ARM) or the Azure Portal to perform operations such as creating, deleting, assigning, or updating identities failed during this time.
The creation of any Azure resources to which system-assigned managed identities were assigned would also have failed, since the creation of these resources requires the creation of a managed identity. Identity management and managed identity token issuance for other Azure resource types - including Azure Service Fabric, Azure Stack HCI, Azure Databricks, and Azure Kubernetes Service - were impacted to varying degrees depending on the specific scenario and resource. Generally, across all Azure compute resource types, management operations were heavily impacted. For token requests, existing resources continued to function during the incident, while new or newly scaled resources saw failures.
What went wrong and why?
At Microsoft, we are working to remove manual key management and rotation from the operation of our internal services. For most of our infrastructure and for our customers, we recommend migrating authentication between Azure resources to use Managed identities for Azure resources. For the Managed identities for Azure resources infrastructure itself, though, we cannot use Managed identities for Azure resources, due to a risk of circular dependency. Therefore, Managed identities for Azure resources uses key-based access to authenticate to its downstream Azure dependencies that store identity metadata.
To improve resilience, we have been moving Managed identities for Azure resources infrastructure to an internal system that provides automatic key rotation and management. The process for rolling out this change was to deploy a service change to support automatic keys, and then roll over our keys into the new system in each region one-by-one, following our safe deployment process. This process was tested and successfully applied in several prior regions.
In the Australia East and Australia Southeast regions, however, the required deployment for key automation did not complete successfully - because these regions had previously been pinned to an older build of the service, for unrelated reasons. Our internal deployment tooling reported a successful deployment and did not clearly show that these regions were still pinned to the older version. Believing the deployment to have completed, a service engineer initiated the switch to move to automatic keys for the Australia East storage infrastructure, immediately causing the key to roll over. Because neither Australia East nor its failover pair Australia Southeast were running the correct service version, they continued to try to use the old key when accessing storage for Australia East identities, which failed.
How did we respond?
This incident was detected within three minutes by internal monitoring of both the Managed Identity service, and other dependent services in Azure. Our engineers immediately knew that the cause was related to the key rollover operation that had just been performed. Since managed identities are built across Availability Zones and failover pairs, our standard mitigation playbook is to perform a failover to a different zone or region. However, in this case, all zones in Australia East were impacted - and requests to the failover pair, Australia Southeast, were also failing.
We pursued several investigation and mitigation threads in parallel. Our engineers initially believed that the issue was that the key rollover had somehow failed or applied incorrectly, and we attempted several mitigation steps to force the new key to be picked up or force the downstream resources to accept the new key. After these were not effective, we discovered that the issue was not with the new key, but rather that our instances were still attempting to use the old key - and that this key had been invalidated by the rollover. We launched more parallel workstreams to understand why this would be the case and how to mitigate, since we still believed that the deployment had succeeded. Once we realized that these regions were in fact pinned to an older build of the service, we authored and deployed a configuration change to force this build to use the managed key, which resolved the issue. Full mitigation occurred at 06:28 UTC on 12 July 2024.
How are we making incidents like this less likely or less impactful?
- We temporarily froze all changes to the Managed identities for Azure resources. (Completed)
- We ensured that all Azure regions with Managed Identity that could be subject to automatically managed keys are running a service version and configuration that supports those keys. (Completed)
- We are working to improve our deployment process to ensure that pinned build versions are detected as part of a deployment and not considered successful. (Estimated completion: July 2024)
- Prior to rolling out automatic key management again, we will improve our service telemetry to enable us to quickly see whether a manually or automatically managed key is used, as well as the identifier of the key. (Estimated completion: July 2024)
- Prior to rolling out automatic key management again, we will update our internal process to verify that all service instances are running the correct version and using the expected key. (Estimated completion: July 2024)
- For the future move to automatic key management, we will schedule the changes to occur in off-peak business hours in each Azure region. (Estimated completion: July 2024)
- Finally, we are working to improve the central documentation for automatic key management to ensure the above best practices are followed by all Azure services using this capability. (Estimated completion: July 2024)
How can customers make incidents like this less impactful?
- In Azure, our safe deployment practices consider both availability zones and Azure regions. We encourage all customers to partition their capacity across regions and zones, and to build and exercise a disaster recovery strategy to defend against issues that may occur in a single region. During this incident, customers who were able to fail out of Australia East would have been able to mitigate their impact. To learn more about a multi-region geodiversity strategy for mission-critical workloads, see https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes. Finally, this page provides more information on disaster recovery strategies: https://learn.microsoft.com/azure/well-architected/reliability/disaster-recovery
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/FTVR-L9Z