Product:
Region:
Date:
September 2024
27
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/HSKF-FB0
What happened?
Between 14:47 and 16:25 EDT on 27 September 2024, a platform issue impacted the Azure Resource Manager (ARM) service in our Azure US Government regions. Impacted customers may have experienced issues attempting control plane operations including create, update and delete operations on resources in the Azure Gov cloud, including the US Gov Arizona, US Gov Texas, and/or US Gov Virginia regions.
The incident also impacted downstream services dependent on ARM including Azure App Service, Azure Application Insights, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Event Grid, Azure Kubernetes Service, Azure Log Search Alerts, Azure Maps, Azure Monitor, Azure NetApp Files, Azure portal, Azure Privileged Identity Management, Azure Red Hat OpenShift, Azure Search, Azure Site Recovery, Azure Storage, Azure Synapse, Azure Traffic Manager, Azure Virtual Desktop, and Microsoft Purview.
What went wrong and why?
This issue was initially flagged by our monitors detecting success rate failures. Upon investigating, we discovered that a backend Cosmos DB was misconfigured to block legitimate access from Azure Resource Manager (ARM). This prevented ARM from serving any requests, as the account is global. While Cosmos DB accounts can be replicated to multiple regions, configuration settings are global. When this account was misconfigured, the change immediately applied to all replicas impacting ARM in all regions. Once understood, the misconfiguration was reverted, which fully mitigated all customer impact.
The misconfigured account was for tracking Azure Feature Exposure Control state. Managing feature state is a resource provider where using Entra authentication is appropriate. However, it's also a critical part of processing requests into ARM – as call behavior may change, depending on what feature flags a subscription is assigned.
The Cosmos DB account managing this state had been incorrectly attributed to the resource provider platform, rather than core ARM processing. Since the resource provider platform had completed its migration to Entra authorization for Cosmos DB accounts, the service was disabling local authentication on all of the accounts it owns, as part of our Secure Futures Initiative – for more details, see: https://www.microsoft.com/trust-center/security/secure-future-initiative.
Since the account in question, used to track the exposure control state of Azure features, was misattributed to be part of the resource provider platform, it was included in the update of those accounts. Since Entra depends on ARM, ARM avoids using Entra authentication to Cosmos DB – to prevent circular dependencies.
How did we respond?
- 27 September 2024 @ 14:47 EDT - Customer impact began, triggered by a misconfiguration applied to the underlying Cosmos DB.
- 27 September 2024 @ 14:55 EDT - Monitoring detected success rate failures, on-call teams engaged to investigate.
- 27 September 2024 @ 15:45 EDT - Investigations confirmed the cause as the Cosmos DB misconfiguration.
- 27 September 2024 @ 16:25 EDT - Customer impact mitigated, by reverting the misconfiguration.
How are we making incidents like this less likely or less impactful?
- We are moving the impacted Cosmos DB account to an isolated subscription, to de-risk this failure mode. (Estimated completion: October 2024)
- Furthermore, we will apply management locks to prevent edits to this specific account, given its criticality. (Estimated completion: October 2024)
- In the longer term, our Cosmos DB team are updating processes to roll out account configuration changes in a staggered fashion, to de-risk impact from changes like this one. (Estimated completion: January 2025)
How can customers make incidents like this less impactful?
There was nothing that customers could have done to prevent or minimize the impact from this specific ARM incident.
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/HSKF-FB0
16
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/1LG8-1X0
What happened?
Between 18:46 UTC and 20:36 UTC on 16 September 2024, a platform issue resulted in an impact to a subset of customers using Azure Virtual Desktop in several US regions – specifically, those who have their Azure Virtual Desktop configuration and metadata stored in the US Geography, which we host in the East US 2 region. Impacted customers may have experienced failures to access their list of available resources, make new connections, or perform management actions on their Azure Virtual Desktop resources.
This event impacted customers in the following US regions: Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, and West US 3. End users with connections established via any of these US regions may also have been impacted. This connection routing is determined dynamically, to route end users to their nearest Azure Virtual Desktop gateway. More information on this process can be found at: https://learn.microsoft.com/azure/virtual-desktop/service-architecture-resilience#user-connections.
What went wrong and why?
Azure Virtual Desktop (AVD) relies on multiple technologies including SQL Database to store configuration data. Each AVD region has a database with multiple secondary copies. We replicate data from our primary copy to these secondary copies using transaction logs, which ensures that all changes are logged and applied to the replicas. Collectively, these replicas house configuration data that is needed when connecting, managing, and troubleshooting AVD.
We determined that this incident was caused by a degradation in the redo process on one of the secondary copies servicing the US geography, which caused failures to access service resources. As the workload picked up on Monday, the redo subsystem on the readable secondary, started getting stalled while processing specific logical log records (around transaction lifetime). Furthermore, we noticed a spike in read latency to fetch the next set of log records needing to be redone. We have identified an improvement and a bug-fix in this space, which is being prioritized, along with critical detection and mitigation alerts.
How did we respond?
At 18:46 UTC on 16 September 2024, our internal telemetry system raised alarms about the redo log on one of our read replicas being several hours behind. Our engineering team immediately started investigating this anomaly. To expedite mitigation, we manually failed over the database to its secondary region.
- 18:46 UTC on 16 September 2024 – Customer impact began. Service monitoring detected that the diagnostics service was behind when processing events.
- 18:46 UTC on 16 September 2024 – We determined that a degradation in the US geo database had caused failures to access service resources.
- 18:57 UTC on 16 September 2024 – Automation attempted to perform a ‘friendly’ geo failover, to move database out of the region.
- 19:02 UTC on 16 September 2024 – The friendly geo failover operation failed since it could not synchronize data between the geo secondary and the unavailable geo primary.
- 19:59 UTC on 16 September 2024 – To mitigate the issue fully, our engineering team manually started a forced geo-failover of the underlying database to the Central US region.
- 20:00 UTC on 16 September 2024 – Geo-failover operation completed.
- 20:36 UTC on 16 September 2024 – Customer impact mitigated. Engineering telemetry confirmed that connectivity, application discovery and management issues to service have been resolved.
How are we making incidents like this less likely or less impactful?
- First and foremost, we will improve our detection of high replication latencies to secondary replicas, to identify problems like this one more quickly. (Estimated completion: October 2024)
- We will improve our troubleshooting guidelines by incorporating mechanisms to perform mitigations without impact, as needed, to bring customers back online more quickly. (Estimated completion: October 2024)
- Our AVD and SQL teams will partner together on updating the SQL Persisted Version Store (PVS) growth troubleshooting guideline, by incorporating this new scenario. (Estimated completion: October 2024)
- To address this particular contention issue in the Redo pipeline, our SQL team has identified a potential new feature (deploying in October 2024) as well as a repair item (which will be completed by December 2024) to de-risk the likelihood of redo lag.
- We will incorporate improved read log generation throttling of non-transactional log records like PVS inserts, which will add another potential mitigation ability to help secondary replicas catchup. (Estimated completion: October 2024)
- Finally, we will work to update our database architecture so that telemetry and control (aka metadata) are stored in separate databases. (Estimated completion: October 2024)
How can customers make incidents like this less impactful?
- For more information on Azure Virtual Desktop host pool object location and data stored there, see https://learn.microsoft.com/azure/virtual-desktop/data-locations#customer-input.
- For more information on Persisted Version Store (PVS), see https://learn.microsoft.com/azure/azure-sql/accelerated-database-recovery
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/1LG8-1X0
5
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/HVZN-VB0
What happened?
Between 08:00 and 21:30 CST on 5 September 2024, customers in the Azure China regions were seeing delays in long running control plane operations including create, update and delete operations on resources. Although these delays were affecting all regions, the most significant impact was in China North 3. These delays resulted in various issues for customers and services utilizing Azure Resource Manager (ARM), including timeouts and operation failures.
- Azure Databricks – Customers may have encountered errors or failures while submitting job requests.
- Azure Data Factory – Customers may have experienced Internal Server Errors while running Data Flow Activity or acquiring Data Flow Debug Sessions.
- Azure Database for MySQL – Customers making create/update/delete operations to Azure Database for MySQL would have not seen requests completing as expected.
- Azure Event Hubs - Customers attempting to perform read or write operations may have seen slow response times.
- Azure Firewall - Customers making changes or updating their policies in Azure Firewall may have experienced delays in the updates being completed.
- Azure Kubernetes Service - Customers making Cluster Management operations (such as scaling out, updates, creating/deleting clusters) would not have seen requests completing as expected.
- Azure Service Bus - Customers attempting to perform read or write operations may have seen slow response times.
- Microsoft Purview - Customers making create/update/delete operations to Microsoft Purview accounts/resources would have not seen requests completing as expected.
- Other services leveraging Azure Resource Manager - Customers may have experienced service management operation failures if using services inside of Resource Groups in China North 3.
What went wrong and why?
Azure Resource Manager is the control plane Azure customers use to perform CRUD (create, read, update, delete) operations on their resources. For long running operations such as creating a virtual machine, there is background processing of the operation that needs to be tracked. During a certificate rotation, the processes tracking these background operations were repeatedly crashing. While background jobs were still being processed, these crashes caused significant delays and an increasing backlog of jobs. As background jobs may contain sensitive data, their state is encrypted at the application level. When the certificate used to encrypt the jobs is rotated, any in-flight jobs are re-encrypted with the new metadata. The certificate rotation in the Azure China regions exposed a latent bug in the update process which caused the processes to begin crashing.
How did we respond?
- 08:00 CST on 5 September 2024 – Customer impact began.
- 11:00 CST on 5 September 2024 – We attempted to increase the worker instance count to mitigate the incident, but were not successful.
- 13:05 CST on 5 September 2024 – We continued to investigate to find contributing factors leading to impact.
- 16:02 CST on 5 September 2024 – We identified a correlation between specific release versions and increased error processing latency, and determined a safe rollback version.
- 16:09 CST on 5 September 2024 – We began rolling back to the previous known good build for one component in China East 3, to validate our mitigation approach.
- 16:55 CST on 5 September 2024 – We started to see indications that the mitigation was working as intended.
- 17:36 CST on 5 September 2024 – We confirmed mitigation for China East 3 and China North 3, and began rolling back to the safe version in other regions using a safe deployment process which took several hours.
- 21:30 CST on 5 September 2024 – The rollback was completed and, after further monitoring, we were confident that service functionality had been fully restored, and customer impact mitigated.
How are we making incidents like this less likely or less impactful?
- We have fixed the latent code bug related to certificate rotation. (Completed)
- We are improving our monitoring and telemetry related to process crashing in the Azure China regions. (Completed)
- Finally, we are moving the background job encryption process to a newer technology that supports fully automated key rotation, to help minimize the potential for impact in the future. (Estimated completion: October 2024)
How can customers make incidents like this less impactful?
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that predominantly impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://docs.azure.cn/service-health/alerts-activity-log-service-notifications-portal
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/HVZN-VB0