Skip to Main Content

Product:

Region:

Date:

September 2024

27

Join our upcoming 'Azure Incident Retrospective' livestream about this incident, or watch on demand:

What happened?

Between 14:47 and 16:25 EDT on 27 September 2024, a platform issue impacted the Azure Resource Manager (ARM) service in our Azure US Government regions. Impacted customers may have experienced issues attempting control plane operations including create, update and delete operations on resources in the Azure Gov cloud, including the US Gov Arizona, US Gov Texas, and/or US Gov Virginia regions.

The incident also impacted downstream services dependent on ARM including Azure App Service, Azure Application Insights, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Event Grid, Azure Kubernetes Service, Azure Log Search Alerts, Azure Maps, Azure Monitor, Azure NetApp Files, Azure portal, Azure Privileged Identity Management, Azure Red Hat OpenShift, Azure Search, Azure Site Recovery, Azure Storage, Azure Synapse, Azure Traffic Manager, Azure Virtual Desktop, and Microsoft Purview.

What went wrong and why?

This issue was initially flagged by our monitors detecting success rate failures. Upon investigating, we discovered that a backend Cosmos DB was misconfigured to block legitimate access from Azure Resource Manager (ARM). This prevented ARM from serving any requests, as the account is global. While Cosmos DB accounts can be replicated to multiple regions, configuration settings are global. When this account was misconfigured, the change immediately applied to all replicas impacting ARM in all regions. Once understood, the misconfiguration was reverted, which fully mitigated all customer impact. 

The misconfigured account was for tracking Azure Feature Exposure Control state. Managing feature state is a resource provider where using Entra authentication is appropriate. However, it's also a critical part of processing requests into ARM – as call behavior may change, depending on what feature flags a subscription is assigned.

The Cosmos DB account managing this state had been incorrectly attributed to the resource provider platform, rather than core ARM processing. Since the resource provider platform had completed its migration to Entra authorization for Cosmos DB accounts, the service was disabling local authentication on all of the accounts it owns, as part of our Secure Futures Initiative – for more details, see: .

Since the account in question, used to track the exposure control state of Azure features, was misattributed to be part of the resource provider platform, it was included in the update of those accounts. Since Entra depends on ARM, ARM avoids using Entra authentication to Cosmos DB – to prevent circular dependencies.

How did we respond?

  • 27 September 2024 @ 14:47 EDT - Customer impact began, triggered by a misconfiguration applied to the underlying Cosmos DB.
  • 27 September 2024 @ 14:55 EDT - Monitoring detected success rate failures, on-call teams engaged to investigate.
  • 27 September 2024 @ 15:45 EDT - Investigations confirmed the cause as the Cosmos DB misconfiguration.
  • 27 September 2024 @ 16:25 EDT - Customer impact mitigated, by reverting the misconfiguration.

 How are we making incidents like this less likely or less impactful?

  • We are moving the impacted Cosmos DB account to an isolated subscription, to de-risk this failure mode. (Estimated completion: October 2024)
  • Furthermore, we will apply management locks to prevent edits to this specific account, given its criticality. (Estimated completion: October 2024)
  • In the longer term, our Cosmos DB team are updating processes to roll out account configuration changes in a staggered fashion, to de-risk impact from changes like this one. (Estimated completion: January 2025)

How can customers make incidents like this less impactful?

There was nothing that customers could have done to prevent or minimize the impact from this specific ARM incident.

The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact:

Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

16

Watch our 'Azure Incident Retrospective' video about this incident: 

What happened?

Between 18:46 UTC and 20:36 UTC on 16 September 2024, a platform issue resulted in an impact to a subset of customers using Azure Virtual Desktop in several US regions – specifically, those who have their Azure Virtual Desktop configuration and metadata stored in the US Geography, which we host in the East US 2 region. Impacted customers may have experienced failures to access their list of available resources, make new connections, or perform management actions on their Azure Virtual Desktop resources.

This event impacted customers in the following US regions: Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, and West US 3. End users with connections established via any of these US regions may also have been impacted. This connection routing is determined dynamically, to route end users to their nearest Azure Virtual Desktop gateway. More information on this process can be found at: .

What went wrong and why?

Azure Virtual Desktop (AVD) relies on multiple technologies including SQL Database to store configuration data. Each AVD region has a database with multiple secondary copies. We replicate data from our primary copy to these secondary copies using transaction logs, which ensures that all changes are logged and applied to the replicas. Collectively, these replicas house configuration data that is needed when connecting, managing, and troubleshooting AVD.

We determined that this incident was caused by a degradation in the redo process on one of the secondary copies servicing the US geography, which caused failures to access service resources. As the workload picked up on Monday, the redo subsystem on the readable secondary, started getting stalled while processing specific logical log records (around transaction lifetime). Furthermore, we noticed a spike in read latency to fetch the next set of log records needing to be redone. We have identified an improvement and a bug-fix in this space, which is being prioritized, along with critical detection and mitigation alerts.

How did we respond?

At 18:46 UTC on 16 September 2024, our internal telemetry system raised alarms about the redo log on one of our read replicas being several hours behind. Our engineering team immediately started investigating this anomaly. To expedite mitigation, we manually failed over the database to its secondary region.

  • 18:46 UTC on 16 September 2024 – Customer impact began. Service monitoring detected that the diagnostics service was behind when processing events.
  • 18:46 UTC on 16 September 2024 – We determined that a degradation in the US geo database had caused failures to access service resources.
  • 18:57 UTC on 16 September 2024 – Automation attempted to perform a ‘friendly’ geo failover, to move database out of the region.
  • 19:02 UTC on 16 September 2024 – The friendly geo failover operation failed since it could not synchronize data between the geo secondary and the unavailable geo primary.
  • 19:59 UTC on 16 September 2024 – To mitigate the issue fully, our engineering team manually started a forced geo-failover of the underlying database to the Central US region.
  • 20:00 UTC on 16 September 2024 – Geo-failover operation completed.
  • 20:36 UTC on 16 September 2024 – Customer impact mitigated. Engineering telemetry confirmed that connectivity, application discovery and management issues to service have been resolved.

How are we making incidents like this less likely or less impactful?

  • First and foremost, we will improve our detection of high replication latencies to secondary replicas, to identify problems like this one more quickly. (Estimated completion: October 2024)
  • We will improve our troubleshooting guidelines by incorporating mechanisms to perform mitigations without impact, as needed, to bring customers back online more quickly. (Estimated completion: October 2024)
  • Our AVD and SQL teams will partner together on updating the SQL Persisted Version Store (PVS) growth troubleshooting guideline, by incorporating this new scenario. (Estimated completion: October 2024)
  • To address this particular contention issue in the Redo pipeline, our SQL team has identified a potential new feature (deploying in October 2024) as well as a repair item (which will be completed by December 2024) to de-risk the likelihood of redo lag.
  • We will incorporate improved read log generation throttling of non-transactional log records like PVS inserts, which will add another potential mitigation ability to help secondary replicas catchup. (Estimated completion: October 2024)
  • Finally, we will work to update our database architecture so that telemetry and control (aka metadata) are stored in separate databases. (Estimated completion: October 2024)

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

5

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 08:00 and 21:30 CST on 5 September 2024, customers in the Azure China regions were seeing delays in long running control plane operations including create, update and delete operations on resources. Although these delays were affecting all regions, the most significant impact was in China North 3. These delays resulted in various issues for customers and services utilizing Azure Resource Manager (ARM), including timeouts and operation failures.

  • Azure Databricks – Customers may have encountered errors or failures while submitting job requests.
  • Azure Data Factory – Customers may have experienced Internal Server Errors while running Data Flow Activity or acquiring Data Flow Debug Sessions.
  • Azure Database for MySQL – Customers making create/update/delete operations to Azure Database for MySQL would have not seen requests completing as expected.
  • Azure Event Hubs - Customers attempting to perform read or write operations may have seen slow response times.
  • Azure Firewall - Customers making changes or updating their policies in Azure Firewall may have experienced delays in the updates being completed.
  • Azure Kubernetes Service - Customers making Cluster Management operations (such as scaling out, updates, creating/deleting clusters) would not have seen requests completing as expected.
  • Azure Service Bus - Customers attempting to perform read or write operations may have seen slow response times.
  • Microsoft Purview - Customers making create/update/delete operations to Microsoft Purview accounts/resources would have not seen requests completing as expected.
  • Other services leveraging Azure Resource Manager - Customers may have experienced service management operation failures if using services inside of Resource Groups in China North 3.

What went wrong and why?

Azure Resource Manager is the control plane Azure customers use to perform CRUD (create, read, update, delete) operations on their resources. For long running operations such as creating a virtual machine, there is background processing of the operation that needs to be tracked. During a certificate rotation, the processes tracking these background operations were repeatedly crashing. While background jobs were still being processed, these crashes caused significant delays and an increasing backlog of jobs. As background jobs may contain sensitive data, their state is encrypted at the application level. When the certificate used to encrypt the jobs is rotated, any in-flight jobs are re-encrypted with the new metadata. The certificate rotation in the Azure China regions exposed a latent bug in the update process which caused the processes to begin crashing.

How did we respond?

  • 08:00 CST on 5 September 2024 – Customer impact began.
  • 11:00 CST on 5 September 2024 – We attempted to increase the worker instance count to mitigate the incident, but were not successful.
  • 13:05 CST on 5 September 2024 – We continued to investigate to find contributing factors leading to impact.
  • 16:02 CST on 5 September 2024 – We identified a correlation between specific release versions and increased error processing latency, and determined a safe rollback version.
  • 16:09 CST on 5 September 2024 – We began rolling back to the previous known good build for one component in China East 3, to validate our mitigation approach.
  • 16:55 CST on 5 September 2024 – We started to see indications that the mitigation was working as intended.
  • 17:36 CST on 5 September 2024 – We confirmed mitigation for China East 3 and China North 3, and began rolling back to the safe version in other regions using a safe deployment process which took several hours.
  • 21:30 CST on 5 September 2024 – The rollback was completed and, after further monitoring, we were confident that service functionality had been fully restored, and customer impact mitigated. 

How are we making incidents like this less likely or less impactful?

  • We have fixed the latent code bug related to certificate rotation. (Completed)
  • We are improving our monitoring and telemetry related to process crashing in the Azure China regions. (Completed)
  • Finally, we are moving the background job encryption process to a newer technology that supports fully automated key rotation, to help minimize the potential for impact in the future. (Estimated completion: October 2024)

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: 

August 2024

5

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 18:22 and 19:49 UTC on 5 August 2024, some customers experienced intermittent connection errors, timeouts, or increased latency while connecting to Microsoft services that leverage Azure Front Door (AFD), because of an issue that impacted multiple geographies. Impacted services included Azure DevOps (ADO) and Azure Virtual Desktop (AVD), as well as a subset of LinkedIn and Microsoft 365 services. This incident specifically impacted Microsoft’s first-party services, and our customers who utilized these Microsoft services that rely on AFD. For clarity, this incident did not impact Azure customers’ services that use AFD directly.

What went wrong and why?

Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in over 200 locations worldwide, serving 20+ million requests per second. This incident was triggered by an internal service team doing a routine configuration change to their AFD profile for modifying an internal-only AFD feature. This change included an erroneous configuration file that resulted in high memory allocation rates on a subset of the AFD servers – specifically, servers that deliver traffic for an internal Microsoft customer profile.

While the rollout of the erroneous configuration file did adhere to our Safe Deployment Practices (SDP), this process requires proper validating checks from the service to which the change is being applied (in this case, AFD) before proceeding to additional regions. In this situation, there were no observable failures, so the configuration change was able to proceed – but this represents a gap in how AFD validates the change, and this prevented timely auto-rollback of the offending configuration – thus expanding the blast radius of impact to more regions.

The error in the configuration file caused resource exhaustion on AFD’s frontend servers that resulted in an initial moderate impact to AFD’s availability and performance. This initial impact on availability triggered a latent bug in the internal service’s client application, which caused an aggressive request storm. This resulted in a 25X increase in traffic volume of expensive requests that significantly impacted AFD’s availability and performance. 

How did we respond? 

During our investigation, we identified the offending configuration from the internal service that triggered this event, and we then initiated a rollback of the configuration change to fully restore all impacted Microsoft services.

  • 18:22 UTC – Customer impact began, triggered by the configuration change.
  • 18:25 UTC – Internal monitoring detected the issue and alerted our teams to investigate.
  • 18:27 UTC – Initial first-party services began failing away from AFD as a temporary mitigation.
  • 18:53 UTC – We identified the configuration change that caused the impact and started preparing to roll it back.
  • 19:23 UTC – We began rolling back the change.
  • 19:25 UTC – Rollback of the change completed, restoring AFD request serving capacity.
  • 19:49 UTC – Customer impact fully mitigated.
  • 20:30 UTC – First-party services began routing traffic back to AFD after mitigation declared.

How are we making incidents like this less likely or less impactful?

  • As an immediate repair item, we are fixing the bug that caused the aggressive request storm by the internal service team’s client application. This development work has completed and is pending rollout. (Estimated completion: August 2024)
  • We are making improvements to validate the complexity and resource overhead requirements for all internal customer configuration updates. (Estimated completion: September 2024)
  • We are implementing enhancements to the AFD data plane to automatically disable performance-intensive operations for specific configurations. (Estimated completion: October 2024)
  • We are enhancing our protection against similar request storms, by determining dynamically any resource intensive requests so that they can be throttled to prevent further impact. (Estimated completion: December 2024).
  • In the longer term, we are augmenting our safe deployment processes surrounding configuration delivery to incorporate more signals to determine AFD service health degradations caused by erroneous configurations and automatically rollback. (Estimated completion: March 2025)
  • Finally, as an additional protection measure to guard against validation gaps, we are investing in a capability to automatically fall back to a last known good configuration state, in the event of global anomalies detected in AFD’s service health – for example, in response to a multi-region availability drop. (Estimated completion: March 2025)

How can customers make incidents like this less impactful?

  • Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss. A retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: 

July 2024

30

Watch our 'Azure Incident Retrospective' video about this incident: 

What happened?

Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN). From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts. Beyond AFD and CDN, downstream services that rely on these were also impacted – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services.

After a routine mitigation of a Distributed Denial-of-Service (DDoS) attack, a network misconfiguration caused congestion and packet loss for AFD frontends. For context, we experience an average of 1,700 DDoS attacks per day – these are mitigated automatically by our DDoS protection mechanisms. Customers can learn more about how we manage these events here: . For this incident, the DDoS attack was merely a trigger event.

What went wrong and why?

Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at and .

Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact. During this time period, our automated DDoS mitigations sent SYN authentication challenges, as is typical in the industry for mitigating DDoS attacks. As a result, a small subset of customers without retry logic in their application(s) may have experienced connection failures.

At around 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. These control plane failures were not caused or related to the initial DDoS trigger event – and in isolation, would not have caused any impact.

However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, when we resolved the routing issue. As was the case during the initial period, a small subset of customers without proper retry logic in their application(s) may have experienced isolated connection failures until 19:43 UTC.

How did we respond?

  • 11:45 UTC on 30 July 2024 – Impact started
  • 11:47 UTC on 30 July 2024 – Our Azure Portal team detected initial service degradation and began to investigate.
  • 12:10 UTC on 30 July 2024 – Our network monitoring correlated this Portal incident to an underlying network issue at one specific site in Europe, and our networking engineers engaged to support the investigation.
  • 12:55 UTC on 30 July 2024 – We confirmed localized congestion, so engineers began executing our standard playbook to alleviate congestion – including rerouting traffic.
  • 13:13 UTC on 30 July 2024 – Communications were published that we are investigating reports of issues connecting to Microsoft services globally, and stated customers may experience timeouts connecting to Azure services.
  • 13:58 UTC on 30 July 2024 – The changes to reroute traffic successfully mitigated most of the impact by this time, after which the only remaining impact was isolated connection failures.
  • 16:15 UTC on 30 July 2024 – While investigating isolated connection failures, we identified a device within Europe that was not properly obeying commands from the Network control plane and was attracting traffic after it had been told to stop attracting traffic.
  • 16:58 UTC on 30 July 2024 – We ordered the network control plane to reissue its commands, but the problematic device was not accessible as described above.
  • 17:50 UTC on 30 July 2024 – We started the safe removal of the device from the network and began scanning the network for other potential issues.
  • 19:32 UTC on 30 July 2024 – We completed the safe removal of the device from the network.
  • 19:43 UTC on 30 July 2024 – Customer impact mitigated, as we confirmed availability returned to pre-incident levels.

How we are making incidents like this less likely or less impactful?

  • We have already added the missing configuration on network devices, to ensure a DDoS mitigation issue in one geography cannot spread to another. (Completed)
  • We are enhancing our existing validation and monitoring in the Azure network, to detect invalid configurations. (Estimated completion: November 2024)
  • We are improving our monitoring where our DDoS protection service is unreachable from the control plane, but is still serving traffic. (Estimated completion: November 2024) 

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: