August 2024
5
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/0N_5-PQ0
What happened?
Between 18:22 and 19:49 UTC on 5 August 2024, some customers experienced intermittent connection errors, timeouts, or increased latency while connecting to Microsoft services that leverage Azure Front Door (AFD), because of an issue that impacted multiple geographies. Impacted services included Azure DevOps (ADO) and Azure Virtual Desktop (AVD), as well as a subset of LinkedIn and Microsoft 365 services. This incident specifically impacted Microsoft’s first-party services, and our customers who utilized these Microsoft services that rely on AFD. For clarity, this incident did not impact Azure customers’ services that use AFD directly.
What went wrong and why?
Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in over 200 locations worldwide, serving 20+ million requests per second. This incident was triggered by an internal service team doing a routine configuration change to their AFD profile for modifying an internal-only AFD feature. This change included an erroneous configuration file that resulted in high memory allocation rates on a subset of the AFD servers – specifically, servers that deliver traffic for an internal Microsoft customer profile.
While the rollout of the erroneous configuration file did adhere to our Safe Deployment Practices (SDP), this process requires proper validating checks from the service to which the change is being applied (in this case, AFD) before proceeding to additional regions. In this situation, there were no observable failures, so the configuration change was able to proceed – but this represents a gap in how AFD validates the change, and this prevented timely auto-rollback of the offending configuration – thus expanding the blast radius of impact to more regions.
The error in the configuration file caused resource exhaustion on AFD’s frontend servers that resulted in an initial moderate impact to AFD’s availability and performance. This initial impact on availability triggered a latent bug in the internal service’s client application, which caused an aggressive request storm. This resulted in a 25X increase in traffic volume of expensive requests that significantly impacted AFD’s availability and performance.
How did we respond?
During our investigation, we identified the offending configuration from the internal service that triggered this event, and we then initiated a rollback of the configuration change to fully restore all impacted Microsoft services.
- 18:22 UTC – Customer impact began, triggered by the configuration change.
- 18:25 UTC – Internal monitoring detected the issue and alerted our teams to investigate.
- 18:27 UTC – Initial first-party services began failing away from AFD as a temporary mitigation.
- 18:53 UTC – We identified the configuration change that caused the impact and started preparing to roll it back.
- 19:23 UTC – We began rolling back the change.
- 19:25 UTC – Rollback of the change completed, restoring AFD request serving capacity.
- 19:49 UTC – Customer impact fully mitigated.
- 20:30 UTC – First-party services began routing traffic back to AFD after mitigation declared.
How are we making incidents like this less likely or less impactful?
- As an immediate repair item, we are fixing the bug that caused the aggressive request storm by the internal service team’s client application. This development work has completed and is pending rollout. (Estimated completion: August 2024)
- We are making improvements to validate the complexity and resource overhead requirements for all internal customer configuration updates. (Estimated completion: September 2024)
- We are implementing enhancements to the AFD data plane to automatically disable performance-intensive operations for specific configurations. (Estimated completion: October 2024)
- We are enhancing our protection against similar request storms, by determining dynamically any resource intensive requests so that they can be throttled to prevent further impact. (Estimated completion: December 2024).
- In the longer term, we are augmenting our safe deployment processes surrounding configuration delivery to incorporate more signals to determine AFD service health degradations caused by erroneous configurations and automatically rollback. (Estimated completion: March 2025)
- Finally, as an additional protection measure to guard against validation gaps, we are investing in a capability to automatically fall back to a last known good configuration state, in the event of global anomalies detected in AFD’s service health – for example, in response to a multi-region availability drop. (Estimated completion: March 2025)
How can customers make incidents like this less impactful?
- Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss. A retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to https://learn.microsoft.com/azure/architecture/patterns/retry
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/0N_5-PQ0