July 2024
30
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/KTY1-HW8
What happened?
Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN). From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts. Beyond AFD and CDN, downstream services that rely on these were also impacted – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services.
After a routine mitigation of a Distributed Denial-of-Service (DDoS) attack, a network misconfiguration caused congestion and packet loss for AFD frontends. For context, we experience an average of 1,700 DDoS attacks per day – these are mitigated automatically by our DDoS protection mechanisms. Customers can learn more about how we manage these events here: https://azure.microsoft.com/blog/unwrapping-the-2023-holiday-season-a-deep-dive-into-azures-ddos-attack-landscape. For this incident, the DDoS attack was merely a trigger event.
What went wrong and why?
Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at https://learn.microsoft.com/azure/ddos-protection/ddos-protection-overview and https://learn.microsoft.com/azure/frontdoor/front-door-ddos.
Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact. During this time period, our automated DDoS mitigations sent SYN authentication challenges, as is typical in the industry for mitigating DDoS attacks. As a result, a small subset of customers without retry logic in their application(s) may have experienced connection failures.
At around 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. These control plane failures were not caused or related to the initial DDoS trigger event – and in isolation, would not have caused any impact.
However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, when we resolved the routing issue. As was the case during the initial period, a small subset of customers without proper retry logic in their application(s) may have experienced isolated connection failures until 19:43 UTC.
How did we respond?
- 11:45 UTC on 30 July 2024 – Impact started
- 11:47 UTC on 30 July 2024 – Our Azure Portal team detected initial service degradation and began to investigate.
- 12:10 UTC on 30 July 2024 – Our network monitoring correlated this Portal incident to an underlying network issue at one specific site in Europe, and our networking engineers engaged to support the investigation.
- 12:55 UTC on 30 July 2024 – We confirmed localized congestion, so engineers began executing our standard playbook to alleviate congestion – including rerouting traffic.
- 13:13 UTC on 30 July 2024 – Communications were published that we are investigating reports of issues connecting to Microsoft services globally, and stated customers may experience timeouts connecting to Azure services.
- 13:58 UTC on 30 July 2024 – The changes to reroute traffic successfully mitigated most of the impact by this time, after which the only remaining impact was isolated connection failures.
- 16:15 UTC on 30 July 2024 – While investigating isolated connection failures, we identified a device within Europe that was not properly obeying commands from the Network control plane and was attracting traffic after it had been told to stop attracting traffic.
- 16:58 UTC on 30 July 2024 – We ordered the network control plane to reissue its commands, but the problematic device was not accessible as described above.
- 17:50 UTC on 30 July 2024 – We started the safe removal of the device from the network and began scanning the network for other potential issues.
- 19:32 UTC on 30 July 2024 – We completed the safe removal of the device from the network.
- 19:43 UTC on 30 July 2024 – Customer impact mitigated, as we confirmed availability returned to pre-incident levels.
How we are making incidents like this less likely or less impactful?
- We have already added the missing configuration on network devices, to ensure a DDoS mitigation issue in one geography cannot spread to another. (Completed)
- We are enhancing our existing validation and monitoring in the Azure network, to detect invalid configurations. (Estimated completion: November 2024)
- We are improving our monitoring where our DDoS protection service is unreachable from the control plane, but is still serving traffic. (Estimated completion: November 2024)
How can customers make incidents like this less impactful?
- For customers of Azure Front Door/Azure CDN products, implementing retry logic in your client-side applications can help handle temporary failures when connecting to a service or network resource during mitigations of network layer DDoS attacks. For more information, refer to our recommended error-handling design patterns: https://learn.microsoft.com/azure/well-architected/resiliency/app-design-error-handling#implement-retry-logic
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/KTY1-HW8