October 2025
9
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/QNBQ-5W8
What happened?
Between 07:50 UTC and 16:00 UTC on 09 October 2025, Microsoft services and Azure customers leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) may have experienced increased latency and/or timeouts – primarily across Africa and Europe, as well as Asia Pacific and the Middle East. This impacted the availability of the Azure Portal as well as other management portals across Microsoft.
Peak failure rates for AFD reached approximately 17% in Africa, 6% in Europe, and 2.7% in Asia Pacific and the Middle East. Availability was restored by 12:50 UTC, though some customers continued to experience elevated latency. Latency returned to baseline levels by 16:00 UTC, at which point the incident was mitigated.
What do we know so far?
AFD routes traffic using globally distributed edge sites and supports Microsoft services including the management portals. The AFD control plane generates system metadata that the data plane consumes for customer-initiated ‘create’, ‘update’, or ‘delete’ operations on AFD or CDN profiles. One of the trigger conditions for this incident was a software defect in the latest version of the AFD control plane which had been rolled out six weeks prior to the incident, in line with our safe deployment practices.
Newly created customer tenant profiles were being onboarded to the newer control plane version. Our service monitoring detected elevated data plane crashes due to a previously unknown bug – triggered by erroneous metadata, generated by a particular sequence of profile update operations. Our automated protection layer intercepted this in early update stages and prevented this metadata from propagating any further to the data plane, thereby averting any customer impact at that time. In addition, as the newer control plane was running in tandem with the previous version of the control plane, we disabled the new control plane from taking any requests.
On 09 October 2025, we initiated a cleanup of the affected tenant configuration with the erroneous metadata. Since the automated protection system was blocking the impacted customer tenant profile updates in the initial stage, we temporarily bypassed it to allow the cleanup of the tenant configuration to proceed. By bypassing the protection system, the erroneous metadata was inadvertently able to propagate to later stages – and triggered the bug in the data plane that crashed the data plane service. This resulted in a disruption to a significant number of edge sites across Europe and Africa, approximately 26% of AFD data plane infrastructure resources in these regions were impacted.
As part of AFD mechanisms to manage traffic, load was automatically distributed to nearby edge sites (including in Asia Pacific and the Middle East). Additionally, as regional business hours traffic started ramping up, it added to the overall traffic load. The increased volume of traffic on the remaining healthy edge sites resulted in high resource utilization, which exceeded operational thresholds. This triggered an additional layer of protection which started distributing traffic to a broader set of edge sites globally, to reduce further impact. Recovery required a combination of automated restarts, manual intervention where automated restarts were taking too long, and traffic failover operations for impacted management portals. Full mitigation was achieved once edge site infrastructure resources stabilized and latency returned to normal.
Additionally, initial customer notifications were delayed primarily due to challenges determining impact, while attempting to target communications to those impacted. We have automated communications to notify customers of incidents quickly, unfortunately this capability was not yet supported in this incident scenario.
How did we respond?
- 07:30 UTC on 09 October 2025 – The cleanup operation was initiated.
- 07:50 UTC on 09 October 2025 – Initial customer impact began, and increased over the next 90 minutes.
- 08:13 UTC on 09 October 2025 – Our telemetry detected resource availability loss across multiple AFD edge sites. We began investigating as impact continued to grow.
- 09:04 UTC on 09 October 2025 – We identified that the crashes were due to the previously identified data plane bug.
- 09:08 UTC on 09 October 2025 – Automated restarts began for our AFD infrastructure resources, and manual intervention began for resources that did not recover automatically.
- 09:15 UTC on 09 October 2025 – Customer impact had grown to be at its peak.
- 10:01 UTC on 09 October 2025 – Communications were published to the Azure Status page.
- 10:45 UTC on 09 October 2025 – Targeted customer communications were sent to Azure Service Health.
- 11:59 UTC on 09 October 2025 – Management portals, like the Azure Portal, performed failover operations (including using scripts to update the load balancing configuration, to split traffic between multiple routes) helping restore its service availability.
- 12:50 UTC on 09 October 2025 – Availability for AFD fully recovered, however a subset of customers may still have been experiencing elevated latency.
- 16:00 UTC on 09 October 2025 – After continuous monitoring of latency improvement, we declared the incident as mitigated after confirming recovery.
How are we making incidents like this less likely or less impactful?
- We have hardened our standard operating procedures, to ensure that the configuration protection system is not bypassed for any operation. (Completed)
- We have fixed the control plane defect which generated the erroneous tenant metadata that led to the data plane resource crashes. (Completed)
- We have fixed the bug in the data plane. (Completed)
- We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation. (Estimated completion: November 2025)
- We are making improvements to our Azure Portal failover systems from AFD, to be more robust and automated. (Estimated completion: December 2025)
- We are building additional runtime configuration validation pipelines against a replica of real-time data plane, as a pre-validation step prior to applying them broadly. (Estimated completion: March 2026)
- We are improving data plane resource instance recovery time, following any impact to the data plane. (Estimated completion: March 2026)
How can customers make incidents like this less impactful?
- Consider implementing failover strategies with Azure Traffic Manager, to fail over from Azure Front Door to your origins: https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/overview
- Consider reviewing our best practices for Azure Front Door architecture: https://learn.microsoft.com/azure/well-architected/service-guides/azure-front-door
- Consider implementing retry patterns with exponential backoff, to improve workload resiliency: https://learn.microsoft.com/azure/architecture/patterns/retry
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: http://aka.ms/AzPIR/QNBQ-5W8