Skip to Main Content

October 2025

29

Watch our 'Azure Incident Retrospective' video about this incident: 

What happened?

Between 15:41 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) experienced connection timeout errors and Domain Name System (DNS) resolution issues. From 18:30 UTC on 29 October 2025, as the system recovered gradually, some customers started to see availability improve – albeit with increased latency – until the system fully stabilized by 00:05 UTC on 30 October 2025.

Affected Azure services included, but were not limited to: Azure Active Directory B2C, Azure AI Video Indexer, Azure App Service, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Marketplace, Azure Media Services, Azure Portal, Azure Sphere Security Service, Azure SQL Database, and Azure Static Web Apps.

Other Microsoft services were also impacted, including Microsoft 365 (see: MO1181369), the Microsoft Communication Registry website, Microsoft Copilot for Security, Microsoft Defender (External Attack Surface Management), Microsoft Dragon Copilot, Microsoft Dynamics 365 and Power Platform (see: MX1181378), Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), Visual Studio App Center, and customers’ ability to open support cases (both in the Azure Portal and by phone).

What went wrong and why?

Azure Front Door (AFD) and Azure Content Delivery Network (CDN) route traffic using globally distributed edge sites supporting customers as well as Microsoft services including various management portals. The AFD control plane generates customer configuration metadata that the data plane consumes for all customer-initiated operations including purge and Web Application Firewall (WAF) on the AFD platform. Since customer applications hosted on AFD and CDN can be accessed by their end users from anywhere in the world, these changes are deployed globally across all its edge sites to provide a consistent user experience.

A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions.

Azure Front Door employs multiple deployment stages, and a configuration protection system to ensure safe propagation of customer configurations. This system validates configurations at each deployment stage and advances only after receiving positive health signals from the data plane. Once deployments are rolled out successfully, the configuration propagation system also updates a ‘Last Known Good’ (LKG) snapshot (a periodic snapshot of healthy customer configurations) so that deployments can be automatically rolled back in case of any issues. The configuration protection system waits for approximately a minute between each stage, completing on an average within 5-10 minutes globally.

During this incident, the incompatible customer configuration change was made at 15:35 UTC, and was applied to the data plane in a pre-production stage at 15:36 UTC. Our configuration propagation monitoring continued to receive healthy signals – although the problematic metadata was present, it had not caused any issues. Because the data plane crash surfaced asynchronously, after approximately five minutes, the configuration passed through the protection safeguards and propagated to later stages. This configuration (with the incompatible metadata) completed propagation to a majority of edge sites by 15:39 UTC. Since the incompatible customer configuration metadata was deployed successfully to the majority of fleet with positive health signal, the LKG was also updated with this configuration.

The data plane impact began in phases starting with our preproduction environment at 15:41 UTC, and replicated across all edge sites globally by 15:45 UTC. As the data plane impact started, the configuration protection system detected this and stopped all new and inflight customer configuration changes from being propagated at 15:43 UTC. The incompatible customer configuration was processed by edge servers, causing crashes across our various edge sites. This also impacted AFD’s internal DNS service, hosted on the edge sites of Azure Front Door, resulting in intermittent DNS resolution errors for a subset of AFD customer requests. This sequence of events was the trigger for the global impact on the AFD platform.

This AFD incident on 29 October was not directly related to the previous AFD incident, from 9 October. Both incidents were broadly related to configuration propagation risk (inherent to a global Content Delivery Network, in which route/WAF/origin changes must be quickly deployed worldwide) but while the failure mode was similar, the underlying defects were different. Azure Front Door’s configuration protection system is designed to validate configurations and proceed only after receiving positive health signals from the data plane. During the AFD incident on 9 October (Tracking ID: QNBQ-5W8) that protection system worked as intended, but was later bypassed by our engineering team during a manual cleanup operation. During this AFD incident on 29 October (Tracking ID: YKYN-BWZ) the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Some of the learnings and repair items from the earlier incident are applicable to this incident as well, and are included in the list of repairs below.

How did we respond?

The issue started at 15:41 UTC and was detected by monitoring at 15:48 UTC, prompting our investigation. By 15:43 UTC the configuration protection system activated in response to widespread data plane issues, and automatically blocked all new and in-flight configuration changes from being deployed worldwide.

Since the latest ‘last known good’ (LKG) version was updated with the conflicting metadata, we chose not to revert to it. To ensure system stability, we decided not to rollback to prior versions of the LKG either. Instead, we opted to edit the latest LKG, by removing the problematic customer configurations manually. We also opted to block all customer configuration changes from propagating to the data plane at 17:30 UTC so that, as we mitigate, we would not reintroduce this issue. At 17:40 UTC we began deploying the updated LKG configuration across the global fleet. Recovery required reloading all customer configurations at every edge site and rebalancing traffic gradually, to avoid overload conditions as Edge sites returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

Many downstream services that use AFD were able to failover to prevent further customer impact, including Microsoft Entra and Intune portals, and Azure Active Directory B2C. In a more complex example, Azure Portal leveraged its standard recovery process to successfully transition away from AFD during the incident. Users of the Portal would have seen limited impact during this failover process and then been able to use the Portal without issue. Unfortunately, some services within the Portal did not have an established fallback strategy and therefore parts of the Portal experience continued to experience failures even after Portal recovery (for example, Marketplace).

Timeline of major incident milestones:

  • 15:35 UTC on 29 October 2025 – Corrupt metadata first introduced, as described above.
  • 15:41 UTC on 29 October 2025 – Customer impact began, triggered by the resulting crashes.
  • 15:43 UTC on 29 October 2025 – Configuration protection system activated in response to issues.
  • 15:48 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
  • 16:15 UTC on 29 October 2025 – Focus of the investigation became examining AFD configuration changes.
  • 16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
  • 16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
  • 17:10 UTC on 29 October 2025 – Began updating the ‘last known good’ LKG configuration to remove problematic configurations manually.
  • 17:26 UTC on 29 October 2025 – Azure Portal failed away from Azure Front Door.
  • 17:30 UTC on 29 October 2025 – Blocked all customer configuration propagation to the data plane, in preparation for deploying the new configuration.
  • 17:40 UTC on 29 October 2025 – Initiated the deployment of our updated ‘last known good’ configuration.
  • 17:50 UTC on 29 October 2025 – Last known good configuration available to all Edge sites, which began gradually reloading LKG configuration.
  • 18:30 UTC on 29 October 2025 – AFD DNS servers recovered, allowing us to rebalance traffic manually to a small number of healthy Edge sites. Customers began seeing improvements in availability.
  • 20:20 UTC on 29 October 2025 – As a sufficient number of Edge sites had recovered, we switched to automatic traffic management – customers continued to see availability improve.
  • 00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers, as availability and latency had returned to pre-incident levels.

Post mitigation, we temporarily blocked all AFD customer configuration changes at the Azure Resource Manager (ARM) level to ensure the safety of the data plane. We also implemented additional safeguards including (i) fixing the control plane and data plane defects, (ii) removing asynchronous processing from the data plane, (iii) introducing an additional ‘pre-canary’ stage to test customer configuration (iv) extending the bake time during each stage of the configuration propagation, and (v) improvements to the data plane recovery time from approximately 4.5 hours to approximately one hour. We began draining the customer configuration queue from 2 November 2025. Once these safeguards were fully implemented, this restriction was removed on 5 November 2025.

The introduction of new stages in the configuration propagation pipeline was coupled with additional ‘bake time’ between stages – which has resulted in an increase in configuration propagation time, for all operations including create, update, delete, WAF operations on AFD platform, and cache purges. We continue to work on platform enhancements to ensure a robust configuration delivery pipeline and further reduce the propagation time. For more details on these temporary propagation delays, refer to .

How are we making incidents like this less likely or less impactful?

To prevent issues like this, and improve deployment safety...

  • We have fixed both the original control plane incompatibility, and the data plane bug described above. (Completed)
  • We are now enforcing complete synchronous processing of each customer configuration, before advancing to production stages. (Completed)
  • We have implemented additional stages in our phased configuration rollout, including extended bake time to help detect configuration related issues. (Completed)
  • We are decoupling configuration processing in data plane servers from active traffic-serving instances to isolated worker process instances, thereby removing the risk of any configuration defect impacting data plane processing. (Estimated completion: January 2026)
  • Once our pre-validation of this configuration pipeline is in place, we will work towards reducing the propagation time from 45 minutes to approximately 15 minutes. (Estimated completion: January 2026)
  • We are enhancing our testing and validation framework to ensure backwards compatibility with configurations generated across previous versions of the control plane build. (Estimated completion: February 2026)

To reduce the blast radius of potential future issues...

  • We have migrated critical first-party infrastructure (including Azure Portal, Azure Communication Services, Marketplace, Linux Software Repository for Microsoft Products, Support ticket creation) into an active-active solution with fail away. (Completed)
  • In the longer term, we are enhancing our customer configuration and traffic isolation, to ensure that no impact to any other customers from single customers’ traffic or configuration issue – utilizing ‘micro cell’ segmentation of the AFD data plane. (Estimated completion: June 2026)

To be able to recover more quickly from issues...

  • We have made changes to accelerate our data plane recovery time, by leveraging the local customer configuration caching more effectively – to restore customer configurations within one hour. (Completed)
  • We are making investments to reduce data plane recovery time further, to restore customer configurations within approximately 10 minutes. (Estimated completion: March 2026)

To improve our communications and support...

  • We have addressed delays in delivering alerts via Azure Service Health to impacted customers, by making immediate improvements to increase resource thresholds. (Completed)
  • We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation – to notify impacted customers more quickly. (Estimated completion: November 2025)
  • Finally, we will resolve the technical and staffing challenges that prevented Premier and Unified customers from being able to create a support request, by ensuring that we can failover to our backups systems more quickly. (Estimated completion: November 2025)

How can customers make incidents like this less impactful? 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: