Azure status history

This page contains Post Incident Reviews (PIRs) of previous service issues, each retained for 5 years. From November 20, 2019, this included PIRs for all issues about which we communicated publicly. From June 1, 2022, this includes PIRs for broad issues as described in our documentation.

Product:

Region:

Date:

December 2025

22

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

What happened?

Between 08:05 UTC and 18:30 UTC on 22 December 2025, the Microsoft Entra Privileged Identity Management (PIM) service experienced issues that impacted role activations for a subset of customers. Some requests returned server errors (5xx) and, for a portion of customers, some requests returned unexpected client errors (4xx). This issue manifested as API failures, elevated latency, and various activation errors.

Impacted actions could have included:

Reading eligible or active role assignments
Role assignment management
PIM Policy and PIM Alerts management
Activation of role assignments
Deactivation of role assignments initiated by customer (deactivations triggered by the expiration of previous activation were not impacted)
Approval of role assignment activation or extension

Customer initiated operations from various Microsoft management portals, mobile app, and API calls were likely impacted. During this incident period, operations that were retried may have succeeded.

What went wrong and why?

Under normal conditions, Privileged Identity Management processes requests (including role activations) by coordinating a series of requests between the service’s front‑end APIs, the routing layer that directs traffic to healthy endpoints, and the database that stores role and activation information. When everything is healthy, these components work together so that requests are quickly routed, checks are completed, and role activations go through without delays. The system keeps a small set of active connections ready so it can respond quickly, and it automatically retries brief interruptions to keep error rates low. This end‑to‑end flow helps ensure customers experience fast, reliable access when reading, managing, or activating role assignments.

In the days leading up to the incident, a change was introduced that inadvertently increased the overall load on the service and contributed to the conditions that led to customer impact. A configuration update increased how much additional work the service does for every request, which shifted how traffic flowed through the system – and created higher pressure on the database itself, setting the stage for the failure pattern observed during the incident.

This update placed more load on the service’s underlying systems than they were designed to handle. A configuration change meant each request required additional work, to be performed by the service, per customer action. Over time, these factors created more demand on the database than it could sustain. Once database connections were exhausted, the service struggled to process new requests – resulting in timeouts, delays, and errors when customers tried to view or activate privileged roles.

How did we respond?

08:05 UTC – Initial customer impact began.
08:26 UTC – Alerts received for intermittent, low volume of errors.
10:30 UTC – We attempted isolated restarts on impacted instances, to help mitigate low-level impact.
13:03 UTC – Automated monitoring alerted us to elevated error rates, full incident response was initiated.
13:22 UTC – We identified that calls to the database were intermittently timing out. Traffic volume appeared to be normal with no significant surge detected.
13:54 UTC – Mitigation efforts began, including beginning to scale out the impacted environment.
15:05 UTC – Scale out efforts were observed as decreasing error rates but not completely eliminating failures. Further instance restarts provided temporary relief.
15:25 UTC – Scaling efforts continued. We engaged our database engineering team to help investigate.
16:37 UTC – Rollback of the problematic configuration change was initiated.
17:20 UTC – Scale-out efforts completed.
17:45 UTC – Service availability telemetry was showing improvements. Some customers began to report recovery.
18:30 UTC – Customer impact confirmed as mitigated, after rollback of configuration change had completed and error rates had returned to normal levels.

How are we making incidents like this less likely or less impactful?

We have rolled back the problematic configuration change across all regions. (Completed)
We are working to ensure that future configuration changes will not inadvertently introduce excessive load, by reviewing health signals to help identify issues earlier in the deployment process. (Estimated completion: January 2026)
We are enriching our monitoring to provide more detail to on-call engineers surrounding low-volume errors like those detected early in the incident, to be able to mitigate issues more quickly. (Estimated completion: February 2026)
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

How can customers make incidents like this less impactful?

Consider retrying critical operations directly through the API, for example, when requesting just in time access: https://learn.microsoft.com/entra/id-governance/privileged-identity-management/pim-apis#iteration-3-current--pim-for-microsoft-entra-roles-groups-in-microsoft-graph-api-and-for-azure-resources-in-azure-resource-manager-api
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/FV31-PQG

8

What happened?

Between 11:04 and 14:13 EST on 08 December 2025, customers using any of the Azure Government regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure App Service, Azure Backup, Azure Communication Services, Azure Data Factory, Azure Databricks, Azure Functions, Azure Kubernetes Service, Azure Maps, Azure Migrate, Azure NetApp Files, Azure OpenAI Service, Azure Policy (including Machine Configuration), Azure Resource Manager, Azure Search, Azure Service Bus, Azure Site Recovery, Azure Storage, Azure Virtual Desktop, Microsoft Fabric, and Microsoft Power Platform (including AI Builder and Power Automate).

What went wrong and why?

Azure Resource Manager (ARM) is the gateway for management operations for Azure services. ARM does authorization for these operations based on authorization policies, stored in Cosmos DB accounts that are replicated to all regions. On 08 December 2025, an inadvertent automated key rotation resulted in ARM failures to fetch authorization policies that are needed to evaluate access. As a result, ARM was temporarily unable to communicate with underlying storage resources, causing failures in service-to-service communication and affecting resource management workflows across multiple Azure services. This issue surfaced as authentication failures and 500 Internal Server errors to customers across all clients. Because the content of the Cosmos DB accounts for authorization policies is replicated globally, all regions within the Azure Government cloud were affected.

Microsoft services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the Microsoft internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

Earlier on the same day of this incident, a similar key rotation issue affected services in the Azure in China sovereign cloud. Both the Azure Government and Azure in China sovereign cloud had their (separate but equivalent) keys created on the same day, starting completely independent timers, back in February 2025 – so each was inadvertently rotated on their respective timers, approximately three hours apart. As such, the key used by ARM for the Azure Government regions was automatically rotated, before the same key rotation issue affecting the Azure in China regions was fully mitigated. Although potential impact to other sovereign clouds was discussed as part of the initial investigation, we did not have a sufficient understanding of the inadvertent key rotation to be able to prevent impact in the second sovereign cloud, Azure Government.

How did we respond?

11:04 EST on 08 December 2025 – Customer impact began.
11:07 EST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
11:38 EST on 08 December 2025 – We began applying a fix for the impacted authentication components.
13:58 EST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
14:13 EST on 08 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/ML7_-DWG

8

What happened?

Between 16:50 CST on 08 December 2025 and 02:00 CST on 09 December 2025 (China Standard Time) customers using any of the Azure in China regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure AI Search, Azure API Management, Azure App Service, Azure Application Insights, Azure Arc, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Database for PostgreSQL Flexible Server, Azure Kubernetes Service, Azure Logic Apps, Azure Managed HSM, Azure Marketplace, Azure Monitor, Azure Policy (including Machine Configuration), Azure Portal, Azure Resource Manager, Azure Site Recovery, Azure Stack HCI, Azure Stream Analytics, Azure Synapse Analytics, and Microsoft Sentinel.

What went wrong and why?

Azure services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

How did we respond?

16:50 CST on 08 December 2025 – Customer impact began.
16:59 CST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
18:37 CST on 08 December 2025 – We identified the underlying cause as the incorrect key rotation.
19:16 CST on 08 December 2025 – We identified mitigation steps and began applying a fix for the impacted authentication components. This was tested and validated before being applied.
22:00 CST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
23:53 CST on 08 December 2025 – Many services had recovered but residual impact remained for some services.
02:00 CST on 09 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/JSNV-FBZ

November 2025

5

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/2LGD-9VG

What happened?

Between approximately 16:53 UTC on 5 November 2025 and 02:25 UTC on 6 November 2025, a subset of customers in the West Europe region experienced service disruptions or degraded performance across multiple services, including Virtual Machines (VM), Azure Database for PostgreSQL Flexible Server, MySQL Flexible Server, Azure Kubernetes Service, Storage, Service Bus, and Virtual Machine Scale Sets, among others. Customers using Azure Databricks in West Europe observed degraded performance when launching or scaling all-purpose and job compute workloads, which impacted Unity Catalog and Databricks SQL operations.

What went wrong and why?

The incident was triggered by a voltage sag in the utility grid, which caused cooling units to shut down and temperatures to rise above normal thresholds in a datacenter located in Physical Availability Zone 01 in the West Europe region. As temperature rose, multiple storage scale units automatically powered down to prevent hardware damage. Under normal circumstances, cooling systems are designed and commissioned to automatically restart after power events and are built with sufficient redundancy to maintain safe operating conditions even when failures occur. While cooling unit auto-restart capability had been regularly tested and exercised in this datacenter, the specific size and duration of utility sag exposed a previously unknown hardware issue that prevented the cooling units from restarting. This resulted in temperatures exceeding safe operational limits for infrastructure.

The extended recovery time was influenced by several factors. Recovery of storage scale units was prolonged because some storage servers entered a degraded state, necessitating sequential validation before they could be brought back online. In addition, dependent services such as compute and networking could not resume until storage integrity checks were complete. Finally, additional thermal audits were performed before workloads were reactivated to ensure that no residual risk remained.

How did we respond?

Automated monitoring detected the temperature anomaly and triggered an incident response. Facilities teams restored safe operating conditions by performing hard reset on the affected units and confirmed that temperatures were trending back to safe levels. Service recovery efforts began once the facility had reached safe operating temperatures.

Following this, storage systems initiated comprehensive data consistency checks when nodes in a storage scale unit restarted after the unplanned shutdown. These checks, which may include reconstructing replicas that have fallen behind. The recovery process prioritizes data integrity and durability over availability, resulting in customer traffic being blocked until all integrity validations was completed and contributing to the extended recovery time. In parallel, engineers prioritized recovery of compute hosts to stop further VM failures and allow dependent services to resume.

Timeline of events:

16:53 UTC on 5 November 2025 – Customer impact began as elevated temperatures caused multiple storage scale units to shut down.
16:55 UTC on 5 November 2025 – Datacenter monitoring detected temperature breaches and triggered thermal alerts.
17:20 UTC on 5 November 2025 – Power sag and cooling unit failure to restart identified as contributing factors. Cooling recovery begins as engineers manually restart cooling units.
17:40 UTC on 5 November 2025 – Mitigation workstream started as cooling restoration progresses.
17:48 UTC on 5 November 2025 – All cooling units restored, we start seeing temperatures reducing as cooling normalized.
17:50 UTC on 5 November 2025 – Rack-level thermal monitoring returned to safe operational thresholds. Engineering teams initiated storage recovery.
18:30 UTC on 5 November 2025 – Sequential storage scale unit validation began. Each scale unit underwent integrity checks to ensure no data corruption before bringing storage services back into rotation.
20:00 UTC on 5 November 2025 – Gradual restoration of storage scale units. Recovery was staged to avoid overloading power and cooling systems. Dependent compute nodes remained offline during this time.
23:30 UTC on 5 November 2025 – Service dependency checks and orchestration. Networking and compute services were progressively re-enabled after storage scale units passed health checks.
02:25 UTC on 6 November 2025 – All services were brought back online, and customer impact was mitigated.

How are we making incidents like this less likely or less impactful?

We are implementing several improvements to strengthen resilience and accelerate recovery:

We are conducting internal and manufacturer-led forensic investigations of specific device hardware and firmware to develop solutions to bolster component tolerance to power sags and voltage fluctuations. (Estimated completion: December 2025)
We are performing a gap analysis of the control circuit to ensure all relays delivering critical function of the units are powered through Uninterruptible Power Supply (UPS) and stored energy sources. (Estimated completion: January 2026)
We are validating our maintenance procedures and adding additional prechecks for Heating, Ventilation, and Air Conditioning (HVAC) subsystems. (Estimated completion: January 2026)
We are optimizing recovery steps for compute and storage systems to reduce overall restoration time. (Estimated completion: January 2026)
We are improving automated correlation between environmental signals and service health to trigger earlier, and more targeted escalations. (Estimated completion: January 2026)
We are updating our cooling unit commissioning playbooks to expand range of power sag and swell scenarios. (Estimated completion: February 2026)
In the longer term, we are implementing a new approach to accelerate service restoration after major incidents. Historically, recovery prioritized full data integrity checks before bringing services online, which extended downtime because it was a serial process. Going forward, integrity checks will run in the background during VM and disk mount operations, allowing availability to return much sooner without compromising data safety. This change is a direct outcome of this incident and represents a significant improvement in recovery speed. (Estimated completion: April 2026)

How can customers make incidents like this less impactful?

Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
Plan for regional redundancy by using active/active or failover designs across paired regions.
Maintain regular backups and test failover procedures to ensure recovery readiness.
Increase resilience by using retry logic and backoff strategies to prevent overload during recovery.
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/2LGD-9VG

October 2025

29

Read Azure Front Door's blog series in making progress on key resilience repairs: https://aka.ms/AzureFrontDoor/Resiliency-Part1

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/YKYN-BWZ

What happened?

Between 15:41 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) experienced connection timeout errors and Domain Name System (DNS) resolution issues. From 18:30 UTC on 29 October 2025, as the system recovered gradually, some customers started to see availability improve – albeit with increased latency – until the system fully stabilized by 00:05 UTC on 30 October 2025.

Affected Azure services included, but were not limited to: Azure Active Directory B2C, Azure AI Video Indexer, Azure App Service, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Marketplace, Azure Media Services, Azure Portal, Azure Sphere Security Service, Azure SQL Database, and Azure Static Web Apps.

Other Microsoft services were also impacted, including Microsoft 365 (see: MO1181369), the Microsoft Communication Registry website, Microsoft Copilot for Security, Microsoft Defender (External Attack Surface Management), Microsoft Dragon Copilot, Microsoft Dynamics 365 and Power Platform (see: MX1181378), Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), Visual Studio App Center, and customers’ ability to open support cases (both in the Azure Portal and by phone).

What went wrong and why?

Azure Front Door (AFD) and Azure Content Delivery Network (CDN) route traffic using globally distributed edge sites supporting customers as well as Microsoft services including various management portals. The AFD control plane generates customer configuration metadata that the data plane consumes for all customer-initiated operations including purge and Web Application Firewall (WAF) on the AFD platform. Since customer applications hosted on AFD and CDN can be accessed by their end users from anywhere in the world, these changes are deployed globally across all its edge sites to provide a consistent user experience.

A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions.

Azure Front Door employs multiple deployment stages, and a configuration protection system to ensure safe propagation of customer configurations. This system validates configurations at each deployment stage and advances only after receiving positive health signals from the data plane. Once deployments are rolled out successfully, the configuration propagation system also updates a ‘Last Known Good’ (LKG) snapshot (a periodic snapshot of healthy customer configurations) so that deployments can be automatically rolled back in case of any issues. The configuration protection system waits for approximately a minute between each stage, completing on an average within 5-10 minutes globally.

During this incident, the incompatible customer configuration change was made at 15:35 UTC, and was applied to the data plane in a pre-production stage at 15:36 UTC. Our configuration propagation monitoring continued to receive healthy signals – although the problematic metadata was present, it had not caused any issues. Because the data plane crash surfaced asynchronously, after approximately five minutes, the configuration passed through the protection safeguards and propagated to later stages. This configuration (with the incompatible metadata) completed propagation to a majority of edge sites by 15:39 UTC. Since the incompatible customer configuration metadata was deployed successfully to the majority of fleet with positive health signal, the LKG was also updated with this configuration.

The data plane impact began in phases starting with our preproduction environment at 15:41 UTC, and replicated across all edge sites globally by 15:45 UTC. As the data plane impact started, the configuration protection system detected this and stopped all new and inflight customer configuration changes from being propagated at 15:43 UTC. The incompatible customer configuration was processed by edge servers, causing crashes across our various edge sites. This also impacted AFD’s internal DNS service, hosted on the edge sites of Azure Front Door, resulting in intermittent DNS resolution errors for a subset of AFD customer requests. This sequence of events was the trigger for the global impact on the AFD platform.

This AFD incident on 29 October was not directly related to the previous AFD incident, from 9 October. Both incidents were broadly related to configuration propagation risk (inherent to a global Content Delivery Network, in which route/WAF/origin changes must be quickly deployed worldwide) but while the failure mode was similar, the underlying defects were different. Azure Front Door’s configuration protection system is designed to validate configurations and proceed only after receiving positive health signals from the data plane. During the AFD incident on 9 October (Tracking ID: QNBQ-5W8) that protection system worked as intended, but was later bypassed by our engineering team during a manual cleanup operation. During this AFD incident on 29 October (Tracking ID: YKYN-BWZ) the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Some of the learnings and repair items from the earlier incident are applicable to this incident as well, and are included in the list of repairs below.

How did we respond?

The issue started at 15:41 UTC and was detected by monitoring at 15:48 UTC, prompting our investigation. By 15:43 UTC the configuration protection system activated in response to widespread data plane issues, and automatically blocked all new and in-flight configuration changes from being deployed worldwide.

Since the latest ‘last known good’ (LKG) version was updated with the conflicting metadata, we chose not to revert to it. To ensure system stability, we decided not to rollback to prior versions of the LKG either. Instead, we opted to edit the latest LKG, by removing the problematic customer configurations manually. We also opted to block all customer configuration changes from propagating to the data plane at 17:30 UTC so that, as we mitigate, we would not reintroduce this issue. At 17:40 UTC we began deploying the updated LKG configuration across the global fleet. Recovery required reloading all customer configurations at every edge site and rebalancing traffic gradually, to avoid overload conditions as Edge sites returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

Many downstream services that use AFD were able to failover to prevent further customer impact, including Microsoft Entra and Intune portals, and Azure Active Directory B2C. In a more complex example, Azure Portal leveraged its standard recovery process to successfully transition away from AFD during the incident. Users of the Portal would have seen limited impact during this failover process and then been able to use the Portal without issue. Unfortunately, some services within the Portal did not have an established fallback strategy and therefore parts of the Portal experience continued to experience failures even after Portal recovery (for example, Marketplace).

Timeline of major incident milestones:

15:35 UTC on 29 October 2025 – Corrupt metadata first introduced, as described above.
15:41 UTC on 29 October 2025 – Customer impact began, triggered by the resulting crashes.
15:43 UTC on 29 October 2025 – Configuration protection system activated in response to issues.
15:48 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
16:15 UTC on 29 October 2025 – Focus of the investigation became examining AFD configuration changes.
16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
17:10 UTC on 29 October 2025 – Began updating the ‘last known good’ LKG configuration to remove problematic configurations manually.
17:26 UTC on 29 October 2025 – Azure Portal failed away from Azure Front Door.
17:30 UTC on 29 October 2025 – Blocked all customer configuration propagation to the data plane, in preparation for deploying the new configuration.
17:40 UTC on 29 October 2025 – Initiated the deployment of our updated ‘last known good’ configuration.
17:50 UTC on 29 October 2025 – Last known good configuration available to all Edge sites, which began gradually reloading LKG configuration.
18:30 UTC on 29 October 2025 – AFD DNS servers recovered, allowing us to rebalance traffic manually to a small number of healthy Edge sites. Customers began seeing improvements in availability.
20:20 UTC on 29 October 2025 – As a sufficient number of Edge sites had recovered, we switched to automatic traffic management – customers continued to see availability improve.
00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers, as availability and latency had returned to pre-incident levels.

Post mitigation, we temporarily blocked all AFD customer configuration changes at the Azure Resource Manager (ARM) level to ensure the safety of the data plane. We also implemented additional safeguards including (i) fixing the control plane and data plane defects, (ii) removing asynchronous processing from the data plane, (iii) introducing an additional ‘pre-canary’ stage to test customer configuration (iv) extending the bake time during each stage of the configuration propagation, and (v) improvements to the data plane recovery time from approximately 4.5 hours to approximately one hour. We began draining the customer configuration queue from 2 November 2025. Once these safeguards were fully implemented, this restriction was removed on 5 November 2025.

The introduction of new stages in the configuration propagation pipeline was coupled with additional ‘bake time’ between stages – which has resulted in an increase in configuration propagation time, for all operations including create, update, delete, WAF operations on AFD platform, and cache purges. We continue to work on platform enhancements to ensure a robust configuration delivery pipeline and further reduce the propagation time. For more details on these temporary propagation delays, refer to http://aka.ms/AFD_FAQ.

How are we making incidents like this less likely or less impactful?

To prevent issues like this, and improve deployment safety...

We have fixed both the original control plane incompatibility, and the data plane bug described above. (Completed)
We are now enforcing complete synchronous processing of each customer configuration, before advancing to production stages. (Completed)
We have implemented additional stages in our phased configuration rollout, including extended bake time to help detect configuration related issues. (Completed)
We are decoupling configuration processing in data plane servers from active traffic-serving instances to isolated worker process instances, thereby removing the risk of any configuration defect impacting data plane processing. (Estimated completion: January 2026)
Once our pre-validation of this configuration pipeline is in place, we will work towards reducing the propagation time from 45 minutes to approximately 15 minutes. (Estimated completion: January 2026)
We are enhancing our testing and validation framework to ensure backwards compatibility with configurations generated across previous versions of the control plane build. (Estimated completion: February 2026)

To reduce the blast radius of potential future issues...

We have migrated critical first-party infrastructure (including Azure Portal, Azure Communication Services, Marketplace, Linux Software Repository for Microsoft Products, Support ticket creation) into an active-active solution with fail away. (Completed)
In the longer term, we are enhancing our customer configuration and traffic isolation, to ensure that no impact to any other customers from single customers’ traffic or configuration issue – utilizing ‘micro cell’ segmentation of the AFD data plane. (Estimated completion: June 2026)

To be able to recover more quickly from issues...

We have made changes to accelerate our data plane recovery time, by leveraging the local customer configuration caching more effectively – to restore customer configurations within one hour. (Completed)
We are making investments to reduce data plane recovery time further, to restore customer configurations within approximately 10 minutes. (Estimated completion: March 2026)

To improve our communications and support...

We have addressed delays in delivering alerts via Azure Service Health to impacted customers, by making immediate improvements to increase resource thresholds. (Completed)
We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation – to notify impacted customers more quickly. (Estimated completion: November 2025)
Finally, we will resolve the technical and staffing challenges that prevented Premier and Unified customers from being able to create a support request, by ensuring that we can failover to our backups systems more quickly. (Estimated completion: November 2025)

How can customers make incidents like this less impactful?

Understand best practices for building a resilient global HTTP ingress layer, to maintain availability during regional or network disruptions: https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/mission-critical-global-http-ingress
As an alternative for when management portals are temporarily inaccessible, customers can consider using programmatic methods to manage resources – including the REST API (https://learn.microsoft.com/rest/api/azure) and PowerShell (https://learn.microsoft.com/powershell/azure)
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/YKYN-BWZ

9

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/QKNQ-PB8

What happened?

Between 19:43 UTC and 23:59 UTC on 09 October 2025, customers may have experienced availability issues and failures when loading content for the Azure Portal as well as other management portals across Microsoft. Over the course of the incident, approximately 45% of customers using the management portals experienced some form of impact – with failure rates reaching their peak at approximately 20:54 UTC, after which time customers generally experienced availability improving when attempting to load content. Note that service management via programmatic methods (such as PowerShell or the REST API) and the availability of resources were not impacted.

What went wrong and why?

Various Microsoft management portals utilize backend services to host content for different portal extensions. Azure Front Door (AFD) and Content Delivery Network (CDN) services are used to accelerate delivery of application content. An earlier incident occurred (Tracking ID: QNBQ-5W8) impacting the availability of management portals primarily from Africa and Europe, with lesser impact to Asia Pacific and the Middle East. We invoked our Business Continuity Disaster Recovery (BCDR) protocol and updated our network filters to allow traffic to bypass AFD so that customer traffic could reach backend services hosting content.

Automation scripts were used to update the traffic load balancing configuration to split traffic across multiple routes. However, the scripts inadvertently removed a configuration value, because the script used an API version that predated the introduction of that value. The AFD endpoint began failing its health checks because the monitor of the AFD route falsely interpreted it as unhealthy. As a result, it was no longer routing traffic. However, since traffic was routed through alternative paths, there was no impact to customers, and we were not aware that the configuration value was removed. These automation scripts had previously been used successfully in other scenarios.

After recovery of the previous incident, we began taking steps to resume normal traffic through AFD by using these automation scripts. However, due to the previously mentioned configuration value, traffic was not served correctly. We updated network filters to allow customer traffic only through AFD and, as this propagated, the backend hosting services were no longer reachable via AFD. This resulted in the impact to customers starting at 19:43 UTC.

How did we respond?

11:59 UTC on 09 October 2025 – We performed operations to allow management portal traffic to bypass AFD.
19:39 UTC on 09 October 2025 – After verifying recovery of the previous incident, we took steps to migrate management portal traffic back completely through AFD.
19:43 UTC on 09 October 2025 – Management portal availability issues occurred and impact grew.
19:44 UTC on 09 October 2025 – Initial automated alerts were triggered.
19:50 UTC on 09 October 2025 – Engineers were engaged and began investigating.
20:30 UTC on 09 October 2025 – We confirmed the issues were related to the traffic migration and continued to investigate contributing factors.
20:54 UTC on 09 October 2025 – We recovered one hosting service domain to correctly resume traffic through AFD and purged its caches to speed up recovery. Gradual recovery happened over the next 60 minutes, though other residual impact continued.
21:30 UTC on 09 October 2025 – Azure Portal availability was mitigated, however other residual impact remained for other management portals. We shifted our focus to investigating these other residual failures and mitigating this impact.
23:59 UTC on 09 October 2025 – We recovered an additional hosting service domain to correctly resume traffic through AFD. After reviewing our configuration and monitoring telemetry, we confirmed mitigation.

How are we making incidents like this less likely or less impactful?

We have audited our architecture for the traffic routing of our management portals. (Completed)
We have updated our AFD migration scripts to use the latest API versions, to ensure that future runs will not remove the problematic configuration value. (Completed)
We are updating our standard operating procedures to add additional validation steps, to help ensure confirmation that traffic is being routed as expected. (Estimated completion: October 2025)
We are hardening our client-side logic so that browsers can fallback from AFD and hosting service incidents. (Estimated completion: November 2025)
We are making improvements to our failover systems from AFD, to be more robust and automated, so no manual intervention is required. (Estimated completion: December 2025)
We will be conducting additional simulation drills on our internal environments, to ensure that traffic shifts exactly as expected. (Estimated completion: January 2026)
In the longer term, we will be updating the architecture of our infrastructure to support regional rollout of such network traffic changes, to reduce the potential impact radius. (Estimated completion: March 2026)

How can customers make incidents like this less impactful?

As an alternative for when management portals are inaccessible, customers can consider using programmatic methods to manage resources – including the REST API (https://learn.microsoft.com/rest/api/azure) and PowerShell (https://learn.microsoft.com/powershell/azure)
During service incidents in which Azure Portal content is unavailable, customers can try using our preview Azure Portal endpoint URL: https://preview.portal.azure.com
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/QKNQ-PB8

9

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/QNBQ-5W8

What happened?

Between 07:50 UTC and 16:00 UTC on 09 October 2025, Microsoft services and Azure customers leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) may have experienced increased latency and/or timeouts – primarily across Africa and Europe, as well as Asia Pacific and the Middle East. This impacted the availability of the Azure Portal as well as other management portals across Microsoft.

Peak failure rates for AFD reached approximately 17% in Africa, 6% in Europe, and 2.7% in Asia Pacific and the Middle East. Availability was restored by 12:50 UTC, though some customers continued to experience elevated latency. Latency returned to baseline levels by 16:00 UTC, at which point the incident was mitigated.

What went wrong and why?

AFD routes traffic using globally distributed edge sites and supports Microsoft services including the management portals. The AFD control plane generates system metadata that the data plane consumes for customer-initiated ‘create’, ‘update’, or ‘delete’ operations on AFD or CDN profiles. One of the trigger conditions for this incident was a software defect in the latest version of the AFD control plane which had been rolled out six weeks prior to the incident, in line with our safe deployment practices.

Newly created customer tenant profiles were being onboarded to the newer control plane version. Our service monitoring detected elevated data plane crashes due to a previously unknown bug – triggered by erroneous metadata, generated by a particular sequence of profile update operations. Our automated protection layer intercepted this in early update stages and prevented this metadata from propagating any further to the data plane, thereby averting any customer impact at that time. In addition, as the newer control plane was running in tandem with the previous version of the control plane, we disabled the new control plane from taking any requests.

On 09 October 2025, we initiated a cleanup of the affected tenant configuration with the erroneous metadata. Since the automated protection system was blocking the impacted customer tenant profile updates in the initial stage, we temporarily bypassed it to allow the cleanup of the tenant configuration to proceed. By bypassing the protection system, the erroneous metadata was inadvertently able to propagate to later stages – and triggered the bug in the data plane that crashed the data plane service. This resulted in a disruption to a significant number of edge sites across Europe and Africa, approximately 26% of AFD data plane infrastructure resources in these regions were impacted.

As part of AFD mechanisms to manage traffic, load was automatically distributed to nearby edge sites (including in Asia Pacific and the Middle East). Additionally, as regional business hours traffic started ramping up, it added to the overall traffic load. The increased volume of traffic on the remaining healthy edge sites resulted in high resource utilization, which exceeded operational thresholds. This triggered an additional layer of protection which started distributing traffic to a broader set of edge sites globally, to reduce further impact. Recovery required a combination of automated restarts, manual intervention where automated restarts were taking too long, and traffic failover operations for impacted management portals. Full mitigation was achieved once edge site infrastructure resources stabilized and latency returned to normal.

Additionally, initial customer notifications were delayed primarily due to challenges determining impact, while attempting to target communications to those impacted. We have automated communications to notify customers of incidents quickly, unfortunately this capability was not yet supported in this incident scenario.

How did we respond?

07:30 UTC on 09 October 2025 – The cleanup operation was initiated.
07:50 UTC on 09 October 2025 – Initial customer impact began, and increased over the next 90 minutes.
08:13 UTC on 09 October 2025 – Our telemetry detected resource availability loss across multiple AFD edge sites. We began investigating as impact continued to grow.
09:04 UTC on 09 October 2025 – We identified that the crashes were due to the previously identified data plane bug.
09:08 UTC on 09 October 2025 – Automated restarts began for our AFD infrastructure resources, and manual intervention began for resources that did not recover automatically.
09:15 UTC on 09 October 2025 – Customer impact had grown to be at its peak.
10:01 UTC on 09 October 2025 – Communications were published to the Azure Status page.
10:45 UTC on 09 October 2025 – Targeted customer communications were sent to Azure Service Health.
11:59 UTC on 09 October 2025 – Management portals, like the Azure Portal, performed failover operations (including using scripts to update the load balancing configuration, to split traffic between multiple routes) helping restore its service availability.
12:50 UTC on 09 October 2025 – Availability for AFD fully recovered, however a subset of customers may still have been experiencing elevated latency.
16:00 UTC on 09 October 2025 – After continuous monitoring of latency improvement, we declared the incident as mitigated after confirming recovery.

How are we making incidents like this less likely or less impactful?

We have hardened our standard operating procedures, to ensure that the configuration protection system is not bypassed for any operation. (Completed)
We have fixed the control plane defect which generated the erroneous tenant metadata that led to the data plane resource crashes. (Completed)
We have fixed the bug in the data plane. (Completed)
We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation. (Estimated completion: November 2025)
We are making improvements to our Azure Portal failover systems from AFD, to be more robust and automated. (Estimated completion: December 2025)
We are building additional runtime configuration validation pipelines against a replica of real-time data plane, as a pre-validation step prior to applying them broadly. (Estimated completion: March 2026)
We are improving data plane resource instance recovery time, following any impact to the data plane. (Estimated completion: March 2026)

How can customers make incidents like this less impactful?

Consider implementing failover strategies with Azure Traffic Manager, to fail over from Azure Front Door to your origins: https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/overview
Consider reviewing our best practices for Azure Front Door architecture: https://learn.microsoft.com/azure/well-architected/service-guides/azure-front-door
Consider implementing retry patterns with exponential backoff, to improve workload resiliency: https://learn.microsoft.com/azure/architecture/patterns/retry
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: http://aka.ms/AzPIR/QNBQ-5W8