October 2025
9
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident (to hear from our engineering leaders, and to get any questions answered by our experts) or watch a recording of the livestream (available later, on YouTube): https://aka.ms/air/QKNQ-PB8
What happened?
Between 19:43 UTC and 23:59 UTC on 09 October 2025, customers may have experienced availability issues and failures when loading content for the Azure Portal as well as other management portals across Microsoft. Over the course of the incident, approximately 45% of customers using the management portals experienced some form of impact – with failure rates reaching their peak at approximately 20:54 UTC, after which time customers generally experienced availability improving when attempting to load content. Note that service management via programmatic methods (such as PowerShell or the REST API) and the availability of resources were not impacted.
What went wrong and why?
Various Microsoft management portals utilize backend services to host content for different portal extensions. Azure Front Door (AFD) and Content Delivery Network (CDN) services are used to accelerate delivery of application content. An earlier incident occurred (Tracking ID: QNBQ-5W8) impacting the availability of management portals primarily from Africa and Europe, with lesser impact to Asia Pacific and the Middle East. We invoked our Business Continuity Disaster Recovery (BCDR) protocol and updated our network filters to allow traffic to bypass AFD so that customer traffic could reach backend services hosting content.
Automation scripts were used to update the traffic load balancing configuration to split traffic across multiple routes. However, the scripts inadvertently removed a configuration value, because the script used an API version that predated the introduction of that value. The AFD endpoint began failing its health checks because the monitor of the AFD route falsely interpreted it as unhealthy. As a result, it was no longer routing traffic. However, since traffic was routed through alternative paths, there was no impact to customers, and we were not aware that the configuration value was removed. These automation scripts had previously been used successfully in other scenarios.
After recovery of the previous incident, we began taking steps to resume normal traffic through AFD by using these automation scripts. However, due to the previously mentioned configuration value, traffic was not served correctly. We updated network filters to allow customer traffic only through AFD and, as this propagated, the backend hosting services were no longer reachable via AFD. This resulted in the impact to customers starting at 19:43 UTC.
How did we respond?
- 11:59 UTC on 09 October 2025 – We performed operations to allow management portal traffic to bypass AFD.
- 19:39 UTC on 09 October 2025 – After verifying recovery of the previous incident, we took steps to migrate management portal traffic back completely through AFD.
- 19:43 UTC on 09 October 2025 – Management portal availability issues occurred and impact grew.
- 19:44 UTC on 09 October 2025 – Initial automated alerts were triggered.
- 19:50 UTC on 09 October 2025 – Engineers were engaged and began investigating.
- 20:30 UTC on 09 October 2025 – We confirmed the issues were related to the traffic migration and continued to investigate contributing factors.
- 20:54 UTC on 09 October 2025 – We recovered one hosting service domain to correctly resume traffic through AFD and purged its caches to speed up recovery. Gradual recovery happened over the next 60 minutes, though other residual impact continued.
- 21:30 UTC on 09 October 2025 – Azure Portal availability was mitigated, however other residual impact remained for other management portals. We shifted our focus to investigating these other residual failures and mitigating this impact.
- 23:59 UTC on 09 October 2025 – We recovered an additional hosting service domain to correctly resume traffic through AFD. After reviewing our configuration and monitoring telemetry, we confirmed mitigation.
How are we making incidents like this less likely or less impactful?
- We have audited our architecture for the traffic routing of our management portals. (Completed)
- We have updated our AFD migration scripts to use the latest API versions, to ensure that future runs will not remove the problematic configuration value. (Completed)
- We are updating our standard operating procedures to add additional validation steps, to help ensure confirmation that traffic is being routed as expected. (Estimated completion: October 2025)
- We are hardening our client-side logic so that browsers can fallback from AFD and hosting service incidents. (Estimated completion: November 2025)
- We are making improvements to our failover systems from AFD, to be more robust and automated, so no manual intervention is required. (Estimated completion: December 2025)
- We will be conducting additional simulation drills on our internal environments, to ensure that traffic shifts exactly as expected. (Estimated completion: January 2026)
- In the longer term, we will be updating the architecture of our infrastructure to support regional rollout of such network traffic changes, to reduce the potential impact radius. (Estimated completion: March 2026)
How can customers make incidents like this less impactful?
- As an alternative for when management portals are inaccessible, customers can consider using programmatic methods to manage resources – including the REST API (https://learn.microsoft.com/rest/api/azure) and PowerShell (https://learn.microsoft.com/powershell/azure)
- During service incidents in which Azure Portal content is unavailable, customers can try using our preview Azure Portal endpoint URL: https://preview.portal.azure.com
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/QKNQ-PB8