February 2025
25
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/TMS9-J_8
What happened?
Between 16:42 UTC on 25 February 2025 and 01:15 UTC on 26 February 2025, a platform issue affected Microsoft Entra ID, which prevented customers from authenticating via Microsoft Entra ID using the following features:
- Seamless Single Sign-On (Seamless SSO) https://learn.microsoft.com/entra/identity/hybrid/connect/how-to-connect-sso
- Microsoft Entra Connect Sync https://learn.microsoft.com/entra/identity/hybrid/connect/whatis-azure-ad-connect
The error was caused by DNS resolution failures when trying to access select Microsoft Entra endpoints related to hybrid identity scenarios. Users were not able to seamlessly sign in to apps via Entra ID and were asked to interactively authenticate. For a subset of clients, authentication was completely blocked. The incident was mitigated for 94% of impacted customers by 18:35 UTC on 25 February 2025, and full mitigation occurred for the remainder of impacted customers at 01:15 UTC on 26 February 2025.
What went wrong and why?
As part of an ongoing maintenance effort, a change was made which inadvertently removed an intermediate DNS record and associated traffic manager used in the mentioned scenarios. The maintenance effort aimed to clean up interim routing topology, introduced during Microsoft Entra adoption of IPv6 in prior years. The DNS record and Traffic Manager were intermediate infrastructure components in the resolution path for autologon.microsoftazuread-sso.com, which is a domain name used in the Microsoft Entra ID's Seamless SSO feature and in the Entra Connect Sync feature. No other authentication flows were affected. The traffic manager was removed as configuration incorrectly indicated its DNS record was not in use.
Similar changes leverage a drift detection system to validate that production DNS zone provisioning reflects the desired configuration. Unfortunately, due to a defect in this safety check, the configuration file was not flagged as out-of-sync. As a result, the DNS record was not correctly identified as in-use, so its removal proceeded. The change was reviewed in an internal change tracking system and correctly identified as high risk. However, this second safety check failed because the potential impact was not accurately identified, leading to the change being mistakenly approved to proceed.
The issue was partially mitigated at 18:35 UTC for 94% of affected tenants by restoring configuration of the hostname via a CNAME record. The original DNS record was an A record which does not require clients to recursively resolve a hostname. Certain clients (https://aka.ms/CNAMEs) using Kerberos authentication are sensitive to the hostname being an A record and could not form the Service Principal Name (SPN) correctly, causing 403 Forbidden errors or server timeouts. At 01:05 UTC on 26 February, the configuration was reverted to its exact previous state and the impact was fully mitigated at 01:15 UTC on 26 February.
Additionally, initial response to the incident and customer notifications were delayed by misconfiguration of auto-engagement systems and monitors. A feature of our incident management tooling used to request assistance from teams was partially affected. The team was required to leverage backup mechanisms to establish a shared incident bridge and merge investigation workstreams.
How did we respond?
- 16:42 UTC on 25 February 2025 – Relevant DNS records were inadvertently removed. Gradual onset of impact as the 5-minute DNS Time to Live (TTL) expired.
- 17:18 UTC on 25 February 2025 – Investigation started based on internal DNS reachability monitor failures.
- 17:40 UTC on 25 February 2025 – We identified and isolated the change that introduced the failure.
- 18:35 UTC on 25 February 2025 – Approximately 94% of the customer impact had been mitigated, as the DNS configuration related to this authentication scenario had been partially restored. This change allowed autologon.microsoftazuread.sso.com to resolve again.
- 19:16 UTC on 25 February 2025 – First notification posted to Azure Status page banner.
- 22:35 UTC on 25 February 2025 – Through customer reports, we identified a subset of affected tenants were still experiencing issues, manifesting as 403 Forbidden errors or time outs.
- 01:05 UTC on 26 February 2025 – The configuration for the affected hostname was rolled back to last known good state using an A record.
- 01:15 UTC on 26 February 2025 – Traffic fully reverted to regular patterns.
How are we making incidents like this less likely or less impactful?
- We addressed the misconfiguration affecting our incident response systems by updating our auto-engagement processes to ensure timely notifications and updates are provided throughout the incident lifecycle. (Completed)
- We have reviewed all internal monitoring related to outside-in reachability for authentication scenarios, to ensure these trigger escalations at the appropriate severity. (Completed)
- We are prioritizing the resolution of the defect in the configuration drift detection system, to de-risk scenarios like this one. (Completed)
- We will conduct a technical investigation on the feasibility of introducing non-global (e.g. regionally scoped) endpoints for Seamless SSO authentication scenarios. (Estimated completion: April 2025)
- An internal investigation is taking place to audit change management practices and re-verify any changes with a similar risk profile over the last 24 months. A change freeze for DNS and traffic management is in place until this effort is completed. (Estimated completion: March 2025)
- Furthermore, we will be upgrading the drift detection system to include additional exhaustive checks of all endpoints used in authentication scenarios. (Estimated completion: April 2025)
- Currently, Azure Service Health alerts can only be configured at the Azure subscription level. To enable us to better communicate with our customers, we plan to enable tenant-level Service Health notifications, so that certain communications sent to Tenant Administrators can trigger Azure Service Health alerts. (Estimated completion: October 2025)
How can customers make incidents like this less impactful?
- Use Managed Identities for Azure Resources in lieu of user accounts employed as service accounts, where possible. Where the use of Managed Identities is not possible, adopt an application-only access pattern, rather than relying on service accounts: https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview and https://learn.microsoft.com/entra/identity-platform/app-only-access-primer.
- Reevaluate whether Seamless SSO is still required. For customers using Windows 10, Windows Server 2016, or later versions, single sign-on via Windows Primary Refresh Tokens provides a more reliable and secure alternative to Seamless SSO.
- As a general best practice, leverage latest authentication libraries such as MSAL, which include safe-by-default implementations of end-to-end authentication flows that account for a range of failure scenarios.
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- The impact times above represent the full incident duration, so they are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts and specifically here to receive Entra ID email notifications: https://learn.microsoft.com/entra/identity/monitoring-health/howto-configure-health-alert-emails
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/TMS9-J_8