Produit :
Région :
Date :
mars 2025
18
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8
What happened?
Between 13:37 and 16:52 UTC on 18 March 2025, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, a combination of a fiber cut and tooling failure within our East US region, resulted in an impact to a subset of Azure customers with services in that region. Customers may have experienced intermittent connectivity loss and increased network latency – when sending traffic to/from/within Availability Zone 3 (AZ3) within this East US region.
What do we know so far?
At 13:37 UTC, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within physical Availability Zone 3 (AZ3) only. With fiber cuts impacting capacity, our systems are designed to shift traffic automatically to other diverse paths. In this instance, we had two concurrent failures happen – before the cut, a large hub router was down due to maintenance (in the process of being repaired); and after the cut, a linecard failed on another router.
This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ3, leading to the potential for retransmits or intermittent connectivity for some customers. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time. The restoration of traffic was fully completed by 16:52 UTC, and the issue was noted as mitigated.
At approximately 21:00 UTC, our fiber provider commenced recovery work on the cut fiber. During the capacity recovery process, at 23:20 UTC, a tooling failure caused our systems to add devices back into the production network before the safe levels of capacity were recovered on the impacted fiber path. As a result, as individual fibers were repaired and brought back into service, our tooling incorrectly began adding devices back to the network. Due to the timing of this tooling failure, traffic was restarted without safe levels of capacity – resulting in congestion that led to customer impact, when sending traffic to/from/within AZ3 of the East US region. The impact was mitigated at 00:30 UTC on 19 March, after manually isolating the capacity affected by this tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut was fully recovered. We completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.
How did we respond?
- 13:37 UTC on 18 March 2025 – Customer impact began, triggered by a fiber cut causing network congestion which led to customers experiencing packet drops or intermittent connectivity. Our monitoring systems identified the impact immediately, so our on-call engineers engaged to investigate.
- 13:45 UTC on 18 March 2025 – Our fiber provider was notified of the fiber cut and prepared for dispatch.
- 13:55 UTC on 18 March 2025 – Mitigation efforts began identifying the impacted datacenters and redirecting traffic to healthier routes.
- 15:07 UTC on 18 March 2025 – All customers using the East US region were notified about connectivity issues.
- 16:52 UTC on 18 March 2025 – Mitigation efforts were successfully completed. All devices affected by the fiber cut were isolated, all customer traffic was using healthy paths and not experiencing congestion.
- 23:20 UTC on 18 March 2025 – Customer impact began, due to a tooling failure during the capacity repair process of the initial fiber cut.
- 00:30 UTC on 19 March 2025 – This impact was mitigated after isolating the capacity that was incorrectly added by the tooling failure as part of the recovery process. Customers and services would have experienced full mitigation.
- 01:52 UTC on 19 March 2025 – The underlying fiber cut was fully restored. We continued to monitor our capacity during the recovery process.
- 06:50 UTC on 19 March 2025 – Fiber restoration efforts were completed. Incident was confirmed as mitigated.
How are we making incidents like this less likely or less impactful?
- We are fixing the tooling failure that caused devices being restored to take traffic before they were ready. (Estimated completion: TBD)
- We are increasing the bandwidth within the East US region as part of a planned technology refresh, to de-risk the impact of multiple concurrent failures. (Estimated completion: May 2025)
- This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
How can customers make incidents like this less impactful?
- Consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8
3
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/ZT0X-9B8
What happened?
Between 16:22 and 18:37 UTC on 3 March 2025, a subset of customers experienced issues when connecting to Microsoft services through our Toronto edge point-of-presence (PoP) location. This impacted customers connecting to Microsoft 365 and/or Azure services - including but not limited to Virtual Machines, Storage, App Service, SQL Database, Azure Kubernetes Service and Azure Portal access. These resources remained available for customers when connections were completed through other network paths.
What went wrong and why?
Our global network connects the datacenters across Azure regions with our large network of edge locations. Our edge PoP location in Toronto has paired network devices for redundancy. At 10:58 UTC one of the devices experienced hardware failures – we have since determined this to be due to a memory issue on the line cards for the device – and our automation successfully removed the device from the rotation.
While we were performing an initial investigation into the cause of device failure, at 16:22 UTC the redundant device also experienced a hardware failure, due to a similar memory issue on the line cards - the line cards restarted, and did not initialize properly after startup, to forward traffic. As the paired device had already been taken out of rotation, earlier that day, automation was not able to perform normal mitigating actions, so it required human intervention. As such, this redundant device remained in rotation and led to network packets being dropped.
During our investigation and evaluation of mitigation options, we considered failing traffic out of this PoP altogether, however this would have been particularly impactful to some customer configurations – for example, those with single-homed ExpressRoute connections – so we opted to keep the location in service. Since the hardware failure was on part of the device, our network engineers isolated the faulty line card from the device, which restored connectivity at 18:37 UTC.
After further investigation of the hardware failures, we determined the memory increases were triggered due to an increase in the number of network flows through the device, which reached a scaling limit for these line cards. These occurrences of memory exhaustion coincided with a recent unrelated datacenter network change for specific devices, which began on 27 February 2025. The change increased the routing scale, which has since been paused. We monitor for overall memory on the devices, however the memory on the line cards was not monitored specifically, which led to a delay in our overall response and mitigation processes. As such, upon our initial investigation the overall scope of impact was not evident, which delayed broad customer communications.
How did we respond?
- 16:22 UTC on 3 March 2025 – Customer impact began, as the redundant device began experiencing the hardware failure, while the paired primary device had already been removed from rotation.
- 16:32 UTC on 3 March 2025 – Our automation attempted to bring the paired primary device back into rotation and, due to system load balancing after route convergence, the impact was marginally reduced.
- 16:43 UTC on 3 March 2025 – First automated alert raised of potential issues, however the scope of impact was not evident from the alert or initial investigations.
- 17:19 UTC on 3 March 2025 – Network engineers investigating the issue took steps to remove congestion from the network peer, which reduced impact.
- 17:27 UTC on 3 March 2025 – Additional customer reports received, following delays in submitting support requests via the Azure Portal.
- 17:46 UTC on 3 March 2025 – Additional engineering teams were engaged to investigate and mitigate.
- 17:59 UTC on 3 March 2025 – We began steps to manually remove the redundant device from rotation.
- 18:37 UTC on 3 March 2025 – We isolated the faulty line card from the device. Network traffic was shifted to healthy routes, connectivity was restored, and we monitored the service health to ensure customers had been mitigated.
How are we making incidents like this less likely or less impactful?
- We have scanned and analyzed for the memory issue across our network. (Completed)
- We have updated the configuration on the device, to handle more connections. (Completed)
- We have added monitoring for this hardware subcomponent memory utilization, to detect issues like this more quickly. (Completed)
- We are improving our notification service for similar issues, to notify impacted customers as soon as our internal monitoring detects the issue. (Estimated completion: March 2025)
- In the longer term, we are improving our network design to handle multiple device failures in our edge PoP locations. (Estimated completion for Toronto edge: December 2025)
How can customers make incidents like this less impactful?
- For ExpressRoute high availability, we recommend operating both primary and secondary connections of ExpressRoute circuits in active-active mode: https://learn.microsoft.com/azure/expressroute/designing-for-high-availability-with-expressroute
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/ZT0X-9B8
février 2025
25
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/TMS9-J_8
What happened?
Between 16:42 UTC on 25 February 2025 and 01:15 UTC on 26 February 2025, a platform issue affected Microsoft Entra ID, which prevented customers from authenticating via Microsoft Entra ID using the following features:
- Seamless Single Sign-On (Seamless SSO) https://learn.microsoft.com/entra/identity/hybrid/connect/how-to-connect-sso
- Microsoft Entra Connect Sync https://learn.microsoft.com/entra/identity/hybrid/connect/whatis-azure-ad-connect
The error was caused by DNS resolution failures when trying to access select Microsoft Entra endpoints related to hybrid identity scenarios. Users were not able to seamlessly sign in to apps via Entra ID and were asked to interactively authenticate. For a subset of clients, authentication was completely blocked. The incident was mitigated for 94% of impacted customers by 18:35 UTC on 25 February 2025, and full mitigation occurred for the remainder of impacted customers at 01:15 UTC on 26 February 2025.
What went wrong and why?
As part of an ongoing maintenance effort, a change was made which inadvertently removed an intermediate DNS record and associated traffic manager used in the mentioned scenarios. The maintenance effort aimed to clean up interim routing topology, introduced during Microsoft Entra adoption of IPv6 in prior years. The DNS record and Traffic Manager were intermediate infrastructure components in the resolution path for autologon.microsoftazuread-sso.com, which is a domain name used in the Microsoft Entra ID's Seamless SSO feature and in the Entra Connect Sync feature. No other authentication flows were affected. The traffic manager was removed as configuration incorrectly indicated its DNS record was not in use.
Similar changes leverage a drift detection system to validate that production DNS zone provisioning reflects the desired configuration. Unfortunately, due to a defect in this safety check, the configuration file was not flagged as out-of-sync. As a result, the DNS record was not correctly identified as in-use, so its removal proceeded. The change was reviewed in an internal change tracking system and correctly identified as high risk. However, this second safety check failed because the potential impact was not accurately identified, leading to the change being mistakenly approved to proceed.
The issue was partially mitigated at 18:35 UTC for 94% of affected tenants by restoring configuration of the hostname via a CNAME record. The original DNS record was an A record which does not require clients to recursively resolve a hostname. Certain clients (https://aka.ms/CNAMEs) using Kerberos authentication are sensitive to the hostname being an A record and could not form the Service Principal Name (SPN) correctly, causing 403 Forbidden errors or server timeouts. At 01:05 UTC on 26 February, the configuration was reverted to its exact previous state and the impact was fully mitigated at 01:15 UTC on 26 February.
Additionally, initial response to the incident and customer notifications were delayed by misconfiguration of auto-engagement systems and monitors. A feature of our incident management tooling used to request assistance from teams was partially affected. The team was required to leverage backup mechanisms to establish a shared incident bridge and merge investigation workstreams.
How did we respond?
- 16:42 UTC on 25 February 2025 – Relevant DNS records were inadvertently removed. Gradual onset of impact as the 5-minute DNS Time to Live (TTL) expired.
- 17:18 UTC on 25 February 2025 – Investigation started based on internal DNS reachability monitor failures.
- 17:40 UTC on 25 February 2025 – We identified and isolated the change that introduced the failure.
- 18:35 UTC on 25 February 2025 – Approximately 94% of the customer impact had been mitigated, as the DNS configuration related to this authentication scenario had been partially restored. This change allowed autologon.microsoftazuread.sso.com to resolve again.
- 19:16 UTC on 25 February 2025 – First notification posted to Azure Status page banner.
- 22:35 UTC on 25 February 2025 – Through customer reports, we identified a subset of affected tenants were still experiencing issues, manifesting as 403 Forbidden errors or time outs.
- 01:05 UTC on 26 February 2025 – The configuration for the affected hostname was rolled back to last known good state using an A record.
- 01:15 UTC on 26 February 2025 – Traffic fully reverted to regular patterns.
How are we making incidents like this less likely or less impactful?
- We addressed the misconfiguration affecting our incident response systems by updating our auto-engagement processes to ensure timely notifications and updates are provided throughout the incident lifecycle. (Completed)
- We have reviewed all internal monitoring related to outside-in reachability for authentication scenarios, to ensure these trigger escalations at the appropriate severity. (Completed)
- We are prioritizing the resolution of the defect in the configuration drift detection system, to de-risk scenarios like this one. (Completed)
- We will conduct a technical investigation on the feasibility of introducing non-global (e.g. regionally scoped) endpoints for Seamless SSO authentication scenarios. (Estimated completion: April 2025)
- An internal investigation is taking place to audit change management practices and re-verify any changes with a similar risk profile over the last 24 months. A change freeze for DNS and traffic management is in place until this effort is completed. (Estimated completion: March 2025)
- Furthermore, we will be upgrading the drift detection system to include additional exhaustive checks of all endpoints used in authentication scenarios. (Estimated completion: April 2025)
- Currently, Azure Service Health alerts can only be configured at the Azure subscription level. To enable us to better communicate with our customers, we plan to enable tenant-level Service Health notifications, so that certain communications sent to Tenant Administrators can trigger Azure Service Health alerts. (Estimated completion: October 2025)
How can customers make incidents like this less impactful?
- Use Managed Identities for Azure Resources in lieu of user accounts employed as service accounts, where possible. Where the use of Managed Identities is not possible, adopt an application-only access pattern, rather than relying on service accounts: https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview and https://learn.microsoft.com/entra/identity-platform/app-only-access-primer.
- Reevaluate whether Seamless SSO is still required. For customers using Windows 10, Windows Server 2016, or later versions, single sign-on via Windows Primary Refresh Tokens provides a more reliable and secure alternative to Seamless SSO.
- As a general best practice, leverage latest authentication libraries such as MSAL, which include safe-by-default implementations of end-to-end authentication flows that account for a range of failure scenarios.
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- The impact times above represent the full incident duration, so they are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts and specifically here to receive Entra ID email notifications: https://learn.microsoft.com/entra/identity/monitoring-health/howto-configure-health-alert-emails
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/TMS9-J_8
janvier 2025
8
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/PLP3-1W8
What happened?
Between 22:31 UTC on 8 January and 00:44 UTC on 11 January 2025, a networking configuration change in a single availability zone (physical zone AZ01) in East US 2 resulted in customers experiencing connectivity issues, prolonged timeouts, connection drops, and resource allocation failures across multiple Azure services. Customers leveraging Private Link with Network Security Groups (NSG) to communicate with Azure services may have also been impacted. Services that were configured to be zonally redundant and leveraging VNet integration may have experienced impact across multiple zones.
The 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ - https://learn.microsoft.com/en-us/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings .
Impacted services included:
- Azure Databricks: between 22:55 UTC on 08 January and 15:00 UTC on 10 January, customers may have experienced issues launching clusters and serverless compute resources, leading to failures in dependent jobs or workflows.
- Azure OpenAI: between 04:29 UTC on 09 January and 15:00 UTC on 10 January, customers may have experienced stuck Azure OpenAI training jobs, and failures in Finetune model deployments and other jobs.
- Azure SQL Managed Instance, Azure Database for PostgreSQL flexible servers, Virtual Machines, Virtual Machine Scale Sets: between 22:31 UTC on 08 January and 22:19 UTC on 10 January, customers may have experienced service management operation failures.
- Azure App Service, Azure Logic Apps, Azure Function apps: between 22:31 UTC on 8 January and 20:05 UTC on 10 January, customers may have received intermittent HTTP 500-level response codes, timeouts or high latency when accessing App Service (Web, Mobile, and API Apps), App Service (Linux), or Function deployments hosted in East US 2.
- Azure Container Apps: between 22:31 UTC on 08 January and 21:01 UTC on 10 January, customers may have experienced intermittent failures when creating revisions, container app/jobs in a consumption workload profile, and scaling out a Dedicated workload profile in East US 2.
- Azure SQL Database: between 01:29 UTC on 09 January and 22:25 UTC on 10 January, customers may have experienced issues accessing services, and errors/timeouts for new database connections. Existing connections remained available to accept new requests, however re-established connections may have failed.
- Azure Data Factory and Azure Synapse Analytics: between 22:31 UTC on January 8 and 00:44 UTC on January 11, customers may have encountered errors when attempting to create clusters, job failures, timeouts during Managed VNet activities, and failures in activity runs for the ExecuteDataFlow activity type.
Other impacted services included API Management, Azure DevOps, Azure NetApp Files, Azure RedHat OpenShift, and Azure Stream Analytics.
What went wrong and why?
The Azure PubSub service is a key component of the networking control plane, acting as an intermediary between resource providers and networking agents on Azure hosts. Resource providers, such as the Network Resource Provider, publish customer configurations during Virtual Machine or networking create, update, or delete operations. Networking agents (subscribers) on the hosts retrieve these configurations to program the hosts networking stack. Additionally, the service functions as a cache, ensuring efficient retrieval of configurations during VM reboots or restarts. This capability is essential for deployments, resource allocation, and traffic management in Azure Virtual Network (VNet) environments.
Around 22:31 UTC on 08 January, during an operation to enable a feature, a manual recycling of internal PubSub replicas on certain metadata partitions was required. This recycling operation should have been completed in ‘series’ to ensure that data quorum for the partition is maintained. In this case, recycling was executed across the replicas in ‘parallel’, causing the partition to lose quorum. This resulted in the PubSub service to lose the indexing data stored on these partitions.
Losing the indexing data on three partitions resulted in ~60% of network configurations not being delivered to agents from the control plane. This impacted all service management operations for resources with index data on the affected partitions, including the network resource provisioning for virtual machines in East US 2, Az01. Networking agents would have failed to receive updates on VM routing information for new and existing VMs that underwent any automated recovery. This manifested in a loss of connectivity affecting VMs that customers use, in addition to the VMs that underpin our PaaS service offerings.
Multi-tenant PaaS services that used VNet integration had a dependency on the partitions for the networking provisioning state. After the index data was lost, the dependency on this data that no longer existed caused impact to these PaaS services to expand beyond the impacted zone.
How did we respond?
The issue was detected by monitoring on the PubSub service at 22:44 UTC, with impacted services raising alerts shortly thereafter. The PubSub team identified the trigger as the three partitions with indexing data loss due to the recycling.
At 23:10 UTC, we started evaluating different options for data recovery. Several workstreams were evaluated including:
- Rehydrating the networking configurations to the partitions from the network resource provider.
- Restoring from backup to recover the networking configurations.
- Recovering data from the unaffected partitions.
Due to limitations in feasibility for first two options, we started to recover data by rebuilding the mappings form the backend data store. Scanning the data store to extract source mappings is a time intensive process. It involves not only scanning the data, but also making sure that the mappings to the existing resources are still valid. To achieve this, we had to develop validation and mapping transformation scripts. We also throttled the mappings extraction to avoid impacting the backend data store. This scanning process took about 5-6 hours to complete. Once we obtained these mappings, we then had to restore the impacted index partitions.
As we initiated the build phase, there were unforeseen challenges that we had to work around. Typically, each build takes 3-4 hours. We spent some time looking for ways to expedite this process, as our plan involved applying multiple builds.
Independently on the builds taking longer than expected, we also observed a high number of failure requests due to the outage itself. As a result, our systems were generating a lot of errors which contributed to high CPU usage. This exacerbated the slowness in the restore process. One of the ways that we overcame this this was to deploy patches to block requests to the impacted partitions which mitigated the high CPU usage.
Unfortunately, at 3:36 UTC on 10 January 2025, we discovered the first mapping source files that we generated were insufficient as we referenced an incorrect endpoint. Once this was understood, we had to repeat the mapping file generation and restore process. By 19:18 UTC on 10 January, all three partitions were fully recovered. Additional workstreams that were executed while this main workstream was underway included:
- An immediate remediation step was to direct the Azure control plane service running on all host machines in this region to PubSub endpoints located in other non-impacted availability zones. This restored the host agent’s ability to look up the correct routing information resulting in mitigating VM to VM connectivity issues.
- We initiated a workstream to route all new allocations requests to other availability zones in East US 2. However, as mentioned, the services that were using VNet integration, which had a fixed dependency on the impacted zone, would have continued to see failures.
- To remediate impact to customers leveraging Private Link with NSGs to communicate with Azure services, we redirected the Private Link mapping to avoid the dependency on the impacted zone.
By 19:18 UTC on 10 January, all three impacted partitions were fully recovered. Following this, we started working with impacted services to validate their health and remediate any data gaps that we discovered. At 00:30 UTC on the 11 January, we initiated a phased approach to gradually enable Az01 to accept new allocation requests. By 00:44 UTC on 11 January, all services confirmed mitigation. We reintroduced Az01 into rotation fully at 04:30 UTC, at which point we declared full mitigation. The impact window for Azure Services varied between service.
Below is a timeline of events:
- 22:31 UTC on 08 January 2025 – Customer impact begins following the recycling of the PubSub partitions.
- 22:44 UTC on 08 January 2025 – The issue is identified as an incorrect networking configuration.
- 23:10 UTC on 08 January 2025: We start to evaluate options for the index data recovery to the PubSub service.
- 23:43 UTC on 08 January 2025: Targeted customer communications from the earliest reported impacted services started going out via Azure Service Health.
- 00:35 UTC on 09 January 2025: We begin scanning the backend data store.
- 02:37 UTC on 09 January 2025: Azure Status page updated with the details of a wider networking impact to the region.
- 03:05 UTC on 09 January 2025: A workstream to route all new allocation requests to other availability zones in East US 2 is initiated.
- 03:45 UTC on 09 January 2025: The workstream for the host agent’s ability to look up the correct routing information was completed.
- 04:30 UTC on 09 January 2025: Developing scripts to restore the impacted partitions started.
- 05:00 UTC on 09 January 2025: The workstream to route all new allocation requests to other availability zones in East US 2 was completed.
- 07:45 UTC on 09 January 2025: Validation and testing of scripts for restoring mapping data to the impacted partitions is completed.
- 08:51 UTC on 09 January 2025: Completed scanning backend generated source mapping files.
- 09:14 UTC on 09 January 2025: Build process completed for the scripts to restore mappings.
- 11:30 UTC on 09 January 2025: We observed high CPU and slow restoration speed.
- 11:58 UTC on 09 January 2025: The workstream to mitigate impacted services with Private link and NSG dependencies is initiated.
- 13:12 UTC on 09 January 2025: We updated one of the internal PubSub services to stop sending requests which was contributing to the high CPU usage.
- 16:10 UTC on 09 January 2025: We completed another update to reduce request volumes to impacted partitions. This helped to improve the restore speed further.
- 18:17 UTC on 09 January 2025: The workstream to redirect Private Link mapping to avoid the dependency on the impacted zone is completed.
- 19:15 UTC on 09 January 2025: We created another patch to speed up the restore process. The new patch restores the mappings inside the partition rather than calling APIs to add mappings.
- 23:40 UTC on 09 January 2025: This patching completed with the new optimization step, and the time for rebuilding one partition was reduced even further. Our overall restore time was reduced by >75%.
- 02:10 UTC on 10 January 2025: Restoration of the first two partitions completed.
- 03:00 UTC on 10 January 2025: Restoration of the third partition completed.
- 03:36 UTC on 10 January 2025: Following the restoration, during our validation steps, we determined that the source mapping files generated were incorrect. As a result, we started generating all new source mapping files.
- 05:00 UTC on 10 January 2025: We validated the data of source mappings for first partition. Continued extraction of the rest of the mappings. Meanwhile, we had to clean the previously restored partitions.
- 08:30 UTC on 10 January 2025: The correct source mapping file is ready, and we started the partition restore process.
- 11:04 UTC on 10 January 2025: The first partition is back online while troubleshooting and mitigating data anomalies.
- 19:18 UTC on 10 January 2025 – All three of the impacted partitions are fully recovered and validated – At this stage most customers and services would have seen full mitigation.
- 00:30 UTC on 11 January 2025 – We initiated a phased approach to gradually enable the impacted zone for new allocations.
- 00:44 UTC on 11 January 2025 - All services confirmed mitigation.
How are we making incidents like this less likely or less impactful?
- We are working on reducing the recovery time for any data loss scenarios by improving the PubSub data recovery mechanism. This will bring our RTO to under 2 hours (Estimated completion: February 2025).
- While PubSub production changes already go through our Safe Deployment Practices (SDP), we have added a step with additional layers of scrutiny. These will include, but are not limited to, verifying rollback plans, ensuring error budgets compliance, and performance validations (Completed).
- All manual configuration changes will be moved to an automation pipeline which will require elevated privileges. (Estimated completion: February 2025).
- Significant impact to the Pubsub service can occur if certain operations are performed incorrectly. We will block these operations from being executed, except by designated experts (Estimated completion: January 2025).
- We are re-examining zonal services to ensure that all downstream dependencies are zone resilient. (Estimated completion: October 2025)
- During the incident, we recommended that customers invoke a failover. This was an inappropriate communication. We are making corrections to our processes to ensure that our communications inform customers about the state of the impacted services, zones and regions. This, coupled with any other information that we have about the incident, including ETAs, should be taken into consideration as customers evaluate their own business continuity decisions (Estimated completion: January 2025)
How can customers make incidents like this less impactful?
- Consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/PLP3-1W8