Produkt:
Oblast:
Datum:
březen 2025
18
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/Z_SZ-NV8
What happened?
Between 13:37 and 16:52 UTC on 18 March, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, 2025, a combination of a third-party fiber cut, and an internal tooling failure resulted in an impact to a subset of Azure customers with services in our East US region.
During the first impact window, immediately after the fiber cut, customers may have experienced intermittent connectivity loss for inter-zone traffic that included AZ03 - to/from other zones, or to/from the public internet. During this time, the traffic loss rate peaked at 0.02% for short periods of time. Traffic within AZ03, as well as traffic to/from/within AZ01 and AZ02, was not impacted.
During the second impact window, triggered by the tooling issue, customers may have experienced intermittent connectivity loss – primarily when sending inter-zone traffic that included AZ03. During this time, the traffic loss rate peaked at 0.55% for short periods of time. Traffic entering or leaving the East US region was not impacted, but there was some minimal impact to inter-zone traffic from both of the other Availability Zones, AZ01 and AZ02.
Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.
What went wrong and why?
At 13:37 UTC on 18 March 2025, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within AZ03. When fiber cuts impact our networking capacity, our systems are designed to redistribute traffic automatically to other paths. In this instance, we had two concurrent failures happen – before the cut, a datacenter router in AZ03 was down for maintenance and was in the process of being repaired. This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ03, leading to the potential for intermittent connectivity issues for some customers. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time.
Additionally, after the fiber cut and failed isolation, at 14:16 UTC a linecard failed on another router further reducing overall capacity to AZ03. However, as traffic had been re-routed, this further reduction in capacity did not cause any additional customer impact.
During the initial mitigation efforts outlined above, our auto-mitigation tool encountered a lock contention problem blocking commands on the impacted devices, failing to isolate all capacity connected to those devices. This failure left some of the impacted capacity un-isolated, and our system did not flag this failed isolation state. Due to some capacity being out of service from the fiber cut, this failed state was not immediately flagged in our systems as the down capacity was not carrying production traffic.
At approximately 21:00 UTC, our fiber provider commenced recovery work on the damaged fiber. During the capacity recovery process, at 23:20 UTC, as a result of the failure to isolate all the impacted fiber capacity, as individual fibers were repaired, our recovery systems begin re-sending traffic to the devices connected to the un-isolated capacity, therefore, bringing them back into service without safe levels of capacity. This caused traffic congestion that impacted customers as described above.
The traffic congestion within AZ03, due to the tooling failure, triggered an unplanned failure mode on a regional hub router that connects multiple datacenters. By design, our network devices attempt to contain congestive packet loss to capacity that is already impacted. Due to the encountered failure mode, this containment failed on a subset of routers – so congestion spread to neighboring capacity on the same regional hub router, beyond AZ03. This containment failure impacted a small subset of traffic from the regional hub router to AZ1 and AZ2.
At this stage, all originally-impacted capacity from the third-party fiber cut was manually isolated from the network – mitigating all customer impact by 00:30 UTC on 19 March. At 01:52 UTC on 19 March the underlying fiber cut was fully recovered. At that time, we completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.
How did we respond?
- 13:37 UTC on 18 March 2025 – Customer impact began, triggered by a fiber cut causing network congestion which led to customers experiencing packet drops or intermittent connectivity. Our monitoring systems identified the impact immediately, so our on-call engineers engaged to investigate.
- 13:45 UTC on 18 March 2025 – Our fiber provider was notified of the fiber cut and prepared for dispatch.
- 13:55 UTC on 18 March 2025 – Mitigation efforts began identifying the impacted datacenters and redirecting traffic to healthier routes.
- 15:07 UTC on 18 March 2025 – All customers using the East US region were notified about connectivity issues, even if their services were not directly impacted.
- 16:52 UTC on 18 March 2025 – Mitigation efforts were successfully completed. All devices affected by the fiber cut were isolated, all customer traffic was using healthy paths and not experiencing congestion.
- 23:20 UTC on 18 March 2025 – Customer impact recommenced, due to a tooling failure during the capacity repair process of the initial fiber cut.
- 00:30 UTC on 19 March 2025 – This impact was mitigated after isolating the capacity that was incorrectly added by the tooling failure as part of the recovery process. Customers and services would have experienced full mitigation.
- 01:52 UTC on 19 March 2025 – The underlying fiber cut was fully restored. We continued to monitor our capacity during the recovery process.
- 06:50 UTC on 19 March 2025 – Fiber restoration efforts were completed. The incident was confirmed as mitigated.
How are we making incidents like this less likely or less impactful?
- We are fixing the tooling failure that caused the devices to be restored to take traffic before they were production ready. (Estimated completion: May 2025)
- We are expediting a capacity upgrade within the most impacted datacenter, ahead of a planned technology refresh for all datacenters within this region - to de-risk the impact of multiple concurrent failures. (Estimated completion: July 2025)
- In the longer term, we are working to limit the scope of impact further – specifically, to prevent the failure of a device from spreading across availability zones. (Estimated completion: February 2026)
How can customers make incidents like this less impactful?
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that predominantly impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8
3
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/ZT0X-9B8
What happened?
Between 16:22 and 18:37 UTC on 3 March 2025, a subset of customers experienced issues when connecting to Microsoft services through our Toronto edge point-of-presence (PoP) location. This impacted customers connecting to Microsoft 365 and/or Azure services - including but not limited to Virtual Machines, Storage, App Service, SQL Database, Azure Kubernetes Service and Azure Portal access. These resources remained available for customers when connections were completed through other network paths.
What went wrong and why?
Our global network connects the datacenters across Azure regions with our large network of edge locations. Our edge PoP location in Toronto has paired network devices for redundancy. At 10:58 UTC one of the devices experienced hardware failures – we have since determined this to be due to a memory issue on the line cards for the device – and our automation successfully removed the device from the rotation.
While we were performing an initial investigation into the cause of device failure, at 16:22 UTC the redundant device also experienced a hardware failure, due to a similar memory issue on the line cards - the line cards restarted, and did not initialize properly after startup, to forward traffic. As the paired device had already been taken out of rotation, earlier that day, automation was not able to perform normal mitigating actions, so it required human intervention. As such, this redundant device remained in rotation and led to network packets being dropped.
During our investigation and evaluation of mitigation options, we considered failing traffic out of this PoP altogether, however this would have been particularly impactful to some customer configurations – for example, those with single-homed ExpressRoute connections – so we opted to keep the location in service. Since the hardware failure was on part of the device, our network engineers isolated the faulty line card from the device, which restored connectivity at 18:37 UTC.
After further investigation of the hardware failures, we determined the memory increases were triggered due to an increase in the number of network flows through the device, which reached a scaling limit for these line cards. These occurrences of memory exhaustion coincided with a recent unrelated datacenter network change for specific devices, which began on 27 February 2025. The change increased the routing scale, which has since been paused. We monitor for overall memory on the devices, however the memory on the line cards was not monitored specifically, which led to a delay in our overall response and mitigation processes. As such, upon our initial investigation the overall scope of impact was not evident, which delayed broad customer communications.
How did we respond?
- 16:22 UTC on 3 March 2025 – Customer impact began, as the redundant device began experiencing the hardware failure, while the paired primary device had already been removed from rotation.
- 16:32 UTC on 3 March 2025 – Our automation attempted to bring the paired primary device back into rotation and, due to system load balancing after route convergence, the impact was marginally reduced.
- 16:43 UTC on 3 March 2025 – First automated alert raised of potential issues, however the scope of impact was not evident from the alert or initial investigations.
- 17:19 UTC on 3 March 2025 – Network engineers investigating the issue took steps to remove congestion from the network peer, which reduced impact.
- 17:27 UTC on 3 March 2025 – Additional customer reports received, following delays in submitting support requests via the Azure Portal.
- 17:46 UTC on 3 March 2025 – Additional engineering teams were engaged to investigate and mitigate.
- 17:59 UTC on 3 March 2025 – We began steps to manually remove the redundant device from rotation.
- 18:37 UTC on 3 March 2025 – We isolated the faulty line card from the device. Network traffic was shifted to healthy routes, connectivity was restored, and we monitored the service health to ensure customers had been mitigated.
How are we making incidents like this less likely or less impactful?
- We have scanned and analyzed for the memory issue across our network. (Completed)
- We have updated the configuration on the device, to handle more connections. (Completed)
- We have added monitoring for this hardware subcomponent memory utilization, to detect issues like this more quickly. (Completed)
- We are improving our notification service for similar issues, to notify impacted customers as soon as our internal monitoring detects the issue. (Estimated completion: March 2025)
- In the longer term, we are improving our network design to handle multiple device failures in our edge PoP locations. (Estimated completion for Toronto edge: December 2025)
How can customers make incidents like this less impactful?
- For ExpressRoute high availability, we recommend operating both primary and secondary connections of ExpressRoute circuits in active-active mode: https://learn.microsoft.com/azure/expressroute/designing-for-high-availability-with-expressroute
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/ZT0X-9B8