Перейти к основному содержимому

Продукт:

Регион:

Дата:

Март 2024

14

Join one of our upcoming 'Azure Incident Retrospective' livestreams about this incident:

What happened?

Between 10:33 UTC on 14 March 2024 and 11:00 UTC on 15 March 2024, customers using Azure services in the South Africa North and/or South Africa West regions may have experienced network connectivity failures, including extended periods of increased latency or packet drops when accessing resources. This incident was part of a broader continental issue, impacting telecom services to multiple countries in Africa.

The incident resulted from multiple concurrent fiber cable cuts that occurred on the west coast of Africa (specifically the WACS, MainOne, SAT3, and ACE cables) in addition to earlier ongoing cable cuts on the east coast of Africa (including the EIG, and Seacom cables). These cables are part of the submarine cable system that connect Africa’s internet to the rest of the world, and service Microsoft’s cloud network for our Azure regions in South Africa. In addition to the cable cuts, we later experienced impact reducing our backup capacity path, leading to congestion that impacted services.

Some customers may have experienced degraded performance including extended timeouts and/or service failures across multiple Microsoft services – while some customers may have been unaffected. Customer impact varied depending on the service(s), region(s), and configuration(s). Impacted downstream services included Azure API Management, Azure Application Insights, Azure Cognitive Services, Azure Communication Services, Azure Cosmos DB, Azure Databricks, Azure Event Grid, Azure Front Door, Azure Key Vault, Azure Monitor, Azure NetApp Files, Azure Policy, Azure Resource Manager, Azure Site Recovery, Azure SQL DB, Azure Virtual Desktop, Managed identifies for Azure resources, Microsoft Entra Domain Services, Microsoft Entra Global Secure Access, Microsoft Entra ID, and Microsoft Graph. For service specific impact details, refer to the ‘Health history’ section of Azure Service Health within the Azure portal.

What went wrong, and why?

The Microsoft network is designed to support multiple failures to our Wide Area Network (WAN) capacity at any given point in time. Specifically, our regions in South Africa are connected via multiple diverse physical paths – both subsea and terrestrially within South Africa. The network is designed to support multiple failures and continue operating with only one single physical path. In this case, our South Africa Regions has four physically diverse subsea cable systems serving the country and the designed failure mode is that three out of four can fail with no impact to our customers.

Following news of geopolitical risks in the Red Sea, we ran internal simulations and capacity planning analysis. On 5 February, we initiated capacity additions to our African network. On 24 February, multiple cable cuts in the Red Sea impacted our east coast network capacity to Africa. This east coast capacity was unavailable, however there was no customer impact because of the built in redundancy.

Before our capacity additions from February had come online, on 14 March we experienced additional multiple concurrent fiber cable cuts, this time on the west coast of Africa – which further reduced the total network capacity for our Azure regions in South Africa. These cable cuts were due to a subsea seismic event (likely an earthquake and/or mudslide) which impacted multiple subsea systems – one of which is used by Microsoft. Additionally, after the west coast cable cuts had occurred, we experienced a line card optic failure on a Microsoft router inside the region that further reduced network headroom. Microsoft experiences hundreds of line card optic failures every day across the 500k+ devices that operate our network – such an event would normally have been invisible to our customers. However, the combination of concurrent cable cuts and this line card failure removed the necessary headroom on the failover path, which led to the congestion experienced.

This combination of events affected Azure services including Compute, Storage, Networking, Databases, and App Services – as well as Microsoft 365 services. While many customers leverage local instances of their services within the South Africa regions, there are some services that rely on API calls made to regions outside of South Africa. The reduced bandwidth to/from the South Africa regions, impacted these specific API calls and therefore impacted service availability and/or performance.

How did we respond?

The timeline that follows includes network availability figures, which represent the breadth of impact to our network capacity but may not represent the impact experienced by any specific customer or service.

  • 3 February 2024 – News articles surfaced geopolitical risk to Red Sea subsea cable infrastructure.
  • 5 February 2024 – Based on our internal simulations, we began the process of requesting capacity augments to Microsoft’s west coast Africa network.
  • 24 February 2024 – Multiple cable cuts in the Red Sea impacted east coast capacity (EIG, and Seacom cables), no impact to customers/services.
  • 4 March 2024 – Local fiber providers began work on approved capacity augments.
  • 14 March 2024 @ 10:02 UTC – Multiple cable cuts impacted west coast capacity (WACS + MAINONE + SAT3).
  • 14 March 2024 @ 10:33 UTC – Customer impact began, as reduced capacity began to cause networking latency and packet drops, our on-call engineers began investigating. Network availability dropped as low as 77%.
  • 14 March 2024 @ 11:55 UTC – Azure Front Door failed out of the region, to reduce inter-region traffic.
  • 14 March 2024 @ 12:00 UTC – Individual cloud service teams began reconfigurations to optimize network traffic to reduce congestion.
  • 14 March 2024 @ 15:44 UTC – After the combination of our mitigation efforts and the end of the business day in Africa, network traffic volume reduced – network availability rose above 97%.
  • 14 March 2024 @ 16:25 UTC – We continued implementing traffic engineering measures to throttle traffic and reduce congestion – network availability rose above 99%.
  • 15 March 2024 @ 06:00 UTC – As network traffic volumes increased, availability degraded, and customers began experiencing congestive packet loss – network availability dropped to 96%.
  • 15 March 2024 @ 11:00 UTC – We shifted capacity from Microsoft's Edge in Lagos to increase headroom for South Africa, last packed drops observed on our WAN. While this effectively mitigated customer impact, we continued to monitor until additional capacity supported more headroom.
  • 17 March 2024 @ 21:00 UTC – First tranche of emergency capacity came online.
  • 18 March 2024 @ 02:00 UTC – Second tranche of emergency capacity came online, Azure Front Door brought back into our South Africa regions, incident declared mitigated.

How are we making incidents like this less likely or less impactful?

  • We have added Wide Area Network (WAN) capacity to the region, in the form of a new physically diverse cable system with triple the capacity of pre-incident levels (Completed).
  • We are reviewing our capacity augmentation processes to help accelerate urgent capacity additions when needed (Estimated completion: April 2024).
  • We continue to work with our fiber providers to restore WAN paths after the cable cuts on the west coast of Africa (Estimated completion: April 2024) and on the east coast of Africa (Estimated completion: May 2024).
  • We are evaluating adding a fifth WAN path between South Africa and the United Arab Emirates, to build even more resiliency to the rest of the world (Estimated completion: June 2024).
  • We are increasingly shifting services to run locally from within our South Africa regions, to reduce dependencies on international regions where possible, including Exchange Online Protection (Estimated completion: June 2024).
  • In the longer term, we are investing in WAN Gateways in Nigeria to improve our fault isolation and routing capabilities. (Estimated completion: December 2024)
  • Finally, we are working to build out and activate Microsoft-owned fiber capacity to these regions, to reduce dependencies on local fiber providers. This includes investments in our own capacity on the new submarine cables going to Africa (specifically the Equiano, 2Africa East and West) which will exponentially increase capacity to serve our regions in South Africa. Importantly, this capacity will also be controlled by Microsoft – giving us more operational flexibility to add/change/move capacity in our WAN, versus relying on third-party telecom operators. These WAN fiber investments on new cable systems will land on the west coast of Africa (Estimated completion: December 2024) as well as on the east coast of Africa (Estimated completion: December 2025).

How can our customers and partners make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

Январь 2024

21

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 01:30 and 08:58 UTC on 21 January 2024, customers attempting to leverage Azure Resource Manager (ARM) may have experienced issues when performing resource management operations. This impacted ARM calls that were made via Azure CLI, Azure PowerShell and the Azure portal. While the impact was predominantly experienced in Central US, East US, South Central US, West Central US, and West Europe, impact may have been experienced to a lesser degree in other regions due to the global nature of ARM. 

This incident also impacted downstream Azure services which depend upon ARM for their internal resource management operations – including Analysis Services, Azure Container Registry, API Management, App Service, Backup, Bastion, CDN, Center for SAP solutions, Chaos Studio, Data Factory, Database for MySQL flexible servers, Database for PostgreSQL, Databricks, Device Update for IoT Hub, Event Hubs, Front Door, Key Vault, Log Analytics, Migrate, Relay, Service Bus, SQL Database, Storage, Synapse Analytics, and Virtual Machines.

In several cases, data plane impact on downstream Azure services was the result of dependencies on ARM for retrieval of Role Based Access Control (RBAC) data (see: ). For example, services including Storage, Key Vault, Event Hub, and Service Bus rely on ARM to download RBAC authorization policies. During this incident, these services were unable to retrieve updated RBAC information and once the cached data expired these services failed, rejecting incoming requests in the absence of up-to-date access policies. In addition, several internal offerings depend on ARM to support on-demand capacity and configuration changes, leading to degradation and failure when ARM was unable to process their requests.

What went wrong and why?

In June 2020, ARM deployed a private preview integration with Entra Continuous Access Evaluation (see: ). This feature is to support continuous access evaluation for ARM, and was only enabled for a small set of tenants and private preview customers. Unbeknownst to us, this preview feature of the ARM CAE implementation contained a latent code defect that caused issues when authentication to Entra failed. The defect would cause ARM nodes to fail on startup whenever ARM could not authenticate to an Entra tenant enrolled in the preview.

On 21 January 2024, an internal maintenance process made a configuration change to an internal tenant which was enrolled in this preview. This triggered the latent code defect and caused ARM nodes, which are designed to restart periodically, to fail repeatedly upon startup. ARM nodes restart periodically by design, to account for automated recovery from transient changes in the underlying platform, and to protect against accidental resource exhaustion such as memory leaks.

Due to these ongoing node restarts and failed startups, ARM began experiencing a gradual loss in capacity to serve requests. Eventually this led to an overwhelming of the remaining ARM nodes, which created a negative feedback loop (increased load resulted in increased timeouts, leading to increased retries and a corresponding further increase in load) and led to a rapid drop in availability. Over time, this impact was experienced in additional regions – predominantly affecting East US, South Central US, Central US, West Central US, and West Europe. 

How did we respond?

At 01:59 UTC, our monitoring detected a decrease in availability, and we began an investigation. Automated communications to a subset of impacted customers began shortly thereafter and, as impact to additional regions became better understood, we decided to communicate publicly via the Azure Status page. By 04:25 UTC we had correlated the preview feature to the ongoing impact. We mitigated by making a configuration change to disable the feature. The mitigation began to rollout at 04:51 UTC, and ARM recovered in all regions except West Europe by 05:30 UTC. 

The recovery in West Europe was slowed because of a retry storm from failed ARM calls, which increased traffic in West Europe by over 20x, causing CPU spikes on our ARM instances. Because most of this traffic originated from trusted internal systems, by default we allowed it to bypass throughput restrictions which would have normally throttled such traffic. We increased throttling of these requests in West Europe which eventually alleviated our CPUs and enabled ARM to recover in the region by 08:58 UTC, at which point the underlying ARM incident was fully mitigated. 

The vast majority of downstream Azure services recovered shortly thereafter. Specific to Key Vault, we identified a latent bug which resulted in application crashes when latency to ARM from the Key Vault data plane was persistently high. This extended the impact for Vaults in East US and West Europe, beyond the vaults that opted into Azure RBAC.

  • 20 January 2024 @ 21:00 UTC – An internal maintenance process made a configuration change to an internal tenant enrolled in the CAE private preview.
  • 20 January 2024 @ 21:16 UTC – First ARM roles start experiencing startup failures, but no customer impact as ARM still has sufficient capacity to serve requests.
  • 21 January 2024 @ 01:30 UTC – Initial customer impact due to continued capacity loss in several large ARM regions.
  • 21 January 2024 @ 01:59 UTC – Monitoring detected additional failures in the ARM service, and on-call engineers began immediate investigation.
  • 21 January 2024 @ 02:23 UTC – Automated communication sent to impacted customers started.
  • 21 January 2024 @ 03:04 UTC – Additional ARM impact was detected in East US and West Europe.
  • 21 January 2024 @ 03:24 UTC – Due to additional impact identified in other regions, we raised the severity of the incident, and engaged additional teams to assist in troubleshooting.
  • 21 January 2024 @ 03:30 UTC – Additional ARM impact was detected in South Central US.
  • 21 January 2024 @ 03:57 UTC – We posted broad communications via the Azure Status page.
  • 21 January 2024 @ 04:25 UTC – The causes of impact were understood, and a mitigation strategy was developed.
  • 21 January 2024 @ 04:51 UTC – We began the rollout of this configuration change to disable the preview feature. 
  • 21 January 2024 @ 05:30 UTC – ARM recovered in all regions except West Europe.
  • 21 January 2024 @ 08:58 UTC – ARM recovered in West Europe, mitigating vast majority of customer impact beyond specific services who took more time to recover.
  • 21 January 2024 @ 09:28 UTC – Key Vault recovered instances in West Europe by adding new scale sets to replace the VMs that had crashed due to the code bug.

How are we making incidents like this less likely or less impactful?

  • Our ARM team have already disabled the preview feature through a configuration update. (Completed)
  • We have offboarded all tenants from the CAE private preview, as a precaution. (Completed)
  • Our Entra team improved the rollout of that type of per-tenant configuration change to wait for multiple input signals, including from canary regions. (Completed)
  • Our Key Vault team has fixed the code that resulted in applications crashing when they were unable to refresh their RBAC caches. (Completed)
  • We are gradually rolling out a change to proceed with node restart when a tenant-specific call fails. (Estimated completion: February 2024)
  • Our ARM team will audit dependencies in role startup logic to de-risk scenarios like this one. (Estimated completion: February 2024)
  • Our ARM team will leverage Azure Front Door to dynamically distribute traffic for protection against retry storm or similar events. (Estimated completion: February 2024)
  • We are improving monitoring signals on role crashes for reduced time spent on identifying the cause(s), and for earlier detection of availability impact. (Estimated completion: February 2024)
  • Our Key Vault, Service Bus and Event Hub teams will migrate to a more robust implementation of the Azure RBAC system that no longer relies on ARM and is regionally isolated with standardized implementation. (Estimated completion: February 2024)
  • Our Container Registry team are building a solution to detect and auto-fix stale network connections, to recover more quickly from incidents like this one. (Estimated completion: February 2024)
  • Finally, our Key Vault team are adding better fault injection tests and detection logic for RBAC downstream dependencies. (Estimated completion: March 2024).

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: