Skip to Main Content

Product:

Region:

Date:

September 2023

16

What happened?

Between 07:24 and 19:00 UTC on 16 September 2023, a subset of customers using Virtual Machines (VMs) in the East US region experienced connectivity issues. This incident was triggered when a number of scale units within one of the datacenters in one of the Availability Zones lost power and, as a result, the nodes in these scale units rebooted. While the majority rebooted successfully, a subset of these nodes failed to come back online automatically. This issue caused downstream impact to services that were dependent on these VMs - including SQL Databases, Service Bus and Event Hubs. Impact varied by service and configuration:

  •  Virtual Machines were offline during this time. while recovery began at approximately 16:30 UTC, full mitigation was declared at 19:00 UTC. 
  • While the vast majority of zone-redundant Azure SQL Databases leveraging were not impacted, some customers using proxy mode connection may have experienced impact, due to one connectivity gateway not being configured with zone-resilience.
  • SQL Databases with ‘auto-failover groups’ enabled were failed out of the region, incurring approximately eight hours of downtime prior to the failover completing.
  • SQL Databases with ‘active geo-replication’ were able to self-initiate a failover to an alternative region manually to restore availability.
  • The majority of SQL Databases were recovered no later than 19:00 UTC. Customers would have seen gradual recovery over time during mitigation efforts.
  • Finally, non-zonal deployments of Service Bus and Event Hubs would have experienced a degradation. Zonal deployments of Service Bus and Event Hubs were unaffected.

 What went wrong and why?

 It is not uncommon for datacenters to experience an intermittent loss of power, and one of the ways we protect against this is by leveraging Uninterruptible Power Supplies (UPS). The role of the UPS is to provide stable power to infrastructure during short periods of power fluctuations, so that infrastructure does not fault or go offline. Although we have redundant UPS systems in place for added resilience, this incident was initially triggered by a UPS rectifier failure on a Primary UPS.

The UPS was connected to three Static Transfer Switches (STS) – which are designed to transfer power loads between independent and redundant power sources, without interruption. The STS is designed to remain on the primary source whenever possible, and to transfer back to it when stable power is available again. When the UPS rectifier failed, the STS successfully transferred to the redundant UPS – but then the primary UPS recovered temporarily, albeit in a degraded state. In this degraded state, the primary UPS is unable to provide stable power for the full load. So, after a 5-second retransfer delay, when the STS transferred from the redundant UPS back to the primary UPS, the primary UPS failed completely.

While the STS should then have transferred power back to the redundant UPS, the STS has logic designed to stagger these power transfers when there are multiple transmissions (to and from primary and redundant UPS) happening in a short period of time. This logic prevented the STS from transferring back to the redundant power, after the primary UPS failed completely, which ultimately caused a power loss to a subset of the scale units within the datacenter – at 07:24 UTC, for 1.9 seconds. This scenario of load transfers, to and from degraded UPS, over a short period of time, was not accounted for in the design. After 1.9 seconds, the load moved to the redundant source automatically for a final time. Our onsite datacenter team validated that stable power was feeding all racks immediately after the event, and verified that all devices were powered on.

Following the restoration of power, our SQL monitoring immediately observed customer impact, and automatic communications were sent to customers within 12 minutes. SQL telemetry also provided our first indication that some nodes were stuck during the boot up process. When compute nodes come online, they first check the network connectivity, then make multiple attempts to communicate with the preboot execution environment (PXE) server, to ensure that the correct network routing protocols can be applied. If the host cannot find a PXE server, it is designed to retry indefinitely until one becomes available so it can complete the boot process.

 A previously discovered bug that applied to some of our BIOS software led to several hosts not retrying to connect to a PXE server, and remaining in a stuck state. Although this was a known issue, the initial symptoms led us to believe that there was a potential issue with the network and/or our PXE servers – troubleshooting these symptoms led to significant delays in correlating to the known BIOS issue. While multiple teams were engaged to help troubleshoot these issues, our attempts at force rebooting multiple nodes were not successful. As such, a significant amount of time was spent exploring additional mitigation options. Unbeknownst to our on call engineering team, these bulk reboot attempts were blocked by an internal approval process, which has been implemented as a safety measure to restrict the number of nodes that are allowed to be forced rebooted at one time. Once we understood all of the factors inhibiting mitigation, at around 16:30 UTC we proceeded to reboot the relevant nodes within the safety thresholds, which mitigated the BIOS issue successfully.

 One of the mechanisms our platform deploys when VMs enter an unhealthy state is ‘service healing’ in which our platform automatically redeploys or migrates it to a healthy node. One of the prerequisites to initiate service healing requires a high percentage of nodes to be healthy – to ensure that, during a major incident, our self-healing systems do not exacerbate the situation. Once we had recovered past the safe threshold, the service healing mechanism initiated for the remainder of the nodes.

 Throughout this incident, we did not have adequate alerting in place, and could not determine which specific VMs were impacted, because our assessment tooling relies on a heartbeat emitted from the compute nodes, which were stuck during the boot up process. Unfortunately, the time taken to understand the nature of this incident meant that communications were delayed. For customers using Service Bus and Event Hubs, this was multiple hours. For customers using Virtual Machines, this was multiple days. As such, we are investigating several communications related repairs, including why automated communications were not able to inform customers with impacted VMs in near real time, as expected.

How did we respond? 

  • 16 September 2023 @ 07:23 UTC - Loss of power to the three STSs.
  • 16 September 2023 @ 07:24 UTC - All three downstream STSs fully re-energized.
  • 16 September 2023 @ 07:33 UTC - Initial customer impact to SQL DB detected via monitoring.
  • 16 September 2023 @ 07:34 UTC - Communications sent to Azure Service Health for SQL DB customers.
  • 16 September 2023 @ 11:40 UTC - The relevant compute deployment team engaged to assist in rebooting nodes.
  • 16 September 2023 @ 12:13 UTC - The infrastructure firmware team was engaged to troubleshoot the BIOS issues.
  • 16 September 2023 @ 13:38 UTC - Multiple compute nodes attempted to be forcefully rebooted with no success.
  • 16 September 2023 @ 15:30 UTC - SQL Databases with ‘auto-failover groups’ were successfully failed over.
  • 16 September 2023 @ 15:37 UTC - Communications sent to Azure Service Health for Service Bus and Event Hub customers.
  • 16 September 2023 @ 16:30 UTC - Safety thresholds blocking reboot attempts understood, successful batch rebooting begins.
  • 16 September 2023 @ 16:37 UTC - Communications published to Azure Status page, in lieu of more accurate impact assessment.
  • 16 September 2023 @ 19:00 UTC - All compute and SQL nodes successfully mitigated.
  • 16 September 2023 @ 22:07 UTC - Once mitigation was validated, communications sent to Azure Service Health for SQL, Service Bus, and Event Hub customers.
  • 19 September 2023 @ 04:10 UTC – Once VM impact was determined, communications sent to Azure Service Health for VM customers. 

How are we making incidents like this less likely or less impactful?

  • First and foremost, we have replaced the failed rectifier inside the UPS. (Completed)
  • We are working with the manufacturer to perform a UPS rectifier failure analysis. (Estimated completion: October 2023)
  • We are reviewing the status of STS automated transfer logic across all of our datacenters. (Estimated completion: October 2023)
  • We are working to modify the STS logic to correct the transfer delay issue. (Estimated completion: December 2023)
  • We have been deploying the fix for the BIOS issue as of January 2023 – we are expediting rollout. (Estimated completion: June 2024)
  • We are improving our detection of stuck nodes for incidents of this class. (Estimated completion: October 2023)
  • We are improving our automated mitigation of stuck nodes for incidents of this class.(Estimated completion: March 2023)
  • We are improving the resiliency of our automated communication system for incidents of this class. (Estimated completion: October 2023)
  • We are reviewing the status of STS automated transfer in all our sites. (Estimated completion: October 2023)
  • For the issue surrounding Multi-AZ Azure SQL Databases using a proxy mode connection, the fix was already underway before this incident and has since been deployed. (Completed)

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: 

August 2023

30

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Starting at approximately 10:30 UTC on 30 August 2023, customers may have experienced issues accessing or using Azure, Microsoft 365 and Power Platform services. This incident was triggered by a utility power sag at 08:41 UTC on 30 August 2023, which impacted one of the three Availability Zones of the Australia East region. This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the datacenter increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware. Although the vast majority of services recovered by 22:40 UTC on 30 August 2023, full mitigation was not until 20:00 UTC on 3 September 2023 – as some services experienced prolonged impact, predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.

Multiple Azure services were impacted by this incident – including Azure Active Directory (AAD), Azure Active Directory B2C, Azure Active Directory Conditional Access, Azure Active Directory Connect Health, Azure Active Directory MyApps, Azure Activity Logs & Alerts, Azure API Management, Azure App Service, Azure Application Insights, Azure Arc enabled Kubernetes, Azure API for FHIR, Azure Backup, Azure Batch, Azure Chaos Studio, Azure Container Apps, Azure Container Registry, Azure Cosmos DB, Azure Databricks, Azure Data Explorer, Azure Data Factory, Azure Database for MySQL flexible servers, Azure Database for PostgreSQL flexible servers, Azure Digital Twins, Azure Device Update for IoT Hub, Azure Event Hubs, Azure ExpressRoute, Azure Health Data Services, Azure HDInsight, Azure IoT Central, Azure IoT Hub, Azure Kubernetes Service (AKS), Azure Logic Apps, Azure Log Analytics, Azure Log Search Alerts, Azure NetApp Files, Azure Notification Hubs, Azure Redis Cache, Azure Relay, Azure Resource Manager (ARM), Azure Role Based Access Control (RBAC), Azure Search, Azure Service Bus, Azure Service Fabric, Azure SQL Database, Azure Storage, Azure Stream Analytics, Azure Virtual Machines, Microsoft Purview, and Microsoft Sentinel.

What went wrong and why?

Starting at approximately 08:41 UTC on 30 August 2023, a utility voltage sag was caused by a lightning strike on electrical infrastructure approximately 18 miles from the impacted Availability Zone of the Australia East region. The voltage sag caused cooling system chillers for multiple datacenters to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the datacenter rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one datacenter to the next. By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.

The two impacted data halls require at least four chillers to be operational. Before the voltage sag, our cooling capacity for these halls consisted of seven chillers, with five chillers in operation and two chillers in standby. At 10:30 UTC some networking, compute, and storage infrastructure began to shutdown automatically as data hall temperatures increased, impacting service availability. At 11:34 UTC, as temperatures continued to increase, our onsite datacenter team began a remote shutdown of remaining networking, compute, and storage infrastructure in the impacted data halls to protect data durability, infrastructure health, and address the thermal runaway. This shutdown allowed the chilled water loop to return to a safe temperature which allowed us to restart the chillers. This shutdown of infrastructure resulted in a further reduction of service availability for this Availability Zone. 

The chillers were successfully brought back online at 12:12 UTC, and data hall temperatures returned to operational thresholds by 13:30 UTC. Power was then restored to the affected infrastructure and a phased process to bring the infrastructure back online commenced. All power to infrastructure was restored by 15:10 UTC. Once all networking and storage infrastructure had power restored, dependent compute scale units were then also returned to operation. As the underlying compute and storage scale units came online, dependent Azure services started to recover, but some services experienced issues coming back online.

From a storage perspective, seven storage scale units were impacted – five standard storage scale units, and two premium storage scale units. Availability impact to affected storage accounts began at 10:30 UTC as hardware shut down in response to elevated data hall temperatures. This was most impactful to storage accounts configured with the default local redundant storage (LRS), which is not resilient to a zonal failure. Accounts configured as zonally redundant (ZRS) remained 100% available, and accounts configured as geographically redundant (GRS) were eligible for customer-managed account failover. After power restoration, storage nodes started coming back online from 15:25 UTC. Four scale units required engineer intervention to check and reset some fault detection logic – the combination of investigation and management tooling performance problems delayed restoration of availability for these scale units. We also identified that some automation was incorrectly marking some already recovered nodes as unhealthy, which slowed storage recovery efforts. By 20:00 UTC, 99% of storage accounts had recovered. Restoring availability for the remaining <1% of storage accounts took more time due to hardware troubleshooting and replacement required on a small number of storage nodes in a single scale unit. Even identifying problematic hardware in the storage nodes took an extended period of time, as the nodes were offline and therefore not able to provide diagnostics. By 01:00 UTC on 31 August, availability was restored for all except a handful of storage accounts, with complete availability restored by 07:00 UTC on 1 September.

From a SQL Database perspective, database capacity in a region is divided into tenant rings. The Australia East region includes hundreds of rings, and each ring consists of a group of VMs (10-200) hosting a set of databases. Rings are managed by Azure Service Fabric to provide availability in cases of VM, network or storage failures. When infrastructure was powered down, customers using zone-redundant Azure SQL Database did not experience any downtime, except for a small subset of customers using proxy mode connection, due to one connectivity gateway not being configured with zonal-resilience. The fix for this issue was already being rolled out, but had not yet deployed to the Australia East region. As infrastructure was powered back on, all tenant rings except one came back online and databases became available to customers as expected. However, one ring remained impacted even after Azure Compute became available. In this ring, 20 nodes did not come back online as expected, so databases on these nodes continued to experience unavailability. As a result of Service Fabric attempting to move databases to healthy nodes, other databases on this ring experienced intermittent availability issues as a side-effect of the overall replica density and unhealthy nodes. The recovery involved first moving all the databases from unhealthy nodes to healthy nodes. All remote storage (general purpose) databases were successfully recovered by this move, but databases using local storage (business critical) only recovered as their underlying nodes recovered. All databases on unhealthy nodes were recovered by 11:00 UTC on 31 August. Since the health and capacity of this final ring did not completely recover, we decided to move all databases out of the ring, which extended the overall recovery time but did not negatively impact customer availability. During this extended recovery, most customers were not experiencing any issues but it was important to move all databases out of this unhealthy ring to prevent any potential impact. The operation of moving all databases out of this ring was completed at 20:00 UTC on 3 September. During this incident, customers who had ‘active geo-replication’ setup were able to failover manually to restore availability. For customers who have ‘auto-failover groups’ enabled, we did not execute automatic failover – our automatic failover policy was not initiated for the region, due to an incorrect initial assessment of the impact severity to SQL Database.

From a Cosmos DB perspective, zone resilient accounts and those with multi-region writes remained operational during the incident, transparently serving requests from a different zone or region, respectively. However, accounts not configured for AZ or multi-region writes experienced full or partial loss of availability, due to the infrastructure that was powered down. Multi-region accounts with single region write eligible for failover (those with Service Managed Failover enabled) were failed over to their alternate regions to restore availability. These were initiated at 12:07 UTC, 33 minutes after decision to power down scale units. The reason for this delay was to identify and failover the Cosmos DB control plane system resources – in retrospect this delay was unnecessary, as the Cosmos DB control plane was already fully zone-resilient. 95% of database accounts were failed over within 35 minutes by 12:42 UTC, and all eligible accounts were failed over by 16:13 UTC on 30 August. Accounts that were not eligible for failover had service restored to partitions only once the dependent storage and compute were restored. 

From an Azure Kubernetes Service (AKS) perspective, the service experienced a loss of compute for the AKS control plane for Australia East as well as data access loss to SQL Database. The AKS control plane underlay is deployed across multiple availability zones. AKS uses Azure SQL Database for its operation queue which is used for Create/Read/Update/Delete (CRUD) activities. Although scheduled to be converted, the SQL Database in Australia East was not configured with AZ resiliency selection, leaving it unavailable during the incident period. In addition, AKS services in the Australia Southeast region depended on this same database, causing an AKS incident for CRUD activities in that region also. Existing customer workloads running on AKS clusters in either region should not have been impacted by the downtime, as long as they did not need to access the AKS resource provider for scaling or other CRUD activities. As the SQL Database recovered, service was restored without any other mitigation required. 

From an Azure Resource Manager (ARM) perspective, the impact on customers was the result of degradation in Cosmos DB. This degradation impacted ARM between 10:45 UTC and 12:25 UTC and resulted in ARM platform availability for the Australia East region dropping from ~99.999% to (at its lowest) 88%, with a 62% success rate for write operations. For data consistency reasons, write operations are required to be sent to the associated Cosmos DB regional replica for a given resource group. While the migration to our next generation zonally-redundant storage architecture is still ongoing, it has not been completed and as a result this region is not yet leveraging fully zonally redundant storage for ARM. This meant that for the duration of the incident, customers worldwide trying to manage resources whose resource groups were homed in Australia East saw increased error rates (and this manifested in a small impact to global platform availability until 15:00 UTC).

How did we respond?

  • 30 August 2023 @ 08:41 UTC – Voltage sag occurred on utility power line
  • 30 August 2023 @ 08:43 UTC – 13 chillers failed to restart automatically
  • 30 August 2023 @ 08:51 UTC – Remote resets on chillers commenced
  • 30 August 2023 @ 09:09 UTC – Team arrived at first group of chillers for manual restarts
  • 30 August 2023 @ 09:18 UTC – Team arrived at second group of chillers for manual restarts
  • 30 August 2023 @ 09:42 UTC – Team arrived at third group of chillers for manual restarts
  • 30 August 2023 @ 09:45 UTC – Team arrived at the final group of chillers which could not be restarted 
  • 30 August 2023 @ 10:30 UTC – Initial impact from automated infrastructure shutdown
  • 30 August 2023 @ 10:47 UTC – Cosmos DB Initial impact detected via monitoring
  • 30 August 2023 @ 10:48 UTC – First automated communications sent to Azure Service Health
  • 30 August 2023 @ 11:30 UTC – Initial communications posted to public Azure Status page 
  • 30 August 2023 @ 11:34 UTC – Decision made to shutdown impacted infrastructure
  • 30 August 2023 @ 11:36 UTC – All subscriptions in Australia East sent portal communications
  • 30 August 2023 @ 12:07 UTC – Failover initiated for eligible Cosmos DB accounts 
  • 30 August 2023 @ 12:12 UTC – Five chillers manually restarted
  • 30 August 2023 @ 13:30 UTC – Data hall temperature normalized
  • 30 August 2023 @ 14:10 UTC – Safety walkthrough completed for both data halls
  • 30 August 2023 @ 14:25 UTC – Decision made to start powering up hardware in the two affected data halls 
  • 30 August 2023 @ 15:10 UTC – Power restored to all hardware 
  • 30 August 2023 @ 15:25 UTC – Storage, networking, and compute infrastructure started coming back online after power restoration
  • 30 August 2023 @ 15:30 UTC – Identified three specific storage scale units still experiencing fault codes
  • 30 August 2023 @ 16:00 UTC – 35% of VMs recovered / Began manual recovery efforts for three remaining storage scale units 
  • 30 August 2023 @ 16:13 UTC – Account failover completed for all Cosmos DB accounts
  • 30 August 2023 @ 17:00 UTC – All but one SQL Database tenant ring had recovered
  • 30 August 2023 @ 19:20 UTC – 90% of VMs recovered
  • 30 August 2023 @ 19:29 UTC – Successfully recovered all premium storage scale units
  • 30 August 2023 @ 20:00 UTC – 99% of storage accounts were back online
  • 30 August 2023 @ 22:35 UTC – Standard storage scale units were recovered, except for one scale unit
  • 30 August 2023 @ 22:40 UTC – 99% of VMs recovered
  • 31 August 2023 @ 04:04 UTC – Restoration of Cosmos DB accounts to Australia East initiated
  • 31 August 2023 @ 04:43 UTC – Final Cosmos DB cluster recovered, restoring all traffic for accounts that were not failed over to alternate regions
  • 31 August 2023 @ 05:00 UTC – 100% of VMs recovered
  • 1 September 2023 @ 06:40 UTC – Successfully recovered all standard storage scale units 
  • 3 September 2023 @ 20:00 UTC – Final SQL Database tenant ring evacuated and all customer databases online

How are we making incidents like this less likely or less impactful?

  • Incremental load increases over time in the Availability Zone resulted in a chiller operating configuration that was susceptible to a timing defect in the Chiller Management System. We have de-risked this failure to restart due to voltage fluctuations, by implementing a change to the control timing logic on the Chiller Management System (Completed).
  • Emergency Operation Procedure for manual restarts of simultaneous chiller failures has changed from an ‘adjacent’ sequence to a data hall ‘load based’ sequence to ensure all impacted data halls have partial cooling, to slow thermal runaway while full cooling is being restored (Completed).
  • Following this incident, as a temporary mitigation, we increased technician staffing levels at the datacenter to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures. Based on our incident analysis the staffing levels at the time would have been sufficient to prevent impact if a ‘load based' chiller restart sequence had been followed, which we have since implemented (Completed). 
  • Datacenter staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total datacenter staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page. 
  • Our Storage team has identified several optimizations in our large scale recovery process which will help to reduce time to mitigate. This includes augmenting data provided in our incidents to enable quicker decision making, and updates to our troubleshooting guides (TSGs) that enable faster execution (Estimated completion: December 2023).
  • Our Azure Service Fabric team is working to improve reliability of SQL Database tenant ring recovery. (Estimated completion: December 2023).
  • Our SQL Database team is reviewing our ‘auto-failover group’ trigger criteria, to ensure that failovers can happen within the expected timeframe. (Estimated completion: October 2023).
  • Our SQL Database team is upgrading internal tooling to enable mass migration of databases. (Estimated completion: December 2023).
  • Our Cosmos DB team is working to optimize Service Managed Failover for single region write accounts to reduce time to mitigate (Estimated completion: November 2023).
  • Our AKS team is immediately converting all operation queue SQL Database databases to be zone redundant (Estimated completion: September 2023).
  • Our AKS team is also replacing all cross-region SQL Database queue usage with Service Bus queues that are zone redundant (Estimated completion: September 2023).
  • Our ARM team will complete its storage layer migration to the next generation, zonally redundant architecture (Estimated completion: December 2023).
  • Our incident management team is exploring ways to harden our readiness, process, and playbook surrounding power down scenarios (Estimated completion: October 2023)

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

July 2023

6

Watch our 'Azure Incident Retrospective' video about this incident:

What happened? 

Between 23:15 UTC on 6 July 2023 and 09:00 UTC on 7 July 2023, a subset of data for Azure Monitor Log Analytics and Microsoft Sentinel failed to ingest. Additionally, platform logs gathered via Diagnostic Settings failed to route some data to customer destinations such as Log Analytics, Storage, Event Hub and Marketplace. These failures were caused by a deployment of a service within Microsoft, with a bug that caused a much higher than expected call volume that overwhelmed the telemetry management control plane. Customers in all regions experienced impact.

Security Operations Center (SOC) functionality in Sentinel may have been impacted. Queries against impacted tables with date range listed above, inclusive of the logs data that we failed to ingest, might have returned partial or empty results. This includes analytics (detections), hunting queries, workbooks with custom queries, and notebooks. In cases where Event or Security Event tables were impacted, incident investigations of a correlated incident may have showed partial or empty results. 

What went wrong and why? 

A code deployment for the Azure Container Apps service was started on 3 July 2023 via the normal Safe Deployment Practices (SDP), first rolling out to Azure canary and staging regions. This version contained a misconfiguration that blocked the service from starting normally. Due to the misconfiguration, the service bootstrap code threw an exception, and was automatically restarted. This caused the bootstrap service to be stuck in a loop where it was being restarted every 5 to 10 seconds. Each time the bootstrap service was restarted, it provided configuration information to the telemetry agents also installed on the service hosts. Each time the configuration information was sent to the telemetry hosts, they interpreted this as a configuration change, and therefore they also automatically exited their current process and restarted as well. Three separate instances of the agent telemetry host, per application host, were now also restarting every 5 to 10 seconds.

Upon each startup of the telemetry agent, the agent immediately contacted the telemetry control plane to download the latest version of the telemetry configuration. Normally this is an action that would take place one time every several days, as this configuration would be cached on the agent. However, as the deployment of the Container Apps service progressed, several hundred hosts now had their telemetry agents requesting startup configuration information from the telemetry control plane every 5-10 seconds. The Container Apps team detected the fault in their deployment on 6 July 2023, stopped the original deployment before it was released to any production regions, and started a new deployment of their service in the canary and staging regions to correct the misconfiguration.

However, the aggregate rate of requests from the services that received the build with the misconfiguration exhausted capacity on the telemetry control plane. The telemetry control plane is a global service, used by services running in all public regions of Azure. As capacity on the control plane was saturated, other services involved in ingestion of telemetry, such as the ingestion front doors and the pipeline services that route data between services internally, began to fail as their operations against the telemetry control plane were either rejected or timed out. The design of the telemetry control plane as a single point of failure is a known risk, and investment to eliminate this risk has been underway in Azure Monitor to design this risk out of the system.

How did we respond?

The impact on the telemetry control plane grew slowly and did not create problems that were detected until 12:30 UTC on 6 July 2023. When the issues were detected, the source of the additional load against the telemetry control plane was not known, but the team suspected additional load had been created against the control plane and took these actions:

  • 6 July 2023 @ 14:53 UTC – Internal incident bridge created.
  • 6 July 2023 @ 15:56 UTC – ~500 instances of garbage collector service were removed, to reduce load on telemetry control plane
  • 6 July 2023 @ 16:09 UTC – First batch of Node Diagnostics servers were removed, to reduce load on telemetry control plane. This process of removing this type of server continued over the next 10 hours.
  • 6 July 2023 @ 20:20 UTC – Source of anomalously high traffic was identified, and the responsible team was paged to assist.
  • 6 July 2023 @ 23:00 UTC – IP address blocks deployed, to prevent anomalous traffic from hitting telemetry control plane.
  • 6 July 2023 @ 23:15 UTC – External customer impact started, as cached data started to expire.
  • 7 July 2023 @ 01:30 UTC – Three additional clusters were added to telemetry control plane to handle additional load, and we began restarting existing clusters to clear backlogged connections.
  • 7 July 2023 @ 02:19 UTC – Initial customer notification was posted to the Azure Status page, to acknowledge the incident was being investigated while we worked to identify which specific subscriptions were impacted.
  • 7 July 2023 @ 02:45 UTC – An additional three clusters were added to telemetry control plane.
  • 7 July 2023 @ 06:15 UTC – Targeted customer notifications sent via Azure Service Health to customers with impacted subscriptions (sent on Tracking ID XMGF-5Z0).
  • 7 July 2023 @ 09:00 UTC – Incident declared mitigated, as call error rate and call latency against control plane APIs stabilized at typical levels.

How are we making incidents like this less likely or less impactful? 

We know customer trust is earned and must be maintained, not just by saying the right thing but by doing the right thing. Data retention is a fundamental responsibility of the Microsoft cloud, including every engineer working on every cloud service. We have learned from this incident and are committed to the following improvements:

  • We have ensured that our telemetry control plane services are now running with additional capacity (Completed)
  • We have created additional alerting on certain metrics that indicate critical, unusual failure patterns in API calls (Completed)
  • We will be adding new positive caching and negative caching to the control plane, to reduce load on backing store (Estimated completion: September 2023)
  • We are putting in place additional throttling and circuit breaker patterns to our core telemetry control plane APIs (Estimated completion: September 2023)
  • In the longer term, we are creating isolation between internal-facing and external-facing services using the telemetry control plane (Estimated completion: December 2023) 

How can customers make incidents like this less impactful?

Note that the Azure Monitoring Agent (AMA) provides more advanced collection and ingestion resilience capabilities (such as caching, buffering and retries) than the legacy Microsoft Monitoring Agent (MMA). Customers who have not yet completed their migration from MMA to AMA, would benefit from accelerating and completing the migration - for use cases required by them and supported in AMA. For more details:

Customers using Microsoft Sentinel can consider the following compensating steps:

  • Identify high priority assets, log sources that cover those assets, and detection or hunting logic normally applied to those assets and logs.
  • If possible, run one-time queries at the source, for the indicated date range where ingestion was impacted - based on the prioritized assets, logs, and logic. It may also be useful to run the same queries for week prior to the incident, compare, and look for differences.
  • Alternatively, to the extent the collection architecture allows for it, re-ingest data from the source into Sentinel, and run those one-time queries in Sentinel.

Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

5

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between approximately 07:22 UTC and 16:00 UTC on 5 July 2023, Azure customers using the West Europe region may have experienced packet drops, timeouts, and/or increased latency. This impact resulted from a fiber cut caused by severe weather conditions in the Netherlands. The West Europe region has multiple datacenters and is designed with four independent fiber paths for the traffic that flows between datacenters. In this incident, one of the four major paths was cut, which resulted in congestive packet loss when traffic on the remaining links exceeded their capacity.

Downstream Azure services dependent on this intra-region network connectivity were also impacted – including Azure App Services, Azure Application Insights, Azure Data Explorer, Azure Database for MySQL, Azure Databricks (which also experienced impact in North Europe, as a result of a control plane dependency), Azure Digital Twins, Azure HDInsight, Azure Kubernetes Service, Azure Log Analytics, Azure Monitor, Azure NetApp Files, Azure Resource Graph, Azure Site Recovery, Azure Service Bus, Azure SQL DB, Azure Storage, and Azure Virtual Machines – as well as subsets of Microsoft 365, Microsoft Power Platform, and Microsoft Sentinel services.

What went wrong and why?

Due to a fiber cut caused by severe weather conditions in the Netherlands, 25% of network links between two campuses of West Europe datacenters became unavailable. These links were already running at higher utilization than our design target, and there was a capacity augment project in progress to address this. Due to a previous incident related to this capacity augment on 16 June 2023 (Tracking ID VLB8-1Z0), the augment work was proceeding with extreme caution and was still in progress when the fiber cut occurred.

As a result of the fiber cut and the higher utilization, congestion increased to a point where intermittent packet drops occurred in many of the intra-region paths. This primarily impacted network traffic between Availability Zones within the West Europe region, not traffic to and from the region itself. As a result of this interruption, Azure services that rely on internal communications with other services within the region may have experienced degraded performance, manifesting in the issues described above.

How did we respond?

Network alerting services indicated a fiber cut at 07:22 UTC and a congestion alert triggered at 07:46 UTC. Our networking on-call engineers engaged and began to investigate. Two parallel workstreams were spun up to mitigate impact:

The first workstream focused on reducing traffic in the region and balancing it across the remaining links. This balancing activity requires a detailed before-and-after traffic simulation to ensure safety, and these simulations were initiated as a first step. At 10:00 UTC we initiated throttling and migration of internal service traffic away from the region. We also started work on rebalancing traffic away from congested links. As a result of these activities, by 14:52 UTC packet drops were reduced significantly, by 15:30 UTC many internal and external services saw signs of recovery, and by 16:00 UTC packet drops had returned to pre-incident levels. We continued to work on reducing high link utilization and by 16:21 UTC the rebalancing activity was completed.

The second workstream focused on repairing the impacted links, in partnership with our dark fiber provider in the Netherlands. These cable repairs took longer than expected since access to the impacted area was hindered by the weather and hazardous working conditions. Partial restoration was confirmed by 19:30 UTC, and full restoration was confirmed by 20:50 UTC. While this restored the network capacity between datacenters, we continued to monitor our network infrastructure and capacity before declaring the incident mitigated at 22:45 UTC.

How are we making incidents like this less likely or less impactful?

  • We have repaired the impacted networking links, in partnership with our dark fiber provider in the Netherlands. (Completed)
  • Within 24 hours of the incident being mitigated we brought additional capacity online, on the impacted network path. (Completed)
  • Within a week of the incident, we are 90% complete with our capacity augments that will double capacity in our West Europe region to bring utilization within our design targets. (Estimated completion: July 2023)
  • As committed in a previous Post Incident Review (PIR), we are working towards auto-declaring regional incidents to ensure customers get notified more quickly (Estimated completion: August 2023).

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: