产品:
区域:
日期:
2025年1月
8
What happened?
Between 22:31 UTC on 8 January and 00:44 UTC on 11 January 2025, a networking configuration change in a single availability zone (physical zone Az01) in East US 2 resulted in customers experiencing connectivity issues, prolonged timeouts, connection drops, and resource allocation failures across multiple Azure services. Customers leveraging Private Link with Network Security Groups (NSG) to communicate with Azure services may have also been impacted. Services that were configured to be zonally redundant and leveraging VNet integration may have experienced impact across multiple zones.
The 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ - https://learn.microsoft.com/en-us/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings .
Impacted services included:
- Azure Databricks: between 22:55 UTC on 08 January and 15:00 UTC on 10 January, customers may have experienced issues launching clusters and serverless compute resources, leading to failures in dependent jobs or workflows.
- Azure OpenAI - between 04:29 UTC on 09 January and 15:00 UTC on 10 January, customers may have experienced stuck Azure OpenAI training jobs, and failures in Finetune model deployments and other jobs.
- Azure SQL Managed Instance, Azure Database for PostgreSQL flexible servers, Virtual Machines, Virtual Machine Scale Sets: between 22:31 UTC on 08 January and 22:19 UTC on 10 January, customers may have experienced service management operation failures.
- Azure App Service, Azure Logic Apps, Azure Function apps: between 22:31 UTC on 8 January and 20:05 UTC on 10 January, customers may have received intermittent HTTP 500-level response codes, timeouts or high latency when accessing App Service (Web, Mobile, and API Apps), App Service (Linux), or Function deployments hosted in East US 2.
- Azure Container Apps: between 22:31 UTC on 08 January and 21:01 UTC on 10 January, customers may have experienced intermittent failures when creating revisions, container app/jobs in a consumption workload profile, and scaling out a Dedicated workload profile in East US 2.
- Azure SQL Database: between 01:29 UTC on 09 January and 22:25 UTC on 10 January, customers may have experienced issues accessing services, and errors/timeouts for new database connections. Existing connections remained available to accept new requests, however re-established connections may have failed.
- Azure Data Factory and Azure Synapse Analytics: between 22:31 UTC on January 8 and 00:44 UTC on January 11, customers may have encountered errors when attempting to create clusters, job failures, timeouts during Managed VNet activities, and failures in activity runs for the ExecuteDataFlow activity type.
Other impacted services included API Management, Azure DevOps, Azure NetApp Files, Azure RedHat OpenShift, and Azure Stream Analytics.
What went wrong and why?
The Azure PubSub service is a key component of the networking control plane, acting as an intermediary between resource providers and networking agents on Azure hosts. Resource providers, such as the Network Resource Provider, publish customer configurations during Virtual Machine or networking create, update, or delete operations. Networking agents (subscribers) on the hosts retrieve these configurations to program the hosts networking stack. Additionally, the service functions as a cache, ensuring efficient retrieval of configurations during VM reboots or restarts. This capability is essential for deployments, resource allocation, and traffic management in Azure Virtual Network (VNet) environments.
Around 22:31 UTC on 08 January, during an operation to enable a feature, a manual recycling of internal PubSub replicas on certain metadata partitions was required. This recycling operation should have been completed in ‘series’ to ensure that data quorum for the partition is maintained. In this case, recycling was executed across the replicas in ‘parallel’, causing the partition to lose quorum. This resulted in the PubSub service to lose the indexing data stored on these partitions.
Losing the indexing data on three partitions resulted in ~60% of network configurations not being delivered to agents from the control plane. This impacted all service management operations for resources with index data on the affected partitions, including the network resource provisioning for virtual machines in East US 2, Az01. Networking agents would have failed to receive updates on VM routing information for new and existing VMs that underwent any automated recovery. This manifested in a loss of connectivity affecting VMs that customers use, in addition to the VMs that underpin our PaaS service offerings.
Multi-tenant PaaS services that used VNet integration had a dependency on the partitions for the networking provisioning state. After the index data was lost, the dependency on this data that no longer existed caused impact to these PaaS services to expand beyond the impacted zone.
How did we respond?
The issue was detected by monitoring on the PubSub service at 22:44 UTC, with impacted services raising alerts shortly thereafter. The PubSub team identified the trigger as the three partitions with indexing data loss due to the recycling.
At 23:10 UTC, we started evaluating different options for data recovery. Several workstreams were evaluated including:
- Rehydrating the networking configurations to the partitions from the network resource provider.
- Restoring from backup to recover the networking configurations.
- Recovering data from the unaffected partitions.
Due to limitations in feasibility for first two options, we started to recover data by rebuilding the mappings form the backend data store. Scanning the data store to extract source mappings is a time intensive process. It involves not only scanning the data, but also making sure that the mappings to the existing resources are still valid. To achieve this, we had to develop validation and mapping transformation scripts. We also throttled the mappings extraction to avoid impacting the backend data store. This scanning process took about 5-6 hours to complete. Once we obtained these mappings, we then had to restore the impacted index partitions.
As we initiated the build phase, there were unforeseen challenges that we had to work around. Typically, each build takes 3-4 hours. We spent some time looking for ways to expedite this process, as our plan involved applying multiple builds.
Independently on the builds taking longer than expected, we also observed a high number of failure requests due to the outage itself. As a result, our systems were generating a lot of errors which contributed to high CPU usage. This exacerbated the slowness in the restore process. One of the ways that we overcame this this was to deploy patches to block requests to the impacted partitions which mitigated the high CPU usage.
Unfortunately, at 3:36 UTC on 10 January 2025, we discovered the first mapping source files that we generated were insufficient as we referenced an incorrect endpoint. Once this was understood, we had to repeat the mapping file generation and restore process. By 19:18 UTC on 10 January, all three partitions were fully recovered. Additional workstreams that were executed while this main workstream was underway included:
- An immediate remediation step was to direct the Azure control plane service running on all host machines in this region to PubSub endpoints located in other non-impacted availability zones. This restored the host agent’s ability to look up the correct routing information resulting in mitigating VM to VM connectivity issues.
- We initiated a workstream to route all new allocations requests to other availability zones in East US 2. However, as mentioned, the services that were using VNet integration, which had a fixed dependency on the impacted zone, would have continued to see failures.
- To remediate impact to customers leveraging Private Link with NSGs to communicate with Azure services, we redirected the Private Link mapping to avoid the dependency on the impacted zone.
By 19:18 UTC on 10 January, all three impacted partitions were fully recovered. Following this, we started working with impacted services to validate their health and remediate any data gaps that we discovered. At 00:30 UTC on the 11 January, we initiated a phased approach to gradually enable Az01 to accept new allocation requests. By 00:44 UTC on 11 January, all services confirmed mitigation. We reintroduced Az01 into rotation fully at 04:30 UTC, at which point we declared full mitigation. The impact window for Azure Services varied between service.
Below is a timeline of events:
- 22:31 UTC on 08 January 2025 – Customer impact begins following the recycling of the PubSub partitions.
- 22:44 UTC on 08 January 2025 – The issue is identified as an incorrect networking configuration.
- 23:10 UTC on 08 January 2025: We start to evaluate options for the index data recovery to the PubSub service.
- 23:43 UTC on 08 January 2025: Targeted customer communications from the earliest reported impacted services started going out via Azure Service Health.
- 00:35 UTC on 09 January 2025: We begin scanning the backend data store.
- 02:37 UTC on 09 January 2025: Azure Status page updated with the details of a wider networking impact to the region.
- 03:05 UTC on 09 January 2025: A workstream to route all new allocation requests to other availability zones in East US 2 is initiated.
- 03:45 UTC on 09 January 2025: The workstream for the host agent’s ability to look up the correct routing information was completed.
- 04:30 UTC on 09 January 2025: Developing scripts to restore the impacted partitions started.
- 05:00 UTC on 09 January 2025: The workstream to route all new allocation requests to other availability zones in East US 2 was completed.
- 07:45 UTC on 09 January 2025: Validation and testing of scripts for restoring mapping data to the impacted partitions is completed.
- 08:51 UTC on 09 January 2025: Completed scanning backend generated source mapping files.
- 09:14 UTC on 09 January 2025: Build process completed for the scripts to restore mappings.
- 11:30 UTC on 09 January 2025: We observed high CPU and slow restoration speed.
- 11:58 UTC on 09 January 2025: The workstream to mitigate impacted services with Private link and NSG dependencies is initiated.
- 13:12 UTC on 09 January 2025: We updated one of the internal PubSub services to stop sending requests which was contributing to the high CPU usage.
- 16:10 UTC on 09 January 2025: We completed another update to reduce request volumes to impacted partitions. This helped to improve the restore speed further.
- 18:17 UTC on 09 January 2025: The workstream to redirect Private Link mapping to avoid the dependency on the impacted zone is completed.
- 19:15 UTC on 09 January 2025: We created another patch to speed up the restore process. The new patch restores the mappings inside the partition rather than calling APIs to add mappings.
- 23:40 UTC on 09 January 2025: This patching completed with the new optimization step, and the time for rebuilding one partition was reduced even further. Our overall restore time was reduced by >75%.
- 02:10 UTC on 10 January 2025: Restoration of the first two partitions completed.
- 03:00 UTC on 10 January 2025: Restoration of the third partition completed.
- 03:36 UTC on 10 January 2025: Following the restoration, during our validation steps, we determined that the source mapping files generated were incorrect. As a result, we started generating all new source mapping files.
- 05:00 UTC on 10 January 2025: We validated the data of source mappings for first partition. Continued extraction of the rest of the mappings. Meanwhile, we had to clean the previously restored partitions.
- 08:30 UTC on 10 January 2025: The correct source mapping file is ready, and we started the partition restore process.
- 11:04 UTC on 10 January 2025: The first partition is back online while troubleshooting and mitigating data anomalies.
- 19:18 UTC on 10 January 2025 – All three of the impacted partitions are fully recovered and validated – At this stage most customers and services would have seen full mitigation.
- 00:30 UTC on 11 January 2025 – We initiated a phased approach to gradually enable the impacted zone for new allocations.
- 00:44 UTC on 11 January 2025 - All services confirmed mitigation.
How are we making incidents like this less likely or less impactful?
- We are working on reducing the recovery time for any data loss scenarios by improving the PubSub data recovery mechanism. This will bring our RTO to under 2 hours (Estimated completion: February 2025).
- While PubSub production changes already go through our Safe Deployment Practices (SDP), we have added a step with additional layers of scrutiny. These will include, but are not limited to, verifying rollback plans, ensuring error budgets compliance, and performance validations (Completed).
- All manual configuration changes will be moved to an automation pipeline which will require elevated privileges. (Estimated completion: February 2025).
- Significant impact to the Pubsub service can occur if certain operations are performed incorrectly. We will block these operations from being executed, except by designated experts (Estimated completion: January 2025).
- We are re-examining zonal services to ensure that all downstream dependencies are zone resilient. (Estimated completion: October 2025)
- During the incident, we recommended that customers invoke a failover. This was an inappropriate communication. We are making corrections to our processes to ensure that our communications inform customers about the state of the impacted services, zones and regions. This, coupled with any other information that we have about the incident, including ETAs, should be taken into consideration as customers evaluate their own business continuity decisions (Estimated completion: January 2025)
How can customers make incidents like this less impactful?
- Consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/PLP3-1W8
2024年12月
26
What happened?
Between 18:40 UTC on 26 December and 19:30 UTC on 27 December 2024, multiple Azure services were impacted by a power event that occurred in one datacenter, within one Availability Zone (physical zone AZ03), in the South Central US region. Within the impacted datacenter, our automated power systems managed the event as expected, without interruption for two of the three data halls. However, one data hall did not successfully transition to an alternate power supply. This failure led to a loss of compute, network, and storage infrastructure in this data hall.
Customer workloads configured for multi-zone resiliency would have seen no impact, or only brief impact, as automated mitigations occurred. Only customer workloads without multi-zone resiliency, and with dependencies on the impacted infrastructure, became degraded or unavailable. Impacted downstream services included:
• Azure Alerts Management – between 18:40 UTC 26 December 2024 and 05:00 UTC on 27 December 2024, impacted customers may have experienced high latency in alerts notifications and persistence.
• Azure App Service – between 18:40 UTC on 26 December and 12:00 UTC on 27 December, impacted customers may have received intermittent HTTP 500-level response codes, experience timeouts or high latency when accessing App Service (Web, Mobile, and API Apps), App Service (Linux), or Function deployments hosted in the South Central US region.
• Azure Application Gateway – between 18:40 UTC on 26 December and 06:58 UTC on 27 December, impacted customers may have experienced data plane disruptions when trying to access your backend applications using Application Gateway hosted in the South Central US region.
• Azure Backup – between 20:40 UTC on 26 December and 02:21 UTC on 27 December, impacted customers may have experienced failures in backup operations for Azure File shares in the South Central US region.
• Azure Cache for Redis – between 18:45 and 21:35 UTC on 26 December, impacted customers may have lost cache availability and/or been unable to connect to cache resources hosted in the South Central US region.
• Azure Cosmos DB – between 18:47 UTC on 26 December and 03:59 UTC on 28 December, impacted customers may have experienced a degradation in service availability and/or request latency. Some requests may have resulted in server errors or timeouts.
• Azure Database for PostgreSQL – between 18:48 UTC on 26 December and 13:12 UTC on 27 December, impacted customers may have experienced connectivity failures and timeouts when executing operations, as well as unavailability of resources hosted in the South Central US region.
• Azure Database Migration Service – between 19:01 UTC on 26 December and 12:17 UTC on 27 December, impacted customers may have experienced timeout errors when attempting to create a new migration service, or when using existing migration service in the South Central US region.
• Azure Event Hubs | Azure Service Bus – Customers with Standard SKU, Premium SKU namespaces or AZ-enabled dedicated Event Hubs clusters experienced an availability drop for approximately five minutes, at the time when the incident started – this issue was mitigated automatically once namespace resources were reallocated to other availability zones. However, a subset of customers using Event Hubs Dedicated non-AZ clusters experienced an availability issue for an extended period of time when trying to access their Event Hubs namespaces in the region. The affected Event Hubs dedicated clusters recovered once the underlying failing VMs in their clusters were brought back online, the last of which were restored by 05:52 UTC on 27 December.
• Azure Firewall – between 18:44 UTC on 26 December and 11:30 UTC on 27 December, impacted customers with an Azure Firewall deployed with multi-zone resilience may have seen partial throughput degradation and no availability loss. Customers with an Azure Firewall not utilizing multi-zone resiliency may have had resources dependent on the impacted Availability Zone (physical zone AZ03) which could have resulted in performance degradation or availability impact. Customers attempting control plane operations (for example, making changes to Firewall policies/rules) may have experienced failures during this incident.
• Azure Logic Apps – between 18:47 UTC on 26 December and 03:10 UTC on 27 December, impacted customers may have encountered delays in run executions and failing data or control plane calls.
• Azure SQL Database – between 20:12 UTC on 26 December and 18:22 UTC on 27 December, impacted customers may have experienced issues accessing services. New connections to databases in the South Central US region may have resulted in an error or timeout. Existing connections remained available to accept new requests, however if those connections were terminated then re-established, they may have failed.
• Azure Storage – between 18:45 UTC on 26 December and 08:50 UTC on 27 December, impacted customers may have experienced timeouts and failures when accessing storage resources hosted in the South Central US region. This affected both Standard and Premium tiers of Blobs, Files and Managed Disks.
• Azure Synapse Analytics – between 18:53 UTC on 26 December and 13:52 UTC on 27 December, impacted customers may have experienced spark job execution failures, and/or errors when attempting to create clusters, in the South Central US, East US 2, and/or Brazil South regions.
• Azure Virtual Machines – between 18:41 UTC on 26 December and 22:26 UTC on 27 December, impacted customers may have experienced connection failures when trying to access some Virtual Machines hosted in the South Central US region. These Virtual Machines may have also restarted unexpectedly.
• Azure Virtual Machine Scale Sets – between 19:04 UTC on 26 December and 11:18 UTC on 27 December, impacted customers may have experienced error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in the South Central US region.
• This incident also impacted a subset of Microsoft 365 services – further details are provided in the Microsoft 365 Admin Center, under incident ID MO966473.
What went wrong and why?
This incident was initially triggered by a utility power loss, itself caused by a localized ground fault – in which a high voltage underground line failed. After a phase to ground short developed in the buried feeder cables, the breaker feeding the datacenter tripped – leading to a loss of utility power, at 18:40 UTC.
By design, the power distribution systems transferred power to diesel backup generators, where UPS batteries carry the load during this transition, which was successful for two of the three affected data halls. During the transition to generator power, the third data hall experienced UPS battery faults, which caused the load to drop during transition.
In any power-related event, our first priority is to ensure the safety of our staff and infrastructure before any power restoration work can begin. Following our assessment, we were able to safely begin restoration at 20:13 UTC. IT power loads were manually re-energized on backup diesel generator power, by performing a bypass on the failed UPS devices. We began seeing infrastructure services returning by 20:35 UTC, with power fully restored by 20:56 UTC. As power and infrastructure recovered, the next validation steps were to ensure that Azure Networking and Azure Storage services were recovering as expected. By 21:00 UTC, almost all storage and network infrastructure services were confirmed as fully operational. A single storage scale unit remained significantly degraded, due to hardware that required deeper inspection and ultimately, replacement.
As storage scale units recovered, 85% of the impacted Virtual Machines (VMs) recovered by 21:40 UTC as their Virtual Hard Disks (VHDs) became available. The next 13% of VMs recovered between 06:00 and 06:30 UTC, as the final storage scale unit became available. Despite all the storage issues being resolved, <2% of VMs impacted by this event remained unhealthy. These issues are detailed below and explain why impacted downstream services with dependencies on these VMs experienced long-tail recoveries. The incident was declared as mitigated at 19:30 UTC on 27 December 2024.
Azure Storage:
For Zone Redundant Storage (ZRS) accounts, there was no availability impact – as data was served from replicas in other Availability Zones during this incident.
The power loss event impacted six Storage scale units. After power restoration, scale units hosting Standard SSD Managed Disks, Premium SSD Managed Disks, Premium Blobs, and Premium Files, fully recovered automatically in around 30 minutes. For most of the HDD-based Standard Storage LRS/GRS scale units, the storage services took approximately one hour to recover.
Unfortunately, within one Standard Storage scale unit, multiple network switches were non-functional following the power event, causing a significant portion of the data in that scale unit to be inaccessible because all replicas were unreachable. This caused significant impact to VMs and dependent services that were using Standard HDD managed disks and LRS blob or file storage accounts hosted on this scale unit. Mitigation required replacement networking equipment to be sourced from spares and installed by datacenter technicians. Network engineers then configured and validated these devices, before bringing them online. Additional actions were taken to recover storage nodes under the replaced switches. For the majority of accounts availability was restored by 06:10 UTC on 27th December 2024 (overall availability at 99.5%), with repairs required on a handful of servers to restore 100% availability by 08:50 UTC on 27 December 2024.
Azure Compute / Virtual Machines:
For customers using VM/compute workloads that leveraged multi-zone resiliency (such as VMSS flex across availability zones), there was no availability impact.
For incidents like this, Azure has an automated recovery suite called ‘Defibrillator’ that starts automatically, to recover the VMs and Host machines they are running on, after datacenter power has been restored. It will orchestrate the power on for all affected Host machines, monitor the boot-up and bootstrap sequences, and ensure that the VMs are up and running. When this is running, Azure’s automated steady-state health detection and remediation systems suspend all activities, in order to avoid disrupting the disaster recovery process.
At approximately 22:00 UTC on 26 December 2024, some compute scale units were found not tracking at the expected level of recovery. For the final 2% of VMs mentioned above, these experienced an extended recovery – we observed three separate events that contributed to this.
- The first scenario was due to initialization without a connection to a network device. Due to the network devices not being fully configured before the Host machines were powered on, a race condition triggered during the Host bootstrap process. This issue is specific to a certain hardware configuration within localized compute scale units, and necessitated the temporary disabling of some validation checks during the bootstrap process.
- The second scenario delaying recovery was some machines failing to boot into the Host OS due to a newly discovered bootloader bug impacting a small subset of host hardware with higher levels of offline memory pages. When the hardware reports repeated corrected memory errors to the Host OS, the Host will offline certain memory ranges to prevent repeated use of that memory range. In a small subset of host hardware where a large range of offline memory was accumulated, this new Host OS bug was discovered – resulting in failing to bootstrap the Host OS. This category was mitigated by clearing and/or ignoring this offline memory list and allowing the Host OS to make forward progress where it could, then rebuild its offline memory list once it started to run the full OS.
- The third scenario that had prevented compute recovery in some cases was due to control plane devices that are inline to execute the power operations on the Host machines. Datacenter technicians were required to reseat that infrastructure manually.
By 10:50 UTC on 27 December, >99.8% of the impacted VMs had recovered, with our team re-enabling Azure’s automated detection and remediation mechanisms. Some targeted remediation efforts were required for a remaining small percentage of VMs, requiring manual intervention to bring these back online.
Azure Cosmos DB -
For Azure Cosmos DB accounts configured with availability zones, there was no impact, and the account maintained availability for reads and writes.
Impact on other Cosmos DB accounts varied depending on the customer database account regional configurations and consistency settings:
- Database accounts configured with availability zones were not impacted by the incident, and maintained availability for reads and writes.
- Database accounts with multiple read regions and a single write region outside South Central US maintained availability for reads and writes if configured with session or lower consistency. Accounts using strong or bounded staleness consistency may have experienced write throttling to preserve consistency guarantees until the South Central US region was either taken offline or recovered. This behavior is by design.
- Active-passive database accounts with multiple read regions and a single write region in South Central US maintained read availability, but write availability was impacted until the South Central US region was taken offline or recovered.
- Single-region database accounts in South Central US without Availability Zone configuration were impacted if any partition resided on the affected instances.
Azure SQL Database:
For Azure SQL Databases configured with zone redundancy, there was no impact.
A subset of customers in this region experienced unavailability and slow/stuck control plane operations, such as updating the service level objective, for databases that are not configured as zone redundant. Customers with active geo-replication configuration were asked to consider failing out of the region at approximately 22:31 UTC.
Impact duration varied. Most databases recovered after Azure Storage recovered. Some databases took an extended time to recover due to the aforementioned long recovery time of some underlying VMs.
Azure Application Gateway:
- Application Gateway experienced issues with data path, control plane, and auto-scale operations, leading to service disruptions. Impact on Application Gateways varied depending on customer configuration:
- Customers who deployed Application Gateways with zone redundancy may have experienced latency issues and overall degraded performance.
- Customers who deployed Application Gateways to a single zone or did not specify zone info during deployment may have experienced data path loss if their deployments had instances in the affected zone.
- Gateways with instances deployed in affected zone may have experienced failures or delays in configuration updates.
- Gateways with instances deployed in affected zone may have experienced failures or delays in auto scale operations.
Azure Firewall:
For Azure Firewalls deployed to all Availability Zones of the region, customers would not have experienced any data path impact.
However, customers with an Azure Firewall deployed only to the impacted Availability Zone (physical zone AZ03), may have experienced some performance degradation – affecting the ability to scale out. Finally, customers attempting control plane operations (for example, making changes to Firewall policies/rules) may have experienced failures during this incident. Both of these impacts were experienced between 18:40 UTC on 26 December and 07:22 UTC on 27 December 2024.
Azure Synapse:
Some users of Azure Synapse Analytics faced spark job execution failures in South Central US, Brazil South, and EastUS2. This impacted less than 1% of Synapse calls in those regions. Your logs may include one or more of the following errors that could be a result of this issue: “CLUSTER_CREATION_TIMED_OUT”, “FAILED_CLUSTER_CREATION”, “CLUSTER_FAILED_AFTER_RUNNING”. During this period, Azure Synapse could not provision on-demand compute due to failure to retrieve Management Group's ancestry for RBAC evaluations. The underlying storage for the SCUS instance of the ancestry data was impacted by this incident, which South Central US, Brazil South, and EastUS2 regions depend on. The data is replicated globally and regional failover attempts were made, but did not succeed due to a gateway error. The issue was resolved across all regions once South Central US region was recovered.
How did we respond?
- 18:40 UTC on 26 December 2024 – Initial power event occurred which led to power loss in the affected data hall.
- 18:45 UTC on 26 December 2024 – Technicians from datacenter operations team engaged.
- 18:46 UTC on 26 December 2024 – Portal Communications started being sent to impacted subscriptions.
- 19:02 UTC on 26 December 2024 – Datacenter incident call began to support triaging and troubleshooting issues.
- 19:08 UTC on 26 December 2024 – Azure engineering teams joined a central incident call, to triage and troubleshoot Azure service impact.
- 20:13 UTC on 26 December 2024 – Power restoration assessed safe and began.
- 20:35 UTC on 26 December 2024 – Compute, Network, and Storage infrastructure began to recover.
- 20:54 UTC on 26 December 2024 – Communications published to our public status page.
- 20:56 UTC on 26 December 2024 – Power had been restored. Infrastructure recovery continued.
- 21:40 UTC on 26 December 2024 – 85% of the VMs impacted by underlying VHD availability recovered.
- 06:30 UTC on 27 December 2024 – Additional 13% of VMs impacted by VHD availability recovered.
- 08:30 UTC on 27 December 2024 – Ongoing mitigation of additionally impacted services.
- 13:00 UTC on 27 December 2024 – Mitigation to most affected services confirmed.
- 19:30 UTC on 27 December 2024 – Incident mitigation confirmed and declared.
How are we making incidents like this less likely or less impactful?
- Datacenter response to return to utility power after ensuring battery health for UPS transition from generator (Completed).
- We are reviewing the nature of the UPS battery failures in line with our global battery standards and maintenance procedures, to identify improvements to de-risk this class of issue across the fleet. (Estimated completion: February 2025)
- Repairs to the offline failed utility line are in progress. (Estimated completion: February 2025)
- The mitigation to bypass various checks during the bootstrap process have been applied to all impacted machines and are being evaluated and executed for other hardware configurations where needed. (Estimated completion: March 2025)
How can customers make incidents like this less impactful?
- Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
- Note that the 'logical' Availability Zones used by each customer subscription may correspond to different physical Availability Zones - customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ - https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations
- Customers using Azure Virtual Machines should consider reviewing our reliability guidance, including Availability Zone support and disaster recovery. See: https://learn.microsoft.com/azure/reliability/reliability-virtual-machines#availability-zone-support
- Customers using Azure Cosmos DB should consider reviewing our high availability and reliability guidance for adopting Availability Zone architecture: https://learn.microsoft.com/azure/reliability/reliability-cosmos-db-nosql
- Customers using Azure SQL DB should consider reviewing our high availability and resiliency guidance surrounding relevant redundancy: https://learn.microsoft.com/azure/azure-sql/database/high-availability-sla-local-zone-redundancy?view=azuresql&tabs=azure-powershell
- Customers currently using Application Gateway v1 should consider migrating to Application Gateway v2, see: https://learn.microsoft.com/azure/application-gateway/migrate-v1-v2
- Customers using Application Gateway v2 should consider ensuring that two or more Azure availability zones are selected during provisioning, for better resiliency. See: https://learn.microsoft.com/azure/application-gateway/overview-v2
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/CNF4-N_0
2024年11月
13
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/PSM0-BQ8
What happened?
Between 00:50 UTC and 12:30 UTC on 13 November 2024, a subset of Azure Blob Storage and Azure Data Lake Storage accounts experienced connectivity errors. The issue manifested as loss of access to Blob and Data Lake storage endpoints of the affected storage accounts, and subsequent unavailability of downstream services that depended on these storage accounts. Since many of the impacted storage accounts were used by other Azure services and major software vendor solutions, the customer impact was widespread. Although unavailable to access, the data stored in these storage accounts was not impacted during this incident. Impacted downstream services included:
- Azure Storage: Impacted customers may have experienced name (DNS) resolution failures when interacting with impacted storage accounts in Australia East, Australia Southeast, Brazil South, Brazil Southeast, Canada Central, Canada East, Central India, Central US, East Asia, East US, East US 2, East US 2 EUAP, France Central, Germany West Central, Japan East, Japan West, Korea Central, North Central US, North Europe, Norway East, South Africa North, South Central US, South India, Southeast Asia, Sweden Central, Switzerland North, UAE North, UK South, UK West, West Central US, West Europe, West US, West US 2, West US 3.
- Azure Container Registry: Impacted customers using the East US region may have experienced intermittent 5xx errors while trying to pull images from the registry.
- Azure Databricks: Impacted customers may have experienced failures with launching clusters and serverless compute resources in Australia East, Canada Central, Canada East, Central US, East US, East US 2, Japan East, South Central US, UAE North, West US, and/or West US 2.
- Azure Log Analytics: Impacted customers using the West Europe, Southeast Asia, and/or Korea Central regions may have experienced delays and/or stale data when viewing Microsoft Graph activity logs.
What went wrong and why?
Azure Traffic Manager manages and routes blob and data lake storage API requests. The incident was caused by an unintentional deletion of Traffic Manager profiles for the impacted storage accounts. These Traffic Manager profiles were originally part of an Azure subscription pool which belonged to the Azure Storage service. This original service was bifurcated to become two separate services, with one service that would eventually be deprecated. Ownership of the subscriptions containing the Traffic Manager profiles for storage accounts should have been assigned to the service that was continuing operation, which was missed. As such, the decommissioning process inadvertently deleted the Traffic Manager profiles under the subscription, leading to loss of access to the affected storage accounts. To learn more about Azure Traffic Manager profiles, see: https://learn.microsoft.com/azure/traffic-manager/traffic-manager-manage-profiles.
How did we respond?
After receiving customer reports of issues, our team immediately engaged to investigate. Once we understood what had triggered the problem, our team initiated started to restore Traffic Manager profiles of the affected storage accounts. The recovery took an extended period of time since it required care reconstructing the Traffic Manager profiles while avoiding further customer impact. We started multiple workstreams in parallel, to drive both manual recovery and create automated steps to speed up recovery. Recovery was carried out in phases, with the majority of affected accounts restored by 06:24 UTC - and the last set of storage accounts recovered and fully operational by 12:30 UTC. Timeline of key events:
- 13 November 2024 @ 00:50 UTC – First customer impact, triggered by the deletion of a Traffic Manager profile.
- 13 November 2024 @ 01:24 UTC – First customer report of issues, on-call engineering team began to investigate.
- 13 November 2024 @ 01:40 UTC – We identified that the issues were triggered by the deletion of Traffic Manager profiles.
- 13 November 2024 @ 02:16 UTC – Impacted Traffic Manager profiles identified, and recovery planning started.
- 13 November 2024 @ 02:25 UTC – Recovery workstreams started.
- 13 November 2024 @ 03:51 UTC – First batch of storage accounts recovered and validated.
- 13 November 2024 @ 06:00 UTC – Automation to perform regional recovery in place.
- 13 November 2024 @ 06:24 UTC – Majority of recovery completed; most impacted accounts were accessible by this time.
- 13 November 2024 @ 12:30 UTC – Recovery and validation 100% complete, incident mitigated.
How are we making incidents like this less likely or less impactful?
- We completed an audit of all the production service artifacts used by the Azure Storage resource provider. (Completed)
- We created a new highly restrictive deployment approval process, as an additional measure to prevent unintended mutations like deletions. (Completed)
- We are improving the process used to clean-up production service artifacts with built-in safety to prevent impact. (Estimated completion: December 2024)
- We are enhancing our monitoring of outside-in storage traffic by making it more sensitive to smaller impacts and to validate connectivity and reachability for all endpoints of storage accounts. (Some services completed, all service will complete in December 2024)
- We are expanding and completing the process of securing platform resources with resource locks, as an additional safety automation to prevent deletes. (Estimated completion: January 2025)
- We will accelerate recovery times by refining restore points and optimizing the recovery process for production service artifacts. (Estimated completion: January 2025)
- We will reduce the blast radius with service architectural improvements to improve the resiliency against issues related to traffic manager, and other upstream dependencies’, unavailability. (Estimated completion: March 2025)
How can customers make incidents like this less impactful?
- Consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/PSM0-BQ8