Gå til hovedindhold

Produkt:

Område:

Dato:

december 2024

26

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

What happened?

Between 18:44 UTC on 26 December and 19:30 UTC on 27 December 2024, multiple Azure services were impacted by a power event that occurred in one datacenter, within one Availability Zone (physical zone AZ03), in the South Central US region. Within the impacted datacenter, our automated power systems managed the event as expected, without interruption for two of the three data halls. However, one data hall did not successfully transition to an alternate power supply. This failure led to a loss of compute, network, and storage infrastructure in this data hall.

Customer workloads configured for multi-zone resiliency would have seen no impact, or only brief impact as automated mitigations occurred. Only those customer workloads without multi-zone resiliency, and with dependencies on the impacted infrastructure, became unavailable. This included impact to the following services: Application Gateway, Azure App Services, Azure Cosmos DB, Azure Data Factory, Azure Database for PostgreSQL, Azure Event Hubs, Azure Firewall, Azure IoT Hub, Azure Log Analytics, Azure Logic Apps, Azure Service Bus, Azure SQL Database, Azure Storage, Azure Synapse Analytics, and Azure Virtual Machines. This incident also impacted a subset of Microsoft 365 services – further details are provided in the Microsoft 365 Admin Center, under incident ID MO966473. 

What do we know so far?

While we continue to investigate the power issue, we are confident in our running state and can operate without interruption. This incident was caused by a utility power loss, itself caused by a localized ground fault – in which a high voltage underground line failed – leading to a loss of utility power, at 18:44 UTC. During the transition to generator power, the impacted data hall experienced issues due to UPS battery faults, which caused the load to drop during transition. At this stage we do not have more detailed information to share regarding this power fault on the underground line, or on the nature of the battery faults, but we expect to be able to share details in the final PIR, within two weeks.

In any power-related event, our first priority is to ensure the safety of our staff and infrastructure before any power restoration work can begin. Following our assessment, we were able to safely begin restoration at 20:13 UTC. We began powering up our infrastructure by 20:35 UTC, with power restored by 20:56 UTC. With power restored, the next validation steps were to ensure that Azure Networking and Azure Storage were recovering as expected. By 21:00 UTC, almost all storage and network infrastructure was confirmed as fully operational. A single storage scale unit remained significantly degraded, due to hardware that required deeper inspection and then replacement.

As storage scale units recovered, 85% of the impacted Virtual Machines (VMs) recovered by 21:40 UTC as their Virtual Hard Disks (VHDs) became available. The next 13% of VMs recovered between 06:00 and 06:30 UTC, as the final storage scale unit became available. Despite all the storage issues being resolved, <2% of VMs impacted by this event remained unhealthy. These issues are detailed below, and explain why impacted downstream services with dependencies on these VMs experienced long-tail recoveries. The incident was declared as mitigated at 19:30 UTC on 27 December 2024.

Azure Storage:

For Zone Redundant Storage (ZRS) accounts, there was no availability impact – as data was served from replicas in other Availability Zones during this incident. 

The power loss event impacted six Storage scale units. After power restoration, scale units hosting Standard SSD Managed Disks, Premium SSD Managed Disks, Premium Blobs, and Premium Files, fully recovered automatically in less than 30 minutes. For most of the HDD-based Standard Storage LRS/GRS scale units, the storage services took approximately one hour to recover.

Unfortunately, within one Standard Storage scale unit, several network devices were non-functional following the power event, causing a significant portion of the data in that scale unit to be inaccessible. This caused significant impact to VMs and dependent services that were using Standard HDD managed disks and LRS blob or file storage accounts hosted on this scale unit. Mitigation required networking equipment to be sourced from spares in the region, brought on-site and installed by datacenter technicians. Network engineers then configured and validated these devices, before bringing them online. Additional actions were taken to recover storage nodes under the replaced switches. Availability was restored for the vast majority of accounts in this scale unit by 06:10 UTC, with 100% availability restored by 06:30 UTC, on 27 December 2024. 

Azure Compute / Virtual Machines:

For customers using VM/compute workloads that leveraged multi-zone resiliency (such as VMSS flex across availability zones), there was no availability impact. 

For incidents like this, Azure has an automated recovery suite called ‘Defibrillator’ that starts automatically, to recover the VMs and Host machines they are running on, after datacenter power has been restored. It will orchestrate the power on for all affected Host machines, monitor the boot-up and bootstrap sequences, and ensure that the VMs are up and running. When this is running, Azure’s automated steady-state health detection and remediation systems suspend all activities, in order to avoid disrupting the disaster recovery process. 

At approximately 22:00 UTC on 26 December 2024, some compute scale units were found not tracking at the expected level of recovery. For the final 2% of VMs, mentioned above, these experienced an extended recovery – we observed three separate events that contributed to this. 

  • The first scenario was due to initialization without a connection to a network device. Due to the network devices not being fully configured before the Host machines were powered on, a race condition triggered during the Host bootstrap process. This issue is specific to a certain hardware configuration within localized compute scale units, and necessitated the temporary disabling of some validation checks during the bootstrap process.
  • The second scenario delaying recovery, was some machines failing to boot into the Host OS due to a newly discovered bootloader bug impacting a small subset of host hardware with higher levels of offline memory pages. When the hardware reports repeated corrected memory errors to the Host OS, the Host will offline certain memory ranges to prevent repeated use of that memory range. In a small subset of host hardware where a large range of offline memory was accumulated, this new Host OS bug was discovered – resulting in failing to bootstrap the Host OS. This category was mitigated by clearing and/or ignoring this offline memory list and allowing the Host OS to make forward progress where it could, then rebuild its offline memory list once it started to run the full OS. 
  • The third scenario that had prevented compute recovery in some cases was due to control plane devices that are inline to execute the power operations on the Host machines. Datacenter technicians were required to reseat that infrastructure manually. 

By 10:50 UTC on 27 December, >99.8% of the impacted VMs had recovered, with our team re-enabling Azure’s automated detection and remediation mechanisms. Some targeted remediation efforts were required for a remaining small percentage of VMs, requiring manual intervention to bring these back online.

Azure Cosmos DB

For Azure Cosmos DB accounts configured with availability zones, there was no impacted, and the account maintained availability for reads and writes.

Impact on other Cosmos DB accounts varied depending on the customer database account regional configurations and consistency settings: 

  • Database accounts configured with availability zones were not impacted by the incident, and maintained availability for reads and writes.
  • Database accounts with multiple read regions and a single write region outside South Central US maintained availability for reads and writes if configured with session or lower consistency. Accounts using strong or bounded staleness consistency may have experienced write throttling to preserve consistency guarantees until the South Central US region was either taken offline or recovered. This behavior is by design. 
  • Active-passive database accounts with multiple read regions and a single write region in South Central US maintained read availability, but write availability was impacted until the South Central US region was taken offline or recovered.
  • Single-region database accounts in South Central US without Availability Zone configuration were impacted if any partition resided on the affected instances.

Azure SQL Database:  

For Azure SQL Databases configured with zone redundancy, there was no impact. 

A subset of customers  in this region experienced unavailability and slow/stuck control plane operations, such as updating the service level objective, for databases that are not configured as zone redundant. Customers with active geo-replication configuration were asked to consider failing out of the region at approximately 22:31 UTC.

Impact duration varied. Most databases recovered after Azure Storage recovered. Some databases took an extended time to recover due to the aforementioned long recovery time of some underlying VMs.

Azure Application Gateway:

For instances configured with zone redundancy, customers may have experienced a degraded throughput. 

However, customers who were deployed to a single zone or were using a regional deployment model with no zones specified, may have experienced complete data path loss if their deployments happened to have instances/disks in the impacted zone.

Azure Firewall:

For Azure Firewalls deployed to all Availability Zones of the region, customers would not have experienced any data path impact.

However, customers with an Azure Firewall deployed only to the impacted Availability Zone (physical zone AZ03), may have experienced some performance degradation – affecting the ability to scale out. Finally, customers attempting control plane operations (for example, making changes to Firewall policies/rules) may have experienced failures during this incident. Both of these impacts were experienced between 18:44 UTC on 26 December and 07:22 UTC on 27 December 2024.

Azure Service Bus and Azure Event Hubs:

Customers with Standard SKU, Premium SKU namespaces or AZ-enabled Dedicated Event Hubs clusters experienced an availability drop for approximately five minutes, at the time when the incident started – this issue was mitigated automatically once namespace resources were reallocated to other availability zones.

However, a subset of customers using Event Hubs Dedicated non-AZ clusters experienced an availability issue for an extended period of time when trying to access their Event Hubs namespaces in the region. The affected Event Hubs dedicated clusters recovered once the underlying failing VMs in their clusters were brought back online, the last of which were restored by 05:52 UTC on 27 December.

How did we respond?

  • 18:44 UTC on 26 December 2024 – Initial power event occurred.
  • 18:45 UTC on 26 December 2024 – Technicians from datacenter operations team engaged.
  • 19:02 UTC on 26 December 2024 – Datacenter incident call began to support triaging and troubleshooting issues.
  • 19:08 UTC on 26 December 2024 – Azure engineering teams joined a central incident call, to triage and troubleshoot Azure service impact.
  • 20:13 UTC on 26 December 2024 – Power restoration assessed safe, and began.
  • 20:35 UTC on 26 December 2024 – Compute, Network, and Storage infrastructure began to recover.
  • 20:56 UTC on 26 December 2024 – Power had been restored. Infrastructure recovery continued.
  • 21:40 UTC on 26 December 2024 – 85% of the VMs impacted by underlying VHD availability recovered.
  • 06:30 UTC on 27 December 2024 – Additional 13% of VMs impacted by VHD availability recovered.  
  • 08:30 UTC on 27 December 2024 – Ongoing mitigation of additionally impacted services.
  • 13:00 UTC on 27 December 2024 – Mitigation to most affected services confirmed.
  • 19:30 UTC on 27 December 2024 – Incident mitigation confirmed and declared. 

How are we making incidents like this less likely or less impactful?   

  • We have measured and evaluated all UPS batteries that support this data hall – and are in the process of identifying and replacing them as needed.
  • We are reviewing the nature of the UPS battery failures in line with our battery standards and maintenance procedures, to identify improvements to de-risk this scenario.
  • We are in the process of commissioning an additional utility power source to this datacenter, to add further redundancy.
  • The mitigation to bypass various checks during the bootstrap process have been applied to all impacted machines and are being evaluated and executed for other hardware configurations where needed.
  • This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

november 2024

13

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 00:50 UTC and 12:30 UTC on 13 November 2024, a subset of Azure Blob Storage and Azure Data Lake Storage accounts experienced connectivity errors. The issue manifested as loss of access to Blob and Data Lake storage endpoints of the affected storage accounts, and subsequent unavailability of downstream services that depended on these storage accounts. Since many of the impacted storage accounts were used by other Azure services and major software vendor solutions, the customer impact was widespread. Although unavailable to access, the data stored in these storage accounts was not impacted during this incident. Impacted downstream services included:

  • Azure Storage: Impacted customers may have experienced name (DNS) resolution failures when interacting with impacted storage accounts in Australia East, Australia Southeast, Brazil South, Brazil Southeast, Canada Central, Canada East, Central India, Central US, East Asia, East US, East US 2, East US 2 EUAP, France Central, Germany West Central, Japan East, Japan West, Korea Central, North Central US, North Europe, Norway East, South Africa North, South Central US, South India, Southeast Asia, Sweden Central, Switzerland North, UAE North, UK South, UK West, West Central US, West Europe, West US, West US 2, West US 3.
  • Azure Container Registry: Impacted customers using the East US region may have experienced intermittent 5xx errors while trying to pull images from the registry.
  • Azure Databricks: Impacted customers may have experienced failures with launching clusters and serverless compute resources in Australia East, Canada Central, Canada East, Central US, East US, East US 2, Japan East, South Central US, UAE North, West US, and/or West US 2.
  • Azure Log Analytics: Impacted customers using the West Europe, Southeast Asia, and/or Korea Central regions may have experienced delays and/or stale data when viewing Microsoft Graph activity logs.

What went wrong and why?

Azure Traffic Manager manages and routes blob and data lake storage API requests. The incident was caused by an unintentional deletion of Traffic Manager profiles for the impacted storage accounts. These Traffic Manager profiles were originally part of an Azure subscription pool which belonged to the Azure Storage service. This original service was bifurcated to become two separate services, with one service that would eventually be deprecated. Ownership of the subscriptions containing the Traffic Manager profiles for storage accounts should have been assigned to the service that was continuing operation, which was missed. As such, the decommissioning process inadvertently deleted the Traffic Manager profiles under the subscription, leading to loss of access to the affected storage accounts. To learn more about Azure Traffic Manager profiles, see: .

How did we respond?

After receiving customer reports of issues, our team immediately engaged to investigate. Once we understood what had triggered the problem, our team initiated started to restore Traffic Manager profiles of the affected storage accounts. The recovery took an extended period of time since it required care reconstructing the Traffic Manager profiles while avoiding further customer impact. We started multiple workstreams in parallel, to drive both manual recovery and create automated steps to speed up recovery. Recovery was carried out in phases, with the majority of affected accounts restored by 06:24 UTC - and the last set of storage accounts recovered and fully operational by 12:30 UTC. Timeline of key events:

  • 13 November 2024 @ 00:50 UTC – First customer impact, triggered by the deletion of a Traffic Manager profile.
  • 13 November 2024 @ 01:24 UTC – First customer report of issues, on-call engineering team began to investigate.
  • 13 November 2024 @ 01:40 UTC – We identified that the issues were triggered by the deletion of Traffic Manager profiles.
  • 13 November 2024 @ 02:16 UTC – Impacted Traffic Manager profiles identified, and recovery planning started.
  • 13 November 2024 @ 02:25 UTC – Recovery workstreams started.
  • 13 November 2024 @ 03:51 UTC – First batch of storage accounts recovered and validated.
  • 13 November 2024 @ 06:00 UTC – Automation to perform regional recovery in place.
  • 13 November 2024 @ 06:24 UTC – Majority of recovery completed; most impacted accounts were accessible by this time.
  • 13 November 2024 @ 12:30 UTC – Recovery and validation 100% complete, incident mitigated. 

How are we making incidents like this less likely or less impactful?

  • We completed an audit of all the production service artifacts used by the Azure Storage resource provider. (Completed)
  • We created a new highly restrictive deployment approval process, as an additional measure to prevent unintended mutations like deletions. (Completed)
  • We are improving the process used to clean-up production service artifacts with built-in safety to prevent impact. (Estimated completion: December 2024)
  • We are enhancing our monitoring of outside-in storage traffic by making it more sensitive to smaller impacts and to validate connectivity and reachability for all endpoints of storage accounts. (Some services completed, all service will complete in December 2024)
  • We are expanding and completing the process of securing platform resources with resource locks, as an additional safety automation to prevent deletes. (Estimated completion: January 2025)
  • We will accelerate recovery times by refining restore points and optimizing the recovery process for production service artifacts. (Estimated completion: January 2025)
  • We will reduce the blast radius with service architectural improvements to improve the resiliency against issues related to traffic manager, and other upstream dependencies’, unavailability. (Estimated completion: March 2025)

How can customers make incidents like this less impactful?

  • Consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: