Note: During this incident, in addition to our standard communications via Azure Service Health and its alerts, we also communicated via the public Azure Status page. This temporary overcommunication was to ensure that all affected customers received information, described as 'scenario 3' in our documentation: https://aka.ms/StatusPageCriteria. Our Post-Incident Review (PIR) will be communicated to affected customers through Azure Service Health, then this entry on the Status History page will be removed.
Summary of Impact: Between 02:10 UTC and 05:05 UTC on 29 November 2023, you were identified as a customer using Azure Kubernetes Service who may have experienced failures of create/read/update/delete (CRUD) operations, including creating new clusters. Kubernetes related operations via the Kubernetes API were not affected.
Preliminary Root Cause: We determined that a recent deployment resulted in the above-mentioned impact.
Mitigation: We rolled back the deployment to the last good build which mitigated this issue.
Next steps: Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/2LZ0-3DG
Between 07:24 and 19:00 UTC on 16 September 2023, a subset of customers using Virtual Machines (VMs) in the East US region experienced connectivity issues. This incident was triggered when a number of scale units within one of the datacenters in one of the Availability Zones lost power and, as a result, the nodes in these scale units rebooted. While the majority rebooted successfully, a subset of these nodes failed to come back online automatically. This issue caused downstream impact to services that were dependent on these VMs - including SQL Databases, Service Bus and Event Hubs. Impact varied by service and configuration:
- Virtual Machines were offline during this time. while recovery began at approximately 16:30 UTC, full mitigation was declared at 19:00 UTC.
- While the vast majority of zone-redundant Azure SQL Databases leveraging were not impacted, some customers using proxy mode connection may have experienced impact, due to one connectivity gateway not being configured with zone-resilience.
- SQL Databases with ‘auto-failover groups’ enabled were failed out of the region, incurring approximately eight hours of downtime prior to the failover completing.
- SQL Databases with ‘active geo-replication’ were able to self-initiate a failover to an alternative region manually to restore availability.
- The majority of SQL Databases were recovered no later than 19:00 UTC. Customers would have seen gradual recovery over time during mitigation efforts.
- Finally, non-zonal deployments of Service Bus and Event Hubs would have experienced a degradation. Zonal deployments of Service Bus and Event Hubs were unaffected.
What went wrong and why?
It is not uncommon for datacenters to experience an intermittent loss of power, and one of the ways we protect against this is by leveraging Uninterruptible Power Supplies (UPS). The role of the UPS is to provide stable power to infrastructure during short periods of power fluctuations, so that infrastructure does not fault or go offline. Although we have redundant UPS systems in place for added resilience, this incident was initially triggered by a UPS rectifier failure on a Primary UPS.
The UPS was connected to three Static Transfer Switches (STS) – which are designed to transfer power loads between independent and redundant power sources, without interruption. The STS is designed to remain on the primary source whenever possible, and to transfer back to it when stable power is available again. When the UPS rectifier failed, the STS successfully transferred to the redundant UPS – but then the primary UPS recovered temporarily, albeit in a degraded state. In this degraded state, the primary UPS is unable to provide stable power for the full load. So, after a 5-second retransfer delay, when the STS transferred from the redundant UPS back to the primary UPS, the primary UPS failed completely.
While the STS should then have transferred power back to the redundant UPS, the STS has logic designed to stagger these power transfers when there are multiple transmissions (to and from primary and redundant UPS) happening in a short period of time. This logic prevented the STS from transferring back to the redundant power, after the primary UPS failed completely, which ultimately caused a power loss to a subset of the scale units within the datacenter – at 07:24 UTC, for 1.9 seconds. This scenario of load transfers, to and from degraded UPS, over a short period of time, was not accounted for in the design. After 1.9 seconds, the load moved to the redundant source automatically for a final time. Our onsite datacenter team validated that stable power was feeding all racks immediately after the event, and verified that all devices were powered on.
Following the restoration of power, our SQL monitoring immediately observed customer impact, and automatic communications were sent to customers within 12 minutes. SQL telemetry also provided our first indication that some nodes were stuck during the boot up process. When compute nodes come online, they first check the network connectivity, then make multiple attempts to communicate with the preboot execution environment (PXE) server, to ensure that the correct network routing protocols can be applied. If the host cannot find a PXE server, it is designed to retry indefinitely until one becomes available so it can complete the boot process.
A previously discovered bug that applied to some of our BIOS software led to several hosts not retrying to connect to a PXE server, and remaining in a stuck state. Although this was a known issue, the initial symptoms led us to believe that there was a potential issue with the network and/or our PXE servers – troubleshooting these symptoms led to significant delays in correlating to the known BIOS issue. While multiple teams were engaged to help troubleshoot these issues, our attempts at force rebooting multiple nodes were not successful. As such, a significant amount of time was spent exploring additional mitigation options. Unbeknownst to our on call engineering team, these bulk reboot attempts were blocked by an internal approval process, which has been implemented as a safety measure to restrict the number of nodes that are allowed to be forced rebooted at one time. Once we understood all of the factors inhibiting mitigation, at around 16:30 UTC we proceeded to reboot the relevant nodes within the safety thresholds, which mitigated the BIOS issue successfully.
One of the mechanisms our platform deploys when VMs enter an unhealthy state is ‘service healing’ in which our platform automatically redeploys or migrates it to a healthy node. One of the prerequisites to initiate service healing requires a high percentage of nodes to be healthy – to ensure that, during a major incident, our self-healing systems do not exacerbate the situation. Once we had recovered past the safe threshold, the service healing mechanism initiated for the remainder of the nodes.
Throughout this incident, we did not have adequate alerting in place, and could not determine which specific VMs were impacted, because our assessment tooling relies on a heartbeat emitted from the compute nodes, which were stuck during the boot up process. Unfortunately, the time taken to understand the nature of this incident meant that communications were delayed. For customers using Service Bus and Event Hubs, this was multiple hours. For customers using Virtual Machines, this was multiple days. As such, we are investigating several communications related repairs, including why automated communications were not able to inform customers with impacted VMs in near real time, as expected.
How did we respond?
- 16 September 2023 @ 07:23 UTC - Loss of power to the three STSs.
- 16 September 2023 @ 07:24 UTC - All three downstream STSs fully re-energized.
- 16 September 2023 @ 07:33 UTC - Initial customer impact to SQL DB detected via monitoring.
- 16 September 2023 @ 07:34 UTC - Communications sent to Azure Service Health for SQL DB customers.
- 16 September 2023 @ 11:40 UTC - The relevant compute deployment team engaged to assist in rebooting nodes.
- 16 September 2023 @ 12:13 UTC - The infrastructure firmware team was engaged to troubleshoot the BIOS issues.
- 16 September 2023 @ 13:38 UTC - Multiple compute nodes attempted to be forcefully rebooted with no success.
- 16 September 2023 @ 15:30 UTC - SQL Databases with ‘auto-failover groups’ were successfully failed over.
- 16 September 2023 @ 15:37 UTC - Communications sent to Azure Service Health for Service Bus and Event Hub customers.
- 16 September 2023 @ 16:30 UTC - Safety thresholds blocking reboot attempts understood, successful batch rebooting begins.
- 16 September 2023 @ 16:37 UTC - Communications published to Azure Status page, in lieu of more accurate impact assessment.
- 16 September 2023 @ 19:00 UTC - All compute and SQL nodes successfully mitigated.
- 16 September 2023 @ 22:07 UTC - Once mitigation was validated, communications sent to Azure Service Health for SQL, Service Bus, and Event Hub customers.
- 19 September 2023 @ 04:10 UTC – Once VM impact was determined, communications sent to Azure Service Health for VM customers.
How are we making incidents like this less likely or less impactful?
- First and foremost, we have replaced the failed rectifier inside the UPS. (Completed)
- We are working with the manufacturer to perform a UPS rectifier failure analysis. (Estimated completion: October 2023)
- We are reviewing the status of STS automated transfer logic across all of our datacenters. (Estimated completion: October 2023)
- We are working to modify the STS logic to correct the transfer delay issue. (Estimated completion: December 2023)
- We have been deploying the fix for the BIOS issue as of January 2023 – we are expediting rollout. (Estimated completion: June 2024)
- We are improving our detection of stuck nodes for incidents of this class. (Estimated completion: October 2023)
- We are improving our automated mitigation of stuck nodes for incidents of this class. (Estimated completion: March 2024)
- We are improving the resiliency of our automated communication system for incidents of this class. (Estimated completion: October 2023)
- We are reviewing the status of STS automated transfer in all our sites. (Estimated completion: October 2023)
- For the issue surrounding Multi-AZ Azure SQL Databases using a proxy mode connection, the fix was already underway before this incident and has since been deployed. (Completed)
How can customers make incidents like this less impactful?
- Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application/ and https://learn.microsoft.com/azure/architecture/patterns/geodes
- We encourage customers to review and follow our guidance and best practices around Azure SQL Database disaster recovery – practice disaster drills to ensure that your application can handle the cross-region failover gracefully: https://learn.microsoft.com/azure/azure-sql/database/disaster-recovery-guidance
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Consider ensuring that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/2LZ0-3DG