Product:
Region:
Date:
September 2023
16
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/2LZ0-3DG
What happened?
Between 07:24 and 19:00 UTC on 16 September 2023, a subset of customers using Virtual Machines (VMs) in the East US region experienced connectivity issues. This incident was triggered when a number of scale units within one of the datacenters in one of the Availability Zones lost power and, as a result, the nodes in these scale units rebooted. While the majority rebooted successfully, a subset of these nodes failed to come back online automatically. This issue caused downstream impact to services that were dependent on these VMs - including SQL Databases, Service Bus and Event Hubs. Impact varied by service and configuration:
- Virtual Machines were offline during this time. while recovery began at approximately 16:30 UTC, full mitigation was declared at 19:00 UTC.
- While the vast majority of zone-redundant Azure SQL Databases leveraging were not impacted, some customers using proxy mode connection may have experienced impact, due to one connectivity gateway not being configured with zone-resilience.
- SQL Databases with ‘auto-failover groups’ enabled were failed out of the region, incurring approximately eight hours of downtime prior to the failover completing.
- SQL Databases with ‘active geo-replication’ were able to self-initiate a failover to an alternative region manually to restore availability.
- The majority of SQL Databases were recovered no later than 19:00 UTC. Customers would have seen gradual recovery over time during mitigation efforts.
- Finally, non-zonal deployments of Service Bus and Event Hubs would have experienced a degradation. Zonal deployments of Service Bus and Event Hubs were unaffected.
What went wrong and why?
It is not uncommon for datacenters to experience an intermittent loss of power, and one of the ways we protect against this is by leveraging Uninterruptible Power Supplies (UPS). The role of the UPS is to provide stable power to infrastructure during short periods of power fluctuations, so that infrastructure does not fault or go offline. Although we have redundant UPS systems in place for added resilience, this incident was initially triggered by a UPS rectifier failure on a Primary UPS.
The UPS was connected to three Static Transfer Switches (STS) – which are designed to transfer power loads between independent and redundant power sources, without interruption. The STS is designed to remain on the primary source whenever possible, and to transfer back to it when stable power is available again. When the UPS rectifier failed, the STS successfully transferred to the redundant UPS – but then the primary UPS recovered temporarily, albeit in a degraded state. In this degraded state, the primary UPS is unable to provide stable power for the full load. So, after a 5-second retransfer delay, when the STS transferred from the redundant UPS back to the primary UPS, the primary UPS failed completely.
While the STS should then have transferred power back to the redundant UPS, the STS has logic designed to stagger these power transfers when there are multiple transmissions (to and from primary and redundant UPS) happening in a short period of time. This logic prevented the STS from transferring back to the redundant power, after the primary UPS failed completely, which ultimately caused a power loss to a subset of the scale units within the datacenter – at 07:24 UTC, for 1.9 seconds. This scenario of load transfers, to and from degraded UPS, over a short period of time, was not accounted for in the design. After 1.9 seconds, the load moved to the redundant source automatically for a final time. Our onsite datacenter team validated that stable power was feeding all racks immediately after the event, and verified that all devices were powered on.
Following the restoration of power, our SQL monitoring immediately observed customer impact, and automatic communications were sent to customers within 12 minutes. SQL telemetry also provided our first indication that some nodes were stuck during the boot up process. When compute nodes come online, they first check the network connectivity, then make multiple attempts to communicate with the preboot execution environment (PXE) server, to ensure that the correct network routing protocols can be applied. If the host cannot find a PXE server, it is designed to retry indefinitely until one becomes available so it can complete the boot process.
A previously discovered bug that applied to some of our BIOS software led to several hosts not retrying to connect to a PXE server, and remaining in a stuck state. Although this was a known issue, the initial symptoms led us to believe that there was a potential issue with the network and/or our PXE servers – troubleshooting these symptoms led to significant delays in correlating to the known BIOS issue. While multiple teams were engaged to help troubleshoot these issues, our attempts at force rebooting multiple nodes were not successful. As such, a significant amount of time was spent exploring additional mitigation options. Unbeknownst to our on call engineering team, these bulk reboot attempts were blocked by an internal approval process, which has been implemented as a safety measure to restrict the number of nodes that are allowed to be forced rebooted at one time. Once we understood all of the factors inhibiting mitigation, at around 16:30 UTC we proceeded to reboot the relevant nodes within the safety thresholds, which mitigated the BIOS issue successfully.
One of the mechanisms our platform deploys when VMs enter an unhealthy state is ‘service healing’ in which our platform automatically redeploys or migrates it to a healthy node. One of the prerequisites to initiate service healing requires a high percentage of nodes to be healthy – to ensure that, during a major incident, our self-healing systems do not exacerbate the situation. Once we had recovered past the safe threshold, the service healing mechanism initiated for the remainder of the nodes.
Throughout this incident, we did not have adequate alerting in place, and could not determine which specific VMs were impacted, because our assessment tooling relies on a heartbeat emitted from the compute nodes, which were stuck during the boot up process. Unfortunately, the time taken to understand the nature of this incident meant that communications were delayed. For customers using Service Bus and Event Hubs, this was multiple hours. For customers using Virtual Machines, this was multiple days. As such, we are investigating several communications related repairs, including why automated communications were not able to inform customers with impacted VMs in near real time, as expected.
How did we respond?
- 16 September 2023 @ 07:23 UTC - Loss of power to the three STSs.
- 16 September 2023 @ 07:24 UTC - All three downstream STSs fully re-energized.
- 16 September 2023 @ 07:33 UTC - Initial customer impact to SQL DB detected via monitoring.
- 16 September 2023 @ 07:34 UTC - Communications sent to Azure Service Health for SQL DB customers.
- 16 September 2023 @ 11:40 UTC - The relevant compute deployment team engaged to assist in rebooting nodes.
- 16 September 2023 @ 12:13 UTC - The infrastructure firmware team was engaged to troubleshoot the BIOS issues.
- 16 September 2023 @ 13:38 UTC - Multiple compute nodes attempted to be forcefully rebooted with no success.
- 16 September 2023 @ 15:30 UTC - SQL Databases with ‘auto-failover groups’ were successfully failed over.
- 16 September 2023 @ 15:37 UTC - Communications sent to Azure Service Health for Service Bus and Event Hub customers.
- 16 September 2023 @ 16:30 UTC - Safety thresholds blocking reboot attempts understood, successful batch rebooting begins.
- 16 September 2023 @ 16:37 UTC - Communications published to Azure Status page, in lieu of more accurate impact assessment.
- 16 September 2023 @ 19:00 UTC - All compute and SQL nodes successfully mitigated.
- 16 September 2023 @ 22:07 UTC - Once mitigation was validated, communications sent to Azure Service Health for SQL, Service Bus, and Event Hub customers.
- 19 September 2023 @ 04:10 UTC – Once VM impact was determined, communications sent to Azure Service Health for VM customers.
How are we making incidents like this less likely or less impactful?
- First and foremost, we have replaced the failed rectifier inside the UPS. (Completed)
- We are working with the manufacturer to perform a UPS rectifier failure analysis. (Estimated completion: October 2023)
- We are reviewing the status of STS automated transfer logic across all of our datacenters. (Estimated completion: October 2023)
- We are working to modify the STS logic to correct the transfer delay issue. (Estimated completion: December 2023)
- We have been deploying the fix for the BIOS issue as of January 2023 – we are expediting rollout. (Estimated completion: June 2024)
- We are improving our detection of stuck nodes for incidents of this class. (Estimated completion: October 2023)
- We are improving our automated mitigation of stuck nodes for incidents of this class. (Estimated completion: March 2024)
- We are improving the resiliency of our automated communication system for incidents of this class. (Estimated completion: October 2023)
- We are reviewing the status of STS automated transfer in all our sites. (Estimated completion: October 2023)
- For the issue surrounding Multi-AZ Azure SQL Databases using a proxy mode connection, the fix was already underway before this incident and has since been deployed. (Completed)
How can customers make incidents like this less impactful?
- Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application/ and https://learn.microsoft.com/azure/architecture/patterns/geodes
- We encourage customers to review and follow our guidance and best practices around Azure SQL Database disaster recovery – practice disaster drills to ensure that your application can handle the cross-region failover gracefully: https://learn.microsoft.com/azure/azure-sql/database/disaster-recovery-guidance
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Consider ensuring that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/2LZ0-3DG
August 2023
30
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/VVTQ-J98
What happened?
Starting at approximately 10:30 UTC on 30 August 2023, customers may have experienced issues accessing or using Azure, Microsoft 365 and Power Platform services. This incident was triggered by a utility power sag at 08:41 UTC on 30 August 2023, which impacted one of the three Availability Zones of the Australia East region. This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the datacenter increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware. Although the vast majority of services recovered by 22:40 UTC on 30 August 2023, full mitigation was not until 20:00 UTC on 3 September 2023 – as some services experienced prolonged impact, predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.
Multiple Azure services were impacted by this incident – including Azure Active Directory (AAD), Azure Active Directory B2C, Azure Active Directory Conditional Access, Azure Active Directory Connect Health, Azure Active Directory MyApps, Azure Activity Logs & Alerts, Azure API Management, Azure App Service, Azure Application Insights, Azure Arc enabled Kubernetes, Azure API for FHIR, Azure Backup, Azure Batch, Azure Chaos Studio, Azure Container Apps, Azure Container Registry, Azure Cosmos DB, Azure Databricks, Azure Data Explorer, Azure Data Factory, Azure Database for MySQL flexible servers, Azure Database for PostgreSQL flexible servers, Azure Digital Twins, Azure Device Update for IoT Hub, Azure Event Hubs, Azure ExpressRoute, Azure Health Data Services, Azure HDInsight, Azure IoT Central, Azure IoT Hub, Azure Kubernetes Service (AKS), Azure Logic Apps, Azure Log Analytics, Azure Log Search Alerts, Azure NetApp Files, Azure Notification Hubs, Azure Redis Cache, Azure Relay, Azure Resource Manager (ARM), Azure Role Based Access Control (RBAC), Azure Search, Azure Service Bus, Azure Service Fabric, Azure SQL Database, Azure Storage, Azure Stream Analytics, Azure Virtual Machines, Microsoft Purview, and Microsoft Sentinel.
What went wrong and why?
Starting at approximately 08:41 UTC on 30 August 2023, a utility voltage sag was caused by a lightning strike on electrical infrastructure approximately 18 miles from the impacted Availability Zone of the Australia East region. The voltage sag caused cooling system chillers for multiple datacenters to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the datacenter rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one datacenter to the next. By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.
The two impacted data halls require at least four chillers to be operational. Before the voltage sag, our cooling capacity for these halls consisted of seven chillers, with five chillers in operation and two chillers in standby. At 10:30 UTC some networking, compute, and storage infrastructure began to shutdown automatically as data hall temperatures increased, impacting service availability. At 11:34 UTC, as temperatures continued to increase, our onsite datacenter team began a remote shutdown of remaining networking, compute, and storage infrastructure in the impacted data halls to protect data durability, infrastructure health, and address the thermal runaway. This shutdown allowed the chilled water loop to return to a safe temperature which allowed us to restart the chillers. This shutdown of infrastructure resulted in a further reduction of service availability for this Availability Zone.
The chillers were successfully brought back online at 12:12 UTC, and data hall temperatures returned to operational thresholds by 13:30 UTC. Power was then restored to the affected infrastructure and a phased process to bring the infrastructure back online commenced. All power to infrastructure was restored by 15:10 UTC. Once all networking and storage infrastructure had power restored, dependent compute scale units were then also returned to operation. As the underlying compute and storage scale units came online, dependent Azure services started to recover, but some services experienced issues coming back online.
From a storage perspective, seven storage scale units were impacted – five standard storage scale units, and two premium storage scale units. Availability impact to affected storage accounts began at 10:30 UTC as hardware shut down in response to elevated data hall temperatures. This was most impactful to storage accounts configured with the default local redundant storage (LRS), which is not resilient to a zonal failure. Accounts configured as zonally redundant (ZRS) remained 100% available, and accounts configured as geographically redundant (GRS) were eligible for customer-managed account failover. After power restoration, storage nodes started coming back online from 15:25 UTC. Four scale units required engineer intervention to check and reset some fault detection logic – the combination of investigation and management tooling performance problems delayed restoration of availability for these scale units. We also identified that some automation was incorrectly marking some already recovered nodes as unhealthy, which slowed storage recovery efforts. By 20:00 UTC, 99% of storage accounts had recovered. Restoring availability for the remaining <1% of storage accounts took more time due to hardware troubleshooting and replacement required on a small number of storage nodes in a single scale unit. Even identifying problematic hardware in the storage nodes took an extended period of time, as the nodes were offline and therefore not able to provide diagnostics. By 01:00 UTC on 31 August, availability was restored for all except a handful of storage accounts, with complete availability restored by 07:00 UTC on 1 September.
From a SQL Database perspective, database capacity in a region is divided into tenant rings. The Australia East region includes hundreds of rings, and each ring consists of a group of VMs (10-200) hosting a set of databases. Rings are managed by Azure Service Fabric to provide availability in cases of VM, network or storage failures. When infrastructure was powered down, customers using zone-redundant Azure SQL Database did not experience any downtime, except for a small subset of customers using proxy mode connection, due to one connectivity gateway not being configured with zonal-resilience. The fix for this issue was already being rolled out, but had not yet deployed to the Australia East region. As infrastructure was powered back on, all tenant rings except one came back online and databases became available to customers as expected. However, one ring remained impacted even after Azure Compute became available. In this ring, 20 nodes did not come back online as expected, so databases on these nodes continued to experience unavailability. As a result of Service Fabric attempting to move databases to healthy nodes, other databases on this ring experienced intermittent availability issues as a side-effect of the overall replica density and unhealthy nodes. The recovery involved first moving all the databases from unhealthy nodes to healthy nodes. All remote storage (general purpose) databases were successfully recovered by this move, but databases using local storage (business critical) only recovered as their underlying nodes recovered. All databases on unhealthy nodes were recovered by 11:00 UTC on 31 August. Since the health and capacity of this final ring did not completely recover, we decided to move all databases out of the ring, which extended the overall recovery time but did not negatively impact customer availability. During this extended recovery, most customers were not experiencing any issues but it was important to move all databases out of this unhealthy ring to prevent any potential impact. The operation of moving all databases out of this ring was completed at 20:00 UTC on 3 September. During this incident, customers who had ‘active geo-replication’ setup were able to failover manually to restore availability. For customers who have ‘auto-failover groups’ enabled, we did not execute automatic failover – our automatic failover policy was not initiated for the region, due to an incorrect initial assessment of the impact severity to SQL Database.
From a Cosmos DB perspective, zone resilient accounts and those with multi-region writes remained operational during the incident, transparently serving requests from a different zone or region, respectively. However, accounts not configured for AZ or multi-region writes experienced full or partial loss of availability, due to the infrastructure that was powered down. Multi-region accounts with single region write eligible for failover (those with Service Managed Failover enabled) were failed over to their alternate regions to restore availability. These were initiated at 12:07 UTC, 33 minutes after decision to power down scale units. The reason for this delay was to identify and failover the Cosmos DB control plane system resources – in retrospect this delay was unnecessary, as the Cosmos DB control plane was already fully zone-resilient. 95% of database accounts were failed over within 35 minutes by 12:42 UTC, and all eligible accounts were failed over by 16:13 UTC on 30 August. Accounts that were not eligible for failover had service restored to partitions only once the dependent storage and compute were restored.
From an Azure Kubernetes Service (AKS) perspective, the service experienced a loss of compute for the AKS control plane for Australia East as well as data access loss to SQL Database. The AKS control plane underlay is deployed across multiple availability zones. AKS uses Azure SQL Database for its operation queue which is used for Create/Read/Update/Delete (CRUD) activities. Although scheduled to be converted, the SQL Database in Australia East was not configured with AZ resiliency selection, leaving it unavailable during the incident period. In addition, AKS services in the Australia Southeast region depended on this same database, causing an AKS incident for CRUD activities in that region also. Existing customer workloads running on AKS clusters in either region should not have been impacted by the downtime, as long as they did not need to access the AKS resource provider for scaling or other CRUD activities. As the SQL Database recovered, service was restored without any other mitigation required.
From an Azure Resource Manager (ARM) perspective, the impact on customers was the result of degradation in Cosmos DB. This degradation impacted ARM between 10:45 UTC and 12:25 UTC and resulted in ARM platform availability for the Australia East region dropping from ~99.999% to (at its lowest) 88%, with a 62% success rate for write operations. For data consistency reasons, write operations are required to be sent to the associated Cosmos DB regional replica for a given resource group. While the migration to our next generation zonally-redundant storage architecture is still ongoing, it has not been completed and as a result this region is not yet leveraging fully zonally redundant storage for ARM. This meant that for the duration of the incident, customers worldwide trying to manage resources whose resource groups were homed in Australia East saw increased error rates (and this manifested in a small impact to global platform availability until 15:00 UTC).
How did we respond?
- 30 August 2023 @ 08:41 UTC – Voltage sag occurred on utility power line
- 30 August 2023 @ 08:43 UTC – 13 chillers failed to restart automatically
- 30 August 2023 @ 08:51 UTC – Remote resets on chillers commenced
- 30 August 2023 @ 09:09 UTC – Team arrived at first group of chillers for manual restarts
- 30 August 2023 @ 09:18 UTC – Team arrived at second group of chillers for manual restarts
- 30 August 2023 @ 09:42 UTC – Team arrived at third group of chillers for manual restarts
- 30 August 2023 @ 09:45 UTC – Team arrived at the final group of chillers which could not be restarted
- 30 August 2023 @ 10:30 UTC – Initial impact from automated infrastructure shutdown
- 30 August 2023 @ 10:47 UTC – Cosmos DB Initial impact detected via monitoring
- 30 August 2023 @ 10:48 UTC – First automated communications sent to Azure Service Health
- 30 August 2023 @ 11:30 UTC – Initial communications posted to public Azure Status page
- 30 August 2023 @ 11:34 UTC – Decision made to shutdown impacted infrastructure
- 30 August 2023 @ 11:36 UTC – All subscriptions in Australia East sent portal communications
- 30 August 2023 @ 12:07 UTC – Failover initiated for eligible Cosmos DB accounts
- 30 August 2023 @ 12:12 UTC – Five chillers manually restarted
- 30 August 2023 @ 13:30 UTC – Data hall temperature normalized
- 30 August 2023 @ 14:10 UTC – Safety walkthrough completed for both data halls
- 30 August 2023 @ 14:25 UTC – Decision made to start powering up hardware in the two affected data halls
- 30 August 2023 @ 15:10 UTC – Power restored to all hardware
- 30 August 2023 @ 15:25 UTC – Storage, networking, and compute infrastructure started coming back online after power restoration
- 30 August 2023 @ 15:30 UTC – Identified three specific storage scale units still experiencing fault codes
- 30 August 2023 @ 16:00 UTC – 35% of VMs recovered / Began manual recovery efforts for three remaining storage scale units
- 30 August 2023 @ 16:13 UTC – Account failover completed for all Cosmos DB accounts
- 30 August 2023 @ 17:00 UTC – All but one SQL Database tenant ring had recovered
- 30 August 2023 @ 19:20 UTC – 90% of VMs recovered
- 30 August 2023 @ 19:29 UTC – Successfully recovered all premium storage scale units
- 30 August 2023 @ 20:00 UTC – 99% of storage accounts were back online
- 30 August 2023 @ 22:35 UTC – Standard storage scale units were recovered, except for one scale unit
- 30 August 2023 @ 22:40 UTC – 99% of VMs recovered
- 31 August 2023 @ 04:04 UTC – Restoration of Cosmos DB accounts to Australia East initiated
- 31 August 2023 @ 04:43 UTC – Final Cosmos DB cluster recovered, restoring all traffic for accounts that were not failed over to alternate regions
- 31 August 2023 @ 05:00 UTC – 100% of VMs recovered
- 1 September 2023 @ 06:40 UTC – Successfully recovered all standard storage scale units
- 3 September 2023 @ 20:00 UTC – Final SQL Database tenant ring evacuated and all customer databases online
How are we making incidents like this less likely or less impactful?
- Incremental load increases over time in the Availability Zone resulted in a chiller operating configuration that was susceptible to a timing defect in the Chiller Management System. We have de-risked this failure to restart due to voltage fluctuations, by implementing a change to the control timing logic on the Chiller Management System (Completed).
- Emergency Operation Procedure for manual restarts of simultaneous chiller failures has changed from an ‘adjacent’ sequence to a data hall ‘load based’ sequence to ensure all impacted data halls have partial cooling, to slow thermal runaway while full cooling is being restored (Completed).
- Following this incident, as a temporary mitigation, we increased technician staffing levels at the datacenter to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures. Based on our incident analysis the staffing levels at the time would have been sufficient to prevent impact if a ‘load based' chiller restart sequence had been followed, which we have since implemented (Completed).
- Datacenter staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total datacenter staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.
- Our Storage team has identified several optimizations in our large scale recovery process which will help to reduce time to mitigate. This includes augmenting data provided in our incidents to enable quicker decision making, and updates to our troubleshooting guides (TSGs) that enable faster execution (Estimated completion: December 2023).
- Our Azure Service Fabric team is working to improve reliability of SQL Database tenant ring recovery. (Estimated completion: December 2023).
- Our SQL Database team is reviewing our ‘auto-failover group’ trigger criteria, to ensure that failovers can happen within the expected timeframe. (Estimated completion: October 2023).
- Our SQL Database team is upgrading internal tooling to enable mass migration of databases. (Estimated completion: December 2023).
- Our Cosmos DB team is working to optimize Service Managed Failover for single region write accounts to reduce time to mitigate (Estimated completion: November 2023).
- Our AKS team is immediately converting all operation queue SQL Database databases to be zone redundant (Estimated completion: September 2023).
- Our AKS team is also replacing all cross-region SQL Database queue usage with Service Bus queues that are zone redundant (Estimated completion: September 2023).
- Our ARM team will complete its storage layer migration to the next generation, zonally redundant architecture (Estimated completion: December 2023).
- Our incident management team is exploring ways to harden our readiness, process, and playbook surrounding power down scenarios (Estimated completion: October 2023)
How can customers make incidents like this less impactful?
- Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
- For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application/ and https://learn.microsoft.com/azure/architecture/patterns/geodes
- Consider which are the right storage redundancy options for your critical applications. Zone redundant storage (ZRS/GZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable: https://learn.microsoft.com/azure/storage/common/storage-redundancy
- Consider the relevant guidance for recovering your SQL Database databases during disaster recovery scenarios: https://learn.microsoft.com/azure/azure-sql/database/disaster-recovery-guidance
- Consider the relevant guidance for achieving high availability with Azure Cosmos DB https://learn.microsoft.com/azure/cosmos-db/high-availability
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/VVTQ-J98