Skip to Main Content

Produkt:

Region:

Datum:

März 2023

6

What happened?

Between 03:50 UTC and 17:55 UTC on 6 March 2023, a subset of customers using Azure Storage may have experienced greater than expected throttling when performing requests against Storage resources located in the West Europe region. Azure services dependent on Azure Storage may also have experienced intermittent failures and degraded performance due to this issue. These included Azure Automation, Azure Arc enabled Kubernetes, Azure Bastion, Azure Batch, Azure Container Apps, Azure Data Factory (ADF), Azure ExpressRoute \ ExpressRoute Gateways, Azure HDInsight, Azure Key Vault (AKV), Azure Logic Apps, Azure Monitor, and Azure Synapse Analytics. 

What went wrong and why?

Azure Storage employs a throttling mechanism that ensures storage account usage remains within the published storage account limits (for more details, refer to ). This throttling mechanism is also used to protect a storage service scale unit from exceeding the overall resources available to the scale unit. In the event that a scale unit reaches such limits, the scale unit employs a safety protection mechanism to throttle accounts that are deemed to contribute to the overload, while also balancing load across other scale units.

Configuration settings are utilized to control how the throttling services monitor usage and apply throttling. A new configuration was rolled out to improve the throttling algorithm. While this configuration change followed the usual Safe Deployment Practices (for more details, refer to: ), this issue was found mid-way through the deployment. The change had adverse effects due to the specific load and characteristics on a small set of scale units where the mechanism unexpectedly throttled some storage accounts in an attempt to bring the scale unit to a healthy state.

How did we respond?

Automated monitoring alerts were triggered, and engineers were immediately engaged to assess issues that the service-specific alert reported. This section will provide a timeline for how we responded. Batch started investigating these specific alerts at 04:08 UTC. This investigation showed that there were storage request failures and service availability impact. At 04:40 UTC Storage engineers were engaged to begin investigating the cause. While storage was engaged and investigating with Batch, at 05:30 UTC, automated monitoring alerts were triggered for ADF. The ADF investigation showed there was an issue with underlying Batch accounts. Batch confirmed with ADF that they were impacted by storage failures and are working with storage engineers to mitigate. At this time, storage engineers diagnosed that one scale unit was operating above normal parameters, this included identifying that Batch storage accounts were throttled due to tenant limits. And the action engineers took was to load balance traffic across different scale units. By 06:34 UTC, storage engineers started to do Batch account migration in efforts to mitigate the ongoing issues. 

At 07:15 UTC, automation detected an issue with AKV requests. The AKV investigation showed there was an issue with the underlying storage accounts. Around 09:10 UTC, engineers performed a failover that mitigated the issue for all existing AKV read operations. However, create, read (for new requests), update and delete operations for AKV were still impacted. Around 10:00 UTC, storage engineers correlated the occurrences of the downstream impacted services with the configuration rollout because the scope of the issue expanded to additional scale units. By 10:15 UTC, storage engineers began reverting the configuration change on select impacted scale units. The Batch storage account migration finished around 11:22 UTC, after that, Batch service became healthy at 11:35 UTC. ADF began to recover after Batch mitigation was completed. ADF was fully recovered around 12:51 UTC after accumulated tasks got consumed.

By 16:34 UTC, impacted resources and services were mitigated. Shortly thereafter, engineers scheduled a rollback of the configuration change on all scale units (even ones that were not impacted), declaring mitigation.  

How are we making incidents like this less likely or less impactful?

  • We are tuning our monitoring to anticipate and quickly detect when a storage scale unit might engage its scale unit protection algorithm and employs throttling to storage accounts. These monitors will proactively alert engineers to take necessary actions (Estimated completion: March 2023).
  • Moving forward, we will rollout throttling improvements in tracking-mode first to assess its impact before getting enabled since throttling improvements may react differently to different workload types and load on a particular scale unit (Estimated completion: March 2023).

The above improvement areas will help to prevent/detect storage-related issues across first-party services that are reliant on Azure storage accounts - for example, Batch and Data Factory.

  • Our AKV team is working on improvements to the current distribution of storage accounts across multiple scale units and update storage implementation to ensure that read and write availability can be decoupled when such incidents happen (Estimated completion: May 2023).
  • We continue to expand our AIOps detection system to provide better downstream impact detection and correlation - to notify customers more quickly, and to identify/mitigate impact more quickly (Ongoing).

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

Februar 2023

7

What happened?  

Between 20:19 UTC on 7 February 2023 and 04:30 UTC on 9 February 2023, a subset of customers with workloads hosted in the Southeast Asia and East Asia regions experienced difficulties accessing and managing resources deployed in these regions.

One Availability Zone (AZ) in Southeast Asia experienced cooling failure. Infrastructure in that zone was shutdown to protect data and infrastructure. This led to failures in accessing resources and services hosted in that zone. However, this zone failure also resulted in two further unexpected failures. Firstly, regional degradation for some services and secondly, services designed to support failover to other regions or zones did not work reliably.  

The services which experienced degradation in the region included Azure Kubernetes Service, Databricks, API Management, Application Insights, Backup, Cognitive Services, Container Apps, Container Registry, Container Service, Cosmos DB,  NetApp Files, Network Watcher, Notification Hubs, Purview, Redis Cache, Search, Service Bus, SignalR Service, Site Recovery, SQL Data Warehouse, SQL Database, Storage, Synapse Analytics, Universal Print, Update Management Center, Virtual Machines, Virtual WAN, Web PubSub – as well as the Azure portal itself, and subsets of the Azure Resource Manager (ARM) control plane services.

BCDR Services that were designed to support regional or zonal failover that didn’t work as expected included Azure Site Recovery, Azure Backup and Storage accounts leveraging Geo Redundant Storage (GRS).

 

What went wrong and why? 

In this PIR we will break down what went wrong into 3 parts. The first part will focus on the failures that led to the loss of an availability zone. The second part will focus on the unexpected impact to the ARM control plane, and BCDR services. Lastly, we will focus on the extended recovery for the SQL Database service.

 

Chiller failures in a single Availability Zone 

At 15:17 UTC on 7 February 2023, a utility voltage dip event occurred on the power grid, affecting a single Availability Zone in the Southeast Asia region.  Power management systems managed the voltage dip as designed, however a subset of chiller units that provide cooling to the datacenter tripped and shut down.  Emergency Operational Procedures (EOP) were performed as documented for impacted chillers but were not successful.  Cooling capacity was reduced in the facility for a prolonged time, and despite efforts to stabilize by shutting down non-critical infrastructure, temperatures continued to rise in the impacted datacenter. At 20:19 UTC on 7 February 2023, infrastructure thermal warnings from components in the datacenter directed a shutdown on critical compute, network and storage infrastructure to protect data durability and infrastructure health. This resulted in loss of resource and service availability in the impacted zone in Southeast Asia.

Maximum cooling capacity for this facility incorporates 8 chillers from 2 different suppliers.  Chillers 1 to 5 are from supplier A providing 47% of the cooling capacity, chillers 6, 7 and 8 from supplier B make up the remaining 53%. Chiller 5 was offline for planned maintenance. After the voltage dip, chiller 4 continued to run and 1, 2 and 3 were started, but could not provide the cooling necessary to stabilize the temperatures which had already increased during the restart period.  These 4 units were operating as expected, but 6, 7, and 8 were unable to be started.  This was not the expected behavior of these chillers, which should automatically restart when tripped by a power dip. The chillers also failed a manual restart as part of the EOP (emergency operating procedure) execution.

This fault condition is currently being investigated by the OEM chiller vendor. Preliminary investigations point to the compressor control card which ceased to respond after the power dip, inhibiting the chillers from successfully restarting. From the chiller HMI alarm log, a “phase sequence alarm” was detected. To normalize this alarm, the compressor control card had to be hard reset requiring a shut down, leaving it powered off for at least 5 minutes to drain the internal capacitors before powering the unit on. The current EOP did not list out these steps. This is addressed later in the PIR.

Technicians from Supplier B were dispatched onsite to assist. By the time the chiller compressor control card was reset with their assistance, the chilled water loop temperatures exceeded the threshold of 28 degrees Celsius, causing a lock out of the restart function to protect the chiller from damage. This condition is referred to as a thermal lockout.  To reduce the chilled water loop temperature, augmented cooling was required to be brought online and thermal loads had to be reduced by shutting down infrastructure, as previously stated. This successfully reduced the chilled water loop temperature below the thermal lockout threshold and enabled the restart function for chillers 6, 7, and 8.  Once the chillers were restarted, temperatures progressed toward expected levels and, with all units recovered, by 14:00 UTC on 8 February 2023, temperatures were back within our standard operational thresholds.

With confidence that temperatures were stable, we then began to restore power to the affected infrastructure, starting a phased process to first bring storage scale units back online. Once storage infrastructure was verified healthy and online, compute scale units were then powered up. As the compute scale units became healthy, virtual machines and other dependent Azure services recovered. Most customers should have seen platform service recovery by 16:30 UTC on 08 February, however a small number of services took an extended time to fully recover, completing by 04:30 UTC on 9 February 2023. 


Unexpected impact ARM and BCDR services.

While infrastructure from the impacted AZ was powered down, services deployed and running in a zone resilient configuration were largely unimpacted by the shutdown of power in the zone. However multiple services may have experienced some form of degradation, these services are listed above. This was due to dependencies on ARM, which did experience unexpected impact from the loss of a single AZ.

In addition, three services that are purpose-built for business continuity and disaster recovery (BCDR) planning and execution, in both zonal and regional incidents, also experienced issues, and will be discussed in detail:

  • Azure Site Recovery (ASR) - Helps ensure business continuity by keeping business apps and workloads running during outages. Site Recovery replicates workloads running on physical and virtual machines (VMs) from a primary site to a secondary location. When an incident occurs at a customer’s primary site, a customer can fail over to a secondary location. After the primary location is running again, customers can fail back.
  • Azure Backup – Allows customers to make backups that keeps data safe and recoverable.
  • Azure Storage – Storage accounts configured for Geo Redundant Storage (GRS) to provide resiliency for data replication.

Impact to the ARM service.

As the infrastructure of the impacted zone was shutdown, the ARM control plane service (which is responsible for processing all service management operations) in Southeast Asia lost access to a portion of its metadata store hosted in a Cosmos DB instance that was in the impacted zone. It was expected that this instance was zone resilient, but due to a configuration error, local to the Southeast Asia region, it was not. Inability to access this metadata resulted in failures and contention across the ARM control plane in that region. To mitigate impact, the ARM endpoints in Southeast Asia were manually removed from their global service Traffic Manager profile. This was completed at 04:10 UTC on 8 February 2023. Shifting the ARM traffic away from Southeast Asia increased traffic to nearby APAC regions, in particular East Asia, which quickly exceeded provisioned capacity. This exacerbated the impact to include East Asia where customers would have experienced issues managing their resources via ARM. The ARM service in East Asia was scaled-up and by 09:30 UTC on the 8 of February the service was fully recovered. 

Impact to BCDR services

A subset of Azure Site Recovery customers started experiencing failures with respect to zonal failovers from the Southeast Asia region and regional failovers to the East Asia region. Initial failures were caused by a recent update to a workflow that leveraged an instance of Service Bus as the communication backbone, which was non-zonally resilient and was in the impacted zone. Failures related to workflow dependencies on Service Bus were completely mitigated by 04:45AM UTC on 8 February 2023 by disabling the newly added update in the Southeast Asia region. Unfortunately, further failures continued for a few customers due to the above ARM control plane issues that were ongoing. At 09:15 UTC on 8 February 2023, once the ARM control plane issues were mitigated, all subsequent ASR failovers worked reliably.

The ARM impact caused failures and delays in restoring applications and data using Azure Backup cross region restore to East Asia. A portion of Azure Backup customers in the Southeast Asia region had enabled the cross region restore (CRR) capability which allows them restore their application and data in the East Asia region during regional outages. At 09:00 UTC on the 08 Feb, we started proactively enabling CRR capability for all Azure Backup customers with GRS Recovery Services Vaults (RSV). This was a proactive mitigation action to enable customers had not enabled this capability to recover their application and data in the East Asia region.  Observing customer activity, we detected a configuration issue in one of the microservices specific to the Southeast Asia region which caused CRR calls to fail for a subset of customers. This issue was mitigated for this subset, through a configuration update and the cross region application and data restores were fully functioning by 15:10 UTC on 8 February 2023.

Most customers using Azure Storage with GRS were able to execute customer controlled failovers without any issues. A subset of customers encountered unexpected delays or failures. There were two underlying causes for this. The first was an integrity check that the service performs prior to a failover. In some instances, this check was too aggressive and blocked the failover from completing. The second cause was that storage accounts using hierarchical namespaces could not complete a failover.

Extended Recovery for SQL Database

The impact to Azure SQL Database customers throughout this incident would have fallen into the following categories.

  • Customers using Multi-AZ Azure SQL were not impacted, except those using proxy mode connection due to one connectivity capacity unit not being configured with zone-resilience.
  • Customers using active geo-replication with automatic failover were failed out of region and failed back when recovery was complete.  These customers had ~1.5 hours of downtime prior to automatic failover completing.
  • Some customers who had active geo-replication self-initiated failover, and some performed geo-restore to other regions.  
  • The rest of the impacted SQL Database customers were impacted from the point of power down and recovered as compute infrastructure was brought back online and Tenant Rings (TR) reformed automatically.  One Tenant Ring of capacity for SQL Database required manual intervention to properly reform. Azure SQL Database capacity in a region is divided into Tenant rings.  In Southeast Asia there are hundreds of rings.  Each ring consists of a group of VMs (10-200) hosting a set of databases.  Rings are managed by Azure Service Fabric (SF) to provide availability when we have VM, network or storage failures. For one of the TRs, when power was restored to the compute infrastructure hosting the supporting group of VMs, the ring did not reform and thus was unable to make any of the databases on that ring available. Infrastructure hosting these VMs did not recover automatically due to a BIOS bug.  After the compute infrastructure was successfully started, a second issue in heap memory management in SF also required manual intervention to reform the TR and make customer databases available. The final TR became available at 04:30 UTC on the 09 February 2023.

 

How are we making incidents like this less likely or less impactful?

For the facility incident we are continuing our internal review to evaluate improvements and optimizations to the ‘AZ down’ playbook. This will focus on communication effectiveness, proactive resiliency improvements, and increasing the fidelity of our periodic AZ down testing (Estimated completion: February 2023).

More specifically;

  • We’re engaged with the OEM vendor to fully understand the results and any additional follow-ups to prevent, shorten the duration, and reliably recover after these types of events.  (Estimated completion: February 2023)
  • We’re implementing updates to existing procedures to include manual (cold) restarting of the chillers as recommended by the OEM vendor (Estimated completion: February 2023).
  • We're conducting additional training from the OEM to our Datacenter operations personnel to be familiar with the updated procedures. (Estimated completion: February 2023).

 

For our platform behavior during zone wide or regional outages, these are the customer experiences we are focusing on. Each of these will have multiple workstreams associated with them.

  1. Ensuring zone resiliency is adopted by our services. This includes reviewing all designs and implementations in addition to conducting regular drills for verification.
  2. Ensuring the containment of single zone failures. A large part of this will be validating that our global services are immune from single zone or even regional issues.
  3. Ensuring that BCDR critical services work reliably. Verifying that all BCDR services only have dependencies on zonally resilient services and can operate at scale.
  4. Minimizing the time to restore service functionality from the point of platform mitigation.

 This section will be updated within 3 weeks with a list of the repair items and dates for estimated completion.

How can we make our incident communications more useful?

You can rate this Post Incident Review (PIR) and provide any feedback using our quick 3-question survey 

Januar 2023

31

What happened?

Between 05:55 UTC on 31 January 2023 and 00:58 UTC on 1 February 2023, a subset of customers using Virtual Machines (VMs) in the East US 2 region may have received error notifications when performing service management operations – such as create, delete, update, scale, start, stop – for resource hosted in this region.

The impact was limited to a subset of resources in one of the region’s three Availability Zones, (physical AZ-02) and there was no impact to Virtual Machines in the other two zones. This incident did not cause availability issues for any VMs already provisioned and running, across any of the three Availability Zones (AZs) – impact was limited to service management operations described above.

Downstream Azure services with dependencies in this AZ were also impacted, including Azure Backup, Azure Batch, Azure Data Factory, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Databricks, and Azure Kubernetes Service.

What went wrong and why?

The East US 2 region is designed and implemented with three Availability Zones. Each of these AZs are further internally sliced into partitions, to ensure that a failure in one partition does not affect the processing in other partitions. Every VM deployment request is processed by one of the partitions in the AZ, and a gateway service is responsible for routing traffic to the right partition. During this incident, one of the partitions associated with physical AZ-02 experienced data access issues at 05:55 UTC on 31 January 2023, because the underlying data store had exhausted an internal resource limit.

While the investigation on the first partition was in progress, at around 11:43 UTC a second partition within the same AZ experienced a similar issue and became unavailable, leading to further impact. Even though this partition became unavailable, the cache service was keeping the partition data and so new deployments in the cluster were succeeding. At 12:04 UTC, the cache service restarted and couldn't retrieve the data, as the target partition was down. Due to a resource creation policy configuration issue, all partitions were required to create a new resource in the gateway service. When the cache service didn’t have the data, this policy resulted in blocking all new VM creations in the AZ. This resulted in additional failures and slowness during the impact timeline, causing failures to downstream services.

How did we respond?

Automated monitoring alerts were triggered, and engineers were immediately engaged to assess the issue. As the issue was being investigated, attempts to auto-recover the partition failed to succeed. By 16:45 UTC, engineers had determined and started implementing mitigation steps on the first partition.

At 17:49 UTC, the first partition was successively progressing with recovery, and engineers decided to implement the recovery steps on the second partition. During this time, the number of incoming requests for VM operations continued to grow. To avoid further impact and failures to additional partitions, at 19:15 UTC Availability Zone AZ-02 was taken out of service for new VM creation requests. This ensured that all new VM creation requests would succeed, as they were automatically redirected to the other two Availability Zones.

By 23:45 UTC on 31 January 2023, both partitions had completed a successful recovery. Engineers continued to monitor the system and, when no further failures were recorded beyond 00:48 UTC on 1 February 2023, the incident was declared mitigated and Availability Zone AZ-02 was fully opened for new VM creations.

How are we making incidents like this less likely or less impactful?

  • As an immediate measure, we scanned all partitions across the three Availability Zones to successfully confirm that no other partitions were at risk of the same issue (Completed).
  • We have improved our telemetry to increase incident severity automatically if the failure rate increases beyond a set threshold (Completed).
  • We have also added automated preventive measures to avoid data store resource exhaustion issues from happening again (Completed).
  • We are working towards removing the cross-partition dependencies for VM creation (Estimated completion: April 2023).
  • We are adding a new capability to redirect deployments to other AZs based on known faults from a specific AZ (Estimated completion: May 2023).

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: 

25

What happened?

Between 07:08 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with network connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud. While most regions and services had recovered by 09:05 UTC, intermittent packet loss issues caused some customers to continue seeing connectivity issues due to two routers not being able to recover automatically. All issues were fully mitigated by 12:43 UTC.

What went wrong and why?

At 07:08 UTC a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains.

Microsoft’s standard operating procedure (SOP) for this type of operation follows a 4-step process that involves: [1] testing in our Open Network Emulator (ONE) environment for change validation; [2] testing in the lab environment; [3] a Safe-Fly Review documenting steps 1 and 2, as well as a roll-out and roll-back plans; and [4] Safe-Deployment which allows access to only one device at a time, to limit impact. In this instance, the SOP was changed prior to the scheduled event, to address issues experienced in previous executions of the SOP. Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps 1-4 above. This unqualified change led to a chain of events which culminated in the widespread impact of this incident. This change added a command to purge the IGP database – however, the command operates differently based on router manufacturer. Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases. While Microsoft has a real-time Authentication, Authorization, and Accounting (AAA) system that must approve each command run on each router, including a list of blocked commands that have global impact, the command’s different, global, default action on the router platform being changed was not discovered during the high-impact commands evaluation for this router model and, therefore, had not been added to the block list.

Azure Networking implements a defense-in-depth approach to maintenance operations which allows access to only one device at a time to ensure that any change has limited impact. In this instance, even though the engineer only had access to a single router, it was still connected to the rest of the Microsoft WAN via the IGP protocol. Therefore, the change resulted in two cascading events. First, routers within the Microsoft global network started recomputing IP connectivity throughout the entire internal network. Second, because of the first event, BGP routers started to readvertise and validate prefixes that we receive from the Internet. Due to the scale of the network, it took approximately 1 hour and 40 minutes for the network to restore connectivity to every prefix.

Issues in the WAN were detected by monitoring and alerts to the on-call engineers were generated within 5 minutes of the command being run. However, the engineer making changes was not informed due to the unqualified changes to the SOP. Due to this, the same operation was performed again on the second Madrid router 33 mins after the first change, thus creating two waves of connectivity issues throughout the network impacting Microsoft customers.

This event caused widespread routing instability affecting Microsoft customers and their traffic flows: to/from the Internet, Inter-Region traffic, Cross-premises traffic via ExpressRoute or VPN/vWAN and US Gov Cloud services using commercial/public cloud services. During the time it took for routing to automatically converge, customer impact dynamically changed as the network completed its convergence. Some customers experienced intermittent connectivity, some saw connections timeout, and others experienced long latency or in some cases even a complete loss of connectivity.

How did we respond?

Our monitoring detected DNS and WAN issues starting at 07:11 UTC. We began investigating by reviewing all recent changes. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issue. Networking telemetry shows that nearly all network devices had recovered by 09:05 UTC, by which point most regions and services had recovered. Final networking equipment recovered by 09:25 UTC.

After routing in the WAN fully converged and recovered, there was still above normal packet loss in localized parts of the network. During this event, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network were not fully optimized and, therefore, experienced increased packet loss from 09:25 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. The recovery was ultimately completed at 12:43 UTC and explains why customers in different geographies experienced different recovery times. The long poles were traffic traversing our regions in India and parts of North America.

How are we making incidents like this less likely and less impactful?

Two main factors contributed to the incident:

  1. A change was made to a standard operating procedure that was not properly revalidated and left the procedure containing an error and without proper pre- and post- checks.
  2. A standard command that has different behaviors on different router models was issued outside of standard procedure that caused all WAN routers in the IGP domain to recompute reachability.

As such, our repair items include the following:

  • Audit and block similar commands that can have widespread impact across all three vendors for all WAN router roles (Estimated completion: February 2023).
  • Publish real-time visibility of approved-automated and approved-break glass, as well as unqualified device activity, to enable on-call engineers to see who is making what changes on network devices. (Estimated completion: February 2023).
  • Continued process improvement by implementing regular, ongoing mandatory operational training and attestation of following all SOPs. (Estimated completion: February 2023).
  • Audit of all SOPs still pending qualification will immediately be prioritized for a Change Advisory Board (CAB) review within 30 days, including engineer feedback to the viability and usability of the SOP. (Estimated completion: April 2023).

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

23

What happened?

Between 15:39 UTC and 19:38 UTC on 23 January 2023, a subset of customers in the South Central US region may have experienced increased latency and/or intermittent connectivity issues while accessing services hosted in the region. Downstream services that were impacted by this intermittent networking issue included Azure App Services, Azure Cosmos DB, Azure IoT Hub, and Azure SQL DB.

What went wrong and why?

The Regional Network Gateway (RNG) in the South Central US region serves network traffic between Availability Zones, which includes datacenters in South Central US and network traffic in and out of the region. During this incident, a single router in the RNG experienced a hardware fault, causing a fraction of network packets to be dropped. Customers may have experienced intermittent connectivity errors and/or error notifications when accessing resources hosted in this region. 

How did we respond?

The Azure network design includes extensive redundancy such that, when a router fails, only a small fraction of the traffic through the network is impacted - and automated mitigation systems can restore full functionality by removing failed any routers from service. In this incident, there was a delay with our automated systems in detecting the hardware failure on the router, as there was no previously known signature for this specific hardware fault. Our synthetic probe-based monitoring fired an alert at 17:58 UTC, which helped to narrow down the potential causal location to the RNG. After further investigation we were able to pinpoint the offending router, but it took additional time to isolate because the failure made the router inaccessible via its management interfaces. The unhealthy router was isolated and removed from service at 19:38 UTC, which mitigated the incident. 

How are we making incidents like this less likely or less impactful?

  • We are implementing improvements to our standard operating procedures for this class of incident to help mitigate similar issues more quickly (Estimated completion: February 2023).
  • We are implementing additional automated mitigation mechanisms to help identify and isolate such unhealthy routers more quickly in the future (Estimated completion: May 2023).
  • We are still investigating the cause of the hardware fault and are collaborating weekly with the hardware vendor to diagnose this unknown OS/hardware failure and obtain deterministic repair actions.

How can we make our incident communications more useful? 

You can rate this PIR and provide any feedback using our quick 3-question survey: 

18

What happened?

Between 09:44 and 13:10 UTC on 18 January 2023, a subset of customers using Storage services in West Europe may have experienced higher than expected latency, timeouts or HTTP 500 errors when accessing data stored on Storage accounts hosted in this region. Other Azure services with dependencies on this specific storage infrastructure may also have experienced impact – including Azure Application Insights, Azure Automation, Azure Container Registry, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Red Hat OpenShift, Azure Search, Azure SQL Database, and Azure Virtual Machines (VMs).

What went wrong and why?

We determined that an issue occurred during a planned power maintenance. While all server racks have redundant dual feeds, one feed was powered down for maintenance, and a failure in the redundant feed caused a shutdown of the affected racks. This unexpected event was caused by a failure in the electrical systems feeding the affected racks. Two Power Distribution Unit (PDU) breakers tripped, which were feeding the impacted racks. The breakers had lower than designed trip values of 160A, versus the 380A to which they should have been set. Our investigation determined that this lower value was a remnant from previous heat load tests, which should have been aligned back to the site design after testing had completed. This led to an overload event on the breakers once power had been removed from the secondary feeds for maintenance. This caused an incident for a subset of storage and networking infrastructure in one datacenter of one Availability Zone in West Europe. This impacted storage tenants, and network devices which may have rebooted.

How did we respond?

The issue was detected by the datacenter operation team performing the maintenance at the time. We immediately initiated the maintenance rollback procedure, and restored power to the affected racks. Concurrently, we escalated the incident and engaged other Azure service stakeholders to initiate/validate service recovery. Most impacted resources automatically recovered following the power event, through automated recovery processes.

The storage team identified two storage scale units that did not come back online automatically – nodes were not booting properly, as network connectivity was still unavailable. Networking teams were engaged to investigate, and identified a Border Gateway Protocol (BGP) issue. BGP is the standard routing protocol used to exchange routing and reachability information between networks. Since BGP functionality did not recover automatically, 3 of the 20 impacted top-of-rack (ToR) networking switches stayed unavailable. Networking engineers restored the BGP session manually. One storage scale unit was fully recovered by 10:00 UTC, the other storage scale unit was fully recovered by 13:10 UTC.

How are we making incidents like this less likely or less impactful?

  • We have reviewed (and corrected where necessary) all PDU breakers within the facility to align to site design (Completed).
  • We are ensuring that the Operating Procedures at all datacenter sites are updated to pre-check breaker trip values prior to all maintenance in the future (Estimated completion: February 2023).
  • We are conducting a full review of our commissioning procedures around heat load testing, to ensure that systems are aligned to site design, after any heat load tests (Estimated completion: February 2023).
  • In the longer term, we are exploring ways to improve our networking hardware automation, to differentiate between hardware failure and power failure scenarios, to ensure a more seamless recovery during this class of incident (Estimated completion: September 2023).

How can customers make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations:
  • Consider which are the right Storage redundancy options for your critical applications. Zone redundant storage (ZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable:
  • Consider using Azure Chaos Studio to recreate the symptoms of this incident as part of a chaos experiment, to validate the resilience of your Azure applications. Our library of faults includes VM shutdown, network block, and AKS faults that can help to recreate some of the connection difficulties experienced during this incident – for example, by targeting all resources within a single Availability Zone:
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: