Product:
Region:
Date:
March 2023
23
The following is our preliminary Post Incident Report (PIR) following this incident, sharing what we know so far. We will follow up within 14 days with a final PIR.
What happened?
Between 02:20 UTC and 07:30 UTC on 23 Mar 2023 you may have experienced issues using Azure Resource Manager (ARM) in West Europe when performing resource management operations. This would have impacted users of Azure Portal, Azure CLI, Azure PowerShell, as well as Azure services which depend upon ARM for their internal resource management operations.
The primary source of impact was limited to ARM API calls being processed in our West Europe region. This caused up to 50% of customer requests to this region to fail (approximately 3% of global requests at the time). This principally affected customers and workloads in geographic proximity to our West Europe region, while customers geographically located elsewhere would not have been impacted (with limited exceptions for VPN users and those on managed corporate networks).
Additionally, Azure services that leverage the ARM API as part of their own internal workflows, and customers of these services, may have experienced issues managing Azure resources located in West Europe as a result.
What went wrong and why?
This outage was the result of a positive feedback loop leading to saturation on the ARM Web API tier. This was caused by high-volume, short-held lock contention on the request serving path, which triggered a significant increase in spin-waits against these locks, driving up CPU load and preventing threads from picking up asynchronous background work.
As a result of this, latency for long running asynchronous operations (such as outgoing database and web requests) increased, leading to timeouts. These timeouts caused both internal and external clients to retry requests, further increasing load and contention on these locks and eventually causing our Web API tier to saturate its available CPU capacity.
There are several factors which contribute to increasing the feedback on this loop, however the ultimate trigger was the recent introduction of a cache used to reduce the time spent parsing complex feature flag definitions in hot loops.
This change was intended to reduce the performance impact of using feature flags on the request serving path and had been previously load tested and validated in our internal testing and canary environments - demonstrating a significant reduction in performance impact in these scenarios.
This change was rolled out following our standard Safe Deployment Process and was progressively deployed to increasingly larger regions over the course of 4 days prior to being deployed to West Europe. Over this time period, it was not exposed to the problematic call pattern, and none of these regions exhibited anomalous performance characteristics.
When this change was deployed to our West Europe region, it was subjected to a call pattern unique to a specific internal service which exercised this cache path more heavily than the broad-spectrum workloads we had tested in our internal and canary environments.
Approximately 24 hours after it was deployed to West Europe, a spike in traffic from this internal service that executed a daily cache refresh was able to induce enough lock contention to start this positive feedback loop across a significant portion of the ARM Web API instances in the region. This cascaded as the service in question retried failed requests and over the course of 20 minutes, the region progressed from a healthy to heavily saturated state.
These recent contributors combined with several other factors to trigger and exacerbate the impact, these include but are not limited to the following:
- A legacy API implementation, whose responses varied infrequently, made heavy use of costly data transforms on each request without caching.
- The introduction of a new feature flag which influenced the data transforms applied to this legacy API, as well as several others, in support of ongoing improvements to Azure's regional expansion workflows.
- Internal retry logic which has the potential for increasing load during a performance degradation scenario (this also significantly improves customer experienced reliability in other scenarios).
- External clients which implement retry logic that can increase load during a saturation scenario.
How did we respond?
At 02:41 UTC we were alerted to a drop in regional availability for Azure Resource Manager in West Europe. This was 21 minutes after the first measurable deviation from nominal performance and 6 minutes after the first significant drop in service availability for the region.
Initial investigations identified an increase in failures for calls to downstream storage dependencies, and an increase in both CPU usage and number of errors served from the reverse proxy layer in front of our Web API services.
We were aware of the most recent deployment to this region happening about 24 hours earlier, however this region (and up to 20 others) had been operating nominally with this new code. The impact appeared to correspond to a significant increase in CPU load without any visible increase in traffic to the region – this atypical behaviour, paired with the fact that all dependencies in the region appeared to be healthy, obfuscated the cause of the load increase.
Following our standard response guidance for saturation incidents, we scaled up our Web API tier in West Europe to reduce the impact to customers, however these new instances were immediately saturated with incoming requests, having little impact on the region's degraded availability. We also determined that shifting traffic away from this region was unlikely to improve the situation and was more likely to cause a multi-region outage, since the problem appeared to be contained within the ARM service and tied to a workload in West Europe.
As a result of the significant increase in CPU load, our internal performance profiling systems automatically began sampling call paths in the service. By 05:15 UTC, we had used this profiling information to attribute the cause to a spin lock in our feature flagging system's cache which was being triggered by a specific API. Using this information, we were able to identify the internal service responsible for the majority of calls to this API and confirmed that a spike in requests from this service was the trigger for the load increase.
We engaged team members from this internal service and disabled the workload, while simultaneously blocking the client to reduce the generated load. At 06:30 UTC these changes succeeded, and this in-turn reduced the incoming load and we started to observe improvements to regional availability in West Europe. This availability continued to improve as our phased configuration rollout within the region progressed, improving the health of the platform until by 07:30 UTC - it had returned to nominal availability at >99.999%. At 07:54 UTC we confirmed mitigation and transitioned into the investigation and repair phase.
Once mitigation was confirmed, we set about hardening the system against any potential for a recurrence of the previous issue, starting first with a rollback of the latest code release in West Europe to the previous release.
Simultaneously, we set to work reverting the code changes primarily responsible for triggering this failure mode and developing patches for the other contributing factors - including adjusting the caching system used for feature flags to remove the potential for lock contention and adding caching to the legacy API responsible for generating this load to significantly reduce the potential for positive feedback loops on these request paths and others like them.
How are we making incidents like this less likely or less impactful?
- Rollback ARM release globally which contain code relating to this performance regression. (Completed)
- Remove the recently introduced feature flag performance cache and hot path feature flags (and features) which depend upon it. (Completed)
- Implement caching on the legacy API responsible for triggering this lock contention. (Completed)
- Implement and load test a feature flag performance cache which operates without the use of locking. (Estimated completion: April 2023)
Next Steps:
We will follow up within 14 days with a final PIR.
6
What happened?
Between 03:50 UTC and 17:55 UTC on 6 March 2023, a subset of customers using Azure Storage may have experienced greater than expected throttling when performing requests against Storage resources located in the West Europe region. Azure services dependent on Azure Storage may also have experienced intermittent failures and degraded performance due to this issue. These included Azure Automation, Azure Arc enabled Kubernetes, Azure Bastion, Azure Batch, Azure Container Apps, Azure Data Factory (ADF), Azure ExpressRoute \ ExpressRoute Gateways, Azure HDInsight, Azure Key Vault (AKV), Azure Logic Apps, Azure Monitor, and Azure Synapse Analytics.
What went wrong and why?
Azure Storage employs a throttling mechanism that ensures storage account usage remains within the published storage account limits (for more details, refer to https://learn.microsoft.com/azure/storage/common/scalability-targets-standard-account). This throttling mechanism is also used to protect a storage service scale unit from exceeding the overall resources available to the scale unit. In the event that a scale unit reaches such limits, the scale unit employs a safety protection mechanism to throttle accounts that are deemed to contribute to the overload, while also balancing load across other scale units.
Configuration settings are utilized to control how the throttling services monitor usage and apply throttling. A new configuration was rolled out to improve the throttling algorithm. While this configuration change followed the usual Safe Deployment Practices (for more details, refer to: https://learn.microsoft.com/en-us/devops/operate/safe-deployment-practices), this issue was found mid-way through the deployment. The change had adverse effects due to the specific load and characteristics on a small set of scale units where the mechanism unexpectedly throttled some storage accounts in an attempt to bring the scale unit to a healthy state.
How did we respond?
Automated monitoring alerts were triggered, and engineers were immediately engaged to assess issues that the service-specific alert reported. This section will provide a timeline for how we responded. Batch started investigating these specific alerts at 04:08 UTC. This investigation showed that there were storage request failures and service availability impact. At 04:40 UTC Storage engineers were engaged to begin investigating the cause. While storage was engaged and investigating with Batch, at 05:30 UTC, automated monitoring alerts were triggered for ADF. The ADF investigation showed there was an issue with underlying Batch accounts. Batch confirmed with ADF that they were impacted by storage failures and are working with storage engineers to mitigate. At this time, storage engineers diagnosed that one scale unit was operating above normal parameters, this included identifying that Batch storage accounts were throttled due to tenant limits. And the action engineers took was to load balance traffic across different scale units. By 06:34 UTC, storage engineers started to do Batch account migration in efforts to mitigate the ongoing issues.
At 07:15 UTC, automation detected an issue with AKV requests. The AKV investigation showed there was an issue with the underlying storage accounts. Around 09:10 UTC, engineers performed a failover that mitigated the issue for all existing AKV read operations. However, create, read (for new requests), update and delete operations for AKV were still impacted. Around 10:00 UTC, storage engineers correlated the occurrences of the downstream impacted services with the configuration rollout because the scope of the issue expanded to additional scale units. By 10:15 UTC, storage engineers began reverting the configuration change on select impacted scale units. The Batch storage account migration finished around 11:22 UTC, after that, Batch service became healthy at 11:35 UTC. ADF began to recover after Batch mitigation was completed. ADF was fully recovered around 12:51 UTC after accumulated tasks got consumed.
By 16:34 UTC, impacted resources and services were mitigated. Shortly thereafter, engineers scheduled a rollback of the configuration change on all scale units (even ones that were not impacted), declaring mitigation.
How are we making incidents like this less likely or less impactful?
- We are tuning our monitoring to anticipate and quickly detect when a storage scale unit might engage its scale unit protection algorithm and employs throttling to storage accounts. These monitors will proactively alert engineers to take necessary actions (Estimated completion: March 2023).
- Moving forward, we will rollout throttling improvements in tracking-mode first to assess its impact before getting enabled since throttling improvements may react differently to different workload types and load on a particular scale unit (Estimated completion: March 2023).
The above improvement areas will help to prevent/detect storage-related issues across first-party services that are reliant on Azure storage accounts - for example, Batch and Data Factory.
- Our AKV team is working on improvements to the current distribution of storage accounts across multiple scale units and update storage implementation to ensure that read and write availability can be decoupled when such incidents happen (Estimated completion: May 2023).
- We continue to expand our AIOps detection system to provide better downstream impact detection and correlation - to notify customers more quickly, and to identify/mitigate impact more quickly (Ongoing).
How can customers make incidents like this less impactful?
- To get the best performance from Azure Storage, including when throttled, consider following the Performance and Scalability Checklist for Blob Storage: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist
- Consider which are the right Storage redundancy options for your critical applications. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable, like in this incident. https://docs.microsoft.com/azure/storage/common/storage-redundancy
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://learn.microsoft.com/en-us/azure/architecture/framework/resiliency/
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/R_36-P80
February 2023
7
What happened?
Between 20:19 UTC on 7 February 2023 and 04:30 UTC on 9 February 2023, a subset of customers with workloads hosted in the Southeast Asia and East Asia regions experienced difficulties accessing and managing resources deployed in these regions.
One Availability Zone (AZ) in Southeast Asia experienced cooling failure. Infrastructure in that zone was shutdown to protect data and infrastructure. This led to failures in accessing resources and services hosted in that zone. However, this zone failure also resulted in two further unexpected failures. Firstly, regional degradation for some services and secondly, services designed to support failover to other regions or zones did not work reliably.
The services which experienced degradation in the region included App Services, Azure Kubernetes Service, Databricks, API Management, Application Insights, Backup, Cognitive Services, Container Apps, Container Registry, Container Service, Cosmos DB, NetApp Files, Network Watcher, Notification Hubs, Purview, Redis Cache, Search, Service Bus, SignalR Service, Site Recovery, SQL Data Warehouse, SQL Database, Storage, Synapse Analytics, Universal Print, Update Management Center, Virtual Machines, Virtual WAN, Web PubSub – as well as the Azure portal itself, and subsets of the Azure Resource Manager (ARM) control plane services.
BCDR Services that were designed to support regional or zonal failover that didn’t work as expected included Azure Site Recovery, Azure Backup and Storage accounts leveraging Geo Redundant Storage (GRS).
What went wrong and why?
In this PIR we will break down what went wrong into 3 parts. The first part will focus on the failures that led to the loss of an availability zone. The second part will focus on the unexpected impact to the ARM control plane, and BCDR services. Lastly, we will focus on the extended recovery for the SQL Database service.
Chiller failures in a single Availability Zone
At 15:17 UTC on 7 February 2023, a utility voltage dip event occurred on the power grid, affecting a single Availability Zone in the Southeast Asia region. Power management systems managed the voltage dip as designed, however a subset of chiller units that provide cooling to the datacenter tripped and shut down. Emergency Operational Procedures (EOP) were performed as documented for impacted chillers but were not successful. Cooling capacity was reduced in the facility for a prolonged time, and despite efforts to stabilize by shutting down non-critical infrastructure, temperatures continued to rise in the impacted datacenter. At 20:19 UTC on 7 February 2023, infrastructure thermal warnings from components in the datacenter directed a shutdown on critical compute, network and storage infrastructure to protect data durability and infrastructure health. This resulted in loss of resource and service availability in the impacted zone in Southeast Asia.
Maximum cooling capacity for this facility incorporates 8 chillers from 2 different suppliers. Chillers 1 to 5 are from supplier A providing 47% of the cooling capacity, chillers 6, 7 and 8 from supplier B make up the remaining 53%. Chiller 5 was offline for planned maintenance. After the voltage dip, chiller 4 continued to run and 1, 2 and 3 were started, but could not provide the cooling necessary to stabilize the temperatures which had already increased during the restart period. These 4 units were operating as expected, but 6, 7, and 8 were unable to be started. This was not the expected behavior of these chillers, which should automatically restart when tripped by a power dip. The chillers also failed a manual restart as part of the EOP (emergency operating procedure) execution.
This fault condition is currently being investigated by the OEM chiller vendor. Preliminary investigations point to the compressor control card which ceased to respond after the power dip, inhibiting the chillers from successfully restarting. From the chiller HMI alarm log, a “phase sequence alarm” was detected. To normalize this alarm, the compressor control card had to be hard reset requiring a shut down, leaving it powered off for at least 5 minutes to drain the internal capacitors before powering the unit on. The current EOP did not list out these steps. This is addressed later in the PIR.
Technicians from Supplier B were dispatched onsite to assist. By the time the chiller compressor control card was reset with their assistance, the chilled water loop temperatures exceeded the threshold of 28 degrees Celsius, causing a lock out of the restart function to protect the chiller from damage. This condition is referred to as a thermal lockout. To reduce the chilled water loop temperature, augmented cooling was required to be brought online and thermal loads had to be reduced by shutting down infrastructure, as previously stated. This successfully reduced the chilled water loop temperature below the thermal lockout threshold and enabled the restart function for chillers 6, 7, and 8. Once the chillers were restarted, temperatures progressed toward expected levels and, with all units recovered, by 14:00 UTC on 8 February 2023, temperatures were back within our standard operational thresholds.
With confidence that temperatures were stable, we then began to restore power to the affected infrastructure, starting a phased process to first bring storage scale units back online. Once storage infrastructure was verified healthy and online, compute scale units were then powered up. As the compute scale units became healthy, virtual machines and other dependent Azure services recovered. Most customers should have seen platform service recovery by 16:30 UTC on 08 February, however a small number of services took an extended time to fully recover, completing by 04:30 UTC on 9 February 2023.
Unexpected impact ARM and BCDR services.
While infrastructure from the impacted AZ was powered down, services deployed and running in a zone resilient configuration were largely unimpacted by the shutdown of power in the zone. However multiple services may have experienced some form of degradation, these services are listed above. This was due to dependencies on ARM, which did experience unexpected impact from the loss of a single AZ.
In addition, three services that are purpose-built for business continuity and disaster recovery (BCDR) planning and execution, in both zonal and regional incidents, also experienced issues, and will be discussed in detail:
Azure Site Recovery (ASR) - Helps ensure business continuity by keeping business apps and workloads running during outages. Site Recovery replicates workloads running on physical and virtual machines (VMs) from a primary site to a secondary location. When an incident occurs at a customer’s primary site, a customer can fail over to a secondary location. After the primary location is running again, customers can fail back.
Azure Backup – Allows customers to make backups that keeps data safe and recoverable.
Azure Storage – Storage accounts configured for Geo Redundant Storage (GRS) to provide resiliency for data replication.
Impact to the ARM service:
As the infrastructure of the impacted zone was shutdown, the ARM control plane service (which is responsible for processing all service management operations) in Southeast Asia lost access to a portion of its metadata store hosted in a Cosmos DB instance that was in the impacted zone. It was expected that this instance was zone resilient, but due to a configuration error, local to the Southeast Asia region, it was not. Inability to access this metadata resulted in failures and contention across the ARM control plane in that region. To mitigate impact, the ARM endpoints in Southeast Asia were manually removed from their global service Traffic Manager profile. This was completed at 04:10 UTC on 8 February 2023. Shifting the ARM traffic away from Southeast Asia increased traffic to nearby APAC regions, in particular East Asia, which quickly exceeded provisioned capacity. This exacerbated the impact to include East Asia where customers would have experienced issues managing their resources via ARM. The ARM service in East Asia was scaled-up and by 09:30 UTC on the 8 of February the service was fully recovered.
Impact to BCDR services
A subset of Azure Site Recovery customers started experiencing failures with respect to zonal failovers from the Southeast Asia region and regional failovers to the East Asia region.
Initial failures were caused by a recent update to a workflow that leveraged an instance of Service Bus as the communication backbone, which was non-zonally resilient and was in the impacted zone. Failures related to workflow dependencies on Service Bus were completely mitigated by 04:45AM UTC on 8 February 2023 by disabling the newly added update in the Southeast Asia region.
Unfortunately, further failures continued for a few customers due to the above ARM control plane issues that were ongoing. At 09:15 UTC on 8 February 2023, once the ARM control plane issues were mitigated, all subsequent ASR failovers worked reliably.
The ARM impact caused failures and delays in restoring applications and data using Azure Backup cross region restore to East Asia. A portion of Azure Backup customers in the Southeast Asia region had enabled the cross region restore (CRR) capability which allows them restore their application and data in the East Asia region during regional outages. At 09:00 UTC on the 08 Feb, we started proactively enabling CRR capability for all Azure Backup customers with GRS Recovery Services Vaults (RSV). This was a proactive mitigation action to enable customers had not enabled this capability to recover their application and data in the East Asia region. Observing customer activity, we detected a configuration issue in one of the microservices specific to the Southeast Asia region which caused CRR calls to fail for a subset of customers. This issue was mitigated for this subset, through a configuration update and the cross region application and data restores were fully functioning by 15:10 UTC on 8 February 2023.
Most customers using Azure Storage with GRS were able to execute customer controlled failovers without any issues. A subset of customers encountered unexpected delays or failures. There were two underlying causes for this. The first was an integrity check that the service performs prior to a failover. In some instances, this check was too aggressive and blocked the failover from completing. The second cause was that storage accounts using hierarchical namespaces could not complete a failover.
Extended Recovery for SQL Database
The impact to Azure SQL Database customers throughout this incident would have fallen into the following categories.
Customers using Multi-AZ Azure SQL were not impacted, except those using proxy mode connection due to one connectivity capacity unit not being configured with zone-resilience.
Customer using active geo-replication with automatic failover were failed out of region and failed back when recovery was complete. These customers had ~1.5 hours of downtime prior to automatic failover completing.
Self-help geo-failover or geo-restore: Some customers who had active geo-replication self-initiated failover, and some performed geo-restore to other regions.
The rest of the impacted SQL Database customers were impacted from the point of power down and recovered as compute infrastructure was brought back online and Tenant Rings (TR) reformed automatically. One Tenant Ring of capacity for SQL Database required manual intervention to properly reform. Azure SQL Database capacity in a region is divided into Tenant rings. In Southeast Asia there are hundreds of rings. Each ring consists of a group of VMs (10-200) hosting a set of databases. Rings are managed by Azure Service Fabric (SF) to provide availability when we have VM, network or storage failures. For one of the TRs, when power was restored to the compute infrastructure hosting the supporting group of VMs, the ring did not reform and thus was unable to make any of the databases on that ring available. Infrastructure hosting these VMs did not recover automatically due to a BIOS bug. After the compute infrastructure was successfully started, a second issue in heap memory management in SF also required manual intervention to reform the TR and make customer databases available. The final TR became available at 04:30 UTC on the 09 February 2023.
How are we making incidents like this less likely or less impactful?
Firstly, we wanted to provide a status update of the review of facilities readiness in response to the chiller issues.
Our datacenter team engaged with the OEM vendor to fully understand the results and any additional follow-ups to prevent, shorten the duration, and reliably recover after these types of events. (Completed)
We're also implementing updates to existing procedures to include manual (cold) restarting of the chillers as recommended by the OEM vendor. (Completed)
We are also conducting additional training from the OEM to our Datacenter operations personnel to be familiar with the updated procedures. (Completed)
When it comes to executing BCDR plans in the event of single AZ impact, we discussed in the earlier PIR that the impact to ARM in multiple regions inhibited customers' ability to reliably self-mitigate. We will complete a validation of ARM service configuration in the Southeast Asia region to prevent this class of impact from repeating. (Completed)
We will also validate that all instances of Azure Resource Manager service are automatically region and/or zone redundant. (Completed)
We will ensure that there is sufficient capacity for ARM in all regions to manage additional traffic caused by zonal and/or regional failures. (Completed)
We will be conducting a thorough review of all Foundational services for gaps in resiliency models. This work will be ongoing, however the first round will be completed for ARM, Software Load Balancer, SQL DB, Cosmos DB, and Redis Cache by April 2023. You can learn more about our Foundational services here (https://learn.microsoft.com/en-us/azure/reliability/availability-service-by-category).
We will be conducting a dependency assessment for all zonal services with an automated daily walk of the dependency graph, documenting points of failures and closing any gaps. [Ongoing]
BCDR critical services will take the following remediations:
We will verify that the configuration issue with Azure Backup cross region restores will be corrected in the impacted region and across all other regions. (Completed)
We will verify that Azure Site Recovery will revert to a zonally resilient workflow model. (Completed)
We will be adjusting safety check thresholds for customer-initiated Storage failover for GRS accounts to ensure failovers aren’t unnecessarily blocked. (Completed)
Beyond VMSS, VMs, Service Fabric and Azure Kubernetes Services, Chaos Studio will be expanding its scenarios so that customers and Azure services can run further service simulations where it is possible to inject faults that emulate zonal failures. (Dec 2023)
We will ensure that our documentation represents the SQL behavior that honors setting changes made by customers when it comes to the SQL services failing back. More information can be found here (https://learn.microsoft.com/en-us/azure/azure-sql/database/auto-failover-group-sql-db). (Completed)
Our communications aim to provide insights into mitigation workstreams for incidents. We understand that for some scenarios this information is critical for customers to make decisions about invoking BCDR plans. We will strive to provide a recoverability expectation and timeline to assist customers in making these decisions for all incidents moving forward. (Completed)
We acknowledge that rapid correlation to impacted or potentially impacted resources in customers' subscriptions was difficult to translate into customer impact and action plans. Work is planned to be completed by December 2023 that addresses improvements in this ability. The below article provides current steps for customers to understand AZ logical mappings across multiple subscriptions (https://learn.microsoft.com/rest/api/resources/subscriptions/check-zone-peers).
If customers have questions about Compliance, Regulations & Standards visit the Service Trust Portal (https://servicetrust.microsoft.com).
How can we make our incident communications more useful?
You can rate this Post Incident Review (PIR) and provide any feedback using our quick 3-question survey https://aka.ms/AzPIR/VN11-JD8
January 2023
31
What happened?
Between 05:55 UTC on 31 January 2023 and 00:58 UTC on 1 February 2023, a subset of customers using Virtual Machines (VMs) in the East US 2 region may have received error notifications when performing service management operations – such as create, delete, update, scale, start, stop – for resource hosted in this region.
The impact was limited to a subset of resources in one of the region’s three Availability Zones, (physical AZ-02) and there was no impact to Virtual Machines in the other two zones. This incident did not cause availability issues for any VMs already provisioned and running, across any of the three Availability Zones (AZs) – impact was limited to service management operations described above.
Downstream Azure services with dependencies in this AZ were also impacted, including Azure Backup, Azure Batch, Azure Data Factory, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Databricks, and Azure Kubernetes Service.
What went wrong and why?
The East US 2 region is designed and implemented with three Availability Zones. Each of these AZs are further internally sliced into partitions, to ensure that a failure in one partition does not affect the processing in other partitions. Every VM deployment request is processed by one of the partitions in the AZ, and a gateway service is responsible for routing traffic to the right partition. During this incident, one of the partitions associated with physical AZ-02 experienced data access issues at 05:55 UTC on 31 January 2023, because the underlying data store had exhausted an internal resource limit.
While the investigation on the first partition was in progress, at around 11:43 UTC a second partition within the same AZ experienced a similar issue and became unavailable, leading to further impact. Even though this partition became unavailable, the cache service was keeping the partition data and so new deployments in the cluster were succeeding. At 12:04 UTC, the cache service restarted and couldn't retrieve the data, as the target partition was down. Due to a resource creation policy configuration issue, all partitions were required to create a new resource in the gateway service. When the cache service didn’t have the data, this policy resulted in blocking all new VM creations in the AZ. This resulted in additional failures and slowness during the impact timeline, causing failures to downstream services.
How did we respond?
Automated monitoring alerts were triggered, and engineers were immediately engaged to assess the issue. As the issue was being investigated, attempts to auto-recover the partition failed to succeed. By 16:45 UTC, engineers had determined and started implementing mitigation steps on the first partition.
At 17:49 UTC, the first partition was successively progressing with recovery, and engineers decided to implement the recovery steps on the second partition. During this time, the number of incoming requests for VM operations continued to grow. To avoid further impact and failures to additional partitions, at 19:15 UTC Availability Zone AZ-02 was taken out of service for new VM creation requests. This ensured that all new VM creation requests would succeed, as they were automatically redirected to the other two Availability Zones.
By 23:45 UTC on 31 January 2023, both partitions had completed a successful recovery. Engineers continued to monitor the system and, when no further failures were recorded beyond 00:48 UTC on 1 February 2023, the incident was declared mitigated and Availability Zone AZ-02 was fully opened for new VM creations.
How are we making incidents like this less likely or less impactful?
- As an immediate measure, we scanned all partitions across the three Availability Zones to successfully confirm that no other partitions were at risk of the same issue (Completed).
- We have improved our telemetry to increase incident severity automatically if the failure rate increases beyond a set threshold (Completed).
- We have also added automated preventive measures to avoid data store resource exhaustion issues from happening again (Completed).
- We are working towards removing the cross-partition dependencies for VM creation (Estimated completion: April 2023).
- We are adding a new capability to redirect deployments to other AZs based on known faults from a specific AZ (Estimated completion: May 2023).
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/BS81-390
25
What happened?
Between 07:08 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with network connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud. While most regions and services had recovered by 09:05 UTC, intermittent packet loss issues caused some customers to continue seeing connectivity issues due to two routers not being able to recover automatically. All issues were fully mitigated by 12:43 UTC.
What went wrong and why?
At 07:08 UTC a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains.
Microsoft’s standard operating procedure (SOP) for this type of operation follows a 4-step process that involves: [1] testing in our Open Network Emulator (ONE) environment for change validation; [2] testing in the lab environment; [3] a Safe-Fly Review documenting steps 1 and 2, as well as a roll-out and roll-back plans; and [4] Safe-Deployment which allows access to only one device at a time, to limit impact. In this instance, the SOP was changed prior to the scheduled event, to address issues experienced in previous executions of the SOP. Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps 1-4 above. This unqualified change led to a chain of events which culminated in the widespread impact of this incident. This change added a command to purge the IGP database – however, the command operates differently based on router manufacturer. Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases. While Microsoft has a real-time Authentication, Authorization, and Accounting (AAA) system that must approve each command run on each router, including a list of blocked commands that have global impact, the command’s different, global, default action on the router platform being changed was not discovered during the high-impact commands evaluation for this router model and, therefore, had not been added to the block list.
Azure Networking implements a defense-in-depth approach to maintenance operations which allows access to only one device at a time to ensure that any change has limited impact. In this instance, even though the engineer only had access to a single router, it was still connected to the rest of the Microsoft WAN via the IGP protocol. Therefore, the change resulted in two cascading events. First, routers within the Microsoft global network started recomputing IP connectivity throughout the entire internal network. Second, because of the first event, BGP routers started to readvertise and validate prefixes that we receive from the Internet. Due to the scale of the network, it took approximately 1 hour and 40 minutes for the network to restore connectivity to every prefix.
Issues in the WAN were detected by monitoring and alerts to the on-call engineers were generated within 5 minutes of the command being run. However, the engineer making changes was not informed due to the unqualified changes to the SOP. Due to this, the same operation was performed again on the second Madrid router 33 mins after the first change, thus creating two waves of connectivity issues throughout the network impacting Microsoft customers.
This event caused widespread routing instability affecting Microsoft customers and their traffic flows: to/from the Internet, Inter-Region traffic, Cross-premises traffic via ExpressRoute or VPN/vWAN and US Gov Cloud services using commercial/public cloud services. During the time it took for routing to automatically converge, customer impact dynamically changed as the network completed its convergence. Some customers experienced intermittent connectivity, some saw connections timeout, and others experienced long latency or in some cases even a complete loss of connectivity.
How did we respond?
Our monitoring detected DNS and WAN issues starting at 07:11 UTC. We began investigating by reviewing all recent changes. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issue. Networking telemetry shows that nearly all network devices had recovered by 09:05 UTC, by which point most regions and services had recovered. Final networking equipment recovered by 09:25 UTC.
After routing in the WAN fully converged and recovered, there was still above normal packet loss in localized parts of the network. During this event, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network were not fully optimized and, therefore, experienced increased packet loss from 09:25 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. The recovery was ultimately completed at 12:43 UTC and explains why customers in different geographies experienced different recovery times. The long poles were traffic traversing our regions in India and parts of North America.
How are we making incidents like this less likely and less impactful?
Two main factors contributed to the incident:
- A change was made to a standard operating procedure that was not properly revalidated and left the procedure containing an error and without proper pre- and post- checks.
- A standard command that has different behaviors on different router models was issued outside of standard procedure that caused all WAN routers in the IGP domain to recompute reachability.
As such, our repair items include the following:
- Audit and block similar commands that can have widespread impact across all three vendors for all WAN router roles (Estimated completion: February 2023).
- Publish real-time visibility of approved-automated and approved-break glass, as well as unqualified device activity, to enable on-call engineers to see who is making what changes on network devices. (Estimated completion: February 2023).
- Continued process improvement by implementing regular, ongoing mandatory operational training and attestation of following all SOPs. (Estimated completion: February 2023).
- Audit of all SOPs still pending qualification will immediately be prioritized for a Change Advisory Board (CAB) review within 30 days, including engineer feedback to the viability and usability of the SOP. (Estimated completion: April 2023).
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/VSG1-B90
23
What happened?
Between 15:39 UTC and 19:38 UTC on 23 January 2023, a subset of customers in the South Central US region may have experienced increased latency and/or intermittent connectivity issues while accessing services hosted in the region. Downstream services that were impacted by this intermittent networking issue included Azure App Services, Azure Cosmos DB, Azure IoT Hub, and Azure SQL DB.
What went wrong and why?
The Regional Network Gateway (RNG) in the South Central US region serves network traffic between Availability Zones, which includes datacenters in South Central US and network traffic in and out of the region. During this incident, a single router in the RNG experienced a hardware fault, causing a fraction of network packets to be dropped. Customers may have experienced intermittent connectivity errors and/or error notifications when accessing resources hosted in this region.
How did we respond?
The Azure network design includes extensive redundancy such that, when a router fails, only a small fraction of the traffic through the network is impacted - and automated mitigation systems can restore full functionality by removing failed any routers from service. In this incident, there was a delay with our automated systems in detecting the hardware failure on the router, as there was no previously known signature for this specific hardware fault. Our synthetic probe-based monitoring fired an alert at 17:58 UTC, which helped to narrow down the potential causal location to the RNG. After further investigation we were able to pinpoint the offending router, but it took additional time to isolate because the failure made the router inaccessible via its management interfaces. The unhealthy router was isolated and removed from service at 19:38 UTC, which mitigated the incident.
How are we making incidents like this less likely or less impactful?
- We are implementing improvements to our standard operating procedures for this class of incident to help mitigate similar issues more quickly (Estimated completion: February 2023).
- We are implementing additional automated mitigation mechanisms to help identify and isolate such unhealthy routers more quickly in the future (Estimated completion: May 2023).
- We are still investigating the cause of the hardware fault and are collaborating weekly with the hardware vendor to diagnose this unknown OS/hardware failure and obtain deterministic repair actions.
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/7NBR-T98
18
What happened?
Between 09:44 and 13:10 UTC on 18 January 2023, a subset of customers using Storage services in West Europe may have experienced higher than expected latency, timeouts or HTTP 500 errors when accessing data stored on Storage accounts hosted in this region. Other Azure services with dependencies on this specific storage infrastructure may also have experienced impact – including Azure Application Insights, Azure Automation, Azure Container Registry, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Red Hat OpenShift, Azure Search, Azure SQL Database, and Azure Virtual Machines (VMs).
What went wrong and why?
We determined that an issue occurred during a planned power maintenance. While all server racks have redundant dual feeds, one feed was powered down for maintenance, and a failure in the redundant feed caused a shutdown of the affected racks. This unexpected event was caused by a failure in the electrical systems feeding the affected racks. Two Power Distribution Unit (PDU) breakers tripped, which were feeding the impacted racks. The breakers had lower than designed trip values of 160A, versus the 380A to which they should have been set. Our investigation determined that this lower value was a remnant from previous heat load tests, which should have been aligned back to the site design after testing had completed. This led to an overload event on the breakers once power had been removed from the secondary feeds for maintenance. This caused an incident for a subset of storage and networking infrastructure in one datacenter of one Availability Zone in West Europe. This impacted storage tenants, and network devices which may have rebooted.
How did we respond?
The issue was detected by the datacenter operation team performing the maintenance at the time. We immediately initiated the maintenance rollback procedure, and restored power to the affected racks. Concurrently, we escalated the incident and engaged other Azure service stakeholders to initiate/validate service recovery. Most impacted resources automatically recovered following the power event, through automated recovery processes.
The storage team identified two storage scale units that did not come back online automatically – nodes were not booting properly, as network connectivity was still unavailable. Networking teams were engaged to investigate, and identified a Border Gateway Protocol (BGP) issue. BGP is the standard routing protocol used to exchange routing and reachability information between networks. Since BGP functionality did not recover automatically, 3 of the 20 impacted top-of-rack (ToR) networking switches stayed unavailable. Networking engineers restored the BGP session manually. One storage scale unit was fully recovered by 10:00 UTC, the other storage scale unit was fully recovered by 13:10 UTC.
How are we making incidents like this less likely or less impactful?
- We have reviewed (and corrected where necessary) all PDU breakers within the facility to align to site design (Completed).
- We are ensuring that the Operating Procedures at all datacenter sites are updated to pre-check breaker trip values prior to all maintenance in the future (Estimated completion: February 2023).
- We are conducting a full review of our commissioning procedures around heat load testing, to ensure that systems are aligned to site design, after any heat load tests (Estimated completion: February 2023).
- In the longer term, we are exploring ways to improve our networking hardware automation, to differentiate between hardware failure and power failure scenarios, to ensure a more seamless recovery during this class of incident (Estimated completion: September 2023).
How can customers make incidents like this less impactful?
- Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
- Consider which are the right Storage redundancy options for your critical applications. Zone redundant storage (ZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable: https://docs.microsoft.com/azure/storage/common/storage-redundancy
- Consider using Azure Chaos Studio to recreate the symptoms of this incident as part of a chaos experiment, to validate the resilience of your Azure applications. Our library of faults includes VM shutdown, network block, and AKS faults that can help to recreate some of the connection difficulties experienced during this incident – for example, by targeting all resources within a single Availability Zone: https://docs.microsoft.com/azure/chaos-studio
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/6S_Q-JT8