製品:
リージョン:
日付:
2023年3月
23
What happened?
Between 02:20 UTC and 07:30 UTC on 23 March 2023 you may have experienced issues using Azure Resource Manager (ARM) when performing resource management operations in the West Europe region. This impacted users of Azure CLI, Azure PowerShell, the Azure portal, as well as Azure services which depend upon ARM for their internal resource management operations.
The primary source of impact was limited to ARM API calls being processed in our West Europe region. This caused up to 50% of customer requests to this region to fail (approximately 3% of global requests at the time). This principally affected customers and workloads in geographic proximity to our West Europe region, while customers geographically located elsewhere would not have been impacted – with limited exceptions for VPN users and those on managed corporate networks. Additionally, Azure services that leverage the ARM API as part of their own internal workflows, and customers of these services, may have experienced issues managing Azure resources located in West Europe as a result.
What went wrong and why?
This incident was the result of a positive feedback loop leading to saturation on the ARM web API tier. This was caused by high-volume, short-held lock contention on the request serving path, which triggered a significant increase in spin-waits against these locks, driving up CPU load and preventing threads from picking up asynchronous background work. As a result of this, latency for long running asynchronous operations (such as outgoing database and web requests) increased, leading to timeouts. These timeouts caused both internal and external clients to retry requests, further increasing load and contention on these locks, eventually causing our Web API tier to saturate its available CPU capacity.
There are several factors which contribute to increasing the feedback on this loop, however the ultimate trigger was the recent introduction of a cache used to reduce the time spent parsing complex feature flag definitions in hot loops. This change was intended to reduce the performance impact of using feature flags on the request serving path, and had been previously load tested and validated in our internal testing and canary environments - demonstrating a significant reduction in performance impact in these scenarios.
This change was rolled out following our standard safe deployment practices, progressively deployed to increasingly larger regions over the course of four days prior to being deployed to West Europe. Over this period, it was not exposed to the problematic call pattern, and none of these regions exhibited anomalous performance characteristics. When this change was deployed to our West Europe region, it was subjected to a call pattern unique to a specific internal service which exercised this cache path more heavily than the broad-spectrum workloads we had tested in our internal and canary environments.
Approximately 24 hours after it was deployed to West Europe, a spike in traffic from this internal service that executed a daily cache refresh was able to induce enough lock contention to start this positive feedback loop across a significant portion of the ARM web API instances in the region. This cascaded as the service in question retried failed requests and, over the course of 20 minutes, the region progressed from a healthy to heavily saturated state.
These recent contributors combined with several other factors to trigger and exacerbate the impact, including:
- A legacy API implementation – whose responses varied infrequently, made heavy use of costly data transforms on each request without caching.
- The introduction of a new feature flag – which influenced the data transforms applied to this legacy API, as well as several others, in support of ongoing improvements to regional expansion workflows.
- Internal retry logic – which has the potential for increasing load during a performance degradation scenario (but this also significantly improves customer experienced reliability in other scenarios).
- External clients which implement retry logic that can increase load during a saturation scenario.
How did we respond?
At 02:41 UTC we were alerted to a drop in regional availability for Azure Resource Manager in West Europe. This was 21 minutes after the first measurable deviation from nominal performance, and 6 minutes after the first significant drop in service availability for the region. Diagnosing the cause took an extended amount of time as the issue was obfuscated by several factors. First, we identified an increase in failures for calls to downstream storage dependencies, and an increase in both CPU usage and number of errors served from the reverse proxy layer in front of our Web API services. The impact appeared to correspond to a significant increase in CPU load without any visible increase in traffic to the region. This was not typical, and all dependencies in the region appeared to be healthy.
We were aware of the most recent deployment to this region happening about 24 hours earlier, however this region (and up to 20 others) had been operating nominally with this new code. Following our standard response guidance for saturation incidents, we scaled up our Web API tier in West Europe to reduce the impact to customers, however, these new instances were immediately saturated with incoming requests, having little impact on the region's degraded availability. We also determined that shifting traffic away from this region was unlikely to improve the situation and had the potential to cause a multi-region incident, since the problem appeared to be contained within the ARM service and tied to a workload in West Europe.
As a result of the significant increase in CPU load, our internal performance profiling systems automatically began sampling call paths in the service. At 05:15 UTC, this profiling information is how were able to attribute the cause to a spin lock in our feature flagging system's cache which was being triggered by a specific API. Using this information, we were able to identify the internal service responsible for the majority of calls to this API and confirmed that a spike in requests from this service was the trigger for the load increase. We engaged team members from this internal service and disabled the workload, while simultaneously blocking the client to reduce the generated load. By 06:30 UTC these changes succeeded, which in-turn reduced the incoming load and we started to observe improvements to regional availability in West Europe.
This availability continued to improve as our phased configuration rollout within the region progressed, improving the health of the platform . By 07:30 UTC, it had returned to nominal availability at >99.999%. At 07:54 UTC we confirmed mitigation and transitioned into the investigation and repair phase. Once mitigation was confirmed, we set about hardening the system against any potential for a recurrence of the previous issue, starting first with a rollback of the latest code release in West Europe to the previous release. Simultaneously, we set to work reverting the code changes primarily responsible for triggering this failure mode and developing patches for the other contributing factors – including adjusting the caching system used for feature flags to remove the potential for lock contention, and adding caching to the legacy API responsible for generating this load to significantly reduce the potential for positive feedback loops on these request paths and others like them.
How are we making incidents like this less likely or less impactful?
- We have rolled back the ARM release globally which contain code relating to this performance regression. (Completed)
- We have removed the recently introduced feature flag performance cache and hot path feature flags and features which depend upon it. (Completed)
- We have implemented caching on the legacy API responsible for triggering this lock contention. (Completed)
- We are working to implement and load test a feature flag performance cache which operates without the use of locking. (Estimated completion: April 2023)
- We are exploring ways to improve ARM’s rate limiting behavior to better avoid positive feedback loops. (Estimated completion: December 2023)
- We are implementing additional automated Service Health messaging, to communicate more quickly with customers when an issue on front-end nodes occurs. (Estimated completion: December 2023)
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://www.aka.ms/AzPIR/RNQ2-NC8
6
What happened?
Between 03:50 UTC and 17:55 UTC on 6 March 2023, a subset of customers using Azure Storage may have experienced greater than expected throttling when performing requests against Storage resources located in the West Europe region. Azure services dependent on Azure Storage may also have experienced intermittent failures and degraded performance due to this issue. These included Azure Automation, Azure Arc enabled Kubernetes, Azure Bastion, Azure Batch, Azure Container Apps, Azure Data Factory (ADF), Azure ExpressRoute \ ExpressRoute Gateways, Azure HDInsight, Azure Key Vault (AKV), Azure Logic Apps, Azure Monitor, and Azure Synapse Analytics.
What went wrong and why?
Azure Storage employs a throttling mechanism that ensures storage account usage remains within the published storage account limits (for more details, refer to https://learn.microsoft.com/azure/storage/common/scalability-targets-standard-account). This throttling mechanism is also used to protect a storage service scale unit from exceeding the overall resources available to the scale unit. In the event that a scale unit reaches such limits, the scale unit employs a safety protection mechanism to throttle accounts that are deemed to contribute to the overload, while also balancing load across other scale units.
Configuration settings are utilized to control how the throttling services monitor usage and apply throttling. A new configuration was rolled out to improve the throttling algorithm. While this configuration change followed the usual Safe Deployment Practices (for more details, refer to: https://learn.microsoft.com/en-us/devops/operate/safe-deployment-practices), this issue was found mid-way through the deployment. The change had adverse effects due to the specific load and characteristics on a small set of scale units where the mechanism unexpectedly throttled some storage accounts in an attempt to bring the scale unit to a healthy state.
How did we respond?
Automated monitoring alerts were triggered, and engineers were immediately engaged to assess issues that the service-specific alert reported. This section will provide a timeline for how we responded. Batch started investigating these specific alerts at 04:08 UTC. This investigation showed that there were storage request failures and service availability impact. At 04:40 UTC Storage engineers were engaged to begin investigating the cause. While storage was engaged and investigating with Batch, at 05:30 UTC, automated monitoring alerts were triggered for ADF. The ADF investigation showed there was an issue with underlying Batch accounts. Batch confirmed with ADF that they were impacted by storage failures and are working with storage engineers to mitigate. At this time, storage engineers diagnosed that one scale unit was operating above normal parameters, this included identifying that Batch storage accounts were throttled due to tenant limits. And the action engineers took was to load balance traffic across different scale units. By 06:34 UTC, storage engineers started to do Batch account migration in efforts to mitigate the ongoing issues.
At 07:15 UTC, automation detected an issue with AKV requests. The AKV investigation showed there was an issue with the underlying storage accounts. Around 09:10 UTC, engineers performed a failover that mitigated the issue for all existing AKV read operations. However, create, read (for new requests), update and delete operations for AKV were still impacted. Around 10:00 UTC, storage engineers correlated the occurrences of the downstream impacted services with the configuration rollout because the scope of the issue expanded to additional scale units. By 10:15 UTC, storage engineers began reverting the configuration change on select impacted scale units. The Batch storage account migration finished around 11:22 UTC, after that, Batch service became healthy at 11:35 UTC. ADF began to recover after Batch mitigation was completed. ADF was fully recovered around 12:51 UTC after accumulated tasks got consumed.
By 16:34 UTC, impacted resources and services were mitigated. Shortly thereafter, engineers scheduled a rollback of the configuration change on all scale units (even ones that were not impacted), declaring mitigation.
How are we making incidents like this less likely or less impactful?
- We are tuning our monitoring to anticipate and quickly detect when a storage scale unit might engage its scale unit protection algorithm and employs throttling to storage accounts. These monitors will proactively alert engineers to take necessary actions (Estimated completion: March 2023).
- Moving forward, we will rollout throttling improvements in tracking-mode first to assess its impact before getting enabled since throttling improvements may react differently to different workload types and load on a particular scale unit (Estimated completion: March 2023).
The above improvement areas will help to prevent/detect storage-related issues across first-party services that are reliant on Azure storage accounts - for example, Batch and Data Factory.
- Our AKV team is working on improvements to the current distribution of storage accounts across multiple scale units and update storage implementation to ensure that read and write availability can be decoupled when such incidents happen (Estimated completion: May 2023).
- We continue to expand our AIOps detection system to provide better downstream impact detection and correlation - to notify customers more quickly, and to identify/mitigate impact more quickly (Ongoing).
How can customers make incidents like this less impactful?
- To get the best performance from Azure Storage, including when throttled, consider following the Performance and Scalability Checklist for Blob Storage: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist
- Consider which are the right Storage redundancy options for your critical applications. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable, like in this incident. https://docs.microsoft.com/azure/storage/common/storage-redundancy
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://learn.microsoft.com/en-us/azure/architecture/framework/resiliency/
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/R_36-P80