July 2024
12
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/FTVR-L9Z
What happened?
A platform issue caused some of our 'Managed Identities for Azure resources' (formerly Managed Service Identity or MSI) customers to experience failures when requesting managed identity tokens or performing management operations for managed identities in the Australia East region, between 00:55 UTC and 06:28 UTC on 12 July 2024. All identities in this region were subject to impact.
For managed identities for Azure Virtual Machines (VMs) and Virtual Machine Scale Sets (VMSS) in this region, around 3% of token requests failed during this time. Established virtual machines with steady token activity in the days and hours prior to the incident were generally not impacted. For management operations, around 50% of calls via Azure Resource Manager (ARM) or the Azure Portal to perform operations such as creating, deleting, assigning, or updating identities failed during this time.
The creation of any Azure resources to which system-assigned managed identities were assigned would also have failed, since the creation of these resources requires the creation of a managed identity. Identity management and managed identity token issuance for other Azure resource types - including Azure Service Fabric, Azure Stack HCI, Azure Databricks, and Azure Kubernetes Service - were impacted to varying degrees depending on the specific scenario and resource. Generally, across all Azure compute resource types, management operations were heavily impacted. For token requests, existing resources continued to function during the incident, while new or newly scaled resources saw failures.
What went wrong and why?
At Microsoft, we are working to remove manual key management and rotation from the operation of our internal services. For most of our infrastructure and for our customers, we recommend migrating authentication between Azure resources to use Managed identities for Azure resources. For the Managed identities for Azure resources infrastructure itself, though, we cannot use Managed identities for Azure resources, due to a risk of circular dependency. Therefore, Managed identities for Azure resources uses key-based access to authenticate to its downstream Azure dependencies that store identity metadata.
To improve resilience, we have been moving Managed identities for Azure resources infrastructure to an internal system that provides automatic key rotation and management. The process for rolling out this change was to deploy a service change to support automatic keys, and then roll over our keys into the new system in each region one-by-one, following our safe deployment process. This process was tested and successfully applied in several prior regions.
In the Australia East and Australia Southeast regions, however, the required deployment for key automation did not complete successfully - because these regions had previously been pinned to an older build of the service, for unrelated reasons. Our internal deployment tooling reported a successful deployment and did not clearly show that these regions were still pinned to the older version. Believing the deployment to have completed, a service engineer initiated the switch to move to automatic keys for the Australia East storage infrastructure, immediately causing the key to roll over. Because neither Australia East nor its failover pair Australia Southeast were running the correct service version, they continued to try to use the old key when accessing storage for Australia East identities, which failed.
How did we respond?
This incident was detected within three minutes by internal monitoring of both the Managed Identity service, and other dependent services in Azure. Our engineers immediately knew that the cause was related to the key rollover operation that had just been performed. Since managed identities are built across Availability Zones and failover pairs, our standard mitigation playbook is to perform a failover to a different zone or region. However, in this case, all zones in Australia East were impacted - and requests to the failover pair, Australia Southeast, were also failing.
We pursued several investigation and mitigation threads in parallel. Our engineers initially believed that the issue was that the key rollover had somehow failed or applied incorrectly, and we attempted several mitigation steps to force the new key to be picked up or force the downstream resources to accept the new key. After these were not effective, we discovered that the issue was not with the new key, but rather that our instances were still attempting to use the old key - and that this key had been invalidated by the rollover. We launched more parallel workstreams to understand why this would be the case and how to mitigate, since we still believed that the deployment had succeeded. Once we realized that these regions were in fact pinned to an older build of the service, we authored and deployed a configuration change to force this build to use the managed key, which resolved the issue. Full mitigation occurred at 06:28 UTC on 12 July 2024.
How are we making incidents like this less likely or less impactful?
- We temporarily froze all changes to the Managed identities for Azure resources. (Completed)
- We ensured that all Azure regions with Managed Identity that could be subject to automatically managed keys are running a service version and configuration that supports those keys. (Completed)
- We are working to improve our deployment process to ensure that pinned build versions are detected as part of a deployment and not considered successful. (Estimated completion: July 2024)
- Prior to rolling out automatic key management again, we will improve our service telemetry to enable us to quickly see whether a manually or automatically managed key is used, as well as the identifier of the key. (Estimated completion: July 2024)
- Prior to rolling out automatic key management again, we will update our internal process to verify that all service instances are running the correct version and using the expected key. (Estimated completion: July 2024)
- For the future move to automatic key management, we will schedule the changes to occur in off-peak business hours in each Azure region. (Estimated completion: July 2024)
- Finally, we are working to improve the central documentation for automatic key management to ensure the above best practices are followed by all Azure services using this capability. (Estimated completion: July 2024)
How can customers make incidents like this less impactful?
- In Azure, our safe deployment practices consider both availability zones and Azure regions. We encourage all customers to partition their capacity across regions and zones, and to build and exercise a disaster recovery strategy to defend against issues that may occur in a single region. During this incident, customers who were able to fail out of Australia East would have been able to mitigate their impact. To learn more about a multi-region geodiversity strategy for mission-critical workloads, see https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes. Finally, this page provides more information on disaster recovery strategies: https://learn.microsoft.com/azure/well-architected/reliability/disaster-recovery
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/FTVR-L9Z