February 2026
2
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
What happened?
Between 19:46 UTC on 02 February 2026 and approximately 00:30 UTC on 03 February 2026, a platform issue resulted in degraded performance and failures for multiple Azure services in multiple regions. Impacted services included:
- Azure Virtual Machines (VMs) – Customers may have experienced failures when deploying or scaling virtual machines, including errors during provisioning and lifecycle operations.
- Azure Virtual Machine Scale Sets (VMSS) – Customers may have experienced failures when scaling instances or applying configuration changes.
- Azure Kubernetes Service (AKS) – Customers may have experienced failures in node provisioning and extension installation.
- Azure DevOps (ADO) and GitHub Actions – Customers may have experienced pipeline failures when tasks required VM extensions or related packages. Actions jobs queued and timed out while waiting to acquire a hosted runner. Other GitHub features that leverage this compute infrastructure were similarly impacted, including Copilot Coding Agent, Copilot Code Review, CodeQL, Dependabot, GitHub Enterprise Importer, and Pages. All regions and runner types were impacted.
- Other dependent services – Customers may have experienced degraded performance or failures in operations that required downloading extension packages from Microsoft-managed storage accounts, including Azure Arc enabled servers, and Azure Database for PostgreSQL.
Separate but related, between 00:08 UTC and 06:05 UTC on 03 February 2026, a platform issue with the 'Managed identities for Azure resources' service (formerly Managed Service Identity) impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens – specifically in the East US and West US regions. Originally communicated on incident tracking ID _M5B-9RZ, this impacted the creation and management of Azure resources with assigned managed identities – including but not limited to Azure AI Video Indexer, Azure Chaos Studio, Azure Container Apps, Azure Database for PostgreSQL Flexible Servers, Azure Databricks, Azure Kubernetes Service, Azure Stream Analytics, Azure Synapse Analytics, and Microsoft Copilot Studio. Newly created resources of these types may have also failed to authenticate using managed identity.
What went wrong, and why?
On 02 February 2026, a periodic recurring job performing storage account access remediation, intended to disable anonymous access, was unintentionally applied to a subset of Microsoft-managed storage accounts that require this access for normal operation. A data synchronization problem in the targeting logic caused the access remediation to be applied to these accounts, many of which host VM extension packages. The extensions are small applications that provide post-deployment configuration and automation on Azure VMs – covering VM configuration, monitoring, security, and utility applications. Publishers take an application, wrap it into an extension, and simplify the installation. As a result, VMs that rely on these extensions during resource provisioning or scaling were unable to access the required artifacts, leading to service impact.
Although the remediation process was initially deployed on a region-by-region basis in line with our safe deployment practices, the data synchronization problem resulted in rapid unintended remediation across all regions in the public cloud in a relatively short time period.
Rollback to re-enable access to these storage accounts was validated in a test region and, in parallel, the job was disabled to prevent any additional remediation activities. Re-enabling access was executed and validated on the impacted storage accounts on a per region basis and, as regions were completed, signals were gradually improving.
As recovery progressed through impacted regions, the backlogged queue of service management operations was generally cleared without additional impact. However, in two specific regions, East US and West US, a very sudden large spike in traffic overwhelmed a platform service critical to managed identities. The impact started first in East US but automatic resilient routing measures retried failures too aggressively in West US, resulting in the second region of impact.
Once the failures began, many upstream infrastructure components began to retry aggressively, continuing to exceed managed identity backend service limits. While we were able to scale out the managed service identity infrastructure, the new resources quickly became overwhelmed as well. Because scaling out did not on its own mitigate impact, our teams worked to shed load from reaching the backend service. This allowed the service to recover fully. Reintroducing load gradually allowed both East US and West US to work through queued resources while handling normal workload.
How did we respond?
- 19:46 UTC on 02 February 2026 – Customers began experiencing issues in multiple regions, while attempting to complete service management operations.
- 19:55 UTC on 02 February 2026 – Service monitoring detected failure rates exceeding failure limit thresholds.
- 20:10 UTC on 02 February 2026 – We began collaborating to devise a mitigation solution and investigate the underlying factors.
- 21:15 UTC on 02 February 2026 – We applied a primary proposed mitigation and validated that it was successful on a test instance. Parallel teams worked to identify the remediation job and disable it, so it would not affect additional storage accounts.
- 21:18 UTC on 02 February 2026 – We identified and disabled the remediation job and stopped any ongoing activity.
- 21:50 UTC on 02 February 2026 – Began broader mitigation to impacted storage accounts. Customers started to see continuing improvement as work progressed.
- 00:07 UTC on 03 February 2026 – Storage accounts for a few high volume VM extensions that utilize managed identity finished their re-enablement in East US.
- 00:08 UTC on 03 February 2026 – Unexpectedly, customer impact increased first in East US and cascaded to West US with retries, as a critical managed identity service degraded under recovery load.
- 00:14 UTC on 03 February 2026 – Automated alerting identified availability impact to Managed Identity services in East US. Engineers quickly recognized that the service was overloaded and began to scale out.
- 00:30 UTC on 03 February 2026 – All extension hosting storage accounts had been reenabled, mitigating this impact in all regions other than East US and West US.
- 00:50 UTC on 03 February 2026 – The first set of managed identity service infrastructure scale out completed, but the new resources were still unable to handle the traffic volume due to the increasing backlog of retried requests.
- 02:00 UTC on 03 February 2026 – A second, larger set of managed identity service infrastructure scale-ups completed. Once again, the capacity was unable to handle the volume of backlogs and retries.
- 02:15 UTC on 03 February 2026 – Reviewed additional data and monitored downstream services to ensure that all mitigations were in place for all impacted storage accounts.
- 03:55 UTC on 03 February 2026 – To recover infrastructure capacity, we began rolling out a change to remove all traffic from the service so that the infrastructure could be repaired without load.
- 04:25 UTC on 03 February 2026 – After infrastructure nodes recovered, we began gradually ramping traffic to the infrastructure, allowing backlogged identity operations to begin to process safely.
- 06:05 UTC on 03 February 2026 – Backlogged operations completed, and services returned to normal operating levels. We concluded our monitoring and confirmed that all customer impact had been mitigated.
How are we making incidents like this less likely or less impactful?
- We are working to fix the data synchronization problem and code defects with validation coverage, to protect from similar scenarios. (Estimated completion: February 2026)
- We are strengthening traffic controls and throttling so that unexpected load spikes or retry storms are contained early and cannot overwhelm the service or impact other customers. (Estimated completion: February 2026)
- We are increasing capacity headroom and tightening scaling safeguards to ensure sufficient resources remain available during recovery events and sudden demand. (Estimated completion: April 2026)
- We will exercise and optimize rollback of remediation jobs, reducing impact and time to mitigate. (Estimated completion: April 2026)
- We are improving retry behavior and internal efficiency, so the service degrades gracefully under stress and issues are detected and mitigated earlier (Estimated completion: June 2026)
- We are improving regional isolation and failover behavior to prevent issues in one region from cascading into paired regions during failures or recovery. (Estimated completion: July 2026)
- This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
How can customers make incidents like this less impactful?
- There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
- Note that the impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/FNJ8-VQZ