Skip to Main Content

Product:

Region:

Date:

July 2024

18

What happened?

Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services. A storage incident impacted the availability of Virtual Machines which may have also restarted unexpectedly. Services with dependencies on the impacted virtual machines and storage resources would have experienced impact.


What do we know so far?

We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.


How did we respond?

  • 21:56 UTC on 18 July 2024 – Customer impact began
  • 22:13 UTC on 18 July 2024 – Storage team started investigating
  • 22:41 UTC on 18 July 2024 – Additional Teams engaged to assist investigations
  • 23:27 UTC on 18 July 2024 – All deployments in Central US stopped
  • 23:35 UTC on 18 July 2024 – All deployments paused for all regions
  • 00:45 UTC on 18 July 2024 – A configuration change as the underlying cause was confirmed
  • 01:10 UTC on 19 July 2024 – Mitigation started
  • 01:30 UTC on 19 July 2024 – Customers started seeing signs of recovery
  • 02:51 UTC on 19 July 2024 – 99% of all impacted compute resources recovered
  • 03:23 UTC on 19 July 2024 – All Azure Storage clusters confirmed recovery
  • 03:41 UTC on 19 July 2024 – Mitigation confirmed for compute resources
  • Between 03:41 and 12:15 UTC on 19 July 2024 – Services which were impacted by this outage recovered progressively and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.


What happens next? 

  • Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: .
  • For more information on Post Incident Reviews, refer to .
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: ;
  • Finally, for broader guidance on preparing for cloud incidents, refer to .

13

Note: During this incident, in addition to our standard communications via Azure Service Health and its alerts, we also communicated via the public Azure Status page. This temporary overcommunication was to ensure that all affected customers received information, described as 'scenario 3' in our documentation, see . Once our Post Incident Review (PIR) has been communicated to affected customers through Azure Service Health, generally within 14 days, this entry on the Status History page will be removed.

What happened?

Between 00:00 UTC and 20:20 UTC on 13 July 2024, a platform issue resulted in an impact to the Azure OpenAI (AOAI) service across multiple regions. A subset of customers experienced errors when calling the Azure OpenAI endpoints and may have experienced issues, such as 5xx errors when accessing their Azure OpenAI service resources in the following regions - Australia East, Brazil South, Canada Central, Canada East, East US, East US 2, France Central, Korea Central, North Central US, Norway East, Poland Central, South Africa North, South India, Sweden Central, UK South, and West Europe.

Between 01:48 UTC on 13 July 2024 and 16:54 UTC on 15 July 2024, new fine-tuned model deployments were unavailable in a subset of regions, and deployment deletions may not have been effective. Impacted regions: East US 2, North Central US, Sweden Central, and Switzerland West.

There was no data loss to Retrieval-augmented generation (RAG) or fine-tuning models due to this event.

What do we know so far?

The Azure OpenAI service has an automation system that is implemented regionally but uses a global configuration to manage the lifecycle for certain backend resources. A change was made to update this configuration to delete unused resources in an AOAI internal subscription. There was a quota on the number of storage accounts on this subscription, which were unused and intended to be cleaned up to prevent storage quota pressure. However, this resource group also contained other resources, such as backend Managed Instance Resource endpoints used for deployment, management and use of OpenAI models. Additionally, the resource group and these critical backend resources were not correctly tagged to prevent automation from executing the specified cleanup operations. When the automation kicked off, these critical backend resources were unintentionally deleted, causing AOAI online endpoints to go offline and be unable to serve customer requests.

How did we respond?

  • 00:00 UTC on 13 July 2024 – Cleanup operations began across regions. Initial customer impact started, gradually increasing as more regions became unhealthy.
  • 00:05 UTC on 13 July 2024 – Service monitoring detected failure rates going above defined thresholds, alerting engineering to investigate.
  • 00:30 UTC on 13 July 2024 – We determined there was a multi-region issue unfolding by correlating alerts, service metrics, and telemetry dashboards. It took 30-40 minutes for the delete to complete, so not every endpoint in every region went offline at the same time and not all regions at the same time.
  • 00:45 UTC on 13 July 2024 – We identified that backend endpoints were being deleted but did not yet know how or why. Though two possibilities were being investigated.
  • 00:52 UTC on 13 July 2024 – We classified the event as critical, engaging incident response personnel to further investigate, coordinate resources, and drive customer workstreams.
  • 01:00 UTC on 13 July 2024 – We identified the automated cleanup operation deleting resources. With the problem known, we began mitigation efforts, beginning with attempting to stop the cleanup automation.
  • 01:40 UTC on 13 July 2024 – The cleanup job stopped on its own and did not delete everything. However, we continued to work to ensure further deletion was prevented.
  • 02:00 UTC on 13 July 2024 – We added resource locks and stopped delete requests at both the ARM ad Automation levels to prevent further deletion.
  • 02:00 UTC on 13 July 2024 - Initial recovery efforts began to recreate deleted resources. Note regarding recovery time - AOAI models are very large and can only be transferred via secure channels. Within a model pool (there are separate model pools per region per model) each deployment is serial (so the model copying time is a significant factor). Model pools themselves are done in parallel.
  • 02:22 UTC on 13 July 2024 – First communication posted to the Azure Status page. Communications were delayed due to initial difficulties scoping affected customers, and impact analysis gaps in our communications tooling.
  • 03:35 UTC on 13 July 2024 – Customer impact scope determined, and first targeted communications sent to customers via Service Health in the Azure portal.
  • ~04:33 UTC on 13 July 2024 – Service partially restored in North Central US and East US 2.
  • ~08:27 UTC on 13 July 2024 – Service partially restored in all regions. A subset of models and deployment types, varying across affected regions, were in recovery.
  • ~10:19 UTC on 13 July 2024 – Majority of services recovered.
  • ~14:39 UTC on 13 July 2024 – GPT-4 recovered in all regions except Canada Central, Sweden Central, North Central US, UK South, Central US, and Australia East. 
  • ~14:39 UTC on 13 July 2024 - GPT-4o recovered in all regions except South India and Sweden Central. 
  • ~14:39 UTC on 13 July 2024 - DALL-E recovered in all regions except Australia East.
  • ~17:29 UTC on 13 July 2024 – GPT-4 in North Central recovered.
  • ~17:55 UTC on 13 July 2024 – GPT-4 in Central US recovered.
  • ~19:39 UTC on 13 July 2024 – GPT-4 in Canada Central recovered.
  • ~19:39 UTC on 13 July 2024 - DALL-E restored in all regions.
  • ~19:51 UTC on 13 July 2024 – GPT-4 in UK South is recovered.
  • 20:20 UTC on 13 July 2024 – All base models recovered, and service restoration completed across all affected regions.
  • 14:00 UTC on 15 July 2024 - Fine tuning model recovery across various regions.
  • 16:54 UTC on 15 July 2024 - All fine-tuning model deployments restored across affected regions.

How are we making incidents like this less likely or less impactful?

  • We have changed the configuration policy to be regional, according to our Safe Deployment Practices (SDP) for all subsequent configuration updates. This will help ensure potentially unhealthy changes are limited to a single region. (Completed)
  • We removed incorrect metadata tags on the resources that are not to be managed by the automation and updated the automation configuration to exclude those resources, thus preventing automation from inadvertently deleting critical resources. (Completed)
  • We have tightened the regional makeup of workloads to less regions to further prevent widespread issues in the event of similar unintentional deletion or comparable scenarios. (Completed)
  • We have enhanced our incident response engagement automation to ensure more efficient involvement of our incident response personnel. (Completed)
  • Our communications tooling will be updated with relevant AOAI data to ensure faster notification for similar scenarios where there are difficulties scoping affected customers. (Estimated completion: TBD)
  • We are investigating additional test coverage and verification procedures for configuration updates. (Estimated completion: TBD)
  • We will increase the parallelization of model pool buildouts to reduce the time it takes for recovery. (Estimated completion: TBD)
  • We will develop additional active monitoring for different model families to assess system health without depending on customer traffic. (Estimated completion: TBD)

12

This is our "Preliminary" PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings. 

What happened?

You were impacted by a platform issue that caused some of our Managed Identities for Azure resources (formerly MSI) customers to experience failures when requesting managed identity tokens or performing management operations for managed identities in the Australia East region, between 0:55 UTC and 06:28 UTC on 12 July 2024. All identities in this region were subject to impact. That’s why we’re providing you with this Post Incident Review (PIR) – to summarize what went wrong, how we responded, and the steps Microsoft is taking to learn from this and improve.

For management operations, around 50% of calls via Azure Resource Manager (ARM) or the Azure Portal to perform operations such as creating, deleting, assigning, or updating identities failed during this time. For token requests, around 3% of requests for Azure Virtual Machines (VMs) and Virtual Machine Scale Sets (VMSS) in this region failed during this time. Because Managed Identity for Azure resources is built on several layers of resilience for established resources, most failures were for newly provisioned, moved, or scaled up VMs. Identity management and managed identity token issuance for other Azure resource types - including Azure Service Fabric, Azure Stack HCI, Azure Databricks, and Azure Virtual Desktop - were impacted to varying degrees depending on the specific scenario and resource. 

What went wrong and why?

At Microsoft, we are working to remove manual key management and rotation from the operation of our internal services. For most of our infrastructure and for our customers, we recommend migrating authentication between Azure resources to use Managed identities for Azure resources. For the Managed identities for Azure resources infrastructure itself, though, we cannot use Managed identities for Azure resources, due to a risk of circular dependency. Therefore, Managed identities for Azure resources uses key-based access to authenticate to its downstream Azure dependencies that store identity metadata. 

To improve resilience, we have been moving Managed identities for Azure resources infrastructure to an internal system that provides automatic key rotation and management. The process for rolling out this change was to deploy a service change to support automatic keys, and then roll over our keys into the new system in each region one-by-one, following our safe deployment process. This process was tested and successfully applied in several prior regions. 

In the Australia East and Australia Southeast regions, however, the required deployment for key automation did not complete successfully - because these regions had previously been pinned to an older build of the service, for unrelated reasons. Our internal deployment tooling reported a successful deployment, and did not clearly show that these regions were still pinned to the older version. Believing the deployment to have completed, a service engineer initiated the switch to move to automatic keys for the Australia East storage infrastructure, immediately causing the key to roll over. Because neither Australia East nor its failover pair Australia Southeast were running the correct service version, they continued to try to use the old key, which failed.   

How did we respond?

This incident was quickly detected by internal monitoring of both the Managed Identity service, and other dependent services in Azure. Our engineers immediately knew that the cause was related to the key rollover operation that had just been performed. Since managed identities are built across availability zones and failover pairs, our standard mitigation playbook is to perform a failover to a different zone or region. However, in this case, all zones in Australia East were impacted - and requests to the failover pair, Australia Southeast, were also failing. 

We pursued several investigation and mitigation threads in parallel. Our engineers initially believed that the issue was that the key rollover had somehow failed or applied incorrectly, and we attempted several mitigation steps to force the new key to be picked up or force the downstream resources to accept the new key. After these were not effective, we discovered that the issue was not with the new key, but rather that our instances were still attempting to use the old key - and that this key had been invalidated by the rollover. We launched more parallel workstreams to understand why this would be the case and how to mitigate, since we still believed that the deployment had succeeded. Once we realized that these regions were in fact pinned to an older build of the service, we authored and deployed a configuration change to force this build to use the managed key, which resolved the issue. Full mitigation occurred at 06:28 UTC on 12 July 2024. 

How are we making incidents like this less likely or less impactful?

  • We temporarily froze all changes to the Managed identities for Azure resources. (Completed)
  • We ensured that all Azure regions with Managed Identity that could be subject to automatically managed keys are running a service version and configuration that supports those keys. (Completed)
  • We are working to improve our deployment process to ensure that pinned build versions are detected as part of a deployment and not considered successful. (Estimated completion: July 2024) 
  • Prior to rolling out automatic key management again, we are working to improve our service telemetry to enable us to quickly see whether a manually or automatically managed key is used, as well as the identifier of the key. (Estimated completion: July 2024)
  • Prior to rolling out automatic key management again, we will update our internal process to verify that all service instances are running the correct version and using the expected key. (Estimated completion: July 2024)
  • For the future move to automatic key management, we will schedule the changes to occur in off-peak business hours in each Azure region. (Estimated completion: July 2024)
  • Finally, we are working to improve the central documentation for automatic key management to ensure the above best practices are followed by all Azure services using this capability. (Estimated completion: July 2024) 

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: 

April 2024

22

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 03:45 CST and 06:00 CST on 23 Apr 2024, a configuration change performed through a domain registrar resulted in a service disruption to 2 system domains (chinacloudapp.cn and chinacloudsites.cn) that are critical in our cloud operations in China. This caused the following impact in the Azure China regions:

Customers may have had issues connecting to multiple Azure services including Cosmos DB, Azure Virtual Desktop, Azure Databricks, Backup, Site Recovery, Azure IoT Hub, Service Bus, Logic Apps, Data Factory, Azure Kubernetes Services, Azure Policy, Azure AI Speech, Azure Machine Learning, API Management, Azure Container Registry, and Azure Data Explorer.

Customers may have had issues viewing or managing resources via the Azure Portal (portal.azure.cn) or via APIs.

Azure Monitor offerings, Log Analytics and Microsoft Sentinel in the China East 2 and China East 3 experienced intermittent data latency, failure to query and retrieve data which could have resulted in a failure of alert activation, and/or failures to create, update, retrieve or delete operations in Log Analytics.

What went wrong and why?

To comply with a regulatory requirement of the Chinese government, we conducted an internal audit, ensuring all our domains had the appropriate ownership and documented properly. During this process, ownership for two critical system domains for Azure in China were misattributed, and as a result, were flagged as potential candidates for decommission.

The next step of the decommissioning process is a period of monitoring active traffic on a flagged domain before continuing with its decommission. However, the management tool that provides DNS zone and hosting information was not scoped to include zones hosted within Azure China, which caused our system to report that the zone file did not exist. It is common for end-of-life domains that are not in use to not have a zone file, and, as such, the non-existent zone file notice did not raise any alerts with the operator. The workflow then proceeded to the next stage, where the nameservers of these two domains were updated to a set of inactive servers, which is a final check to identify any hidden users or services dependent on the domain.

As DNS caches across the Internet gradually timed out, DNS resolvers made requests to refresh the information for the two domains and received responses containing the inactive nameservers, resulting in failures to resolve FQDNs in those domains. Our health signals detected this degradation in our Azure China Cloud and alerted our engineers. Once we understood the issue, the change was reverted in a matter of minutes. However, the mitigation time was prolonged due to the caching applied by DNS resolvers.

The issue impacted only specific Microsoft-owned domains, and it did not affect the Azure DNS platform availability or DNS services serving any other zone hosted on Azure.

How did we respond?

  • 03:45 CST on 23 April 2024 – Nameserver configuration was updated. Due to previous DNS TTLs (Time To Live) the impact was not immediate.
  •  04:37 CST on 23 April 2024 - Our internal monitors alerted us to degradation in the service and created Incident.
  • 04:39 CST on 23 April 2024 – Incident was acknowledged by our engineering team.
  • 04:57 CST on 23 April 2024 – We determined the cause of the resolution failures coincide with a change in name servers for the chinacloudapp.cn and chinacloudsites.cn domains.
  • 04:59 CST on 23 April 2024 - We reverted to use the previously known good name servers.
  • 05:13 CST on 23 April 2024 - The reversion was completed, at which point services began to recover.
  • 06:00 CST on 23 April 2024 - Full recovery was declared after verifying that traffic for the services and affected DNS zones had recovered back to pre-incident levels.

How are we making incidents like this less likely or less impactful?

  • We have suspended any further runs of this domain lifecycle process until updates to the management tool for Azure in China are completed.
  • Update our validation process regarding domain lifecycle management to ensure all cloud region signals are incorporated (Estimated completion: June 2024).
  • Implement additional validations to obtain nameserver information by directly resolving the zone over internet (Estimated completion: July 2024).

How can our customers and partners make incidents like this less impactful?

  • As this issue impacted two domains used in operating the management plane of the Azure China Cloud and naming Azure services offered in the Azure China Cloud, users of the China Cloud did not have many opportunities to design their services to be resilient to this type of outage.
  • More generally customers should consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts: . These can trigger emails, SMS, webhooks, push notifications (via the Azure Mobile app ) and more.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: