Skip to Main Content

Product:

Region:

Date:

July 2024

13

Impact Statement: Starting at approximately 00:01 UTC on 13 July 2024, a subset of customers across multiple regions using the Azure OpenAI service began experiencing errors when calling the Azure OpenAI endpoints and may have issues accessing their resources for the duration of this impact. 

Current Status: During a routine cleanup operation, we believe that some dependent backend components became unavailable, which led to the aforementioned issues. This was believed to be transient, as initially, this only affected a small percentage of customers, however it soon increased to multiple regions. We have stopped the cleanup operation to avoid further impact, while we continue to work on our recovery efforts.

These efforts are ongoing across all impacted regions, which are seeing partial service restoration. Customers in these regions may start noticing improvements. 

Retries may be successful as the mitigation progresses across the different affected regions. 

Initially, this issue was communicated via the public Azure Status page, as the full impact was not adequately determined. Now that the root cause has been determined, and the impact is no longer increasing, all further updates will be communicated via our standard communications to impacted customers in Azure Service Health.

12

What happened?

Between at 00:55 UTC on 12 Jul 2024 and 06:28 UTC on 12 Jul 2024, you have been identified among a subset of customers using Managed Service Identity (MSI) for Azure resources who may experience failures when requesting tokens for managed identities associated with Virtual Machines or Virtual Machine Scale Sets, Windows Virtual Desktop, Azure Databricks and any other Azure service that relies on MSI.

 What do we know so far?

We identified that a configuration change introduced in a recent deployment had caused this issue. We had to roll back the new change and restart to the last known good build.

How did we respond?

  • 00:55 UTC on 12 July 2024 – Customer impact began.
  • 01:10 UTC on 12 July 2024 – Service monitoring detected decreasing availability on some storage scale units in the region.
  • 01:14 UTC on 12 July 2024 – our team engaged and started the investigation.
  • 02:58 UTC on 12 July 2024 – Recent configuration change was identified and we started a deployment to roll back the change.
  • 06:05 UTC on 12 July 2024 – We completed rolling back on one Availability Zone (AZ) and verified that our telemetry looks good on this AZ and started with other AZs. We also failed over the other availability zones where we were seeing signs of impact.
  • 06:23 UTC on 12 July 2024 – Service started to recover and customers should start seeing recovery at this point of time. We continue to apply recovery operations and monitoring recovery.
  • 06:28 UTC on 12 July 2024 – Rollback completed, and service showed full recovery from platform side. (customers may benefit recycling service if they are not fully mitigated)

 What happens next?

April 2024

22

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 03:45 CST and 06:00 CST on 23 Apr 2024, a configuration change performed through a domain registrar resulted in a service disruption to 2 system domains (chinacloudapp.cn and chinacloudsites.cn) that are critical in our cloud operations in China. This caused the following impact in the Azure China regions:

Customers may have had issues connecting to multiple Azure services including Cosmos DB, Azure Virtual Desktop, Azure Databricks, Backup, Site Recovery, Azure IoT Hub, Service Bus, Logic Apps, Data Factory, Azure Kubernetes Services, Azure Policy, Azure AI Speech, Azure Machine Learning, API Management, Azure Container Registry, and Azure Data Explorer.

Customers may have had issues viewing or managing resources via the Azure Portal (portal.azure.cn) or via APIs.

Azure Monitor offerings, Log Analytics and Microsoft Sentinel in the China East 2 and China East 3 experienced intermittent data latency, failure to query and retrieve data which could have resulted in a failure of alert activation, and/or failures to create, update, retrieve or delete operations in Log Analytics.

What went wrong and why?

To comply with a regulatory requirement of the Chinese government, we conducted an internal audit, ensuring all our domains had the appropriate ownership and documented properly. During this process, ownership for two critical system domains for Azure in China were misattributed, and as a result, were flagged as potential candidates for decommission.

The next step of the decommissioning process is a period of monitoring active traffic on a flagged domain before continuing with its decommission. However, the management tool that provides DNS zone and hosting information was not scoped to include zones hosted within Azure China, which caused our system to report that the zone file did not exist. It is common for end-of-life domains that are not in use to not have a zone file, and, as such, the non-existent zone file notice did not raise any alerts with the operator. The workflow then proceeded to the next stage, where the nameservers of these two domains were updated to a set of inactive servers, which is a final check to identify any hidden users or services dependent on the domain.

As DNS caches across the Internet gradually timed out, DNS resolvers made requests to refresh the information for the two domains and received responses containing the inactive nameservers, resulting in failures to resolve FQDNs in those domains. Our health signals detected this degradation in our Azure China Cloud and alerted our engineers. Once we understood the issue, the change was reverted in a matter of minutes. However, the mitigation time was prolonged due to the caching applied by DNS resolvers.

The issue impacted only specific Microsoft-owned domains, and it did not affect the Azure DNS platform availability or DNS services serving any other zone hosted on Azure.

How did we respond?

  • 03:45 CST on 23 April 2024 – Nameserver configuration was updated. Due to previous DNS TTLs (Time To Live) the impact was not immediate.
  •  04:37 CST on 23 April 2024 - Our internal monitors alerted us to degradation in the service and created Incident.
  • 04:39 CST on 23 April 2024 – Incident was acknowledged by our engineering team.
  • 04:57 CST on 23 April 2024 – We determined the cause of the resolution failures coincide with a change in name servers for the chinacloudapp.cn and chinacloudsites.cn domains.
  • 04:59 CST on 23 April 2024 - We reverted to use the previously known good name servers.
  • 05:13 CST on 23 April 2024 - The reversion was completed, at which point services began to recover.
  • 06:00 CST on 23 April 2024 - Full recovery was declared after verifying that traffic for the services and affected DNS zones had recovered back to pre-incident levels.

How are we making incidents like this less likely or less impactful?

  • We have suspended any further runs of this domain lifecycle process until updates to the management tool for Azure in China are completed.
  • Update our validation process regarding domain lifecycle management to ensure all cloud region signals are incorporated (Estimated completion: June 2024).
  • Implement additional validations to obtain nameserver information by directly resolving the zone over internet (Estimated completion: July 2024).

How can our customers and partners make incidents like this less impactful?

  • As this issue impacted two domains used in operating the management plane of the Azure China Cloud and naming Azure services offered in the Azure China Cloud, users of the China Cloud did not have many opportunities to design their services to be resilient to this type of outage.
  • More generally customers should consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts: . These can trigger emails, SMS, webhooks, push notifications (via the Azure Mobile app ) and more.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

19

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 04:26 UTC and 07:30 UTC on 19 April 2024, a platform issue in the West US region resulted in impact for the following services:

  • Azure Database for MariaDB: Connectivity issues, all connections may have failed for a subset of customers. Retries would have been unsuccessful.
  • Azure Database for MySQL - Single Server: Connectivity issues, all connections may have failed for a subset of customers. Retries would have been unsuccessful.
  • Azure Databricks: Due to Azure Databricks dependencies on the impacted database service, customers in West US, West US 2 and/or South Central US may have experienced failures and timeouts with workspace login and authentication requests. Cluster CRUD requests and jobs relying on cluster start/resize/termination may not have executed, jobs submitted through APIs/Schedulers may not have executed, UI and Databricks SQL queries may have timed out, users may have experienced failures launching Databricks SQL Serverless Warehouses and may have been unable to access UC APIs, and customers may have observed errors citing “Authentication is temporarily unavailable” or “TEMPORARILY_UNAVAILABLE”.

What went wrong and why?

All connections to Azure Database for MySQL - Single Server and Azure Database for MariaDB service are established through a gateway responsible for routing incoming connections to their respective servers. This gateway service is hosted by a group of stateless compute nodes sitting behind an IP address. As part of ongoing service maintenance, compute hardware hosting the gateway service is periodically refreshed to ensure we provide the most secure and performant experience. Before moving the incoming new connections requests to a new gateway ring, customers running their servers and connecting to older gateway rings are notified via email and in the Azure portal to update their outbound rules to allow new gateway IP address.

When the gateway hardware is refreshed, a new ring of the gateway compute nodes is built out first. This new ring serves the traffic for all the newly created Azure Database for MySQL servers, and it will have a different IP address from older gateway rings in the same region, to differentiate the traffic. Once the new ring is fully functional, existing server connection traffic is routed to the new gateway by updating the DNS - and the older gateway hardware serving existing servers is planned for decommissioning. After the DNS update, all newer connections from the clients are automatically routed to the new gateway ring.

On 19 April 2024, as part of the planned maintenance event, the move of existing server incoming traffic to the new gateway ring was scheduled in West US region. At 04:26 UTC, the DNS update was made for a batch of Azure Database for MySQL and MariaDB servers, to route new incoming connections for servers to a new gateway ring. Due to a procedural error, the new connections were erroneously routed to a newly built gateway ring whose configuration was incomplete. This new gateway ring was not ready to accept new connections yet - this led to the login requests failing at the gateway, and therefore the gateway rejecting all new incoming connections to the Azure Database for MySQL and Azure Database for MariaDB servers. We initially failed to detect the failures caused by this change, as the new gateway ring configuration was incomplete, so it did not have the telemetry and monitoring configured yet. As a result, the issue went undetected from service telemetry, until we received reports from internal teams about unreachable databases.

How did we respond?

  • 19 April 2024 @ 04:46 UTC - Our MySQL team received a report from Azure Databricks that multiple databases were unreachable.
  • 19 April 2024 @ 05:14 UTC - Our MySQL on-call engineers determined that the databases were healthy but that no connections were being made.
  • 19 April 2024 @ 05:44 UTC - Our networking team was engaged to assist with the investigation.
  • 19 April 2024 @ 06:46 UTC - The issue was correlated to the attempt to move traffic to the new gateway ring, and mitigation steps to roll back the change were initiated.
  • 19 April 2024 @ 07:30 UTC - The incident was mitigated when the change was fully rolled back, and all the connections were moved back to the previous functional gateway ring after validation. 

How are we making incidents like this less likely or less impactful?

  • MySQL engineering will improve operational practices by adding additional checks, testing and sign-offs procedures before switching the production gateway rings. (Estimated completion: May 2024)
  • Azure Databricks engineering will introduce alerts to detect multiple databases being unavailable within a short span of time. (Estimated completion: May 2024)

How can customers make incidents like this less impactful?

  • Azure Databricks customers should consider reviewing our best practices surrounding disaster recovery for Azure Databricks, see: 
  • As Azure Database for MySQL – Single Server and Azure Database for MariaDB are on the retirement path, customers are recommended to upgrade to Azure Database for MySQL – Flexible Server. The Flexible Server architecture has no shared gateways and is a single VM per tenant architecture. All planned maintenance in Flexible Server is scheduled in maintenance window defined by the end user for that server, which architecturally avoids such incidents.  For details, see: 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: