April 2024
19
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/QS3K-BPZ
What happened?
Between 04:26 UTC and 07:30 UTC on 19 April 2024, a platform issue in the West US region resulted in impact for the following services:
- Azure Database for MariaDB: Connectivity issues, all connections may have failed for a subset of customers. Retries would have been unsuccessful.
- Azure Database for MySQL - Single Server: Connectivity issues, all connections may have failed for a subset of customers. Retries would have been unsuccessful.
- Azure Databricks: Due to Azure Databricks dependencies on the impacted database service, customers in West US, West US 2 and/or South Central US may have experienced failures and timeouts with workspace login and authentication requests. Cluster CRUD requests and jobs relying on cluster start/resize/termination may not have executed, jobs submitted through APIs/Schedulers may not have executed, UI and Databricks SQL queries may have timed out, users may have experienced failures launching Databricks SQL Serverless Warehouses and may have been unable to access UC APIs, and customers may have observed errors citing “Authentication is temporarily unavailable” or “TEMPORARILY_UNAVAILABLE”.
What went wrong and why?
All connections to Azure Database for MySQL - Single Server and Azure Database for MariaDB service are established through a gateway responsible for routing incoming connections to their respective servers. This gateway service is hosted by a group of stateless compute nodes sitting behind an IP address. As part of ongoing service maintenance, compute hardware hosting the gateway service is periodically refreshed to ensure we provide the most secure and performant experience. Before moving the incoming new connections requests to a new gateway ring, customers running their servers and connecting to older gateway rings are notified via email and in the Azure portal to update their outbound rules to allow new gateway IP address.
When the gateway hardware is refreshed, a new ring of the gateway compute nodes is built out first. This new ring serves the traffic for all the newly created Azure Database for MySQL servers, and it will have a different IP address from older gateway rings in the same region, to differentiate the traffic. Once the new ring is fully functional, existing server connection traffic is routed to the new gateway by updating the DNS - and the older gateway hardware serving existing servers is planned for decommissioning. After the DNS update, all newer connections from the clients are automatically routed to the new gateway ring.
On 19 April 2024, as part of the planned maintenance event, the move of existing server incoming traffic to the new gateway ring was scheduled in West US region. At 04:26 UTC, the DNS update was made for a batch of Azure Database for MySQL and MariaDB servers, to route new incoming connections for servers to a new gateway ring. Due to a procedural error, the new connections were erroneously routed to a newly built gateway ring whose configuration was incomplete. This new gateway ring was not ready to accept new connections yet - this led to the login requests failing at the gateway, and therefore the gateway rejecting all new incoming connections to the Azure Database for MySQL and Azure Database for MariaDB servers. We initially failed to detect the failures caused by this change, as the new gateway ring configuration was incomplete, so it did not have the telemetry and monitoring configured yet. As a result, the issue went undetected from service telemetry, until we received reports from internal teams about unreachable databases.
How did we respond?
- 19 April 2024 @ 04:46 UTC - Our MySQL team received a report from Azure Databricks that multiple databases were unreachable.
- 19 April 2024 @ 05:14 UTC - Our MySQL on-call engineers determined that the databases were healthy but that no connections were being made.
- 19 April 2024 @ 05:44 UTC - Our networking team was engaged to assist with the investigation.
- 19 April 2024 @ 06:46 UTC - The issue was correlated to the attempt to move traffic to the new gateway ring, and mitigation steps to roll back the change were initiated.
- 19 April 2024 @ 07:30 UTC - The incident was mitigated when the change was fully rolled back, and all the connections were moved back to the previous functional gateway ring after validation.
How are we making incidents like this less likely or less impactful?
- MySQL engineering will improve operational practices by adding additional checks, testing and sign-offs procedures before switching the production gateway rings. (Estimated completion: May 2024)
- Azure Databricks engineering will introduce alerts to detect multiple databases being unavailable within a short span of time. (Estimated completion: May 2024)
How can customers make incidents like this less impactful?
- Azure Databricks customers should consider reviewing our best practices surrounding disaster recovery for Azure Databricks, see: https://learn.microsoft.com/azure/databricks/administration-guide/disaster-recovery
- As Azure Database for MySQL – Single Server and Azure Database for MariaDB are on the retirement path, customers are recommended to upgrade to Azure Database for MySQL – Flexible Server. The Flexible Server architecture has no shared gateways and is a single VM per tenant architecture. All planned maintenance in Flexible Server is scheduled in maintenance window defined by the end user for that server, which architecturally avoids such incidents. For details, see: https://learn.microsoft.com/azure/postgresql/flexible-server/overview
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/QS3K-BPZ