April 2024
22
Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/V_BH-7DZ
What happened?
Between 03:45 CST and 06:00 CST on 23 Apr 2024, a configuration change performed through a domain registrar resulted in a service disruption to 2 system domains (chinacloudapp.cn and chinacloudsites.cn) that are critical in our cloud operations in China. This caused the following impact in the Azure China regions:
Customers may have had issues connecting to multiple Azure services including Cosmos DB, Azure Virtual Desktop, Azure Databricks, Backup, Site Recovery, Azure IoT Hub, Service Bus, Logic Apps, Data Factory, Azure Kubernetes Services, Azure Policy, Azure AI Speech, Azure Machine Learning, API Management, Azure Container Registry, and Azure Data Explorer.
Customers may have had issues viewing or managing resources via the Azure Portal (portal.azure.cn) or via APIs.
Azure Monitor offerings, Log Analytics and Microsoft Sentinel in the China East 2 and China East 3 experienced intermittent data latency, failure to query and retrieve data which could have resulted in a failure of alert activation, and/or failures to create, update, retrieve or delete operations in Log Analytics.
What went wrong and why?
To comply with a regulatory requirement of the Chinese government, we conducted an internal audit, ensuring all our domains had the appropriate ownership and documented properly. During this process, ownership for two critical system domains for Azure in China were misattributed, and as a result, were flagged as potential candidates for decommission.
The next step of the decommissioning process is a period of monitoring active traffic on a flagged domain before continuing with its decommission. However, the management tool that provides DNS zone and hosting information was not scoped to include zones hosted within Azure China, which caused our system to report that the zone file did not exist. It is common for end-of-life domains that are not in use to not have a zone file, and, as such, the non-existent zone file notice did not raise any alerts with the operator. The workflow then proceeded to the next stage, where the nameservers of these two domains were updated to a set of inactive servers, which is a final check to identify any hidden users or services dependent on the domain.
As DNS caches across the Internet gradually timed out, DNS resolvers made requests to refresh the information for the two domains and received responses containing the inactive nameservers, resulting in failures to resolve FQDNs in those domains. Our health signals detected this degradation in our Azure China Cloud and alerted our engineers. Once we understood the issue, the change was reverted in a matter of minutes. However, the mitigation time was prolonged due to the caching applied by DNS resolvers.
The issue impacted only specific Microsoft-owned domains, and it did not affect the Azure DNS platform availability or DNS services serving any other zone hosted on Azure.
How did we respond?
- 03:45 CST on 23 April 2024 – Nameserver configuration was updated. Due to previous DNS TTLs (Time To Live) the impact was not immediate.
- 04:37 CST on 23 April 2024 - Our internal monitors alerted us to degradation in the service and created Incident.
- 04:39 CST on 23 April 2024 – Incident was acknowledged by our engineering team.
- 04:57 CST on 23 April 2024 – We determined the cause of the resolution failures coincide with a change in name servers for the chinacloudapp.cn and chinacloudsites.cn domains.
- 04:59 CST on 23 April 2024 - We reverted to use the previously known good name servers.
- 05:13 CST on 23 April 2024 - The reversion was completed, at which point services began to recover.
- 06:00 CST on 23 April 2024 - Full recovery was declared after verifying that traffic for the services and affected DNS zones had recovered back to pre-incident levels.
How are we making incidents like this less likely or less impactful?
- We have suspended any further runs of this domain lifecycle process until updates to the management tool for Azure in China are completed.
- Update our validation process regarding domain lifecycle management to ensure all cloud region signals are incorporated (Estimated completion: June 2024).
- Implement additional validations to obtain nameserver information by directly resolving the zone over internet (Estimated completion: July 2024).
How can our customers and partners make incidents like this less impactful?
- As this issue impacted two domains used in operating the management plane of the Azure China Cloud and naming Azure services offered in the Azure China Cloud, users of the China Cloud did not have many opportunities to design their services to be resilient to this type of outage.
- More generally customers should consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts: https://aka.ms/ash-alerts . These can trigger emails, SMS, webhooks, push notifications (via the Azure Mobile app https://aka.ms/AzureMobileApp) and more.
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/V_BH-7DZ