Product:

Region:

Date:

November 2022

2

What happened?

Between 00:42 UTC on 2 November and 05:55 UTC on 3 November 2022, customers experienced failures when attempting to provision new resources in a subset of Azure services including Application Gateway, Bastion, Container Apps, Database Services (MySQL - Flexible Server, Postgres - Flexible Server, and others), ExpressRoute, HDInsight, Open AI, SQL Managed Instance, Stream Analytics, VMware Solution, and VPN Gateway. Provisioning new resources in these services requires creating new certificates. The certificate Registration Authority (RA) that processes new certificate requests experienced a service degradation, which led to an incident preventing provisioning new resources in this subset of Azure services. We’re providing you with this Post Incident Review (PIR) to summarize what went wrong, how we responded, and the steps Microsoft is taking to learn and improve. Communications for this incident were provided under Tracking IDs YTGZ-1Z8 (Azure Public), 7LHZ-1S0 (Azure Government) and YTHP-180 (Azure China).

What went wrong, and why?

At 23:56 UTC on 1 November through 00:52 UTC on 2 November, an internal Certificate Authority (CA) experienced a brief service degradation. At the same time, the Registration Authority (RA) that sends requests to the CA received a burst of certificate renewal requests, resulting in requests queueing at the RA. Once the CA recovered, the request queue started processing. Due to a latent performance bug in the RA, inadvertently introduced in the past month as part of a feature enhancement, the rate of new incoming requests was greater than the rate at which requests could be processed from the queue. This triggered automatic throttling of incoming requests, but the RA was unable to recover fully due to throttled requests causing additional retry traffic. The latent performance bug was not caught during deployment or through health monitoring, due to a test gap where this specific set of conditions was not exercised. 

How did we respond? 

The issue was detected via our internal monitoring and the relevant engineering team was engaged within one minute of the alert firing. The service was unable to self-heal, so the following steps were taken to mitigate the incident. Firstly, we rolled back the change containing the performance bug. Secondly, we blocked requests to the RA, to enable the queue to drain. Finally, once the queue was at a manageable length, we re-enabled traffic slowly - and monitored service recovery until all traffic was re-enabled, and the RA had returned to a healthy state.

How are we making incidents like this less likely or less impactful?

  • We have increased processing capacity by 3x, for the RA backend component. (Completed)
  • We have introduced more granular request throttling, to help smooth similar spikes in traffic. (Completed)
  • We have extended our end-to-end service monitoring to include upstream request sources, to reduce time to detection. (Completed)
  • We are exploring more fine-grained throttling as an additional isolation layer across the backend of the RA (Estimated completion: December 2022).
  • We are deploying additional dedicated RA capacity for the Azure China and Azure Government cloud environments (Estimated completion: March 2023). 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

October 2022

26

What happened? 

Between 00:25 UTC and 06:00 UTC on 26 October 2022, a subset of customers using Azure Cosmos DB in the East US region may have experienced issues connecting to the service. Connections to Azure Cosmos DB accounts in this region may have resulted in an error or timeout. Downstream Azure services that rely on Azure Cosmos DB also experienced impact during this window - including Azure Application Insights, Azure Automation, Azure Container Registry, Azure Digital Twins, Azure Policy, Azure Rights Management, Azure Red Hat OpenShift, and Azure Spatial Anchors. 

What went wrong, and why?

A change to the front-end gateway of Azure Cosmos DB, to include additional diagnostic information, was introduced on the affected cluster earlier this month. The change had no effect to the fidelity of the system since introduction. On 26 October, a configuration change was applied to the Azure Load Balancer. This change resulted in intermittent network connectivity issues, from which the system can normally recover. However, the diagnostic change resulted in higher-than-expected time spent in the Kernel, which resulted in spikes of high CPU utilization across the cluster. In turn, this created an increase in exceptions, and even larger time spent locking in the kernel. This resulted in timeouts and increased latency of incoming requests, ultimately leading to the customer impact described above.

How did we respond?

Our monitors alerted us to the impact on this cluster. We worked with our customers and partners to trigger mitigation steps while investigating the factors contributing to this issue. To mitigate the incident, accounts were offloaded from the impacted cluster to other clusters in the same region. To safely offload the quantity of accounts we had to migrate, we systematically moved each database account to an alternative healthy cluster. As a result, the lower load improved the state of the impacted cluster and enabled recovery. All customer impact was confirmed mitigated by 06:00 UTC.

How are we making incidents like this less likely or less impactful?

  • We have paused the configuration changes for the Azure Load Balancer. (Completed)
  • We have worked to fix the original regression in the diagnostic stack. (Completed)
  • We are improving our load balancing automation, to speed up recovery in similar circumstances. (Estimated completion: November 2022).

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

September 2022

7

What happened? 

Between 16:10 and 19:55 UTC on 07 Sep 2022, a subset of customers using Azure Front Door (AFD) experienced intermittent availability drops, connection timeouts and increased latency. At its peak, this impacted approximately 25% of the traffic, and on average, 10% of the traffic that traverses through the AFD service during the impact window. Some customers may have seen higher failures if their traffic was concentrated in the edges or regions with higher impact. This could also have impacted customers’ ability to access other Azure services that leverage AFD, in particular the Azure management portal and Azure Content Delivery Network (CDN).

What went wrong and why? 

The AFD platform automatically balances traffic across our global network of edge sites. When there is a failure in any of our edge sites or an edge site becomes overloaded, traffic is automatically moved to other healthy edge sites in other regions where we have fallback capacity. It is because of this design that customers and end users don’t experience any issues in case of localized or regionalized impact. In addition to that, we also have protections built in every single node to protect our platform from unusual traffic spikes corresponding to each domain hosted on AFD.

 Between 15:15 and 16:44 UTC we observed 3 unusual traffic spikes for one of the domains hosted on AFD.

  • The first two spikes for this domain occurred at 15:15 UTC and 16:00 UTC on 07 of September, 2022 and were fully mitigated by AFD platform. A third spike that occurred between 16:10 to 16:44 caused a subset of environments managing this traffic to go offline.
  • The first two spikes were successfully absorbed due the platform protection mechanisms, however the ones that were initiated during the third spike did not fully mitigate the unexpected increase due to the nature of the traffic pattern (different from the first two spikes). At this stage in our investigation, we believe that all three traffic spikes were malicious HTTPS flood attacks. (Layer 7 volumetric DDOS attacks) 
  • The malicious traffic spikes did not originate from a single region. We found that they were coming from all around the world. A combination of malicious traffic (3rd spike), large traffic ramp-up for legitimate traffic for other customer and degraded customer origins, resulted in overwhelming the resources of a few environments taking them offline and resulting in a 25% drop in overall availability during the third traffic spike
  • By design, these environments will automatically recover and resume taking traffic once healthy. During this instance, users and our systems retried the requests, resulting in a larger build-up of requests. This build-up did not allow time for a subset of the environments to recover fully resulting in a subsequent 8% availability drop, for more than 3.5 hours following the traffic spike.

How did we respond? 

We have automatic protection mechanisms in such events which mitigate circa 2,000 DDoS attacks per day, and the record that we have mitigated in a day has been 4,296. (More information can be found here: ). In addition to this, AFD platform also has in-built DDoS protection mechanisms on each node at both a system and an application layer. These help for further mitigations in such cases. In this instance, these mechanisms significantly helped to absorb the first two spikes without any customer impact.

During the third spike, the platform protection mechanisms were partially effective, mitigating around 40% of the traffic. This significantly helped to limit global impact. For a larger duration, 8.5% of the overall AFD service, concentrated in some regions, was impacted by this issue. Some customers may have seen higher failures if their traffic was concentrated in predominantly North America, Europe, or the APAC regions.

 As our telemetry alerted us regarding impact on availability, we manually intervened. The first step was that we took manual action to further block the attack traffic. In addition, we expedited the AFD load balancing process which then enabled auto-recovery systems to work as designed. The systems worked by ensuring the most efficient load distributions in regions where there was a large build-up of traffic. Once the environment recovered, we began to gradually bring AFD instances back online to resume traffic management in a normal way. We were 100% globally recovered by 19:55 UTC.

How are we making incidents like this less likely or less impactful?

Although the AFD platform has built-in resiliency and capacity, we must continuously strive to improve through these lessons learned. We have a few previously planned repair items that were inflight being either partially deployed and/or staged to be deployed. We believe that these repair items would have mitigated the third malicious traffic spike had they been in place before Sept 7th. We are now expediting these repair items that were scheduled for later this year and they should be completed in the next few weeks. These include: 

  • Effectively tuning the protection mechanisms in the AFD nodes to mitigate the impact of this class of traffic patterns in future. (Estimated completion September 2022)
  • Addressing issues identified in the current platform environment recovery process. This will reduce time to recover for each environment and will prevent environments from becoming overloaded. (Estimated completion September 2022)
  • Tooling to trigger ‘per customer’ failover until we have fully automated the traffic shifting mechanisms. This work is completed.
  • Improvements in dynamic rate limiting algorithm to ensure fairness to legitimate traffic. (Estimated completion October 2022)
  • Improve existing proactive automatic communication process to notify customers more expeditiously. (Estimated completion October 2022)

How can we make our incident communications more useful?

Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey, 

7

What happened?

Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may have resulted in an error or timeout.

Downstream Azure services that rely on Cosmos DB also experienced impact during this window - including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview.

What went wrong and why?

Cosmos DB load balances workloads across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance event in one of the frontend clusters in the North Europe region, causing the availability issues described above. 

How did we respond?

Our monitors alerted us of the impact on this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region.

Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state.

Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted.

How are we making incidents like this less likely or less impactful?

Already completed:

  • Fixed the regression in our load balancer procedure, to safely factor in capacity fluctuations during maintenance.

In progress:

  • Improving our monitoring and alerting to detect these issues earlier and apply pre-emptive actions. (Estimated completion: October 2022)
  • Improving our processes to reduce the impact time with a more structured manual load balancing sequence during incidents. (Estimated completion: October 2022)

How can customers make incidents like this less impactful?

Consider configuring your accounts to be globally distributed – enabling multi-region for your critical accounts would allow for a customer-initiated failover during regional service incidents like this one. For more details, refer to:

More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:

Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey: