Product:

Region:

Date:

July 2022

29

This is our "Preliminary" PIR that we endeavor to publish within 3 days of incident mitigation, to share our findings so far.

After our internal retrospective completes (generally 14 days from mitigation) we will publish a "Final" PIR with additional details/learnings.

What happened?

Between as early as 08:00 UTC and 13:20 UTC on 29 July 2022, customers may have experienced connectivity issues such as network drops, latency, and/or degradation when attempting to access or manage Azure resources in multiple regions. Most of the impact was determined to have occurred in the following regions – Japan East, Canada Central, Brazil South, East Asia, East US, South Central US, East US 2, West US, West Europe, France Central, North Central US, South Africa North, Korea Central, and Southeast Asia. 

What went wrong, and why?

Starting at 08:00 UTC on 29 July, the Azure WAN network began to experience a significant spike of traffic, upwards of 60 Tbps. While the event was detected immediately and automated remediation was triggered, the size of the spike impacted the ability of automatic mitigations to handle the volume. This impacted both intra-region and cross-region traffic over various network paths which included ExpressRoute.

How did we respond?

We have developed several detection and mitigation algorithms that were triggered and alerted upon automatically around 08:00 UTC when the traffic spike started. However, the level of the spike was an unprecedented level of traffic not seen previously. So, while the mitigation mechanisms were successfully triggered, the substantial volume of traffic caused the mitigations to take longer to return traffic levels to normal. The traffic returned to normal at 13:30 UTC without additional remediations applied, which is when impact to customers would have been mitigated for this event.

How are we making incidents like this less likely or less impactful?

We are implementing service repairs as a result of this incident, including but not limited to:

  • Changes to network configurations to better handle traffic spikes. [In Progress]
  • Apply updated throttling to protect against network traffic spikes. [In Progress]

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:


21

What happened?

Between 03:47 UTC and 13:30 UTC on 21 Jul 2022, customers using SQL Database and SQL Data Warehouse in West Europe may have experienced issues accessing services. During this time, new connections to databases in this region may have resulted in errors or timeouts. Existing connections would have remained available to accept new requests, however if those connections were terminated and then re-established, they may have failed.

New connections to the region and related management operations began failing from 03:47 UTC, partial recovery began at 06:12 UTC, with full mitigation at 13:30 UTC. Although we initially did not declare mitigation until 18:45 UTC, a thorough impact analysis confirms that failure rates had returned to pre-incident levels earlier. No failures that occurred after 13:30 UTC were directly as a result of this incident.

During this impact window, several downstream Azure services that were dependent on the SQL Database service in the region were also impacted - including App Services, Automation, Backup, Data Factory V2, and Digital Twins.

Customers that had configured active geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica - more information can be found here

What went wrong, and why?

For context, connections to the Azure SQL Database service are received and routed by regional gateway clusters. Each region has multiple gateway clusters for redundancy - traffic is distributed evenly between the clusters under normal operations, and automatically rerouted in case one of the clusters becomes unhealthy. Each gateway cluster has a persisted cache of metadata about each database in the system, that is used for connection routing. These caches are used for scaling-out gateway nodes, to avoid contention on a single source of metadata. There are multiple caches per gateway cluster and each node will fetch data from any of the caches that is available. The West Europe region has two gateway clusters, and each of these clusters has two persisted metadata caches.

An operator error led to an incorrect action being performed in close sequence on all four persisted metadata caches. The action resulted in a configuration change that made the caches unavailable to the regional gateway processes. This resulted in all regional gateway processes in West Europe becoming unable to access connection routing metadata, leading to the regional incident from 03:47 UTC. New connections would have failed as the gateways were not able to read routing metadata, but connections that were already established would have continued to work. Management operations on server and database resources would also have been impacted, as some workflows also rely on connection routing.

A secondary impact of the issue was that our internal telemetry service in the West Europe region became overloaded with queries. This caused the telemetry ingestion to fall behind by a few hours and telemetry queries were also timing out. The telemetry issues contributed to delays in automatically notifying impacted customer subscriptions via Azure Service Health.

As some customers were receiving automatic notifications of impact within 15 minutes, we assumed that the notification pipeline was working as designed. It was later in the event when we understood that communications were not reaching all impacted subscriptions. As a result, we broadened our communications to every customer in the region and published an update on the Azure status page.

Additionally, Automatic failover for anyone who had setup failover groups with auto-failover configuration was also impacted due to telemetry issues (manual failover was not impacted).

How did we respond?

This regional incident was detected by our availability monitors, and we were on the investigation bridge within 13 minutes of customer impact. We understood the issue to the action that was performed erroneously and determined a way to reverse it. Another option would have been to rebuild entirely new caches - but it was determined that this rebuild would take much longer than fixing the caches in-place, so we proceeded to formulate the method to revive the caches in-place.

On applying this initial mitigation, the caches came back up, which resulted in a partial recovery of the incident at 06:18 UTC. While success rates improved significantly at this point (~60%) the recovery was considered 'partial' due to two reasons. Firstly, a timing issue in applying mitigation caused gateways in one of the two clusters to cache incorrect cache connection strings. Secondly, the metadata caches were not receiving updates for changes that happened while the caches were unavailable.

The first issue was mitigated by restarting all the gateway nodes in the cluster, which needed to be done at a measured pace to avoid overloading the recovering metadata caches. As the restarts progressed, we saw success rates continue to improve, steadily reaching 97% around 07:58 UTC, once all restarts had completed. At this point connections to any database that had not undergone changes (i.e., service tier updates) during the incident would have been successful.

The last step was to determine which persistent cache entries were stale (missed updates) and refresh them to a consistent state. We developed and executed a script to refresh cache entries, with the initial refreshes being done manually while the script was being developed. The success rate recovered to 99.9% for the region at 11:10 UTC. We then proceeded to identify and mitigate any residual issues, and also started the process to confirm recovery with customers and downstream impacted Azure services.

Based on login success rate telemetry, the incident mitigation time was determined to be 13:30 UTC. Mitigation communications were sent out to all impacted customers at 19:16 UTC, after a thorough validation that no residual impact remained.

How are we making incidents like this less likely or less impactful?

We are implementing a number of service repairs as a result of this incident, including but not limited to:

Completed:

  • Programmatically blocking any further executions of the action that led to the metadata caches becoming unavailable.

In progress:

  • Implementing stronger guardrails on impactful operations to prevent human errors like the one that triggered this incident.
  • Implementing in-memory caching of connection routing metadata in each gateway process, to further increase resiliency and scalability.
  • Implementing throttling on telemetry readers to prevent ingestion from falling behind.
  • Removing dependency of automatic-failover on telemetry system.
  • Investigating other service resiliency repairs as determined by our internal retrospective of this incident, which is ongoing.

How can our customers and partners make incidents like this less impactful?

Customers who had configured active geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica.

More guidance for recovery in regional failure scenarios is available at:

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

June 2022

29

What happened?

Across 28 and 29 June, 2022, two independent Azure networking service incidents occurred simultaneously. While very few customer subscriptions were impacted by both, the overlapping nature of the incident timelines meant that we previously delivered a Preliminary Post Incident Review (PIR) that addressed both issues. After thorough investigations, we are publishing a Final PIR for each incident independently. Below is the Final PIR for Incident YKDK-TT8, summarizing our EMEA-impacting Wide Area Network (WAN) issue – we have separately published a Final PIR for Incident YVTL-RS0, summarizing our multi-region Software Load Balancer (SLB) issue.

Between 02:40 UTC and 20:14 UTC on 29 June 2022, a subset of customers experienced intermittent network failures and packet loss due to an issue with a router in the Microsoft Wide Area Network (WAN) in the London, UK area. We have determined that one of several WAN routers handling traffic transiting this area experienced a partial hardware failure, which caused it to route improperly 0.4% of the packets it normally handles. This caused intermittent periods of packet loss for a subset of customers whose traffic went through that router, which included a subset of the traffic being served to and from four regions – UK West, UK South, Europe West and Europe North.

Some customers using ExpressRoute and/or VPN Gateway were impacted by this, as their traffic traversed this router. Customers affected by this would have experienced impact in the form of lower transmission throughput, long delays in connections and, in the most extreme cases, connection failures.

What went wrong, and why?

Upon investigating this issue, we identified an issue with part of our WAN infrastructure located in London, U.K.. One specific router in London intermittently started dropping network packets on one line card. This impacted 0.4% of the packets traversing it and caused intermittent network failures for a subset of customer traffic flowing to/from the UK West, UK South, Europe West, and Europe North regions.

Since end users experienced this issue intermittently, this was characterized as a ‘gray’ failure so alerting was not immediately able to identify that there was an issue. On this single faulty router, only four of its 140 physical ports had a physical failure, which was causing a section of the router to malfunction intermittently without producing an actionable error message.

For Azure VPN Gateway customers, on-premises customer traffic goes through the WAN routers on their path before reaching Azure VPN Gateway, so intermittent packet drops on the core router caused connectivity to be disrupted for Azure VPN scenarios.

How did we respond?

When first investigating the impact of this issue, we initially invested significant effort attempting to identify software-based causes, because failures were being observed across 17 regions. Router issues typically only impact a small number of regions, and we identified a number of concurrent software changes that had been deployed recently.

As the investigation continued, engineers were able to separate the impact from an unrelated Software Load Balancer (SLB) incident, and at approximately 19:15 UTC the remaining impacted traffic was triangulated to WAN routers located near London – with related failures only impacting a small subset of traffic across four regions, not 17.

By 20:00 UTC, engineers were able to identify the partially malfunctioning router. Within the next 15 minutes, the engineers removed the unhealthy router from the network – at which point redundant routers took over the traffic, and fully resolved the issue.

As mentioned above, this specific partial hardware failure was not producing any specific message in the router that could be tracked or monitored, and the overall low intensity and intermittence of this issue (<0.4% of packets dropped, because of 4 faulty ports out of 140) caused our traffic-quality alerting to miss this issue, so the investigation had to resort to manual inspection of router’s behavior and second-degree signals.

How are we making outages like this less likely or less impactful?

Already completed:

  • We enhanced our WAN traffic-quality alerting systems to identify more accurately these low-intensity but long-duration issues, to improve triangulation in future.
  • We have worked with our hardware vendor to improve their error messages when partial failures like this happen, and are consuming this new signal into our existing automatic alerts and automatic mitigations.

Work in progress:

  • We are replacing the faulty components on the failing London router. This will be brought back into rotation slowly and safely, as we have sufficient redundant capacity with other routers in the network.
  • We are investing in improved VPN Gateway end-to-end diagnostics to help identify and mitigate this class of issue more quickly.

In the longer term:

  • We are planning enhancements to ExpressRoute traffic-quality monitoring, to better identify impact to individual customers during these kinds of scenarios.

How can we make our incident communications more useful?

We are piloting this "PIR" template as a potential replacement for our "RCA" (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey:


28

What happened?

Across 28 and 29 June, 2022, two independent Azure networking service incidents occurred simultaneously. While very few customer subscriptions were impacted by both, the overlapping nature of the incident timelines meant that we previously delivered a Preliminary Post Incident Review (PIR) that addressed both issues. After thorough investigations, we are publishing a Final PIR for each incident independently. Below is the Final PIR for Incident YVTL-RS0, summarizing our multi-region Software Load Balancer (SLB) issue – we have separately published a Final PIR for Incident YKDK-TT8, summarizing our EMEA-impacting Wide Area Network (WAN) issue.

Between 05:26 UTC on 28 June 2022 and 04:00 UTC on 1 July 2022, a subset of customers experienced intermittent network failures and packet loss while trying to connect to resources behind a Software Load Balancer (SLB). This incident impacted customers whose Load Balancer frontend IPs (VIPs) were in either of two specific scenarios:

(1) Azure Firewall customers with more than 15 Destination IP addresses (DIPs) behind an Azure Load Balancer, combined with a requirement for both upstream and downstream packets of a single flow to always land on the same DIP. This scenario impacted services in 14 regions.

(2) A service endpoint or private link endpoint connection to a service in the region with more than 15 DIPs – for example, SQL and/or Storage. Impacted customers experienced intermittent connectivity failures and packet loss while trying to connect to their resources. This scenario impacted services in three regions.

What went wrong, and why?

The Software Load Balancer (SLB) is our software-based routing and load balancing service for all network traffic within our datacenters and clusters. At the highest level, it is a set of servers called multiplexers or MUXs that create an SLB scale unit. Each scale unit hosts many VIPs (Virtual IPs) to serve traffic to multiple DIPs (Destination IPs, these are the IPs of the resources behind the VIP). The MUXs then direct traffic to the DIPs that are in service behind the VIP, which are dynamically monitored utilizing periodic probes to monitor health of DIPs.

Upon investigating this issue, we identified a change in our latest code deployment which caused some MUX instances to store the backend DIPs in a different order. The change only occurred for load balancers with more than 15 DIPs in the backend pool. The order of the DIPs is important, as it is tied directly to the process used to select the backend server that receives a particular connection. The ordering change impacts traffic that passes through more than one MUX. There are two scenarios where this can occur – the first is when traffic passes through a firewall in both the incoming and return paths; the second is when a particular connection moves from one MUX to another MUX, due a change in routing path through the physical network.

The first scenario impacted Azure Firewall customers, but only those with more than 15 firewall instances behind a Load Balancer. That’s because Azure Firewall is a stateful firewall that leverages Azure Load Balancer behind the scenes, scaling horizontally to handle traffic load. Since flows are pinned to instances for firewall processing, the load balancer issue meant that packets for a given flow would be received at different instances and be dropped, impacting customer traffic. The number of instances behind the Firewall’s internal Load Balancer are determined by a customer’s traffic pattern. Not all Firewall’s internal Load Balancers have this many instances, which led to delays in correlation between customer reported issues and the behavior of the platform.

The second scenario occurred due to a change in the flow identifier for an IPv6 packet routing header used for Private Link and service endpoint traffic. Traffic that used this technology would have seen drops as the connection moved from one MUX to another MUX. The new MUX would forward the connection to a different server in the backend pool, which would then reject the request.

When we deploy updates to our cloud, we use a system called Safe Deployment Practices (SDP). SDP is how we manage change automation so that all code and configuration updates go through well-defined stages, to catch regressions and bugs before they reach customers or – if they do make it past the early stages – impact the smallest number possible. Although telemetry observed a small increase in errors (and a number of customer support requests had been raised) the intermittent nature of this failure meant that our health checks did not correlate these signals to the recent change. This meant that the code had propagated to 14 regions (of our 60+ regions) before we correlated customer impact to this specific deployment.

How did we respond?

We engaged multiple investigation and mitigation workstreams in parallel. While we worked to understand the issue, we initially used targeted operations to mitigate VIPs manually for specific customers that were identified as impacted.

Simultaneously, we developed a hotfix that would address the bug that could be applied without causing further impact to customer traffic. After the hotfix was tested, it was rolled out using a phased approach as per our Safe Deployment Practices. 

Additionally, we created automation that would scan all VIPs every five minutes and mitigate any VIP that was in an inconsistent state to minimize the impact to customers while the hotfix rolled out. This script continued to run until we had our sustainable mitigation completely rolled out.

How are we making outages like this less likely or less impactful?

Already completed:

  • We have restored consistent ordering for our backend instances, returning to the pre-incident state. The order of these is tied directly to the process used to select the backend server that receives a particular connection.

Work in progress:

  • To address SLB scenario #1 above, we are investing in additional monitoring for Azure Firewall to detect cases when upstream and downstream packets on a single flow do not reach the same DIP.
  • To address SLB scenario #2 above, we are fixing the IPv6 flow identifier change so that it remains consistent for the lifetime of the connection and, as a result, all packets will land on a single MUX.
  • For Azure Firewall, we are investing in improved packet capture support specific to firewalls (which will help to identify and mitigate issues like this more quickly in the future), as well as enhanced logging for TCP flows (to help isolate the issue more effectively, by logging invalid packets, RST, FIN and FIN-ACK packets).

In the longer term:

How can we make our incident communications more useful?

We are piloting this "PIR" template as a potential replacement for our "RCA" (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey:


7

What happened?

Between 02:41 and 14:30 UTC on 07 Jun 2022, a subset of customers experienced difficulties connecting to resources hosted in one particular Availability Zone (AZ) of the East US 2 region. This issue impacted a subset of storage and compute resources within one of the region’s three Availability Zones. As a result, Azure services with dependencies on resources in this zone also experienced impact.

Since the vast majority of services that were impacted already support Availability Zones customers using always-available and/or zone-redundant services would have observed that this zone-specific incident did not affect the availability of their data and services. Five services (Application Insights, Log Analytics, Managed Identity Service, Media Services, and NetApp Files) experienced regional impact as a result of this zonal issue. These five services are already working towards enabling AZ support. Finally, while App Service instances configured to be zone-redundant would have stayed available, from the other AZs, control plane issues were observed regionally that may have prevented customers from performing service management operations during the impact window.

What went wrong, and why?

Microsoft experienced an unplanned power oscillation in one of our datacenters within one of our Availability Zones in the East US 2 region. Components of our redundant power system created unexpected electrical transients, which resulted in the Air Handling Units (AHUs) detecting a potential fault, and therefore shutting themselves down pending a manual reset.

The electrical transients were introduced by anomalous component behavior within Uninterruptible Power Supply (UPS) modules, and cascaded throughout the datacenter electrical distribution system including electrical power supply to the mechanical cooling plant. As a result of the AHU self-protective shutdown, cooling to the datacenter was interrupted. Although the electrical transients did not impact our compute, networking, or storage infrastructure – which did not lose power – the mechanical cooling plant shutdown led to an escalating thermal environment, which induced protective shutdown of a subset of this IT infrastructure prior to the restoration of cooling.

Thorough detailed analysis has resulted in an adjustment to the UPS gain settings, preventing any further oscillations. These oscillations are the power equivalent of having a microphone too close to an amplifier – just as setting the volume too high can trigger a self-sustained sound oscillation, power oscillations can occur when the gain of the UPS is too high. The normal process of adding load to the UPS units results in an increase in gain and, in this case, the gain went high enough to cause the oscillations to occur. Adjusting the control gain setting lower in the UPS returns them to stable operation for all load values, preventing disruptions to any other infrastructure such as the AHUs.

Subsets of equipment including network, storage, and compute infrastructure were automatically shut down, both to prevent damage to hardware and to protect data durability under abnormal temperatures. As a result, Azure resources and services with dependencies on these underlying resources experienced availability issues during the impact window. A significant factor of downstream service impact was that our storage infrastructure was amongst the hardware most affected by these automated power and thermal shutdowns. Eight storage scale units were significantly impacted – due to thermal shutdowns directly and/or loss of networking connectivity, itself due to thermal shutdowns of corresponding networking equipment. These scale units hosted Standard Storage including LRS/GRS redundant storage accounts, which in turn affected Virtual Machines (VMs) using Standard HDD disks backed by this storage, as well as other services and customers directly consuming blob/file and other storage APIs.

The platform continuously monitors input/output transactions from the VMs to their corresponding storage. So even if the scale unit running a VM’s underlying compute was operational, when transactions did not complete successfully within 120 seconds (inclusive of retries) the connectivity to its virtual disk is considered to be lost, and a temporary VM shutdown is initiated. Any workloads running on these impacted VMs, including first-party Azure services and third-party customer services, would have been impacted as their underlying hosts were either shut down by thermal triggers, or had their storage/networking impacted by the same. 

How did we respond?

As soon as the AHUs shut themselves down as a result of the power disturbance, alerts notified our onsite datacenter operators. We deployed a team to investigate, who confirmed that the cooling units had shut themselves down pending manual intervention. Following our Standard Operating Procedure (SOP), the team attempted to perform manual resets on the AHUs, but these were not successful. Upon further investigation the onsite team identified that, due to the nature of this disturbance, recovering safely would require resetting the AHUs while running on backup power sources, to prevent the power oscillation pattern on the utility line from triggering a fault. This meant that two primary steps were required to recover – firstly, the impacted datacenter manually transferred from utility power to backup power sources, our onsite generators. By doing this, we changed the characteristics in the power lineup to obviate the creation of the oscillations. Secondly, the AHUs were then manually reset to recover them, which restored cooling to the datacenter.

Once temperatures returned to normal levels, some hardware including network switches needed to be manually power cycled to be brought back online. The network hardware and components serve different compute and storage resources for the scale units in this datacenter, including host instances for other applications and services. Onsite engineers then manually reviewed the status of various infrastructure components, to ensure that everything was working as intended. 

Following the restoration of most storage network connectivity, recovery activities included diagnosing and remediating any host nodes that had entered an unhealthy state due to loss of network, and triaging any other hardware failures to ensure that all storage infrastructure could be brought back online. Even after all storage nodes returned to a healthy state, two storage scale units still exhibited slightly lower API availability compared to before this incident. It was determined that this was caused by a limited number of storage software roles being in an unhealthy state – those roles were restarted, which restored full API availability for those scale units.

Since our compute continuously monitors for Storage access, as storage/networking started recovering the compute VMs automatically started coming back up. This worked as expected in all cases expect on one scale unit, where the physical machines were shut down and did not recovery automatically. Since the VMs were originally down due to storage/networking issues, it was only detected once storage recovered, so we manually recycled the nodes to bring them back online. Upon investigation, an issue with the cluster power management unit prevented automatic recovery.

Two specific Azure services (ExpressRoute and Spatial Anchors) performed manual mitigations to fail customers over to use the other two Availability Zones within the region. Thus, while some impacted services recovered even earlier, full mitigation of this incident was declared at 14:30 UTC.

After cooling was restored and infrastructure was brought back online, our onsite teams opted to leave the datacenter running on backup power sources during additional investigations and testing, both focused on the UPS gain setting. In consultation with our critical environment hardware suppliers, we ran comprehensive testing to confirm the relevant gain settings based on the amount of load across the system. After these settings were deployed, we have since returned the datacenter back to our normal utility power feed.

How are we making incidents like this less likely or less impactful?

Already completed:

  • Updates to the gain setting, described above, have been deployed and the datacenter is back on utility in the impacted datacenter. We are confident that this has mitigated the risk of the power oscillation issue that was triggered.
  • Furthermore, our critical environment team has assessed systemic risk across all our datacenters globally, to ensure that none are at risk of the same situation. Of our 200+ Azure datacenters across 60+ regions, we identified only one other datacenter (beyond the impacted datacenter in East US 2) that had a similar power draw that could have potentially triggered a similar oscillation – this risk has since been mitigated with a similar configuration change.

Work in progress:

  • We have identified opportunities to improve our tooling and processes to flag anomalies more quickly, and are in the process of fine-tuning our alerting to inform onsite datacenter operators more comprehensively.
  • We are investigating why a subset of networking switches took longer than expected to recover. Although these were manually mitigated during the incident, we are exploring ways to optimize this recovery to ensure that customer workloads are brought online more quickly.
  • Similarly, we continue to diagnose a small subset of storage and compute nodes that remained in unhealthy states after restoration of networking, to streamline their recovery. This includes addressing a driver-related issue that prevented compute nodes in one scale unit from recovery automatically.
  • We are addressing some specific monitoring gaps including for compute nodes that have not been powered back on, specifically for scenarios in which they had been automatically shut down.

In the longer term:

  • We are developing a plan for fault injection testing relevant critical environment systems, in partnership with our industry partners, to be even more proactive in identifying and remediating potential risks.
  • We are exploring improved supplier diversity in the critical environment space, to minimize potential single points of failure within our hardware lineup.
  • We are investing in improved engineering tooling and processes that will accelerate the identification and remediation of unhealthy node states during incidents of this scale.
  • We have several workstreams in motion that will further improve storage node start-up times, learnings from this incident have validated the need to prioritize these optimizations.
  • Finally, we continue to invest in expanding how many Azure services support Availability Zones, so that customers can opt for automatic replication and/or architect their own resiliency across services:

How can our customers and partners make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations:
  • Consider which are the right Storage redundancy options for your critical applications. Zone redundant storage (ZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable:
  • Consider using Azure Chaos Studio to recreate the symptoms of this incident as part of a chaos experiment, to validate the resilience of your Azure applications. Our library of faults includes VM shutdown, network block, and AKS faults that can help to recreate some of the connection difficulties experienced during this outage – for example, by targeting all resources within a single Availability Zone:
  • More generally, consider evaluating the reliability of each of your critical Azure applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:
  • Finally, ensure that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, web-hooks, and more:

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

May 2022

31

Summary of impact: Between 21:35 UTC on 31 May and 09:54 UTC on 01 Jun 2022, you were identified as a customer who may have experienced significant delays in the availability of logging data for resources such as sign in and audit logs, for Azure Active Directory and related Azure services. This impact may also have resulted in missed or misfired alerts, and issues accessing tools such as Microsoft Graph, the Azure portal, Azure Application Insights, Azure Log Analytics, and/or PowerShell. During this time, Azure Resource Manager (ARM) dependent services may also have experienced CRUD (create, read, update, and delete) service operation failures and/or issues communicating with other Azure services.

Root Cause: As part of continuous service improvements, a recent change to the underlying infrastructure that was intended to optimize performance, inadvertently caused the observed impact. While this issue was detected by our telemetry, service health degradation was only detected in the final phase of the rollout. This underlying infrastructure has both regional and global endpoints, to facilitate the varying needs of dependent services. When global endpoints become unhealthy, they will have a global impact on services that depend on them. In this incident, we saw the observed impact manifest as the change reached the final phase of the rollout - when it was deployed to high traffic regions and global endpoints. Prior to this issue manifesting, the impactful change was being rolled out over the last week, with no issues being observed.

Mitigation: Initially we attempted to roll-back the impactful change however, due to internal service dependencies, this approach was unsuccessful. As a result, we deployed a hotfix to the affected underlying infrastructure, which mitigated all customer impact.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Disabling the impactful change globally, across all of the affected underlying infrastructure [Completed]
  • Updating our standard operating procedures (SOP) to include specific mitigation guidance for similar scenarios to avoid rollback and help to reduce time to mitigation [Will be completed by 6/8/2022]
  • Adjusting our rollout plan for similar changes to the underlying infrastructure, to add rollout phases that will help reduce impact to customers in similar scenarios [Will be completed by 6/8/2022]
  • Migrating the underlying infrastructure to a new architecture that will eliminate internal service dependencies which prevented us from rolling back the change. This will also improve the speed of our repair actions, to help significantly shorten time to mitigation during similar impact scenarios [Rollout will begin in July 2022]

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: