Product:

Region:

Date:

January 2023

31

Summary of Impact: Between 07:01 UTC on 31 Jan 2023 and 1:00 UTC on 01 Feb 2023, customers using Virtual Machines in East US 2 may have received error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in this region. The impact was limited to single availability zone.


Preliminary Root Cause: We determined that due a configuration issue, two partitions of the backend service became unhealthy in a single availability zone. Due to dependencies between services, this manifested in failures when performing service management operations for a subset of Virtual Machines within the availability zone. 


Mitigation: We applied a manual configuration change to mitigate the issue.


Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

25

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far.

After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

What happened?

Between 07:05 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully mitigated by 12:43 UTC. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud.

What went wrong and why?

We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.

How did we respond?

Our monitoring initially detected DNS and WAN related issues from 07:12 UTC. We began investigating by reviewing all recent changes. By 08:10 UTC, the network started to recover automatically. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issues. Networking telemetry shows that nearly all network devices had recovered by 09:00 UTC, by which point the vast majority of regions and services had recovered. Final networking equipment recovered by 09:35 UTC.

Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.

How are we making incidents like this less likely or less impactful?

  • We have blocked highly impactful commands from getting executed on the devices (Completed)
  • We will require all command execution on the devices to follow safe change guidelines (Estimated completion: February 2023)

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

23

Summary of Impact: Between approximately 16:27 UTC on 23 Jan 2023 and 19:38 UTC on 23 Jan 2023, a subset of customers in the South Central US region may have experienced increased latency and/or intermittent connectivity issues for some services in the region. Downstream impact to other impacted services in the region also occurred. 

Preliminary Root Cause: We identified an unhealthy network device in the regional network gateway in the South Central US region. This device would have impacted traffic between Availability Zones and data centers in South Central US and traffic into and out of the South Central US region.

Mitigation: The unhealthy network device was removed from service so that traffic was served through other healthy paths.

You can stay informed about Azure service issues, maintenance events, or advisories by creating custom service health alerts ( for video tutorials and for how-to documentation) and you will be notified via your preferred communication channel(s).


18

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far.

After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

What happened?

Between 09:44 and 13:10 UTC on 18 January 2023, a subset of customers using Storage services in West Europe may have experienced higher than expected latency, timeouts or HTTP 500 errors when accessing data stored on Storage accounts hosted in this region. Other Azure services with dependencies on this specific storage infrastructure may also have experienced impact – including Azure Application Insights, Azure Automation, Azure Container Registry, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Red Hat OpenShift, Azure Search, and Azure Virtual Machines (VMs).

What went wrong and why?

We determined that an issue occurred during a planned power maintenance, causing an incident for a subset of storage and networking infrastructure in one datacenter of one Availability Zone in West Europe. This impacted storage tenants, and network devices which may have rebooted. This unexpected event was caused by a failure in the electrical systems feeding the affected racks. While all server racks have redundant dual feeds, one feed was powered down for maintenance, and a failure in the redundant feed caused a complete shutdown of the affected racks. We continue to investigate the nature of this redundant feed failure, to prevent incident reoccurrence.

How did we respond?

The issue was detected by the datacenter operation team performing the maintenance at the time. We immediately initiated the maintenance rollback procedure, and restored power to the affected racks. Concurrently, we escalated the incident and engaged other Azure service stakeholders to initiate/validate service recovery. Most impacted resources automatically recovered following the power event, through automated recovery processes. The storage team identified two storage scale units that did not come back online automatically – nodes were not booting properly, as network connectivity was still unavailable. Networking teams were engaged to investigate, and identified a Border Gateway Protocol (BGP) issue. BGP is the standard routing protocol used to exchange routing and reachability information between networks. Since BGP functionality did not recover automatically, 3 of the 20 impacted top-of-rack (ToR) networking switches stayed unavailable. Networking engineers restored the BGP session manually. One storage scale unit was fully recovered by 10:00 UTC, the other storage scale unit was fully recovered by 13:10 UTC.

How are we making incidents like this less likely or less impactful?

Microsoft has an extensive internal retrospective process after incidents, including deep-dive reviews into any issues caused by power distribution systems. All learnings and action items are captured in an incident management system to ensure that all are tracked and closed in a timely manner. Since this incident was triggered by a power event, the datacenter forensic team will evaluate these results and issue service bulletins or process changes globally to prevent incident reoccurrence, as and when required. Our internal retrospective is ongoing. This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings. This will include a summary of our learnings from the review process, including any relevant next steps that will make incidents like this less likely, or at least less impactful.

How can customers make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations:
  • Consider which are the right Storage redundancy options for your critical applications. Zone redundant storage (ZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable:
  • Consider using Azure Chaos Studio to recreate the symptoms of this incident as part of a chaos experiment, to validate the resilience of your Azure applications. Our library of faults includes VM shutdown, network block, and AKS faults that can help to recreate some of the connection difficulties experienced during this incident – for example, by targeting all resources within a single Availability Zone:
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: