Журнал состояний Azure

На этой странице содержатся обзоры последствий инцидентов (PIR) предыдущих проблем службы, каждый из которых хранится в течение 5 лет. С 20 ноября 2019 г. в их число входили PIR по всем проблемам, о которых мы сообщали публично. С 1 июня 2022 г. сюда относятся PIR по широкому кругу проблем, как описано в нашей документации.

Продукт:

Регион:

Дата:

Май 2026

29

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/LYXT-C1Z

What happened?

Between 09:39 UTC and 17:05 UTC on 29 May 2026, customers using the Azure OpenAI Service may have experienced increased latency, intermittent request failures, timeouts, and/or HTTP 5XX errors when submitting inference requests. The impact was more pronounced in regions in Europe and Australia East that were directly processing the highest levels of affected traffic.

The incident was triggered by an upstream change that altered how certain capacity-related failures were surfaced, which led to a rapid and unexpected increase in internal retry traffic. This retry amplification overwhelmed a shared inference load balancer component, which distributes the inference traffic, degrading service reliability across multiple regions.

What went wrong and why?

Azure OpenAI relies on an internal routing and load balancing layer to direct inference requests across regions and model capacity. This layer is designed to distribute traffic efficiently and support resiliency during transient failures. During this incident, that layer became overloaded and unstable after it was subjected to a large volume of retry traffic from an upstream internal workload.

The triggering condition was a change in an upstream API layer used by a Microsoft first-party workload, Microsoft 365, which relies on Azure OpenAI for some request processing scenarios, including Copilot-related experiences. This change altered how capacity-related failures were surfaced to higher layers in the request path. The rollout first occurred in Australia East, and later expanded to Sweden Central.

Before this change, some lower-level capacity or resource exhaustion conditions could still result in retry activity, as designed - but after repeated failures, they would ultimately return in a form that did not trigger broad retry behavior across multiple upstream layers. After the change, those same conditions began surfacing as transient server failures instead. Because those responses were now interpreted as retriable, multiple layers in the calling stack retried immediately and repeatedly, without sufficient backoff or jitter. In some cases, a single failed request resulted in up to 48 additional retry attempts. This created a cascading retry amplification effect that dramatically increased internal traffic volume, even though underlying customer demand had not materially changed.

The first wave of impact occurred in Australia East, where the change was initially deployed. As retry traffic increased, it began exhausting resources in the Azure OpenAI inference routing layer and triggered failures in the affected backend path. During that first wave, early crash signatures pointed to an internal feature associated with error handling, which appeared to be a plausible explanation for the instability. We disabled that feature as a potential mitigation, and around the same time traffic in Australia East naturally declined as part of the normal regional usage pattern. Because service health improved after those two things happened in close succession, it initially appeared that the mitigation had addressed the problem. In reality, the feature we disabled was not the underlying cause, and the retry amplification issue remained in place.

When the upstream rollout later expanded to Sweden Central, the same failure pattern reappeared, but at much greater scale. As a major European processing hub, Sweden Central serves significantly higher traffic volume - so the retry amplification effect was stronger and the resulting overload was more severe. This second wave made it clear that the issue was not limited to the internal feature we had first suspected. Instead, it exposed the true underlying problem: a large internal retry storm overwhelming the shared routing layer.

As the retry storm intensified, it exhausted resources in the Azure OpenAI inference routing layer. The affected component scaled out, but the magnitude of the retry amplification exceeded the protection available at the time. In a small number of regions, recovery was further constrained because scale-out options were limited by available compute quota for replacement virtual machine SKUs. The result was resource exhaustion, out-of-memory crashes on multiple instances, and broad degradation across the shared routing path.

This incident also exposed a gap in how internal workloads were protected, relative to traffic from external sources. The affected upstream workload operated in a model where traffic was not subject to the same rate-limiting and overload controls used for many external scenarios. In addition, some of this traffic reached the internal routing layer more directly than standard external traffic paths. That meant the system did not sufficiently suppress or isolate the amplified retry volume before it reached shared infrastructure.

While the initial impact in Australia East led us toward a credible but incomplete explanation, the apparent recovery after disabling the suspected feature obscured the real cause. Once the rollout reached Sweden Central and the issue reappeared, albeit at greater scale, we were able to definitively correlate the degradation to retry amplification, correctly diagnose the trigger condition as the change in the upstream dependency API layer, then execute the targeted mitigations that restored stability.

How did we respond?

09:20 UTC on 29 May 2026 – Initial customer impact began as the first wave of retry amplification, associated with the upstream rollout in Australia East, caused increased failures and latency across multiple regions.
09:39 UTC on 29 May 2026 – Automated monitoring detected a drop in request success rates and initiated incident investigation.
10:10 UTC on 29 May 2026 – We identified widespread service instability and correlated failures across regions and models.
12:17 UTC on 29 May 2026 – Based on early crash diagnostics, we disabled a component that was suspected of contributing to the instability.
12:40 UTC on 29 May 2026 – Service health began improving as traffic conditions in Australia East eased, which initially made the earlier mitigation appear effective.
14:20 UTC on 29 May 2026 – The upstream rollout expanded to Sweden Central, where significantly higher traffic volume increased the impact of the retry amplification.
14:40 UTC on 29 May 2026 – The issue reoccurred after the upstream rollout reached Sweden Central, where higher traffic volume magnified the retry amplification and overloaded the system at greater scale.
16:20 UTC on 29 May 2026 – We identified that the earlier mitigation had addressed a secondary symptom rather than the underlying cause, and refocused the investigation on upstream retry behavior.
16:27 UTC on 29 May 2026 – We confirmed that the amplified traffic was originating from an internal first-party workload and began targeted mitigation planning.
16:30 UTC on 29 May 2026 – We identified the source of the amplified retry traffic, isolated the offending internal workload onto dedicated infrastructure, and coordinated with the upstream service to roll back the triggering change and reduce traffic volume.
16:25 UTC on 29 May 2026 – The upstream service began rollback of the triggering change and took additional actions to reduce retry-driven traffic volume.
17:05 UTC on 29 May 2026 – Following the completion of the isolation and rollback, and all backlogged requests were cleared, customer impact was confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

We are reducing shared infrastructure blast radius by moving large first-party generative AI workloads onto dedicated routing infrastructure, rather than allowing them to share the same inference load balancing stamp with broader multi-tenant traffic. (Estimated completion: June 2026)
We are improving our diagnostic guidance, runbooks, and alerting so that engineers can more quickly distinguish between secondary crash symptoms and the true underlying source of overload during fast-moving incidents. (Estimated completion: June 2026)
We are partnering with upstream service teams to improve retry policy design, dependency contracts, and telemetry so retry amplification conditions can be detected earlier and prevented from propagating across shared infrastructure. (Estimated completion: July 2026)
We are implementing stronger overload prevention and workload throttling controls so that excessive retry traffic from any single internal workload cannot overwhelm shared components. (Estimated completion: July 2026)
Finally, we are continuing work to identify and eliminate single points of failure and improve resiliency in the inference routing layer. (Perpetual/Ongoing)

How can customers make incidents like this less impactful?

For mission-critical AI workloads, customers should consider a multi-region resiliency strategy so that service degradation affecting a specific processing path or region is less likely to impact the full application experience. Customers using regional deployments may wish to evaluate whether additional geographic diversity or fallback patterns would improve resilience for their scenarios. See: https://learn.microsoft.com/azure/ai-services/openai/how-to/business-continuity-disaster-recovery
Customers should also review application retry behavior to ensure that client-side retries use appropriate backoff and jitter, and that applications degrade gracefully when dependent services are under stress. Well-controlled retry behavior helps avoid compounding platform-level recovery events. See: https://learn.microsoft.com/azure/architecture/patterns/retry
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/LYXT-C1Z

29

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/GHRP-84G

What happened?

Between 04:24 UTC on 29 May and 02:30 UTC on 30 May 2026, customers with resources in the West US 2 region may have experienced failures when attempting to access or manage some Azure services in the region. Impact included connectivity failures, resource unavailability, and service management failures.

The issue was triggered by a severe thunderstorm that caused utility power disturbances affecting multiple datacenter facilities serving the region. These disturbances led to some cooling systems entering a protective mode to preserve equipment integrity. This caused elevated temperatures which triggered proactive shutdowns of infrastructure affecting compute, networking, and storage systems across two different availability zones in the region (physical AZ-01 and AZ-03). After cooling systems stabilized, storage services - and the networking on which they depend - required extensive manual recovery actions, which contributed to the total duration of customer impact.

The following services were among those affected: Application Insights, Azure App Service, Azure Backup, Azure Cache for Redis, Azure Container Registry, Azure Cosmos DB, Azure Databricks, Azure Data Explorer, Azure Data Factory, Azure Functions, Azure IoT Hub, Azure Kubernetes Service, Azure Monitor, Azure NetApp Files, Azure Resource Manager, Azure Site Recovery, Azure SQL Database, Azure Storage, Azure Synapse Analytics, Azure Virtual Machines, Log Analytics, Service Bus, and Virtual Machine Scale Sets.

What went wrong and why?

A severe thunderstorm produced lightning strikes, resulting in utility power disturbances across multiple datacenter facilities serving the West US 2 region. These disturbances affected a wide enough geographic area that a subset of datacenters within two Availability Zones experienced impact. While this event was not a complete utility power interruption, these datacenters experienced multiple voltage sag/swell events, meaning utility power remained present but was unstable. The region consists of datacenters built at different times with varying infrastructure designs, including differences in cooling control systems and power protection mechanisms. Some datacenters transferred to generator power, while others did not transfer, as the utility power source remained available even as voltage fluctuated.

Independently of whether each datacenter transferred to backup power, a subset of components within the mechanical cooling systems - across multiple datacenters - detected abnormal electrical conditions and, by design, entered a 'lockout' protective state. This reduced cooling capability and prevented parts of the cooling system from automatically returning to normal operation after the power disturbance. As temperatures rose beyond safe operating thresholds, a subset of cloud infrastructure automatically shut down to prevent physical damage and preserve data integrity.

Impacted cooling systems required manual intervention and troubleshooting, including multiple restart attempts before stable cooling capacity could be reestablished. Additionally, we had to identify which devices required manual recovery and which were expected to self-recover. Impacted Storage scale units required additional validation. These factors contributed to the extended recovery time for services dependent on storage, including telemetry processing systems.

How did we respond?

Recovery efforts began immediately, prioritizing cooling restoration, then network infrastructure, followed by compute and storage systems. Cooling was restored within approximately two hours of impact start. The majority of compute resources recovered within eight hours. Storage validation, which must be performed sequentially, extended over approximately 14 hours. Finally, telemetry services (Application Insights and Log Analytics) required an additional six hours beyond storage recovery to process accumulated backlogs. The timeline below details the progression of recovery:

04:14 UTC on 29 May 2026 – Utility power sag/swells were observed across multiple datacenters. These caused a subset of the mechanical cooling systems to enter a 'lockout' protective state.
04:24 UTC on 29 May 2026 – Customer impact began; our monitoring produced thermal alerts as temperatures rose due to reduced cooling functionality.
04:31 UTC on 29 May 2026 – Manual cooling diagnosis and restoration efforts began.
04:40 UTC on 29 May 2026 – As temperatures rose, some cloud infrastructure shut down to prevent damage.
04:50 UTC on 29 May 2026 – Networking and Storage engineers began assessing infrastructure health and necessary restoration efforts.
05:55 UTC on 29 May 2026 – All cooling was restored in impacted datacenters, and temperatures had stabilized. Assessments continued on remaining infrastructure issues.
06:15 UTC on 29 May 2026 – Approximately 50% of impacted underlying Virtual Machines (VMs) had recovered.
12:00 UTC on 29 May 2026 – Approximately 95% of impacted underlying Virtual Machines (VMs) had recovered, after premium Storage had been largely recovered. Compute and Storage teams continued to identify and mitigate remaining unhealthy nodes.
18:46 UTC on 29 May 2026 – Storage was confirmed as the remaining recovery bottleneck; validation of Storage-specific issues was ongoing.
20:18 UTC on 29 May 2026 – All affected services confirmed as recovered, except for Application Insights and Log Analytics which were still processing backlogs.
02:30 UTC on 30 May 2026 – Customer impact for Log Analytics and Application Insights confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

First and foremost, we are reviewing and enhancing our operational processes for cooling system recovery based on learnings from this incident, to ensure that all employees have clear and documented actions to support faster recovery during high-stress scenarios. (Estimated completion: June 2026)
Our Log Analytics team is improving the resiliency of service startup processes so that transient initialization failures self-recover automatically, reducing the need for manual intervention during large-scale restarts. (Estimated completion: June 2026)
Our Networking team is addressing platform limitations that constrained the speed at which storage nodes could be returned to service following the thermal event. This includes reducing dependencies on manual coordination steps, and enabling faster parallel recovery of storage infrastructure. (Estimated completion: July 2026)
Our Networking team is also improving the tooling and processes used to identify which infrastructure devices have entered protective states and require manual recovery. This includes developing clearer prioritization of foundational devices (such as management and networking components) earlier in the recovery sequence, and improving awareness across teams when thermal events affect shared infrastructure. (Estimated completion: August 2026)
Our Log Analytics team is improving the architectural resiliency of our telemetry processing services by enabling multi-cluster flexibility and support for a broader pool of compute resource types, reducing dependence on any single cluster or resource type's availability and improving the ability to recover during large-scale scenarios. (Estimated completion: August 2026)
Finally, in the longer term we are performing a joint, holistic engineering review to evaluate the fail-state behavior of the cooling system protection logic. The current system prioritizes equipment protection and data security over operational continuity, which resulted in a full lockout rather than staged or degraded operation. This includes evaluating proactive row-level shutdown strategies during rapid temperature increases, and expanding the decision-making window available to operators before protective systems engage. (Estimated completion: October 2026)

How can customers make incidents like this less impactful?

Note that the 'logical' Availability Zones used by each customer subscription may correspond to different physical Availability Zones - customers can use the Locations API to understand this mapping, to confirm which resources run in which physical AZ: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations
For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/en-us/azure/architecture/patterns/geodes and https://learn.microsoft.com/en-us/azure/well-architected/design-guides/regions-availability-zones
Consider leveraging data redundancy options such as Geo-Redundant Storage (GRS) or Read-Access Geo-Redundant Storage (RA-GRS), which replicate data to a secondary region and can provide continued read access to data during regional incidents: https://learn.microsoft.com/azure/storage/common/storage-redundancy
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources. For guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/GHRP-84G

Апрель 2026

24

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/air/5GP8-W0G

What happened?

Between 11:30 and 23:22 UTC on 24 April 2026, customers may have experienced failures or delays when attempting to provision, scale, or update resources in East US.

Beyond this impact to service management, a small subset of customers may have encountered intermittent connectivity issues for newly-provisioned workloads including Virtual Machines and Azure Virtual Desktop sessions.

The issue initially began with impact to a subset of customers in a single Availability Zone (physical AZ-01) but as demand shifted, similar symptoms were observed impacting service management operations in AZ-02 and AZ-03. While none of these zones were impacted for the full duration of the incident, customers could have experienced periods of impact in each zone for portions of the incident.

The following services were among those affected: Azure Application Gateway, Azure App Service, Azure Batch, Azure Cache for Redis, Azure Data Explorer, Azure Data Factory, Azure Databricks, Azure Health Data Services, Azure Kubernetes Service (AKS), Azure Red Hat OpenShift, Azure Service Fabric, Azure Synapse Analytics, Azure Virtual Desktop, Azure Virtual Machines (VMs), Azure Virtual Network Manager, Azure VMware Solution, Oracle Database@Azure, Virtual Machine Scale Sets – and potentially additional services that were dependent on new compute allocations in the region.

Note: Logical availability zones assigned to customer subscriptions may map to different physical availability zones. Customers can use the Locations API to understand this mapping: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.

What went wrong and why?

The Azure PubSub service is a key component of the networking control plane, acting as an intermediary between resource providers and networking agents on Azure hosts. Resource providers, such as the Network Resource Provider (NRP), publish customer configurations during VM or networking ‘create, update, or delete’ operations. Networking agents (subscribers) on the hosts retrieve these configurations from the resource providers to program the host's networking stack. Additionally, the PubSub service functions as a cache, ensuring efficient retrieval of configurations during the lifecycle of a VM. This capability is essential for resource allocation and network mappings in Azure Virtual Network (VNet) environments. Like many Azure control plane services, Azure PubSub leverages Azure Service Fabric for distributed state management through a compute and data co-locating architecture. This is designed to enable low-latency, high-performance execution with built-in consistency by keeping application logic and replicated state together on the same nodes.

During normal platform operations, one partition of the PubSub control plane service in a single Availability Zone (AZ-01) experienced lock contention errors for several minutes, resulting in operations on that partition to timeout and fail. As designed, the platform initiated an automatic failover to a secondary replica, but this failover did not complete successfully. We intervened to investigate and attempted a manual failover of the primary partition, but this attempt was also unsuccessful, therefore resources on the impacted partition continued to see service management failures.

Approximately two hours into troubleshooting, we received alerts of impact to a single partition in AZ-03, which indicated partial loss of control plane availability in this zone as well. As the investigation progressed, we suspected that a previously deployed update to a control plane dependency had introduced a regression which could trigger lock contention under certain conditions.As part of our mitigation efforts, we prepared a rollback to the ‘last known good’ version to eliminate the possibility of a regression. We first applied this rollback in AZ-03, which successfully restored control plane service in that zone. Based on this, we began rolling back the affected components in AZ-01.

By design, rollback operations are executed in stages by our Service Fabric controllers using update domains to ensure platform safety. Recovery time was further extended because compute and data are co-located on the same nodes. In some cases, rollback requires rebuilding full replicas on new nodes, which significantly increases the time required to complete each stage. Given these conditions, we intentionally prioritized maintaining availability for unaffected partitions and services rather than accelerating the rollback. This combination of sequential update domain processing, replica rebuild requirements, and our standard safe recovery approach, resulted in an extended time to mitigate.

While mitigation was in progress, there were two periods of time during which we were unable to maintain two simultaneous fully healthy instances of the PubSub service across availability zones, a requirement for normal replication and control plane operations. This resulted in a temporary loss of quorum within the service. As the service attempted to self-heal, customer impact shifted between availability zones, leading to periods of degraded behavior across multiple zones.

These failure patterns began to appear in AZ-02 and again in AZ-03, broadening the scope of impact across the region. For AZ-02 we initiated and completed a rollback, and although AZ-03 had previously shown recovery following the earlier attempted rollback, we discovered that the rollback in that zone had not fully completed to all update domains. As impact reemerged, a rollback in AZ-03 was reinitialized and then completed, fully restoring service health.

Our post incident investigations confirmed that a latent regression in a recently deployed version of the PubSub service exacerbated the replica build times for the large Availability Zones of this region. This regression also contributed to the failure of the failover process, as it increased contention and reduced the service’s ability to build replicas under these conditions successfully. This issue had not surfaced during earlier testing, or in any other regions which had already received the update.

During the early stages of the incident, the impact appeared limited to a specific set of VM service management operations. Our first customer communication was automatically sent to impacted customers within 30 minutes of the issue starting, via our “BRAIN” automated messaging system. As more signals surfaced, it became clear that the impact was becoming broader and affecting the underlying systems on which multiple services rely, across Availability Zones. Consequently, we expanded our communications from attempting to target affected subscriptions, to communicating to all customers with resources in the region – in addition to publishing to our public Azure Status page. Finally, some updates lacked specificity that could have helped customers understand the evolving impact, how widespread the issue was, and how best to respond. We have included communications-related learnings below, to improve both the speed and level of detail.

How did we respond?

11:30 UTC on 24 April 2026 – Customer impact began, including failures and delays when attempting to provision, scale, or update resources.
11:38 UTC on 24 April 2026 – We detected an issue in AZ-01. A control plane partition experienced lock contention, and automatic failover attempts to secondary replicas did not complete successfully.
11:38–13:40 UTC on 24 April 2026 – We attempted manual failover in AZ-01, but these efforts did not successfully restore service.
11:59 UTC on 24 April 2026 – First targeted notifications were sent to a subset of impacted customers, via our “BRAIN” automated messaging system.
13:14 on 24 April 2026 – Notifications expanded to customer subscriptions that were using VMs/VMSS resources in the region.
13:40 UTC on 24 April 2026 – We identified a recently deployed update as the likely trigger condition for this issue.
13:50 UTC on 24 April 2026 – We began observing impact in AZ-03, indicating the issue was now affecting multiple Availability Zones.
14:07 UTC on 24 April 2026 – We initiated a rollback to the last known good version in AZ-03.
15:03 UTC on 24 April 2026 – We observed significant recovery in AZ-03, control plane availability exceeded 99%.
15:04 UTC on 24 April 2026 – We initiated a rollback to the last known good version in AZ-01.
15:36 UTC on 24 April 2026 – First public Azure Status page update, to ensure wider visibility.
17:19 on 24 April 2026 – Notifications expanded to all customer subscriptions that were using any resources in the region.
18:52 UTC on 24 April 2026 – We observed significant improvement in AZ-01, as the rollback progressed.
19:02 UTC on 24 April 2026 – We observed significant recovery in AZ-01, control plane availability exceeded 99%.
19:05 UTC on 24 April 2026 – We began observing impact in AZ-02, as load redistributed across the region.
19:10 UTC on 24 April 2026 – We initiated a rollback to the last known good version in AZ-02.
21:02 UTC on 24 April 2026 – We observed instability reappear in AZ-03, determined that this was because the rollback had not yet completed across all update domains, so manually reinitialized the rollback.
22:39 UTC on 24 April 2026 – We confirmed the rollback was fully completed in AZ-03.
23:22 UTC on 24 April 2026 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

We assessed the risk of occurrence in other regions with large Availability Zones and rolled back the PubSub service update in these regions. (Completed)
We are improving our test coverage surrounding the failure cases and load patterns that contributed to this incident, to catch similar issues before they reach production. (Estimated completion: June 2026)
We are reducing the blast radius of each partition by adjusting the scale of the partitioning used by the PubSub service. (Estimated completion: June 2026)
We are investing in additional enrichment surrounding our Network Resource Provider monitoring, to determine impacted subscriptions and send initial communications automatically. (Estimated completion: August 2026)
While co-locating compute and data on the same node in Service Fabric provides performance benefits, it can make rollback and recovery more challenging under resource-constrained conditions. To address this, we are developing a solution designed to ensure resource constraints do not delay recovery operations. (Estimated completion: September 2026)
Finally, we are further developing our AI-assisted communication system to consolidate relevant incident information into more detailed customer updates. (Estimated completion: September 2026)

How can customers make incidents like this less impactful?

For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that impacted a single region: https://learn.microsoft.com/azure/architecture/patterns/geodes and https://learn.microsoft.com/azure/well-architected/design-guides/regions-availability-zones
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/5GP8-W0G