
Amazon Web Services has issued a full apology and technical explanation after a 15-hour outage in its North Virginia data region took thousands of major platforms offline, exposing the internet’s heavy dependence on a handful of US cloud providers.
What Happened?
The incident began late on Sunday 19 October, when engineers at Amazon’s US-East-1 data region in North Virginia detected connection failures across multiple services. Starting at 11:48 pm Pacific time, critical systems linked to Amazon DynamoDB, a database service used by many of the world’s largest apps, began to fail.
The root cause, according to Amazon’s post-event summary, was a “latent race condition”, which is a rare timing bug in the automated Domain Name System (DNS) management system that maintains the internal “address book” of AWS services. In this case, automation mistakenly deleted the DNS record for DynamoDB’s regional endpoint, effectively removing its ability to resolve names to IP addresses.
This caused immediate DNS failures for any service trying to connect to DynamoDB, including other AWS components such as EC2 virtual machines, Redshift databases, and Lambda compute functions. The DNS record was finally restored manually at around 2:25 am, but many dependent systems took much longer to recover, with some continuing to fail well into the afternoon of 20 October.
Where The Internet Felt It
The outage’s ripple effects in this case were global. For example, more than 1,000 platforms and services experienced disruption, including Snapchat, Reddit, Roblox, Fortnite, Lloyds Bank, Halifax, and Venmo. UK financial services were hit particularly hard, with some Lloyds customers reporting payment delays and app errors until mid-afternoon.
Other sectors also suffered unexpected consequences. For example, Eight Sleep, which manufactures smart mattresses that use cloud connections to control temperature and positioning, confirmed that some of its products overheated or became stuck in a raised position during the incident. The company said it would work to “outage-proof” its devices following the incident.
For millions of consumers, the event briefly made parts of the digital world disappear. Websites, apps, and connected devices remained online but could not “see” each other due to the DNS fault, illustrating just how central Amazon’s infrastructure has become to everyday online activity.
How A Single Glitch Spread So Widely
At its core, the failure was a simple but catastrophic DNS issue. DNS is the internet’s naming system, translating web addresses into machine-readable IP numbers. When AWS’s automation produced an empty DNS record for DynamoDB’s endpoint, every application depending on it lost its bearings.
AWS engineers later confirmed that two redundant DNS update systems, known internally as “Enactors”, attempted to apply configuration plans simultaneously. One was significantly delayed, allowing an older plan to overwrite a newer one before being deleted, taking all IP addresses with it. The automation could not self-repair, leaving manual intervention as the only option.
As a result, internal systems that depended on DynamoDB also stalled. Amazon EC2, the platform used to launch virtual servers, could not start new instances. Network Load Balancer (NLB), which distributes traffic between servers, suffered cascading health-check failures as it tried to route connections to resources that were technically online but unreachable.
Why Recovery Took Most Of The Day
While the DNS issue was resolved within hours, the actual automated systems that depend on it did not immediately catch up. For example, EC2’s control software reportedly entered a “congestive collapse” as it attempted to re-establish millions of internal leases with physical servers. Restarting this process safely took several hours.
At the same time, delayed network configurations created a backlog in AWS’s Network Manager, causing newly launched instances to remain disconnected. To make things worse, load balancers then misinterpreted these delays as failures and pulled healthy capacity from service, worsening connection errors for some customers.
By early afternoon on 20 October, Amazon said all EC2 and NLB operations were back to normal, though the ripple effects continued to be felt across smaller services for some time.
Amazon’s Explanation And Apology
Following the outage (and the backlash), Amazon published a detailed (long) 7,000-word technical report outlining the chain of events. The company admitted that automation had failed to detect and correct the DNS deletion and said manual recovery was required to restore service.
“We apologise for the impact this event caused our customers,” Amazon wrote. “We know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways.”
The company confirmed it has disabled the affected DNS automation worldwide until a permanent fix is in place. AWS engineers are now adding new safeguards to prevent outdated plans from being applied, and additional limits to ensure health checks cannot remove too much capacity during regional failovers.
Reactions And Tech Commentary
Industry experts have generally described the incident as a textbook case of automation failure, pointing to how a rare timing error in AWS’s DNS management system exposed wider systemic dependencies. Many engineers have noted that the issue reinforces the importance of resilience and of designing systems to tolerate faults in automated processes.
The outage is a clear reminder of a long-standing saying in IT circles, i.e., “It’s always DNS.” Although such faults are not unusual, the sheer scale of AWS’s infrastructure meant that a single configuration error was able to cause global disruption.
An Argument For Diversifying Cloud Setups?
Experts have also warned that the outage shows why businesses should diversify their cloud setups. For example, those running all workloads within a single AWS region found themselves completely offline. Organisations using multiple regions, or backup capacity in other cloud providers, were, however, able to switch over and maintain operations.
The Broader Implications
AWS remains the market leader in global cloud infrastructure, accounting for roughly 30 per cent of worldwide spending (according to Synergy Research). Its nearest competitors, Microsoft Azure and Google Cloud, hold around 25 per cent and 11 per cent respectively. However, this latest disruption has reignited debate about overreliance on a single provider.
Large-scale customers are now likely to review their resilience strategies in the wake of the incident. Financial institutions, healthcare providers, and government departments using AWS may now face renewed scrutiny over whether they have realistic fallback options if US-East-1 (Amazon’s largest and oldest data region) goes down again.
For Amazon, the incident is a reminder that its strength as the backbone of the internet can also be its greatest vulnerability, and how every outage draws widespread attention because of its systemic impact. The company’s rapid publication of a detailed (and very long) postmortem is in line with its usual transparency practices, but is also now unlikely to prevent competitors from using the episode to argue for multi-cloud adoption.
How Users Were Affected
For individuals and smaller businesses, the experience of the outage was that websites and apps stopped working. Some services displayed error messages while others simply timed out. With AWS hosting backend systems for thousands of platforms, many users had no idea that Amazon was the root cause.
Gaming companies like Roblox and Epic Games were among the first to confirm the disruption, reporting that login and matchmaking services were unavailable for several hours. Social media feeds froze for many users, while banking and payments apps experienced intermittent outages throughout the morning.
Even Amazon’s own services, such as Alexa and Ring, saw degraded performance during the height of the incident, highlighting the circular dependencies within its own ecosystem.
What Critics Are Saying
Criticism has centred on the scale of AWS’s dominance and the concentration of critical systems in one region. The US-East-1 region handles enormous traffic, both for North America and internationally, because it hosts many AWS “control plane” functions that manage authentication and routing across the network.
Analysts have warned for years that this architecture creates a “single point of systemic risk”, which is a problem that cannot be easily fixed without major structural changes. Calls for greater geographic and provider diversity in cloud services are now growing louder, particularly from European regulators seeking more independence from US infrastructure. Analysts have essentially said the incident showed how organisations that rely on a single AWS region are (perhaps obviously) more vulnerable to disruption. Experts in cloud resilience have noted that customers without secondary regions or providers to keep services running during an outage reinforce long-standing advice to build in redundancy and avoid single points of failure.
What Now?
AWS says it is reviewing all automation across its regions to identify similar vulnerabilities. It says the DNS Enactor and Planner systems will remain disabled until the race condition bug is eliminated and additional safeguards verified. It also says engineers are enhancing testing for EC2 recovery workflows to ensure large fleets can re-establish leases more predictably after regional incidents.
For business users, the event is likely to prompt at least a discussion about the wider adoption of multi-region resilience testing and disaster recovery planning. The broader question is whether the global internet can continue to rely so heavily on a few cloud giants without developing greater local redundancy.
Amazon’s response has been technically thorough and contrite, but the 20 October outage has again exposed the fragility of the infrastructure that underpins much of modern digital life.
What Does This Mean For Your Business?
For Amazon, the scale of this disruption highlights both its dominance and its exposure. When so much of the world’s digital infrastructure runs on AWS, even a small internal fault can have far-reaching consequences. That puts continual pressure on the company to prove not only that it can recover quickly but also that it can prevent similar incidents altogether. Investors, partners, and enterprise customers will expect to see evidence of lasting improvements rather than temporary workarounds.
For UK businesses, this incident offers a practical reminder about risk, resilience, and dependency. Many British firms now rely on US cloud platforms for critical operations, from financial transactions to logistics and customer service. The lesson is, therefore, that resilience cannot be outsourced entirely. Businesses must understand where their data and services actually live, review which regions and providers they depend on, and ensure that key functions can continue if one part of the cloud goes dark.
Regulators and policymakers are also likely to have taken note of what happened and its effects. The outage is likely to reinforce long-running discussions in the UK and Europe about digital sovereignty and the risks of relying on infrastructure controlled by a handful of American companies. While creating a truly independent alternative would be expensive and complex, the case for diversified, regionally distributed systems is now stronger than ever.
Competitors, meanwhile, now have an opportunity to frame this as a kind of turning point. Microsoft, Google, and European providers such as OVH and Stackit will likely use the event to promote multi-cloud architectures and region-level redundancy. However, each faces the same challenge at scale, i.e., automation that makes systems efficient can also make them fragile when unexpected conditions arise.
Ultimately, the outage serves as a stark illustration of how deeply interconnected the modern internet has become. Every business that builds on these platforms shares some part of the same risk. The real question for Amazon and its customers alike is not whether such failures can be avoided completely, but how quickly and transparently they can recover when the inevitable happens.



