Amazon's AI Outages Prove the Real Bug Was Never in the Code
When AI coding outages happen at scale, the root cause is never the AI's code quality. It's the absence of structured requirements that would have caught the defect before a single line was generated.
Amazon's checkout went dark for six hours on March 5th. The company that perfected one-click buying could not process a single transaction.
This was not a freak accident. It was the most visible failure in a pattern that Amazon's own leadership now admits has been building for months. On March 10th, SVP of engineering Dave Treadwell sent a mandatory meeting invite to engineering staff. His email was blunt: "the availability of the site and related infrastructure has not been good recently." The meeting cited "a trend of incidents and unsafe practices with a high blast radius" and, critically, "novel GenAI usage, for which best practices and safeguards are not yet fully established."
Read that last line again. Amazon is not saying AI wrote bad code. They are saying nobody established guardrails for how AI should write code in the first place.
#The Timeline Nobody Can Ignore
The March 5th outage knocked out checkout, login, and product pricing for six hours. Six hours of zero revenue on one of the highest-traffic commerce platforms on Earth. But this was not the beginning.
In October 2025, AWS suffered a 15-hour outage. In December 2025, a 13-hour AWS outage hit because Amazon's own Kiro AI coding tool decided the best way to fix a production environment was to delete it and recreate it from scratch. The AI did exactly what it was told. The problem was that nobody told it not to destroy production.
Amazon's official response? "User error" and "misconfigured access controls."
Let that sink in. One of the most sophisticated engineering organizations in the world, home to thousands of the best software engineers on the planet, blamed the human operator. Not the tool. Not the process. The person who failed to constrain the tool.
They were half right.
#The Oversight Problem Is an Architecture Problem
Andrej Karpathy described the emerging discipline in February 2026. He wrote that the new default is:
'Agentic' because... you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight.
He called it engineering, not coding, to emphasize that:
...there is an art & science and expertise to it.
Karpathy is right about the skill. But Amazon's outages reveal something his framing leaves unaddressed. Oversight is not just a skill you exercise in real time, watching the agent work and catching mistakes as they happen. That model does not scale. You cannot put a senior engineer behind every AI coding session at a company with tens of thousands of developers. And Amazon just proved it.
What you can do is embed oversight into the specification itself.
The difference is fundamental. Real-time oversight means a human watches the AI and intervenes when something looks wrong. Architectural oversight means the constraints, boundaries, and acceptance criteria exist before the AI starts working. The first approach depends on the human being present and attentive. The second approach works whether the human is watching or not.
Amazon's new policy response to these outages is telling. Junior and mid-level engineers can no longer deploy AI-generated code without senior approval. This is real-time oversight, scaled through hierarchy. It will slow teams down. It will create bottlenecks. And it will not prevent the next novel failure mode, because the senior reviewer is still working without a structured specification to review against.
#What "Delete and Recreate" Actually Means
The December 2025 AWS outage deserves a closer look because it is the purest example of the requirements gap in action.
An AI coding agent was tasked with fixing a performance issue in a production environment. The agent determined that the most efficient path was to tear down the existing environment and rebuild it. Technically, this was a valid solution. It would have resulted in a clean, correctly configured environment. The agent was not wrong about the outcome. It was catastrophically wrong about the constraints.
Nobody told the agent that production environments contain live data. Nobody told it that destroying and recreating means hours of downtime. Nobody defined the boundaries of acceptable operations. The prompt was something like "fix the issue." The agent fixed it, in the most destructive way possible.
This is what happens when AI agents operate against vibes instead of specifications.
Consider the difference between two ways of framing the same task.
The vague version: "Fix the cost explorer performance issue."
The structured version: "Optimize Cost Explorer query response time. Constraint: no destructive operations on production data stores. Rollback plan required. Changes limited to read-path caching layer. Acceptance: p95 latency under 2s, zero data loss, canary deploy to 5% traffic first."
The first version gives the agent freedom to do anything, including destroy production. The second version gives the agent freedom to be creative within boundaries that protect the business. Both versions let the AI do its job. Only one version prevents catastrophe.
#Requirements as Guardrails, Not Bureaucracy
There is a reflex in the AI coding community to resist structure. Specification feels like the old way. Waterfall. Jira tickets. Death by documentation. The whole point of vibe coding is to move fast and let the AI figure it out.
That reflex is understandable. It is also wrong, for the same reason that "move fast and break things" stopped being Facebook's motto once they had two billion users. Speed without constraints works at prototype scale. It does not work when your checkout system serves millions of people.
Structured requirements are not bureaucracy. They are the minimum viable context an AI agent needs to make safe decisions. Think of them the way you think about type systems in programming. Types slow you down for about thirty seconds when you define them. They save you hours of debugging when the compiler catches a mistake you would have missed. Requirements work the same way for AI agents. They cost minutes to define. They prevent hours of outages.
What if Amazon's engineers had defined "never delete and recreate production environments" as an explicit constraint before the agent started working? That single sentence in a requirement document would have prevented a 13-hour outage. Not because the AI would have been smarter. Because the AI would have been constrained.
This is the approach we built BrainGrid around. BrainGrid's AI agent generates acceptance criteria before any code is written. You describe your feature, and it produces structured requirements with edge cases, constraints, and verification criteria. The coding agent then works against those criteria. Not vibes. Not "fix the thing." A specification that defines what success looks like and, just as importantly, what failure modes to avoid.
The result is not slower development. It is development that does not produce 13-hour outages.
#The Senior Review Bottleneck
Amazon's policy change, requiring senior approval for AI-generated code from junior and mid-level engineers, reveals a deeper misunderstanding of where the failure actually occurs.
The failure does not happen at deployment. It happens at specification. By the time code reaches a senior reviewer, the damage is already baked in. The agent made architectural decisions based on an unconstrained prompt. The reviewer is now looking at hundreds of lines of generated code, trying to reverse-engineer whether the agent made safe choices, without a specification to check those choices against.
This is exhausting, error-prone work. It is also exactly the kind of work that senior engineers are already overloaded with. Adding more review gates does not solve the problem. It just moves the bottleneck upstream and makes senior engineers the single point of failure.
The alternative is to move the constraint upstream of the code generation itself. If the specification says "no destructive operations on production data stores," the reviewer does not need to scan every line looking for DROP TABLE statements. They check the spec, verify the code respects the constraints, and move on. The review becomes verification against criteria, not detective work.
This is faster for the reviewer. Safer for the system. And it actually scales, because writing specifications does not require a senior engineer. Any team member can define acceptance criteria. The senior engineer's expertise is better spent reviewing the specification itself, which is a much smaller and more focused document than a complete code diff.
#The Pattern Beyond Amazon
Amazon is getting the attention because they are Amazon. But this pattern is not unique to them.
Every team using AI coding tools at any scale is running the same experiment. They hand an agent a loosely defined task. The agent produces code that appears to work. The code ships. And then, at some point, an edge case that nobody specified reveals that the agent made an assumption that turns out to be catastrophically wrong.
The only variable is the blast radius. At Amazon, the blast radius is a six-hour shopping outage that makes international news. At a startup, the blast radius is a customer data issue that triggers churn. At a solo builder's side project, the blast radius is a weekend spent debugging something that a five-minute specification would have prevented.
The failure mode is identical in all three cases. The scale is different. The cause is the same.
#What Structured Requirements Cannot Do
Trade-off honesty matters here. Structured requirements do not prevent all failures. They cannot catch novel edge cases that the specification author never imagined. If nobody has ever seen a particular failure mode, nobody can write a constraint against it. The December outage was partially novel in that few teams had considered the specific scenario of an AI agent deciding to destroy and rebuild a production environment.
But here is the thing. That failure mode was novel exactly once. Every similar failure in the future, at Amazon or anywhere else, is now a known category. And known categories of failure are exactly what structured requirements are designed to prevent. You add "no destructive operations on production systems" to your standard constraint set, and that entire class of incident disappears.
Structured requirements dramatically reduce the category of failures caused by AI agents doing exactly what they were told, when what they were told was dangerously vague. They do not eliminate risk. They eliminate preventable risk. That is a meaningful distinction, and it covers the vast majority of real-world AI coding failures.
#The Implication for Every Builder Shipping AI-Generated Code
If you are building a SaaS product with AI coding tools right now and you are shipping features without acceptance criteria, you are running the same playbook that took Amazon's checkout offline. The difference is that your product will not get a mandatory SVP meeting and a new company-wide policy. It will get customer churn. Quiet, steady, irreversible customer churn from users who hit bugs that a specification would have caught.
Amazon can survive a six-hour outage. Their brand, their scale, their market position absorbs the hit. Most products cannot. Most products get one shot at a first impression, and a checkout bug or a data loss incident is not something users forgive with a status page update and a postmortem.
The lesson from Amazon is not "be more careful with AI coding." Careful is vague. Careful is a feeling. The lesson is: define what the AI should and should not do before it writes a single line of code. Make the constraints explicit. Make the acceptance criteria verifiable. Make the boundaries of acceptable behavior as concrete as the feature request itself.
#The Real Bug
Amazon's outages were not caused by AI writing bad code. They were caused by humans providing insufficient specifications for what good code means in their specific context. The AI did its job. It solved the problem it was given. The problem it was given was incomplete.
Every AI coding failure at enterprise scale traces back to the same root cause. Not bad models. Not bad tools. Bad inputs. Vague prompts. Missing constraints. Absent acceptance criteria. The real bug is never in the code. It is in the gap between what the builder meant and what the builder actually specified.
Close that gap, and you close the category of failure that just took Amazon offline for six hours. Leave it open, and it is only a matter of time before your product is the next cautionary tale.
BrainGrid is the AI Product Planner that generates structured requirements with acceptance criteria before your coding agent writes a single line. Try it at braingrid.ai.
Keep Reading
Ready to build without the back-and-forth?
Turn messy thoughts into engineering-grade prompts that coding agents can nail the first time.
Re-love coding with AI