Break It on Tuesday: Practice Failure to Protect Friday Revenue
Good leaders don't avoid breakdowns. They schedule them.
Netflix has a tool called Chaos Monkey that randomly murders servers—during business hours—because Netflix engineers apparently hate sleep. While you’re reading this, somewhere in Netflix’s infrastructure, something important is being unplugged to see what happens. And Netflix keeps streaming. They’ve turned system failure into something as routine and boring as a fire drill. This approach is called chaos engineering.
Our businesses now run teams of humans and software, including AI, where tiny wobbles ripple fast. Yet most organisations spend months planning resilience on slides but never actually practice it.
We’re delegating more to agents that act. Failures spread faster and quieter than old ‘blue-screen’ crashes.
So, here’s an idea: how about we learn from digital giants and run small and safe fire drills to make your operations provably safer.
Now, you’re probably thinking, “Chaos Monkey is nice for Netflix, but I sell insurance, not streaming.” Fair. But didn’t you recently introduce this new, AI-powered piece of software? Imagine it’s 3:05 pm on a Friday. Your “smart” pricing tool just marked your bestseller down 90% instead of 9%. The bargain hunters are having a field day, your margin is having a near-death experience, and someone in finance just discovered religion. Your team scrambles while the AI cheerfully explains it was “optimising for engagement.”
Remember when Air Canada’s chatbot promised a customer a bereavement discount that didn't exist, and then Air Canada argued in court that the chatbot was a “separate legal entity responsible for its own actions”? They actually said that. In a real tribunal. With straight faces. The judge basically responded with the legal equivalent of “Are you kidding me?” and made them pay up. Now imagine if Air Canada had run a Tuesday drill where they fed their chatbot tricky questions about edge-case policies. They might have discovered it was confidently hallucinating refund policies before a grieving customer did.
The Problem: We Plan Resilience on Slides, Not in Real Life
The frequency of AI incidents grows every year. This year, it seems like every company is shipping copilots with the authority to send emails, approve expenses, and make decisions at machine speed. What could possibly go wrong?
Every organisation I know has a disaster recovery plan that’s about as battle-tested as a submarine with a screen door. We’ve all sat through those workshops where consultants draw complicated flowcharts and everyone nods seriously while secretly checking email.
But here’s what’s changed with AI: when computers used to fail, they’d crash with a blue screen, and everything would stop. It was annoying but obvious. When AI fails, it doesn’t stop working. It works just as confidently, but now completely wrong, and at scale. It’s hard even to notice that an AI disaster is unfolding!
When AI fails, it doesn’t stop working. It works wrong, confidently, at scale.
Every disaster recovery workshop looks the same: executives, post-it notes, two hours of confident roleplay. Every actual disaster looks the same, too. Those same executives spending forty minutes trying to remember who has the password while the AI cheerfully destroys margins. The gap between our rehearsals and our reality is where money goes to die.
Your board doesn’t want to see your 47-slide resilience deck. They want to know why you didn't catch the problem before customers started screenshotting for Reddit.
The Pattern: BOUNDED FAILURE
But here’s the bit that matters for those of us who don’t run data centers: this pattern, let's call it BOUNDED FAILURE, isn’t about servers. It’s about practicing small, controlled screw-ups so the big, expensive ones don’t eat your lunch. The “bounded” part means you set limits: Tuesday afternoon (not Black Friday), one pricing algorithm (not the entire catalog), ten minutes (not until someone notices).
Think of it as a fire drill for your operations, except instead of shuffling outside to gossip, you’re learning what actually happens when the AI goes off the rails. And unlike a fire drill where everyone knows the alarm is coming at 2 pm Tuesday, the best bounded failures catch people off guard. Real disasters don’t send calendar invites.
The pattern emerged from tech companies that got tired of surprises. They realized that if failure was inevitable, they might as well get good at it.
What they discovered was beautifully mundane: most disasters aren’t technical, they’re organisational. The AI fails, sure, but the real problem is that nobody notices, nobody knows who can override it, or the override requires three approvals from people who are all in different meetings, or the only person who understands the system is on vacation in Bali with sketchy wifi.
The Hidden Challenge: DECISION CHAOS
This brings us to what I call DECISION CHAOS: that special moment when your AI makes a questionable call and everyone stands around looking at each other like meerkats in headlights.
Remember when Zillow’s home-buying algorithm lost them half a billion dollars in 2021? Their AI was supposed to identify undervalued homes, buy them, and flip them for profit. Instead, it got drunk on pandemic housing data and started overpaying for everything. The fascinating part wasn’t that the algorithm failed. Everyone’s models were struggling with COVID-19’s market chaos. The fascinating part was that Zillow bought 9,680 homes in Q3 2021 alone, even after they knew something was wrong. They ended up with 27,000 homes they’d overpaid for, only managed to sell 17,000, and had to write down $304 million. The algorithm had been running for three and a half years before someone finally pulled the plug. Three and a half years of the AI confidently buying houses at the wrong price while humans debated whether to trust it or override it.
Zillow’s home-buying algorithms went on a Q3’21 spree: 9,680 homes. Then the company took a ~$304M write-down and wound the unit down. The lesson wasn’t just model error; it was override delay.
Most organisations discover these decision gaps during actual disasters. That’s like learning you can’t swim while you’re already drowning. Fire drills reveal them on Tuesday when coffee is fresh and customers aren’t watching.
The Playbook: Three Steps That Won’t Make IT Cry
Here’s how to run your first fire drill without causing actual fires:
First, pick one thing that would ruin your day if it broke. Not “system architecture” or “digital transformation”. I mean the specific thing that approves customer refunds or calculates delivery fees. Something where you’d notice failure because customers would definitely notice first. Write down what “normal” looks like with an actual number: “95% of orders price correctly” or “refund approval takes less than 5 minutes.”
Second, break it a little bit on a Tuesday afternoon. After lunch, when everyone’s awake but not stressed. Feed your pricing AI a product that doesn’t exist. Tell your chatbot that refund limits don’t apply for the next 10 minutes. Delay that critical supplier spreadsheet by half an hour. Set a timer and track one metric: how many minutes until someone notices and stops the bleeding? This is your containment time, and it’s probably going to be embarrassing. That’s the point.
Third, fix the dumbest thing you saw. You’ll see many dumb things, trust me. Maybe nobody knows who can override the AI. Maybe your “urgent” communication method is a Slack channel that 200 people have on mute. Maybe the override process requires someone to remember a password they haven’t used since 2019. Pick the dumbest one. Fix it. Make it official. Put next month’s drill on the calendar.
A quick safety note: always use test SKUs, sandbox accounts, or products nobody actually buys. Set a hard stop at 30 minutes. This is a fire drill, not an actual fire.
The Evidence: What Nobody Admits About Their First Drill
Here’s the thing about real chaos engineering success stories: companies love to talk about doing it, but they rarely share the embarrassing specifics. National Australia Bank achieved an 85% reduction in critical incidents after their 2014 cloud migration, where they implemented chaos engineering practices. Target stores will present at conferences about their “Finding Joy in Chaos Engineering” program. But nobody publishes their actual containment times from their first drill.
I’ve sat in rooms where this happens. The pattern is always the same. First drill: complete disaster. Nobody knows who owns what. The “kill switch” turns out to be three different switches owned by three different teams who are all in meetings. Someone suggests we should have a meeting about having better meetings. Containment time: “eventually.”
Second drill, after adding a straightforward fix (usually just documenting who can actually stop the system), containment time drops to under 10 minutes. Not because of complex technology. Because someone finally wrote down a phone number.
The specific improvements are always the same: from “nobody knows” to “somebody knows” to “everybody knows.” From meetings about meetings to pushing a button. From forty-seven people on a conference call to one person with clear authority.
But good luck getting anyone to admit their first containment time was measured in hours while their CEO was wondering why the website was down.
The Payoff: Friday Stays Boring
The immediate wins are obvious: fewer surprise disasters, faster recovery when surprises happen anyway, and far fewer apology emails that start with “We regret to inform you...”
But the strategic win is what gets me excited. When you can tell your board, with actual data, “We’ve tested our AI pricing override four times, and our containment time improved from 47 to 8 minutes,” you're not selling resilience theory. You’re showing resilience receipts. That’s how you earn permission to automate the next thing: by proving you can handle it when automation gets creative.
Plus, there’s something beautiful about watching a team handle a crisis they’ve rehearsed: no panic, no blame, just process. Someone hits the button, someone else makes the call, and everyone knows their role. It’s like watching a Formula 1 pit crew (if the pit crew sold insurance and the car occasionally decided to identify as a boat).
In The Economy of Algorithms, I describe algorithms as digital minions: eager to help, occasionally psychotic, always literal. Fire drills are their orientation program: you teach them the house rules before giving them the keys to the kingdom. Your digital minions will behave in weird ways. Better to learn that on your schedule than during Black Friday. And when algorithms become your primary customer interface, BOUNDED FAILURE is what keeps that relationship from becoming expensively awkward.
Your Next Move
If you don’t break it on Tuesday, Friday will break you.
Here’s a drill that works for almost any organisation: Next week, test what happens when your AI meeting assistant goes rogue. Most of us use them now: Otter, Teams Copilot, Claude, ChatGPT, something that listens to our calls and sends out helpful summaries that everyone assumes someone else has reviewed.
Create a fake meeting. Have someone say something absurd but plausible: “We’ve decided to pivot the entire company to cryptocurrency,” or “The board approved a 40% budget cut,” or “All remote work ends Monday.” Let the AI create its summary. Send it to a test distribution list of people who would usually receive it.
Time what happens: How long until someone notices the insane content? More importantly, does anyone speak up? Or do they assume it must be correct because it came from an official summary? Do they assume someone else approved it? Do they forward it to their teams before reading it?
You’ll likely discover what everyone discovers: nobody reads these summaries carefully. The senior person who’s supposed to review them skims for their name. The junior person who spots the error assumes they must be misunderstanding something. The middle manager forwards it to their team with “FYI, see below from leadership.”
The containment time isn’t about finding a kill switch. It’s discovering that your organisation has been running on autopilot, trusting AI summaries that nobody verifies. The fix isn’t technical. It’s cultural. Someone needs to own the review. Someone needs permission to say “wait, this seems wrong.”
Rerun the drill next month. See if anyone has started actually reading.
Make BOUNDED FAILURE your Tuesday habit. And retire DECISION CHAOS.
What’s one small disaster you’d be willing to rehearse next week? And who would you not tell in advance?
This is a great line : When AI fails, it doesn’t stop working. It works wrong, confidently, at scale.
Reminds me of executives whose confidence behaviour traits accentuate as the crisis deepens. At least you can pull the plug on the machine!
Really like this Marek and am going to riff on this in relation to what schools etc might do - thanks for the prompt!