Simon Willison Runs Four Agents at Once, Burns Out by 11 AM

Source: Lenny's Podcast | Published: 2026-04-02T12:31:15Z

The Django co-creator and 25-year coding veteran finds that wrangling multiple coding agents taxes every ounce of his professional experience — not because he's writing code, but because the cognitive load of tracking them simultaneously pushes human limits.

Simon Willison has been writing code for 25 years and co-created Django — the web framework behind Instagram, Pinterest, and Spotify. He coined the term "prompt injection," popularized "AI slop," and has over 100 open-source projects. Over the past five months, he completely overhauled his workflow, then documented every discovery on his blog.

November 2025: An Invisible Line Was Crossed

Throughout 2025, Anthropic and OpenAI poured nearly all their training resources into code generation. Reasoning models — which spread from OpenAI's o1 in late 2024 to every major provider — turned out to be exceptionally effective for code, capable of tracing bug root causes and following logic chains. With both labs going all-in on coding, GPT 5.1 and Claude Opus 4.5, released in November, crossed a threshold: coding agents went from producing code that "mostly sort of worked" to code that "almost always does what you tell it to."

That gap sounds small. In practice, it was a phase change. Engineers who started tinkering with these tools over the holidays all hit the same epiphany: if you describe what you want clearly enough, it actually builds it. By January and February, droves of engineers woke up to a new reality — you could produce ten thousand lines of code in a day, and most of it worked.

Vibe Coding and Agentic Engineering Are Two Different Things

When Andrej Karpathy originally defined vibe coding, he meant: never look at the code, let AI build things by feel, keep what works, tweak what doesn't. Willison believes that definition should stay in its lane — vibe coding means "I haven't read how the code works, this isn't production-grade, but it's a decent prototype."

When a professional engineer uses a coding agent to write production code, do code review, and run tests, that deserves a different name. Willison chose "agentic engineering," because the core is the agent's closed-loop capability: it doesn't just spit out code — it writes, debugs, tests, and iterates. Using that loop well demands deep software engineering experience, and that won't get easier anytime soon.

For non-programmers, vibe coding unlocks a genuinely new capability: anyone can make a computer do things for them. But the boundaries are clear — if your vibe-coded project is only used by you and a bug only hurts you, go wild. The moment someone else depends on your code, you need to step back and think about risk. And knowing what constitutes responsible use is itself an expert-level skill.

Lights-Out Factory: Can You Ship Production Software Without Reading Code?

A security company called StrongDM started an experiment in August 2025: no writing code, and no reading code either. Their product is enterprise access management — someone joins the company and needs Jira, Slack, and Okta permissions provisioned. This is security-sensitive software; by all conventional wisdom, you shouldn't build it this way. But StrongDM has years of security product experience and knows exactly where the risks lie.

Their approach was to build an army of agent testers: hundreds of simulated employees in simulated Slack channels firing off messages around the clock — "Hey, can I get Jira access?" These simulated users burn through roughly ten thousand dollars a day in token costs, but in return the software gets continuously, intensively tested.

The more interesting part is that the Slack channels themselves are fake. Real Slack has rate limits — you can't run tens of thousands of simulated users simultaneously. So StrongDM had coding agents build mock versions of Slack, Jira, and Okta based on their public API docs and open-source client libraries — a tiny Go binary, zero cost to run, complete with a UI. They vibe-coded a fake Slack frontend just to monitor test status.

The takeaway isn't what tools they used. It's the shift in mindset: if you're not reading the code, how do you know the software is good? The answer is to make verification absolutely bulletproof.

Once Code Gets Cheap, the Bottleneck Shifts to Everything Else

The old process: write a spec, hand it to engineering, and if you're lucky, get an implementation three weeks later. Now that step might take three hours. So what comes next?

Willison's approach: for any feature he wants to design, he runs three different prototypes through agents simultaneously, because prototypes are now virtually free. Then he compares, tests, and picks. Prototyping was a core competitive advantage across his 25-year career. Now anyone can do it. But knowing when to prototype and how to use prototypes to explore direction still requires experience.

As for whether AI can replace product managers in deciding "what to build" — Willison thinks AI is excellent at the first two-thirds of brainstorming, rapidly exhausting every obvious idea. The interesting stuff emerges when you push it to generate 20 more: the tail end of the list starts surfacing ideas that aren't great on their own but point you toward something compelling. You can even force cross-domain mashups — "give me marketing ideas for my SaaS product, inspired by marine biology." Most of the output is garbage, but occasionally there's a seed worth planting.

That said, AI still can't replace real humans for validating ideas. Having ChatGPT pretend to be a user clicking through your prototype is nowhere near as valuable as watching a real person share their screen on Zoom.

Mid-Level Engineers Might Be in the Most Awkward Position

About a month ago, Thoughtworks convened a group of engineering VPs for a closed-door discussion. They reached an interesting conclusion: AI is great for senior engineers — it amplifies their skills. It's also great for juniors — it dramatically shortens ramp-up time. Cloudflare and Shopify both reported hiring thousands of interns in 2025 because what used to take a month to get productive now takes a week.

The most awkward spot is the middle. Mid-level engineers don't have the deep experience that seniors can amplify, and the onboarding boost that juniors get from AI is something they've already aged out of. Willison sees this as the most important open question right now — not junior vs. senior, but the position of mid-level engineers.

Running Four Agents at Once and Being Exhausted by 11 AM

I can spin up four agents at once, each solving a different problem. By 11 AM I'm spent.

Willison emphasizes that using coding agents well draws on every bit of his 25 years of experience. Not because he's writing code by hand, but because of the cognitive load: tracking multiple agents' progress simultaneously, judging how much prompting each problem needs, holding multiple contexts in his head — human cognition has hard limits, and they're easy to blow past.

He's heard about people staying up until 2 AM because "the agent can do work for me," or waking at 4 AM to kick off new tasks. He hopes this is just novelty-driven and temporary — after all, agents have only been truly useful for four or five months. But there's a gambling and addiction component that worries him.

The paradox: AI was supposed to make us more productive and more relaxed, but the people using AI most intensely are working harder than ever. Willison's own New Year's resolution was counterintuitive — every previous year he told himself to focus and do less. For 2026, his goal is to take on more work and be more ambitious.

Hoard the Things You Know How to Build

Willison has a GitHub repo called `simonw/tools` with 193 small HTML/JavaScript tools, each a validated idea. Another repo, `simonw/research`, contains AI-driven research projects — Claude Code downloads new software, tries it out, writes a report, runs benchmarks, and outputs a markdown file.

The key distinction: these aren't deep-research-style web search reports. They're results from a coding agent that actually wrote code, ran it, and generated charts. A repo full of unverified reports has little value, but once an agent has written and executed code, it becomes a reusable starting point.

In practice, he tells Claude Code: "Go look at the WebAssembly and Rust projects in my research repo, then use that experience to tackle this new task." The coding agent can search the entire repo for relevant context and assemble it into a new solution.

Red-Green TDD: Five Words That Transform Agent Output Quality

The core loop of test-driven development: write a test, watch it fail (red), write the implementation, watch it pass (green). Willison tried this for two years as a human programmer, found it too slow and tedious, and gave up. But coding agents don't get bored.

Adding "use red/green TDD" to your prompt — just five words — makes the agent write a test first, confirm it fails, then write the implementation and confirm it passes. This dramatically reduces missed tests and unnecessary code. And because the agent recognizes the term, you don't need a paragraph explaining the process.

Willison's standards for code quality shifted accordingly. Before, a hundred lines of code with a thousand lines of tests would have been over-testing. Now he doesn't care — updating a thousand lines of tests is the agent's problem, and those tests prevent the agent from breaking existing features when adding new ones. People who skip tests think they're saving time. In reality, technical debt will slow them down before long.

A Good Template Beats a Page of Documentation

Willison starts every new project with a minimal template: a test file that tests 1+1=2, formatted in his preferred style, with a few lines of scaffolding. Coding agents are exceptionally good at continuing patterns they find in existing code — if there's one test in the repo, the agent writes more tests. If there's a consistent indentation style, the agent follows it.

Some people write lengthy CLAUDE.md files describing their workflow preferences. Willison doesn't. He uses a skeleton project to imply preferences, and the agent picks them up. These templates are all on his GitHub — one for Python libraries, one for dataset plugins, one for CLI tools.

The Lethal Trifecta: The Security Dead End of Personal AI Assistants

Willison coined "prompt injection" in 2022, but the name is misleading — SQL injection is a solved problem; prompt injection is not. So he coined a second term: "lethal trifecta." You can't guess what it means from the name alone — you have to look it up, which is exactly the point.

The three sides of the lethal trifecta: the agent can access private information (like your email); an attacker can inject instructions into the system (like sending you an email); the agent can exfiltrate data back to the attacker (like forwarding that email). When all three conditions exist simultaneously, it's a disaster. The defense is to eliminate one side, and the easiest one to cut is usually the data exfiltration channel.

You can get detection accuracy to 97%. In my book, that's a failing grade. It means 3 out of every 100 attacks steal all your information.

The AI Version of the Challenger Disaster Hasn't Happened Yet — But It Will

Willison borrows a concept called "normalization of deviance": before the Challenger disaster, many people knew the O-ring seals were unreliable, but every launch went fine, so institutional confidence kept growing. AI is going through the same thing — we're using these systems in increasingly unsafe ways, and there hasn't been a headline-grabbing prompt injection incident yet, so risk perception keeps dulling.

He predicts a "Challenger-level" AI security incident will eventually happen. But he also admits he's made this prediction every six months for the past three years, and it hasn't come true yet.

OpenClaw: The Desire for Personal Assistants Overrides Everything

OpenClaw's first line of code was written on November 25th. Three and a half months later, its white-label hosted service appeared in a Super Bowl ad. This is almost exactly the kind of product Willison opposes most — a personal digital assistant with access to all your email and the ability to act on your behalf. From a security standpoint, it's catastrophic.

But OpenClaw proved something: the demand for personal digital assistants is so enormous that people will ignore security risks and even wrestle with API keys and token configurations to get one. Anthropic and OpenAI could have built this product first, but they didn't know how to make it safe, so they held back. An independent third party had no such hesitation, shipped it, and happened to ride the wave of post-November model capability gains.

Willison bought a Mac Mini to run OpenClaw, but only runs it in a Docker container without access to his personal email. A friend said OpenClaw is basically a digital pet, and the Mac Mini is the fish tank you bought for it.

GPT 5.4 and Claude Opus 4.6 Are Nearly Neck and Neck

Willison currently uses Claude Code as his primary tool, especially its web version — you can operate it from your phone, the code runs on Anthropic's servers, and there's no risk of accidentally deleting local files. This means you can enable YOLO mode (Anthropic calls it "dangerously skip permissions"; OpenAI just calls it YOLO), where the agent stops asking permission at every step, so you can spin up four agents and go make a cup of tea.

GPT 5.4, released by OpenAI three weeks ago, has nearly caught up to Claude Opus 4.6 — possibly better in some areas, and cheaper. The two keep trading the lead, and the next Gemini model could reshuffle things again. Willison keeps switching back and forth but always returns to Claude Code, and the reason is taste — he has very specific preferences for code style, and Claude Code's output happens to match them.

Today, roughly 95% of the code I produce isn't typed by me. I write a huge amount of code on my phone. Walking the dog along the beach, I can still get work done.

Pelican on a Bicycle: A Joke That Became the Most Honest Benchmark

About a year and a half ago, Willison got fed up with numerical benchmarks like "scored 72% on TerminalBench" and decided to create his own: have a text model generate an SVG of a pelican riding a bicycle. This isn't testing image models — it's testing a text model's spatial reasoning and code generation ability.

The uncanny part: models that draw a good pelican also perform better at everything else. Nobody can explain why. Every major AI lab noticed the meme and started showing off pelicans when launching new models. Gemini 3.1's launch video even featured an animated pelican riding a bicycle.

Willison had a backup plan ready — he kept alternatives like "ocelot riding a motorcycle" in his back pocket, so if labs specifically optimized for pelicans, he could test with other animals to catch cheating. As it turned out, Gemini 3.1 nailed every animal-vehicle combination, completely closing that loophole. Willison didn't mind: "All I ask of life is a well-drawn pelican riding a bicycle. If I can trick every AI lab in the world into gaming a benchmark to make that happen, then I've gotten exactly what I wanted."

New Zealand is home to a rare parrot called the kākāpō — only 236 left in the world, flightless, nocturnal. They only breed in years when rimu trees produce a massive fruit crop. The last time was 2022 — four years without a single chick. In 2026, the rimu trees finally fruited, and over 100 chicks hatched, shattering the previous record of 85 set in 2019. There are cameras where you can watch them incubate. Willison says this is the best news he's brought.