AI Found Five Years of Bugs in Six Weeks — But Humans Still Must Vet Every Fix

Source: 20VC with Harry Stebbings | Published: 2026-06-22T13:58:51Z

Palo Alto Networks ran its Mythos AI against its own codebase for six weeks, surfacing vulnerabilities that would have taken five to six years to find manually. The catch: a 30% false-positive rate means every patch still needs human review and sandbox validation before it ships.

Within six weeks, Palo Alto Networks uncovered security vulnerabilities that would have taken five to six years to find the old way — that's what happened after they ran Mythos against their own codebase. Then came another six weeks: patching them one by one, each requiring human evaluation, test environment validation, and sandbox confirmation before going to production. AI finds problems faster than humans can imagine. Fixing them still requires human judgment.

This is the first concrete example Nikesh Arora shares in the conversation, and it neatly captures his entire read on where AI stands today: the capability is there. We just haven't figured out how to use it.

Frontier Models Have Two Problems. Everyone's Only Talking About One.

Nikesh frames the core tension facing frontier models as "breadth vs. depth."

On the breadth end, the consumer side is already good enough. He uses himself as an example: he had Gemini generate an investment memo for a company he was researching. Four minutes, mostly accurate, needed a few edits. In the past, that would have required a team of bankers and analysts and several days. Consumers have a high tolerance for hallucinations — there's always a human sitting between the output and reality, absorbing the false positives.

The depth end is a different story. He uses Waymo: fully replacing a human driver with AI required tens of billions of dollars in training on massive edge cases, using data that isn't on the public internet. You can't stuff the latest Anthropic model into a Mercedes and ask it to drive you home.

"Frontier models want consumer attention because it generates post-training data and builds consumer brand recognition. But the real enterprise revenue comes from use cases that require deep context."

There's no easy resolution to this tension. Frontier model companies are chasing both things simultaneously, but the two demand completely different data, error tolerances, and business models.

Token Prices Should Fall to a Tenth of What They Are Today

Nikesh gives a clear directional call on token pricing: long-term, tokens should cost a tenth of what they do today.

The logic isn't complicated. Compute is severely supply-constrained right now, with prices two to four times what they were two years ago. Consumer use cases consume more than half of all compute, but the services are nearly all free to end users — ChatGPT, Claude, Gemini charge ordinary users nothing. That's a loss-making business. The remaining compute pressure falls on enterprise and coding use cases, which explains why token prices remain stubbornly high.

He expects meaningful price declines within three to five years. Free access on the consumer side will eventually tighten, because once frontier model companies have accumulated enough post-training data, their incentive to keep subsidizing every user will weaken. Meanwhile, compute efficiency continues to improve.

As for what share of developer salaries will eventually go toward token spend, he says: "There's no way to give you a number right now, because pricing will change dramatically over the next three to five years."

GNA Headcount Gets Cut in Half. Technical Roles Actually Grow.

His forecast is specific: over the next three years, most companies will reduce headcount in GNA functions — marketing, finance, HR — by roughly half. Palo Alto currently has around 600 people in marketing. "It won't be 600 anymore."

The logic: the training data for marketing work naturally lives in the public domain. What you publish is itself training data. Frontier models can already tell you which copy doesn't match your brand voice, which content contradicts itself. If AI can bring the output quality of an average employee close to that of the best employee, you need fewer people.

But he's equally emphatic that technical roles need to grow, not shrink. His team keeps asking him for more AI-capable people, because there's a backlog of projects: "I have a great project to transform marketing. I have a great project to transform HR. I need more people who know how to prompt frontier models, build harnesses, and integrate proprietary data."

On the broader claim that AI will reduce overall employment, he disagrees. What disappears are process management roles. What grows are technical roles that understand and wield AI — and sales roles, because if your product is genuinely good, you need more people to cover the market.

The Real Difference Between SaaS and AI Applications: Having a Point of View

He uses a simple distinction to explain why he thinks SaaS will be disrupted: SaaS software has no opinions. AI applications do.

SaaS logic: you define the input, it produces the output, nothing more. AI application logic: you hand it copy and it says, "This isn't good enough. It doesn't match your brand voice. Here's how I'd change it."

That distinction has deep implications for workflow. His analogy: in a recruiting process, if AI scans all the resumes and tells you which 20 candidates to call — and because it knows three of your colleagues have already asked certain questions, it suggests ten different ones for you — that requires handing over 80% of the decision-making. "We're not doing that yet. What we're doing now is: scan the invoice, extract the data, say look, 20% faster."

SaaS vendors' stock prices are already reflecting this judgment. He thinks the market signal is clear: the migration from opinion-free software to opinionated AI applications is real, but the timeline is uncertain.

He Runs Two "AIO Meetings" Every Week

Palo Alto Networks has 21,000 employees. Nikesh drives the transformation in a very concrete way.

Twice a week, he convenes 14 to 20 of the company's most senior technical leaders in meetings called AIO — "like Old MacDonald Had a Farm," he says. Each session, he goes around the room and asks: What have you done on AI in the last three days? Do you have agents in your product roadmap? How much engineering capacity have you freed up to work on new things?

The logic: if leadership isn't aligned, everything else is noise. When he discovers that a product manager's 6-to-12-month roadmap contains zero agent work, he doesn't write a memo. He asks directly: "Everyone is talking about this. You have nothing in your roadmap. Why?"

He doesn't believe hiring a Chief AI Officer to own this effort actually works. He saw the same movie play out in 2004 when he was running Google Europe: CEOs hired an "internet wizard" to lead their internet transformation, then stepped back themselves. Those wizards couldn't move the needle, because nobody in the organization actually gave them resources or attention.

In Cybersecurity, AI Is an Accelerant — But the Path Is Specific

Back to the vulnerability scanning at the start. He says that after Palo Alto got access to Mythos, they found it uncovered vulnerabilities far faster than anyone expected. But remediation can't be handed to AI entirely — AI will go "fix" 30% of things that aren't actually broken, and nobody knows what that would do to a production environment. So the actual path is: AI finds the problem, Claude Code helps draft the patch, human review, test environment validation, sandbox confirmation of no side effects, then production.

The industry-wide implication: the same model, pointed at any company's infrastructure, will find vulnerabilities. Customers are starting to realize how exposed they actually are, which accelerates the urgency of patching — a tailwind for cybersecurity companies, because customer urgency just went up.

Palo Alto has 150 million sensors standing guard at the edge of customer infrastructure. He believes what AI can do now is inject more intelligence into those guards — improving detection and interception efficiency. The guards themselves won't be replaced by a Claude or OpenAI endpoint agent, because that kind of agent doesn't exist yet.

Memory and Context Are the Real Moat

On the question of where value accrues between frontier models and the application layer, he identifies what he sees as the most critical variable: memory and context.

On the consumer side, models have started recording what you've asked, what your preferences are. On the enterprise side, that hasn't happened yet — but he believes it's the next battleground. Whoever accumulates enough user context builds stickiness, and that builds a moat.

Frontier model companies already understand this and are actively building memory capabilities into their models. This creates a structural problem: if you're deeply integrated with one model's memory layer, switching to another means redesigning your entire application architecture. Choosing a model binds you far more tightly than most people realize.

The orchestration layer was supposed to solve this, but he notes that orchestration players are far outgunned — in both funding and capability — by the frontier model companies themselves. In that asymmetric landscape, the question of who stores context remains genuinely open.

Three Postures Toward Transformation. Two Are Wrong.

He uses self-driving cars to describe the three strategies companies take toward AI.

The Waymo approach: start from zero, no driver, every edge case is training data, the goal is full autonomy. The corporate equivalent is what Brian Armstrong and Jack Dorsey are doing — cut 30-40% of staff, rebuild with AI-native people.

The Tesla approach: staged automation. Get highways working first, then tackle complex conditions incrementally. This is what Palo Alto Networks is doing: hire only through hackathons, use natural attrition (roughly 2% per month) to gradually refresh the talent base, and achieve a 20-25% workforce turnover over three years.

The legacy automaker approach: add some AI features to existing products, slap a label on it, say "we're doing AI too."

He's explicit: Tesla's approach is acceptable. The legacy automaker approach is not. The difference is that Tesla's approach delivers real AI capability improvements at every stage. The legacy approach is just AI washing.

"Miss Three Opportunities and You're Out"

When asked what mainstream Silicon Valley gets wrong, he says his concern is the combination of FOMO and euphoria causing bad judgment — "if I don't invest in this interesting thing right now, I'll get left behind."

He uses Anthropic as an illustration of how the windows are shrinking: you had 20 years to invest in SpaceX; with Anthropic you had 3. People who missed Anthropic's early rounds are now paying the price in other ways, and that makes every new project feel like it could be the next Anthropic — so decisions get made before there's enough information.

His own read on SaaS valuations: the market pressure is real, but the AI application layer hasn't matured enough to replace the entire SaaS ecosystem, so the current pricing chaos reflects genuine uncertainty, not just a preview of the endgame.

"You miss one opportunity, you can survive. Miss two, you start to hurt. Miss three, you might be out."

Many SaaS vendors are at position two right now — not yet eliminated, but already under pressure.