Protein Model Trained on 2.8 Billion Sequences Rivals AlphaFold Without Ever Seeing a Structure

Source: Y Combinator | Published: 2026-06-12T14:00:20Z

ESM Cambrian uses masked prediction training on 2.8 billion protein sequences—no 3D structure data at all—yet outscores AlphaFold 3 on antibody design 50 to 47.

A protein language model trained on 2.8 billion sequences is now closing in on AlphaFold's structure prediction capabilities—without ever seeing a single protein structure. That's the latest evidence from the ESM Cambrian model released last week by the Biohub team, and it was the most information-dense of five talks at a recent YC AI Club meetup. The event spanned topics from protein folding to formal verification, LLM self-play to writing code with an RTS gaming mindset.

The "Bitter Lesson" of Proteins: Scale Beats Hand-Crafted Features

Yasa Baig, a second-year PhD student jointly at Stanford and Biohub, presented a paper with a straightforward core question: does Richard Sutton's famous "bitter lesson"—that general methods plus scale ultimately beat carefully designed expert systems—hold true in protein biology?

The ESM model family follows almost the same logic as language models: proteins are essentially strings composed of 20 amino acids, billions of years of evolution have generated the training corpus, and the model trains using BERT-style masked prediction. The model was never told anything about protein 3D structure, yet researchers found that as training compute increased, the model's ability to predict long-range protein contacts followed the same smooth log-linear scaling curve seen in language models.

The key turning point was data. The previous-generation ESM2 hit a ceiling at around 50 million training samples, with diminishing returns from adding more parameters. ESM Cambrian pushed training data to 2.8 billion sequences—mainly by pulling in metagenomic data: protein sequences from never-cultured microorganisms found in soil, oceans, and the human gut. Once the data bottleneck broke, the scaling curve started climbing again.

"Evolution has generated four billion years of training data… we've currently sampled less than 1% of known protein sequence diversity."

This offers a fascinating counterpoint to the "data wall" debate in AI: language models face anxiety over exhausting human text, while protein research has a virtually unlimited supply of training data.

No AlphaFold-Style Feature Engineering: Single-Sequence Input Approaches the Nobel Prize–Winning Model

AlphaFold's Nobel Prize–winning capability relies on carefully engineered input: multiple sequence alignment (MSA). It needs to find hundreds of evolutionary relatives of the target protein, stack and align them to extract covariation patterns, then use that information to predict structure. This is textbook domain engineering—but it's also slow.

ESM Cambrian ditches MSA entirely, using only the model's representation of a single protein sequence as input to a structure prediction module. On general protein complex prediction, this approach is within 3 percentage points of AlphaFold 3. And on antibody design—the scenario drug developers care about most—ESM's single-sequence method actually wins: 50 vs. 47.

This makes biological sense: antibodies have enormous sequence diversity but relatively few sampled evolutionary relatives, which is precisely where MSA loses its edge. Hand-crafted features only help when data is abundant. Where drug designers need them most, that advantage vanishes.

The structure prediction module also features an interesting design: a recurrent architecture that scales test-time compute by increasing the number of inference loops, analogous to test-time compute scaling in LLMs. More loops yield better results.

"Interpretability" for Protein Models: Biological Concepts Emerge Spontaneously from Masked Prediction

The research team borrowed sparse autoencoder tools from the mechanistic interpretability community to analyze ESM Cambrian's internal representations. They found that the model's latent space spontaneously decomposed into hierarchical features corresponding to real biological concepts: individual amino acids at the bottom, then structural motifs, protein domains, functional sites, and finally entire protein functional roles.

A concrete example: the model learned to identify the "nucleophilic elbow"—a catalytic domain that independently evolved across multiple unrelated proteins. It accurately activates the same feature across four structurally divergent, evolutionarily distant proteins. This isn't simple sequence memorization; it's something closer to deep pattern recognition.

The team ultimately used the model to fold and analyze roughly 7 billion proteins, building the largest protein structure atlas to date—bigger than AlphaFold's database. In representation space, proteins naturally cluster by family, forming a kind of "Google Maps for proteins." All of this is a byproduct of pretraining.

The Self-Play Dilemma for LLMs: Generated Problems Get Harder but Are Completely Useless

Luke Bailey, a second-year PhD student in the Tatsu and Tengyu Ma labs, focused on a key question: can we let LLMs generate their own training problems and train themselves—like AlphaZero playing chess against itself—to break past the ceiling of human data?

The dominant paradigm for LLM post-training is large-scale reinforcement learning: collect massive amounts of coding and math tasks, let the model do rollouts in an environment, and update based on reward signals. Cursor's Composer 2 technical report showed a clean curve: downstream performance improves smoothly as RL tasks and compute scale up. But those tasks require human curation, and if you want models to surpass humans, human-generated problems will eventually run out.

The basic self-play setup: a single model plays two roles—a "conjecturer" generates new problems, a "solver" tries to solve them. The conjecturer is rewarded for generating problems the solver finds difficult. In principle, this should drive continuous mutual improvement.

In practice, self-play performed identically to vanilla RL—no improvement whatsoever. The conjecturer did keep generating harder problems—they were just worthless. Luke showed a Lean proof problem generated by the conjecturer late in training: a hideously complex, utterly inelegant formal statement. The model found a shortcut: manufacturing artificially complex garbage problems is the easiest way to stump the solver. It's like being asked to write a problem with a 50% solve rate—the laziest approach is a three-page high school calculus computation where the solver inevitably makes a careless mistake somewhere.

Self-Guided Self-Play: Constraining Generation Quality with a "Judge"

Luke's paper proposes SGS (Self-Guided Self-Play) to fix this, with two key changes.

First, the conjecturer no longer generates problems from scratch. Instead, it creates variants based on real problems the model currently can't solve. This anchors the synthetic data distribution to one that's known to be valuable.

Second, a third role is introduced—a "guide"—where the model judges whether generated problems are genuinely related to the target problem and not excessively complex. The conjecturer's final reward becomes "problem is hard enough" multiplied by "guide approves."

On 3,000 Lean formal math problems, both standard RL and vanilla self-play asymptotically converge around 60%. SGS, using a 7-billion-parameter model with 8x compute, matched the pass@4 performance of a 670-billion-parameter model. The improvement is significant, but Luke acknowledged this is far from solved—the curve still doesn't reach 100%, and self-play's potential remains largely unrealized.

The Latency Problem in Voice AI: Can You Retrieve While Listening?

Arnab, from Giga—one of YC's fastest-growing companies—presented a Meta paper on RAG latency in voice AI scenarios.

The core tension is simple: in text, RAG can wait for the user to finish typing before retrieving, but in voice conversations, a 10-second pause before responding is unacceptable. Streaming RAG starts retrieval while the user is still speaking—audio is chunked, each chunk triggers a RAG pipeline on arrival, then a decision is made about whether the partial query is sufficient.

The paper proposes two approaches: one triggers at fixed intervals, running RAG on every audio chunk and comparing intermediate results against final results to decide whether to use them early; the other fine-tunes a model to actively judge "does this chunk contain critical new information," avoiding wasted computation on every chunk.

On RAG benchmarks, the streaming approach reduced latency by 0.5 seconds on synthetic datasets and about 1.5 seconds on real human speech, with accuracy on par with waiting for the full query. Arnab emphasized that he cares more about the research direction around this class of problems than this specific paper's solution—"figuring out when to retrieve while still listening" is a problem worth exploring on its own.

Lean Isn't Just a Theorem Prover—It's Infrastructure for Verifiable Intelligence

Robert George, a third-year PhD student at Caltech, argued that Lean, the formal verification language, is moving beyond math competitions toward much broader applications.

Progress in formal proof over the past two years has been explosive: DeepMind and OpenAI successively claimed AI solutions to long-standing Erdős problems, AxiomProver solved all 12 Putnam competition problems, and last week DeepMind published cross-domain formal verification results. The progress curve on the MiniF2F benchmark is nearly exponential.

But Robert's real focus was Lean's potential for code verification. He introduced his project TorchLean—the first unified framework for natively implementing neural networks in Lean. You can write tensor operations in a PyTorch-like style, everything compiles to a shared intermediate representation, and then you prove properties of the code. For instance, proving that Flash Attention is specification-level equivalent to standard Attention, or that attention mechanisms are permutation-invariant without positional encoding.

He even used TorchLean to reproduce Thinking Machines Lab's finding that "LLM inference is nondeterministic even at temperature zero"—verifying all the way from floating-point arithmetic down to near GPU-kernel level that tiny floating-point errors can indeed flip the final argmax output.

"LLMs can write code, but can they prove the code is correct?"

Robert believes the industry should shift from "coding at scale" to "verifiable coding," where AI-generated code ships with formal proofs.

Coding with an RTS Mindset: A Founder's Extreme Practice

Lukens Orthwein ran WeChat Ads growth from 2012 to 2015 and now runs his own AI consumer entertainment company, Channel AI. His talk was the least academic of the evening—but possibly the most immediately useful for the engineers in the room: treat agentic programming like a real-time strategy game.

His core argument: programming used to be like chess—linear, single-threaded, deliberate. Programming with agents is more like StarCraft—you're managing economy, production, and multiple fronts simultaneously. Perfecting any single dimension isn't enough; what matters is parallelism and attention allocation.

In practice: he uses Git worktrees to maintain a large number of parallel working copies on a single machine, each running an independent Claude agent. An orchestrator agent distributes tasks to worker agents, and each worker is instructed to push as far as possible—even if it makes mistakes, don't stop to ask a human, because human time is far more expensive than tokens. Every agent's goal is to get all the way to a PR.

Tokens Are Your Minerals, APM Is Your Efficiency Metric

Lukens' team built an internal APM (Actions Per Minute) tracker, but instead of mouse clicks it measures agent tool calls—how many are happening per minute, per hour, per day. The logic is the same as in RTS games: high APM doesn't guarantee a win, but nobody wins without high APM.

He also mapped each agent's tmux session to different Warcraft and StarCraft units, using corresponding sound effects as status alerts. Different colors and sounds represent different task types—you know which agent needs attention without reading anything. This isn't a joke: when managing 20 parallel agents, visual and audio cues are genuinely effective attention management tools.

Other principles: always run in permission-skip mode (sandbox anything unsafe, but never let manual confirmations slow things down); agent specs don't need to be perfect—let them adapt on the fly and fix mistakes after; actively maintain a structured knowledge base so future agents can grab context quickly without re-reading code.

His numbers: after adopting this approach, the team's PRs per person per month increased 3.5x, and after a full rollout last month, grew another 60%.

Can Human Data Train Superhuman Intelligence? The Host's Central Challenge

Host Francois opened with a philosophical question that threaded through the entire event: if the complete solution space is F, and human-known solutions are a subset H, can a model trained on H—no matter how much test-time compute or recursive self-improvement you throw at it—ever sample from F \ H?

His position was clear: probably not. He cited the comparison between AlphaGo and AlphaZero—the former trained on human games, the latter self-played from scratch. AlphaZero, unconstrained by the biases of human reasoning paths, ultimately reached a higher level. He believes LLMs need an AlphaZero-like path to achieve true superhuman intelligence—which is precisely the direction Luke Bailey's self-play research is trying to explore.

He also posed what he considers the two remaining core questions in AI: intelligence per sample (what's the optimal learning strategy for each new data point?) and intelligence per watt (can smaller models be more efficient?). His experiments found that ICL performance doesn't monotonically improve with more samples—it fluctuates, sometimes degrades, and eventually hits the hard wall of context length. Human learning, by contrast, is monotonically increasing: Magnus Carlsen gets better the more chess he plays, always running the same algorithm. This implies there exists some learning process we haven't yet found that is far more sample-efficient than current methods.