6x the Price for 2.5x the Speed: Only One Variable Matters

Source: Dwarkesh Patel | Published: 2026-04-29T17:20:27Z

The ratio of FLOPs to memory bandwidth holds steady around 300 across GPU generations — multiply by model sparsity and you get the optimal batch size. This formula is scale-invariant; only sparsity matters.


Claude Code Fast, Cursor Fast Mode, Codex Fast Mode — pay 6x the price for 2.5x the speed. The mechanics behind this can be explained with two diagrams and a few equations. The core variable is just one: batch size.


The Cost Floor of Inference: You Can't Discount Forever

Running large model inference on a GPU cluster is constrained by two factors: the time to load weights from memory into the chip (memory bandwidth-bound) and the time to perform matrix multiplications (compute-bound). At small batch sizes, the fixed cost of weight loading can't be amortized, making per-token cost extremely high. As batch size grows, weight loading gets shared across more users and per-token cost drops sharply — until it hits the hard floor set by compute time.

This is why "slow mode" has almost no room for price cuts: KV cache and compute are unique to each user and can't be shared across requests. The only savings come from the weight-loading portion, which at large batch sizes is already amortized down to nearly zero.

300x Sparsity: A Surprisingly Accurate Rule of Thumb

Setting weight-loading time equal to compute time yields the optimal batch size. The derivation is remarkably clean: the ratio of FLOPs to memory bandwidth has held steady at roughly 300 across multiple GPU generations. Multiply that by the model's sparsity ratio (total parameters divided by active parameters), and you get the optimal batch size.

For DeepSeek V3 — 256 experts with 32 active, a sparsity ratio of 8 — the optimal batch is about 2,400. Real deployments add a 2–3x safety margin on top. The formula is independent of model scale and depends only on sparsity — a rather counterintuitive result.

A Train Departs Every 20 Milliseconds

Inference scheduling on a GPU cluster works like a train running on a fixed timetable: a batch ships roughly every 20 milliseconds, whether it's full or not. That 20ms isn't arbitrary — it's exactly the time needed to read all of HBM once (capacity divided by bandwidth). On the Rubin generation, 288GB divided by 20TB/s comes out to about 15ms.

In the worst case, a request just misses the train. Wait for the next one plus processing time, and the latency ceiling is around 40ms. That's the physical limit of "fast mode": you can pay to jump onto a smaller batch, but you can't beat a single weight-loading pass.

Sparsity Is a Free Lunch — Until It Isn't

A sparse model with 64 experts and 370 million active parameters matches the quality of a 1.3-billion-parameter dense model. Sounds great — 4x compute savings. But the trade-off is a 64x blowup in total parameters.

"From the analysis we've done, it's pure upside. Keep increasing sparsity until you run out of users to fill the batches."

From an inference economics perspective, higher sparsity means longer weight-loading time, but also larger batches that spread loading costs more thinly. The real bottleneck shifts to memory capacity — you need more memory to house all those experts that aren't active but still must be resident.

Full Connectivity Inside a Rack, Not Across Racks

The communication pattern in MoE layers is all-to-all: any GPU might need to send tokens to any other GPU. Nvidia's Blackwell racks achieve this internally via NVSwitch — any two of the 72 GPUs are just two hops apart. But the scale-out network across racks is roughly 8x slower.

This means the maximum size of an expert layer is effectively limited by how many GPUs fit in a single rack. From Hopper's 8 to Blackwell's 72 to Rubin's 500+, the scale-up domain keeps expanding. The barriers driving this progression are surprisingly physical: not chip design, but cable density, bend radii, rack weight limits, and cooling.

Pipeline Parallelism: Free in Inference, Unnecessary in Practice

Slicing a model by layers across different racks (pipeline parallelism) neither increases nor decreases inference latency — batches flow through racks sequentially and total time stays the same. Its only benefit is reducing per-rack memory requirements. But a Blackwell rack already has tens of terabytes of memory, and a trillion-parameter model only needs about 1TB. Memory isn't the bottleneck.

Training is a different story. Pipeline parallelism creates "pipeline bubbles" — earlier racks sit idle while waiting for later racks to finish backpropagation. Filling those bubbles requires splitting batches into micro-batches, which in turn limits how efficiently weight loading gets amortized.

KV Cache: The Bottleneck Pipeline Parallelism Can't Fix

Pipeline parallelism partitions weights perfectly, but it can't do anything about KV cache. Adding more pipeline stages means fewer layers per stage, but proportionally more in-flight sequences — the two effects cancel exactly, leaving per-GPU KV cache usage unchanged.

The same goes for the batch dimension: KV cache can't be shared across users or sharded across pipeline stages. It's the most stubborn memory consumer in any inference system.

Why Models Haven't Gotten Much Bigger Since GPT-4

GPT-4 reportedly exceeded one trillion parameters when it launched in 2023. In the nearly three years since, mainstream model sizes haven't grown dramatically. The constraint isn't training capability — pipeline parallelism solved the weight-storage problem long ago. The real bottleneck is memory bandwidth within the scale-up domain.

In the Hopper era, 8 GPUs shared 640GB of memory. It took Blackwell to expand the scale-up domain to the 10–20TB range — enough for a 5-trillion-parameter model plus sufficient KV cache for a real user base. Weight-loading latency equals total parameters divided by aggregate bandwidth across all GPUs in the scale-up domain. Going from 8 to 72 GPUs compressed that latency by nearly 10x. Google's TPUs had large scale-up domains early on, which may partly explain Gemini's lead during a certain period.

How API Pricing Leaks Infrastructure Details

Gemini 3.1 charges 50% more beyond 200K tokens. Plug that inflection point into a roofline analysis and you can back out that each token's KV cache is roughly 2KB — a perfect match for Character AI's published architecture (128-dim heads, 8 KV heads, cross-layer sharing).

Output tokens cost about 5x more than input tokens. The reason is straightforward: during decode, each token requires a full independent weight load, heavily bandwidth-bound; during prefill, multiple tokens share a single load, shifting the bottleneck to compute. The 5x price gap tells you that frontier models are deeply memory bandwidth-bound during decode.

Training, RL, and Inference Compute Should Split Evenly in Thirds

Assuming total cost is optimal when spending is equal across stages (a heuristic that generally holds in power-law regimes), you can derive an interesting conclusion: pretraining tokens, RL tokens, and inference tokens should all be in the same order of magnitude.

Working backwards: assume a frontier model serves 50 million tokens per second for two months — that's roughly 200 trillion inference tokens. If active parameters are around 100 billion, Chinchilla-optimal training requires only 2 trillion tokens. Actual pretraining volume is about 100x Chinchilla-optimal — and that degree of "over-training" is driven entirely by inference economics.

"Even if you're off by 50%, being able to derive these order-of-magnitude numbers from first principles is meaningful in itself."

The Memory Wall at One Million Tokens of Context

From GPT-3's 8K to today's 100–200K, context length has nearly plateaued over the past couple of years. The quadratic term in compute cost (attention) has such a gentle slope that it doesn't become significant until a million tokens. What's actually blocking context expansion is memory bandwidth: each additional token of KV cache means a few more KB to read from HBM, and there's no fast path to increasing HBM bandwidth.

Sparse attention helps — DeepSeek's approach reduces memory bandwidth requirements from linear to square-root scaling. But push sparsity too hard and quality degrades. If Dario is right that "you don't need continual learning — in-context learning is enough," then the road to 100-million-token context is nowhere in sight.

KV Cache Storage Economics: From HBM to Hard Drives

API providers' cache pricing reveals their storage tiering strategy. Using "capacity divided by bandwidth" as each tier's characteristic time: HBM clocks in at about 20ms, DDR at a few seconds, flash at around one minute, and hard drives at about one hour.

Mapping a certain API's two cache pricing tiers (5 minutes and 1 hour) to their matching characteristic times points to flash and hard drives, respectively. Hard drives — technology from decades ago — still play a role in cutting-edge AI inference. They're slow enough that loading full capacity takes an hour, but cheap enough to store KV caches for users who "might come back in a bit."

The Convergent Evolution of Neural Networks and Cryptography

Cryptographic protocols and neural networks share a striking architectural similarity: both need information to mix thoroughly across all inputs. The difference is directionality — cryptography turns structured information into indistinguishable randomness, while neural networks extract high-order structure from seemingly random data.

This symmetry is most visible in adversarial settings. An adversarial attack on an image classifier — finding a tiny perturbation that completely changes the output — is precisely the normal operating mode of a well-designed cipher. The Feistel structure from cryptography (building reversible functions from irreversible ones) was transplanted directly into the 2017 RevNets paper, which achieved network reversibility by storing each layer's input, nearly eliminating activation storage during training.

"With KV cache, you spend memory to save compute. With RevNets, you spend compute to save memory. At current hardware ratios, spending memory to save compute is usually the better deal."

The Physical Limits of Hardware Design Are More 'Physical' Than You'd Think

Racks went from Hopper's 8 GPUs to Blackwell's 72, driven not by new semiconductor processes but by a product decision: switching from trays to racks as the deployment unit. Going from 72 to Rubin's 500+ involves genuine physical engineering challenges — more complex cable routing, higher connector density, and more extreme cooling solutions.

Cable bend radii, rack weight, structural steel strength, power delivery capacity — these sound like last century's engineering problems, yet they concretely determine how large the scale-up domain can be, which in turn determines how fast models can run and how far context can stretch. At the frontier of hyperscale computing, the ultimate limits are steel and copper.

More articles on TLDRio