OpenAI's Math Team Recaps How a 42-Year-Old Conjecture Fell in Just 12 Hours of ChatGPT

Source: OpenAI | Published: 2026-04-28T17:01:01Z

Mathematician Ernest Ryu spent 40+ hours failing to prove Nesterov's accelerated gradient method diverges — then cracked the 42-year open problem in a 12-hour ChatGPT session.

Math used to be large language models' most embarrassing weakness. Four years ago, Google released a math model called Minerva. Sébastien Bubeck, then a researcher at Microsoft, nearly fell out of his chair watching the demo — because the model could calculate a line passing through a few points on a plane given their coordinates. Looking back today, it's almost laughable. But it shows just how low the bar was.

From Botching a Camping Bill to Solving a 42-Year-Old Open Problem

Before joining OpenAI, applied mathematician Ernest Ryu had been testing ChatGPT's everyday math abilities. Three people go camping, each buys a dozen items, and they need to split the bill evenly — in 2023, 2024, and even early 2025, the model couldn't get it right. Scheduling a Zoom call across time zones? Same story.

Then "it just changed." Ernest put it more precisely: unless you're trying to invent entirely new mathematics, ChatGPT can now handle every math problem you'll ever encounter — differential equations, differential geometry, the advanced math used across STEM fields, all of it. You still need to verify the output, and the model will make mistakes, but for 99% of people's math needs, it's already good enough.

Ernest ran a far more hardcore experiment on his own. He dug up a classic open problem in optimization theory: does Nesterov's accelerated gradient method diverge in the worst case? The problem had been open for 42 years. Every night from 8 PM — after putting his son to bed — until midnight, he had four hours of focused work. Three nights, twelve hours total, spent going back and forth with ChatGPT. Not firing off a single prompt and getting an answer, but playing the role of verifier — correcting the model's mistakes, steering the conversation toward directions he thought were promising. In the end, the proof was complete, and it held up. Without AI, he'd already spent over 40 hours trying and failed. Conservative estimate: doing it alone would have taken a month.

He chose to post the result on Twitter instead of writing a paper, "because that's more fun." It became one of the earliest cases of AI solving a genuinely open math problem.

Math Is the Perfect Yardstick for Reasoning

Why does OpenAI care so much about math? Sebastian offered a deeply pragmatic reason: math is the cleanest benchmark. Problems are precise, unambiguous, and everyone agrees on what's being asked. Answers are verifiable — there's no gray area between right and wrong. Over the past four years, math has been the best lens for tracking progress in model reasoning.

But the deeper reason lies in what math demands of reasoning quality. Solving a math problem can require sustained thinking for days or even weeks, and you must maintain logical consistency throughout — if any single link in the chain breaks, the entire argument collapses, no matter how correct everything after it is. This is precisely the capability reasoning models need most: self-correction after errors, and coherence across extremely long chains of thought.

Sebastian's analogy is straightforward: why do humans study math? Not because everyone needs to become a mathematician, but because math builds rigorous logical thinking. The same applies to AI — reasoning ability gained through math transfers to every other domain.

The Erdős Problems: From Literature Search to Original Proofs

Paul Erdős was one of the most prolific mathematicians of the last century, publishing roughly 1,500 papers. He had no permanent home, wandering between universities worldwide, collaborating and posing problems wherever he went. The math world even has the concept of an "Erdős number" — how many degrees of collaboration separate you from Erdős. Sebastian's Erdős number is 2; Ernest's is 3. "That basically tells you our respective ages," Sebastian joked.

Mathematician Thomas Bloom maintains a website tracking all unsolved Erdős problems — about a thousand of them. Once GPT became capable of research-level math, this problem set naturally became a testing ground.

The first breakthrough was "deep literature search." GPT scanned thousands of papers and found the answer to a particular Erdős problem buried in a completely unrelated field of mathematics — the paper's author had no idea they'd solved an Erdős problem, having used entirely different terminology. GPT made the cross-domain connection.

After team member Mark Sellke systematically verified the results, Sebastian tweeted that the model had found solutions to 10 Erdős problems. The tweet sparked controversy — people assumed these were all brand-new original solutions, when most were actually rediscoveries of results already in the literature. Google DeepMind's Demis Hassabis even weighed in publicly.

But the ending was unexpected: just a few months later, they had more than 10 completely original proofs of Erdős problems — results that existed nowhere in any prior literature — publishable in top combinatorics journals. Some came from ChatGPT, some from internal models. The leap from "finding existing answers in the literature" to "producing original mathematics" took only a few months.

"AGI Time": From Seconds to Weeks on the Reasoning Arc

Sebastian proposed a framework for measuring progress: AGI time. AI can simulate human thinking, but for how long? Two years ago, models could roughly simulate a high schooler spending a few minutes on a problem. Now, they can simulate a researcher thinking for hours or even days.

The trajectory over the past four years has been remarkably consistent: seconds → minutes → hours → days. Next comes weeks, then months. Sebastian acknowledged this remains an open research problem — "I don't think anyone on Earth knows exactly how to do it" — but the trend itself shows no signs of slowing.

Ernest explained the bottleneck from a different angle. He observed that mathematicians around him use AI by working within ChatGPT's context window, roughly equivalent to 50 pages of mathematical writing. But many important proofs far exceed 50 pages, and the thinking behind even a 10–30 page paper is orders of magnitude longer than the final product. He believes the approach Codex uses for large codebases — continuously accepting instructions, periodically compressing context, maintaining coherence across ultra-long sessions — will carry over to mathematical research.

The Automated Researcher: From "Professor Guiding a Student" to Autonomous Exploration

The current mode of AI-driven math research is essentially a "professor-student interaction": humans set the direction, the model attempts solutions, humans verify and correct, and the cycle repeats. The upside is massive time compression — Ernest's 12 hours versus a potential month — but it has a ceiling.

To achieve true breakthroughs — solving long-standing mathematical conjectures, or completing biology research requiring repeated interaction with wet labs — models need to work autonomously for much longer stretches. This is the vision behind the "automated researcher": instead of waiting for human feedback after each conversational turn, a model or model cluster pushes forward on its own over timescales of weeks or even months.

Sebastian highlighted an easily overlooked capability: models aren't just getting good at answering questions — they're getting good at asking them. Internal agents at OpenAI can already find errors in papers and propose corrections. They can also pose research questions compelling enough that human mathematicians look at them and say, "Maybe I should write a paper about this."

Math Will Become a Far More Connected Endeavor

Ernest's take on the future of mathematics is specific. Research math today is extremely siloed — you write a paper knowing that maybe five people on Earth will care about it. After publication, it enters the archives and sits untouched for 20 years.

But AI has read it. If some future problem has a useful connection to that paper, AI will dig it up. Ernest says he now feels more confident about his published work — even if nobody uses it today, as long as it's useful someday, it won't stay buried. Conversely, he can now tap into results from mathematical fields he never studied, discovering connections to his own research through AI.

Verification will speed up too. Currently, an important 300-page proof can take years from publication to full community verification, and sometimes fatally flawed proofs are accepted long before anyone catches the error. AI isn't perfect, but it's far more patient than humans — it can deliver preliminary verification within a week of publication, giving subsequent research a more reliable foundation to build on.

Domain Expertise Matters More Than Ever

Sebastian issued a serious warning: the most dangerous outcome is humans handing the reins to AI and then skipping the hard foundational training. He pointed to something already happening — non-mathematicians attempting to prove theorems with AI tools, producing dozens of pages of "proofs" that turn out to be completely wrong.

"We can extract these results from ChatGPT because we have years of training and deep understanding of the field." He hasn't seen thousands of non-experts suddenly proving new theorems. Quite the opposite — domain expertise is the prerequisite for using these tools, not something the tools replace.

Ernest raised the same concern about programming. He doesn't have a computer science background, but he took classes, wrote his own code, and wrestled with debuggers. In today's university curricula, those things may no longer be required. He thinks that's dangerous.

Sebastian's reaction to the claim "we no longer need scientists" was visceral: "That's terrible. Please don't say that. We need scientists more than ever. Those scientists will be more productive, more powerful, producing better results — but we need them to be genuinely strong in their own fields."

The Best Way to Start Learning Math: Just Talk to ChatGPT

Ernest's advice is simple and direct: if you're curious about math, go talk to ChatGPT. Even at the research level, when he needs to learn a new concept, his first instinct is to go to Wikipedia — but 30 seconds later he gives up. Too dense. He asks ChatGPT instead, follows up with a few questions, and gets an explanation tailored to the exact gaps in his knowledge.

He suggests telling the model your math background — what books you've read, what courses you've taken — then asking it to pose an open problem at your level. Solve it, continue the conversation, propose variations, solve those. "Even if you're still alone in a room, it no longer feels like a lonely process. Math is fundamentally a social activity."

At a debate Sebastian attended a year and a half ago, 80% of mathematicians in the audience believed LLMs could never help solve major open problems. By the end of the debate, it was 50-50. Eight months later, models started doing research-level math. A few months after that, original proofs of Erdős problems began appearing in batches. Sebastian put it bluntly: "In hindsight, that 80% was obviously dead wrong."