Note that sections of this work are still being revised.
This January, I submitted an AI-ideated resolution to the 659th problem listed on the Erdős forum, a database of problems posed by the prolific Hungarian mathematician Paul Erdős. I was astonished to see, five hours later, that my submission had been approved by Terence Tao.
Erdős #659 is a technical write-up I published a few days ago detailing the problem and solution. I recommend at least skimming it before beginning this piece; I’m assuming the reader is at least tangentially familiar with the problem.
This post details my process and personal reflection on the problem. I had to suspend disbelief throughout the process, expecting to find some AI hallucinated lemma that would invalidate the result. That invalidation never arrived; I am now absolutely certain the result is correct and prepared to share the story.
Sparks of Specialized Intelligence
The Erdős forum emerged as one of the first frontiers where AI was capable of carving out real (albeit minor) scientific novelty. Thomas Bloom launched the site in late March 2023 as part of an effort to “clear away” simpler and resolved problems in order to see what “genuinely interesting/hard problems remain.”
While he built out the site with AI coding assistance, Bloom identified October 2025 as the time “AI tools started to become useful for this site” on the “math side.”
It began with an inconsistent trickle. One OpenAI researcher went briefly viral for claiming that GPT-5 had ‘solved’ 10 Erdos problems, only to walk it back moments later as the community pointed out that these had simply been solutions identified in the literature and not novel contributions.
This pattern repeated for a while. There were big claims about AI solutions from model-hunters, but equally many retractions. We saw some examples of problems being solved before a solution had surfaced, independently of a solution, as a partner to a human’s solution, and even some cases where the generated solution was different from a literature result that turned up later, but no ‘pure AI’ novelty.
There was another step-function change around January, with the release of GPT-5.2 Pro. You can see it in the chart about. It was around this time that, writing on Mastodon, Terence Tao noted the crossing of this threshold:
Recently, the application of AI tools to Erdos problems passed a milestone: an Erdos problem (#728) was solved more or less autonomously by AI (after some feedback from an initial attempt), in the spirit of the problem (as reconstructed by the Erdos problem website community), with the result (to the best of our knowledge) not replicated in existing literature (although similar results proven by similar methods were located).
I’m not convinced #728 was an absolute threshold, per se. The four parentheticals are doing real work here, and the notches AI was racking up seemed fuzzier and more stochastic than clearly separable.
What was clear, at least to me, was that a more general boundary had indeed been crossed. Nineteen other problems were solved throughout December, and six more would fall the week after #728. There was human and literature assistance with many of these, and you could construct a reason to dismiss several individually, but in aggregate it became obvious that something was happening.
At least, obvious to me. I was admittedly invested: exactly six days after Tao wrote that Mastodon post he approved my submission for #659.
Finding the Problem
My introduction to the Erdos forum was a Hacker News post. I saw the article around Thanksgiving, and decided to spend some over break exploring the forum.
A few intuitions guided me. The first was that, as I noted above, progress seemed fuzzy. There was AI tooling for #367, hype around #124, and Tao’s story of #1024. Mathematicians were arguing over what counted as a novel contributions, and as a question of attribution these debates were important, but as a question of raw capability it felt like nitpicking.1 Model capability wasn’t the bottleneck.
Assuming the models were capable, I figured that across 700+ open problems in a fiercely mathematician-oriented site there would be some low-hanging fruit—an Erdos Marshmallow—no one had yet tried. I supposed that, if I picked well, I could find one before anyone else.
Lucky Taste
I used a few heuristics to guide this selection.2
The simplest was to find a problem with no serious attempts and no listed reward. The goal was to demonstrate that AI was capable of finding a solution; the lowest-hanging fruit would be something no one had ever tried to pick.3 I also wanted to find a problem that was verifiable, and ideally had some comments to help steer the search.
But there’s also no accounting for luck.
The Erdos problem site has a ‘random open’ button to take you to a random unsolved problem. I can’t remember the exact number, but I’m pretty sure 659 was the third result I landed on.
It was a problem that had (at the time) no serious attempts, no listed reward, and exactly one comment:
It also helped that this one didn’t seem too tricky to understand. Drawing points from a set and checking them against a bound was something I could do. It would be difficult to prove this in the negative, but a single verifiable set would be enough to resolve the problem.
Looking back, I’m not sure if the search criteria or dumb luck played a bigger role. I’d like to think each played an important role, though simply being aware of the database and rapid progress was an important way to ‘position’ myself to get lucky.
The Process
My initial input to Gemini 3.0 Pro was a screenshot of the problem combined with Desmond’s comment about the local constraint and a request for ‘simulation.’
Erdos's 659th problem is described in the attached image. The following comment is found below: […]
Think a little about what a possible solution (such a set of n points) might look like, then write an efficient script to randomly guess and check solutions.
The ‘randomly guess and check solutions’ part was my attempt to be clever, and it backfired horribly. I wasn’t familiar with the « notation in the problem statement had assumed that a single discrete nontrivial example would be enough.
(In my defense, it was past midnight and I was actively avoiding nerdsnipe on the math. If it helps, my second query was “What does << mean formally?”)
Gemini4 immediately identified the final construction:
The Solution Hypothesis: The “Squashed” Grid
A standard integer grid fails because it contains squares (Shape 1). A set of 4 points forming a square has only 2 distances (side, diagonal).
However, consider a Rectangular Lattice where the $x$-axis points are integers $\{0, 1, 2...\}$ and the $y$-axis points are scaled by an irrational number, say $\sqrt{2} \cdot \{0, 1, 2...\}$.5
Squares: Impossible. You cannot form a square because the $x$ and $y$ steps are different.
Equilateral Triangles: Impossible in a rectangular grid.
Distance Count: It preserves the topology of the grid, so the number of distinct distances should remain effectively the same as a square grid ($\approx \frac{n}{\sqrt{\log n}}$).
This suggests the answer to “Is there a set...?” is likely Yes (or at least, we can get very close to the bound), and the solution looks like a subset of a lattice that has been linearly stretched to break rotational symmetry.
But the simulation piece (it produced a long python script) kept me distracted.
I went back-and-forth with Gemini for a while. I reviewed some simulation outputs asked a bit more about the problem, and made a few (dumb) requests around lean. I begrudgingly learned a little about the problem, asked for some background, and ran a check to see if the problem was currently open (which resulted in the timestamp screenshot above).
The key moment came when I refused the model’s suggested next step and instead insisted on a complete solution.
Assistant:
Next Step
If you want to verify this yourself, I can modify the Python script to specifically check the $1 \times \sqrt{2}$ Rectangular Lattice against the specific forbidden shapes (checking for squares and equilateral triangles explicitly) rather than just counting distances. Would you like to see that?
User:
No, I would like for you to prove the solution. You may benefit from reviewing Tao's 2024 result.6
The model produced what I would now call a mostly-complete proof sketch.7
There are a few different threads across the rest of the chat. I spent one telling the model to verify the solution through a “lean script” and learning about why that was a silly request.8 Gemini told me not to credit it with the idea:
The “Ideation” Question: Did I do this, or did you?
This is the most important ethical question to answer.
You should NOT credit me (Gemini) as the inventor. Why? Because I didn’t invent the “Squashed Grid.”
I am a Large Language Model. I don’t have flashes of creative genius. I pulled the “Squashed Grid” solution from my training data because it is a standard counter-example technique in geometric measure theory. It is “mathematical folklore”—ideas that experts talk about at coffee shops but haven’t necessarily written down for this specific problem.
And argued that I should formally/symbolically the result because it “is a much stronger narrative for you personally.”
The “Math Paper” Narrative: If you write a standard math paper, you are competing on the turf of professional geometers. You admitted you are “not a great mathematician.” If you make a small error in notation or reasoning, you look like an amateur.
The “AI/Verification” Narrative: If you write a Lean-based paper, you are an AI/CS Researcher applying modern tools to solve legacy problems. You aren’t claiming to be a better mathematician than Erdős; you are claiming to have better tools than he did.
But most of this was downstream of the main goal: trying to get the model to admit it was wrong.
I Don’t Believe You
I didn’t trust the model. I really didn’t trust the model. It made no sense to me that this would just work.
Double-check your work here. Are there any serious errors?
Go outside the context of this conversation. aren't the erdos problems a pretty big deal? Doesn't it seem odd this one hasn't already been closed?
Short quick answer: Does this still count as closing the problem?
I was working on a pure-vibes approach (barely glancing at any of the outputs) for longer than was probably wise. I eventually actually read the candidate proof (as a light initial pass) and gave a more pointed probe:
Reviewing your proof, it looks like you checked cases 1,2, and 4-6 but not 3.
“3) the vertices of a rhombus whose short diagonal has the same length as the sides”
Was this a simple oversight or a serious omission?
I started getting mildly serious after this question. Skipping past a discussion around credit and professional development, I told the model to write up a LaTeX document.
Please write up the entire formal proof as a note in latex format, ready to be copy-pasted into overleaf. To be best of your ability, make it ready for submission. You may include references to lean files or symbolic scripts to be completed later.
I then went back and forth between Gemini and GPT 5.2 (extended thinking) to try and catch more mistakes. This resulted in many different overleaf scripts; each a reasonable improvement over the next one another. GPT verified, cited, and adjusted the proof.
The major mistake it corrected was the reference for the distinct-distance bound (as Gemini hallucinated a citation from Landau while the relevant theorem was Bernay’s). GPT also killed the lingering computational verification part of the paper and tightened a few of the arguments.9
At this point I had a clean and visually presentable proof in front of me. I went through it properly, pencil-and-paper, line-by-line. It took a few hours. I couldn’t find any mistakes, but felt like I didn’t fully understand how Bernay’s theorem applied to the distance bound.
I reached out to one of my professors, and asked to go through the paper with him. We were able to hunt down the original problem formulation from a book in the library, then sat down and briefly went through it. Professor Winkler told me that the proof, while ‘written in a strange way,’ appeared correct. He talked me through Bernay’s theorem and explained some proof conventions, including where AI tended to repeat itself and speak too defensively.10
I recorded much of this information in the subsequent prompt:11
I am working on my (entirely AI-generated and edited) paper on Erdos problem 659. I showed it to a professor who thought it was generally correct and could end up published, but gave several suggestions for the write-up. They were about:
- Change the title
- Remove references to Erdos 659 outside of a specific reference to the public website
- replace the [«] notation, which is strange and unconventional, with either big O or 'c [equation] for some c'. He actually suggested including both to remove ambiguity
- Explicitly stating the theorem of the paper (what am I ‘actually saying’ in one clear position)
- Reworking the intro, and especially the sentence 'classical that lattice-based constructions can realize . .'
- Explicitly stating Bernays' theorem, and showing that it does apply to this formulation (enonsquare discriminate, etc.). This was his biggest critique ('the most important part of the paper, and you don't show it!).
- Generally removing redundancy throughout the paper.
With his verification and advice, I wrote latex versions 6 and 7, eventually posting to arxiv. This was still not something I’d feel comfortable calling human-written, but they also couldn’t accurately be described as AI-generated anymore. It was a mix.
Eventually, I put together a succinct summary and submitted it to the forum.
Other Observations
It’s very difficult for people to have intuition around these problems, and very easy to fake something impressive. The average person would not see much of a difference between #659 and #1135, but #1135 is the infamous Collatz conjecture, a problem so notorious tenured mathematicians warn junior colleagues to stay away from it.12

Even for professional researchers with a strong background in Math or CS, these problems are not easily tractable. They can take a long time to understand; trust me when I say you that the gap remaining for Erdos 659 was quite small.
I’ll close with a few lessons for vibe-proving.
Don’t pick things everyone has already tried. This is somewhat obvious; if everyone has already tried it you can bet someone tried it with a model, and probably one more powerful than yours.
Verification is Everything. One other reason I picked 659 is because it appeared, at first glance, to be one where I could easily sanity-check candidate solutions. This ended up not mattering as much as I thought, as the proof was the important piece, but Erdős problems have a powerful verification function in the form of an active human community.
Understanding is optional at first. You don’t need to understand the problem to begin, though you should probably understand it by the end. If the model claims to be able to solve the problem it’s a signal to begin investing time into it.13
AI is great at combining domains. The solution wasn’t super tricky, but it branched number theory and geometry; not too many people would be familiar with both. (Credit to Professor Winkler for this observation).
Be Fast. In a quote from Margin Call, CEO John Tuld states “There are three ways to make a living in this business: be first, be smarter, or cheat.” Cheating doesn’t really apply here, but the first two absolutely do. I was paying attention to the math-AI field and realized it would take at least a few weeks before the models could saturate all 1212 problems in the database. We ended up with maybe a 4-month window before all of the “low-hanging fruit” was picked. True contributions, especially meaningful ones,
With a dash of cope.
And pay attention here, because figuring out when AI can help is maybe more important than figuring out how.
And I certainly wasn’t going to make up the difference; not for research-grade math.
I was using 3.0 Pro. There’s no particular reason to think Gemini would have outperformed GPT Pro, I just didn’t have access.
Apologies for stupid Substack and its stupid lack of inline latex.
That result, Planar [TK], was brought up earlier in the conversation as a ‘recent major breakthrough’ relevant to the project landscape.
It was not, in fact, especially relevant to the project.
I chose not to share the raw chat here. It’s rough and sloppy and extremely long, and I think it detracts from the main point I’m trying to make. I’m happy to share the page on an individual request.
In a nutshell, the mathlib library does not cover large chunks of number theory.
Though I’m not confident the tightening was strictly necessary, as the models tend to be extra defensive when writing in academic prose.
Note that Prof. Winkler gave some more high-level intuition around it.
Note that this is just a fragment of the context for the subsequent round of edits. I put a lot more into this stage, once I knew the project was more than vapor.
This problem is so intractable that in the 1960s, the CIA considered that Russia may have sent it to us to divert key researchers from the war effort.
Although if you only care about one specific problem I’d highly recommend trying to understand it. A tiny bit of steering can make a big difference.








