Benchmarks in Leipzig
Posted by root-parent 3 days ago
Comments
Comment by christianstump 3 days ago
Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.
But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.
Comment by _flag 3 days ago
- Question 093 is a word problem of the kind that I would imagine is commonly given to high school students. Maybe it is slightly more difficult, but it doesn't appear to have any mathematical relevance and nobody would ever give it to a second-year PhD student.
- Question 096 is something I would expect a computer to do easily by brute force, and has essentially no mathematical content other than doing a calculation. (Under what circumstance does one care about taking base 10 digits and interpreting them in base 11?). Again, nobody would ever assign this to a math PhD student, and I expect that any undergrad who knows how to code can give you this answer.
- Question 016 is the kind of combinatorial problem that one could expect to brute force with a computer (and some decently-written code) even before AI. Again nobody would give it to a 2nd year PhD student because it is too random and of no academic interest.
- There are questions like 026 and 014, about computing Hilbert series. Computing Hilbert series is a standard computer algebra task that nobody would want to do by hand before generative AI, and certainly not now.
Similar comments apply to many others. There are plenty of random-looking computational questions of exactly the type that one expects not only that computers cans solve, but should be used to solve, because nobody would ever do it by hand. None of them are research-level --- certainly not anything that would be considered publishable (before generative AI or after) --- despite the subtitle of the paper saying "research-level". And if you give them to a 2nd year PhD student I would imagine you would just be wasting their time.
I also don't like your phrasing "much harder than any exam question in any exam". If I ask you to multiply two 1000 digit numbers, the question is "much harder" than any question that will ever appear on any exam. Everyone understands the computer will do it instantly, and it doesn't demonstrate anything relevant. There is a clear regime in which one expects AI-type methods to perform better (combinatorial, calculation-based questions which can be answered using standard methods), and other regimes where one expects worse performance (e.g., proofs of statements that use abstract concepts). Why is there nothing here of the second type?
Comment by christianstump 3 days ago
1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?
2. The questions are of theoretical nature, even if a little calculation is involved. This does not mean that the problems are not solvable using a computer program, but it means that they are not solvable with reasonalble effort with a computer program.
3. And we do not ask for proofs because other projects already do that (IMProofBench, please have a look) and we cannot grade LLM answers as a human would need to understand the provided proof -- and this is not what I or we or actually most researchers are interested in doing.
Comment by tchalla 3 days ago
Comment by pelasaco 2 days ago
Which sounds arrogant and IMO don't belong to hacker news. IMO it's ok to don't like some questions. But it's ok to have such question in a non-research forum like here.
Comment by _flag 3 days ago
The objection is to phrasing "much harder". One should distinguish between something that is difficult for reasons stemming from a lack of computational power and something that is difficult for reasons stemming from a lack of relevant abstractions or the ability to grapple with them. If the reason that a particular problem is "hard" for a PhD student is that they have to do a long calculation, but not because of a lack of conceptual understanding, then it doesn't say much about the capabilities of generative AI if the computer solves it.
Hence the example: multiplying two large numbers is hard for the former reason, not the latter. Your example of factoring a 4096-bit semiprime is hard for both reasons (because the brute force method is too slow).
Comment by christianstump 3 days ago
I trust the judgement of respected researchers submitting the questions, I personally know them, and they publish research under their full names (and whose names are fully disclosed in the paper). And you also should trust them.
Please consider disclosing your name and your field of expertise, pick a question in your own research area and explain to me why this question is not research-level. And, best of all, solve it yourself to clarify why it was too easy.
Comment by _flag 3 days ago
By [1, Theorem 4.1], the Neron-Severi rank of the perfectoid cover is the same as the Neron-Severi rank of the reduction. For a product E x E' of elliptic curves, it is well known that NS(E x E') = NS(E) + NS(E') + Hom(E,E'); see [2, Prop. 2.3]. Since E = E' here and E is supersingular, this number is 1 + 1 + 4 = 6.
Is it research level? It of course takes a graduate student a long time to understand, say, what a perfectoid space is. But the statement follows immediately from quoting the literature, as long as one knows what to quote.
1. https://arxiv.org/pdf/2105.05230 2. https://arxiv.org/pdf/1402.2233
Comment by christianstump 3 days ago
Comment by _flag 3 days ago
> But we give no indication that the questions are hard because of computational tasks and we give many indications that the problems are of theorecical nature and hard for theoretical reasons.
and Question 034 seemed to be one of the few that did not have a computational component, and so would presumably be hard for theoretical reasons. I already indicated above some problems that I feel to be of a non-theoretical nature.
Comment by spacebacon 3 days ago
Comment by christianstump 3 days ago
Comment by jll29 3 days ago
Comment by christianstump 3 days ago
For humans, these two correlate quite strongly. So we ask PhD students to work on the former to prepare and to become better at the latter.
For LLMs, it remains unclear if there is any correlation.
Comment by jona-f 3 days ago
Comment by christianstump 3 days ago
Comment by jona-f 3 days ago
On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.
Comment by christianstump 3 days ago
Comment by jona-f 3 days ago
Comment by sajithdilshan 3 days ago
Comment by christianstump 3 days ago
Comment by zerobees 3 days ago
This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.
The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."
Comment by christianstump 3 days ago
This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.
But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.
Comment by lightningspirit 3 days ago
Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.
Comment by spuz 3 days ago
If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:
- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)
- Opus: 1306/2000 questions answered, of which 294 were correct (22%)
So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.
Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.
Comment by christianstump 3 days ago
1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.
2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.
Comment by spuz 3 days ago
Comment by tomtomatoide 3 days ago
Look for final exams of advanced courses in CS or math. It will be clarifying how close (or plainly, harder) the questions from the study are. And so how impressive the capabilities these models are achieving...
Comment by puttycat 3 days ago
Comment by davidmpaz 3 days ago
This is the most "Hitchhiker's Guide to the Galaxy" thing I have read so far :)
I am even expecting some of the question to be answered with 42!!
Comment by sinuhe69 2 days ago
Comment by christianstump 2 days ago
Comment by danielgall500 3 days ago
Comment by qsort 3 days ago
https://math.sciencebench.ai/benchmarks
I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.
It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?
Comment by tux3 3 days ago
GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.
Comment by christianstump 3 days ago
Comment by root-parent 3 days ago
Comment by rabidvermin 3 days ago
... that are therefore liable to be in the training data?
Comment by fc417fc802 3 days ago
Comment by christianstump 3 days ago
Comment by fc417fc802 3 days ago
Comment by christianstump 3 days ago
This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.
So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!
Comment by fc417fc802 3 days ago
I observed that the two things are quite different in terms of model capabilities. That's relevant when considering how to interpret the results of the benchmark. We need to differentiate between (at minimum) reproducing an (approximately) verbatim answer from the training set, assembling disparate items from the training set into an answer piecewise, and performing novel logical inference using items from the training set.
I further speculated about the intent of the authors but you seem to be saying that my guess was wrong. In response I will observe that for any problem that's known to be solved it's likely to be quite difficult if not impossible to confidently determine that the model performed a de novo derivation as opposed to finding pieces of the answer in various places.
Of course there's absolutely nothing wrong with the latter! It's just important to be aware of the possibility when drawing conclusions about model capabilities.
Comment by tossandthrow 3 days ago
The goal was not to define unsolved problems.
But as such, the problems are also not previously published problems.
This seems quite reasonable IMHO.
Comment by criemen 3 days ago
> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.
So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.
Comment by andy99 3 days ago
A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.
Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.
Comment by openclawclub 3 days ago
Comment by danielovichdk 3 days ago
The new Berlin