Benchmarks in Leipzig

Posted by root-parent 3 days ago

Comments

Comment by christianstump 3 days ago

I am the leader of the study and the author of the benchmark paper: let me add: the problems are much harder than any exam question in any exam.

Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.

But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.

Comment by _flag 3 days ago

I don't like that you've called these problems "research-level", or your description that they are something you might give to a second-year PhD student. Some examples:

- Question 093 is a word problem of the kind that I would imagine is commonly given to high school students. Maybe it is slightly more difficult, but it doesn't appear to have any mathematical relevance and nobody would ever give it to a second-year PhD student.

- Question 096 is something I would expect a computer to do easily by brute force, and has essentially no mathematical content other than doing a calculation. (Under what circumstance does one care about taking base 10 digits and interpreting them in base 11?). Again, nobody would ever assign this to a math PhD student, and I expect that any undergrad who knows how to code can give you this answer.

- Question 016 is the kind of combinatorial problem that one could expect to brute force with a computer (and some decently-written code) even before AI. Again nobody would give it to a 2nd year PhD student because it is too random and of no academic interest.

- There are questions like 026 and 014, about computing Hilbert series. Computing Hilbert series is a standard computer algebra task that nobody would want to do by hand before generative AI, and certainly not now.

Similar comments apply to many others. There are plenty of random-looking computational questions of exactly the type that one expects not only that computers cans solve, but should be used to solve, because nobody would ever do it by hand. None of them are research-level --- certainly not anything that would be considered publishable (before generative AI or after) --- despite the subtitle of the paper saying "research-level". And if you give them to a 2nd year PhD student I would imagine you would just be wasting their time.

I also don't like your phrasing "much harder than any exam question in any exam". If I ask you to multiply two 1000 digit numbers, the question is "much harder" than any question that will ever appear on any exam. Everyone understands the computer will do it instantly, and it doesn't demonstrate anything relevant. There is a clear regime in which one expects AI-type methods to perform better (combinatorial, calculation-based questions which can be answered using standard methods), and other regimes where one expects worse performance (e.g., proofs of statements that use abstract concepts). Why is there nothing here of the second type?

Comment by christianstump 3 days ago

I cannot keep answering everyone's comments of the type "Why did you consider / not consider?" or "Here are much better ideas". I promise you that we have thought quite a bit about the setup and have discussed it with many math researchers.

1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?

2. The questions are of theoretical nature, even if a little calculation is involved. This does not mean that the problems are not solvable using a computer program, but it means that they are not solvable with reasonalble effort with a computer program.

3. And we do not ask for proofs because other projects already do that (IMProofBench, please have a look) and we cannot grade LLM answers as a human would need to understand the provided proof -- and this is not what I or we or actually most researchers are interested in doing.

Comment by tchalla 3 days ago

Haha, the classic “Why didn’t you do X?” comments always appear. I think a lot of people underestimate how much quality researchers deeply think about such setups. My genuine standard rely to those folks is - do the research with your setup and publish it.

Comment by pelasaco 2 days ago

> - do the research with your setup and publish it.

Which sounds arrogant and IMO don't belong to hacker news. IMO it's ok to don't like some questions. But it's ok to have such question in a non-research forum like here.

Comment by _flag 3 days ago

> 1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?

The objection is to phrasing "much harder". One should distinguish between something that is difficult for reasons stemming from a lack of computational power and something that is difficult for reasons stemming from a lack of relevant abstractions or the ability to grapple with them. If the reason that a particular problem is "hard" for a PhD student is that they have to do a long calculation, but not because of a lack of conceptual understanding, then it doesn't say much about the capabilities of generative AI if the computer solves it.

Hence the example: multiplying two large numbers is hard for the former reason, not the latter. Your example of factoring a 4096-bit semiprime is hard for both reasons (because the brute force method is too slow).

Comment by christianstump 3 days ago

Well, you are correct that one should distinguish the two. But we give no indication that the questions are hard because of computational tasks and we give many indications that the problems are of theorecical nature and hard for theoretical reasons. There is not a single question where a PhD student would need to do a long calculation.

I trust the judgement of respected researchers submitting the questions, I personally know them, and they publish research under their full names (and whose names are fully disclosed in the paper). And you also should trust them.

Please consider disclosing your name and your field of expertise, pick a question in your own research area and explain to me why this question is not research-level. And, best of all, solve it yourself to clarify why it was too easy.

Comment by _flag 3 days ago

I solve 034.

By [1, Theorem 4.1], the Neron-Severi rank of the perfectoid cover is the same as the Neron-Severi rank of the reduction. For a product E x E' of elliptic curves, it is well known that NS(E x E') = NS(E) + NS(E') + Hom(E,E'); see [2, Prop. 2.3]. Since E = E' here and E is supersingular, this number is 1 + 1 + 4 = 6.

Is it research level? It of course takes a graduate student a long time to understand, say, what a perfectoid space is. But the statement follows immediately from quoting the literature, as long as one knows what to quote.

1. https://arxiv.org/pdf/2105.05230 2. https://arxiv.org/pdf/1402.2233

Comment by christianstump 3 days ago

You see yourself that your own solution is purely of theoretical nature and not at all what you wrote before, right? (And no, I am not commenting on your answer.)

Comment by _flag 3 days ago

Indeed. But I chose the problem in response to your comment:

> But we give no indication that the questions are hard because of computational tasks and we give many indications that the problems are of theorecical nature and hard for theoretical reasons.

and Question 034 seemed to be one of the few that did not have a computational component, and so would presumably be hard for theoretical reasons. I already indicated above some problems that I feel to be of a non-theoretical nature.

Comment by spacebacon 3 days ago

On problems this close to active research, seeing the model’s internal reasoning at the points of highest effort is more valuable than pass/fail outcomes alone, which is what SRT-Introspect makes possible on frozen models.

https://github.com/space-bacon/SRT

Comment by christianstump 3 days ago

But it still remains far away from mathematics research. Solving any of the problems would not result in a new research paper.

Comment by jll29 3 days ago

Can anyone comment on the "distance to publication-worthiness" of the typical question from this set?

Comment by christianstump 3 days ago

Infinitly far away. These questions are about "have you understood and can you apply existing research" not about "create new research".

For humans, these two correlate quite strongly. So we ask PhD students to work on the former to prepare and to become better at the latter.

For LLMs, it remains unclear if there is any correlation.

Comment by jona-f 3 days ago

Was this event sponsored by Surge AI? Why didn't you run the prompts yourself?

Comment by christianstump 3 days ago

No, they only provided large-scale model runs for us (this is explained in the ackonowledgements). These runs would have been too expensive to perform myself, so I am happy they offered to provide them.

Comment by jona-f 3 days ago

Thanks for answering this random internet guy's question. It's a bit sad that a german math prof doesn't have sufficient funds to run a few prompts. I would have paid for them for this amount of advertising. I don't like that you gave them to a silicon valley company.

On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.

Comment by christianstump 3 days ago

Please indicate which other models you would like to see included. (And I agree that the context window limitations were not reasonable to have.) Finally: running this few prompts would have been $10-20k if I would have run them myself via the API. (And the company didn't asked to contribute, but I asked whether they would be willing to do so, just saying.)

Comment by jona-f 3 days ago

Kimi K2.6 and mimo 2.5 pro are ahead of deepseek v4 in other benchmarks. Anyhow, great work, the benchmark seems to show great separation, so should be very useful to improve the math capabilities of the next generation of ai. I'm more interested in the prompt engineering/orchestration and technical details (what I can do without millions), but I get that you are mathematicians, so your focus is obviously on the math. Sorry for the nagging.

Comment by sajithdilshan 3 days ago

What would have been more interesting is if LLMs were tested with questions where the direct solutions are not publicly available (so not in training data). In that case I wonder how much of hallucinations would happen or if it tries to connect dots with what’s available publicly and come up with a direct solution

Comment by christianstump 3 days ago

I don't understand why you expect that an answer known to the researcher but which has never been published should be in the training data. You possibly missunderstand what these problems look like -- we made them all publicly available on the website, so please have a look: https://math.sciencebench.ai/benchmarks/benchmarks-in-leipzi...

Comment by 3 days ago

Comment by zerobees 3 days ago

I know that people with strong feelings one way or the other will comment here, but note that this is specifically about problems with known answers that can be inferred from existing literature (e.g., training data).

This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.

The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."

Comment by christianstump 3 days ago

Let me also add: there is zero chance of the problems being included in the training data. The results are quite impressive: leading experts struggled to write questions with well-defined unique answers on existing research that the models were not able to solve.

This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.

But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.

Comment by lightningspirit 3 days ago

I think most of the value LLMs provide comes from connecting the dots between unsolved questions and patterns or structures that have already been demonstrated, which accelerates research.

Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.

Comment by spuz 3 days ago

As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.

If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:

- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)

- Opus: 1306/2000 questions answered, of which 294 were correct (22%)

So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.

Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.

Comment by christianstump 3 days ago

You are 100% correct with your assessment of the situation. But I do not agree with either of your conclusions:

1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.

2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.

Comment by spuz 3 days ago

Ah you are right. I think I started reading the results of Stage 2 thinking it was Stage 1.

Comment by tomtomatoide 3 days ago

For some reason, perhaps some sort of Freudian self-defense mechanism, we tend to downplay how impressive solving never seen problems that require deep understanding of the concepts at play requires.

Look for final exams of advanced courses in CS or math. It will be clarifying how close (or plainly, harder) the questions from the study are. And so how impressive the capabilities these models are achieving...

Comment by puttycat 3 days ago

Hopefully they password-protect the datasets:

https://arxiv.org/abs/2305.10160

Comment by davidmpaz 3 days ago

With all due respect to the interesting research results this paper offers....

This is the most "Hitchhiker's Guide to the Galaxy" thing I have read so far :)

I am even expecting some of the question to be answered with 42!!

Comment by sinuhe69 2 days ago

Did any AI lab sponsor the study? If yes, I believe a note should be included in the paper, just like commercial sponsorship in any other field.

Comment by christianstump 2 days ago

I am very happy to provide any information I missed to include -- but exactly what you ask for is written in the paper's acknowledgements (no sponsorship, but some model runs).

Comment by danielgall500 3 days ago

I almost did a double take when I saw Leipzig mentioned here. Cool work and greetings from InfAI. :)

Comment by qsort 3 days ago

These are the results from the website they link in the paper:

https://math.sciencebench.ai/benchmarks

I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.

It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?

Comment by tux3 3 days ago

If you're trying to compare what the models are good at, important to note that the different models did not run with the same settings. In one case they also retried with GPT until it answered all the problems but did not retry with the other models.

GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.

Comment by christianstump 3 days ago

the difference between gpt and gemini concerning the "retry until..." can almost be ignored. I did rerun gpt a few times, but still way below what gemini was not able to answer at all.

Comment by root-parent 3 days ago

"...Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers... We present the resulting collection of 100 questions....We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs....we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive..."

Comment by rabidvermin 3 days ago

mathematics questions with known answers...

... that are therefore liable to be in the training data?

Comment by fc417fc802 3 days ago

I had the same thought, because even if the exact solution doesn't appear there's a notable difference between performing a literature search versus solving something de novo. But I think perhaps this benchmark wasn't meant to exclude the former and that the point may have been to test the ability of the model to accurately interpret and synthesize relevant output for research level mathematical problems at all.

Comment by christianstump 3 days ago

I think you are underestimating the complexity of such problems. A PhD in the exact field of research would need days to weeks to understand what the problem means and how to solve it. This is far beyond "throwing standard techniques" at a problem. (But, I keep emphasizing this, it is also far away from solving research mathematics.)

Comment by fc417fc802 3 days ago

What did I say that led you to believe I was underestimating the complexity? I don't believe I commented on it at all.

Comment by christianstump 3 days ago

When you write "there's a notable difference between performing a literature search versus solving something de novo", you suggest that the questions we provided can be solved doing a literature search.

This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.

So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!

Comment by fc417fc802 3 days ago

I did not suggest that, no. I stress that claiming a possibility is not the same as claiming a fact.

I observed that the two things are quite different in terms of model capabilities. That's relevant when considering how to interpret the results of the benchmark. We need to differentiate between (at minimum) reproducing an (approximately) verbatim answer from the training set, assembling disparate items from the training set into an answer piecewise, and performing novel logical inference using items from the training set.

I further speculated about the intent of the authors but you seem to be saying that my guess was wrong. In response I will observe that for any problem that's known to be solved it's likely to be quite difficult if not impossible to confidently determine that the model performed a de novo derivation as opposed to finding pieces of the answer in various places.

Of course there's absolutely nothing wrong with the latter! It's just important to be aware of the possibility when drawing conclusions about model capabilities.

Comment by tossandthrow 3 days ago

I can recommend reading section 2 of the paper.

The goal was not to define unsolved problems.

But as such, the problems are also not previously published problems.

This seems quite reasonable IMHO.

Comment by criemen 3 days ago

Partially, 2.2 Submission workflow W2 deals with this:

> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.

So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.

Comment by andy99 3 days ago

“In the training data” isn’t really relevant for a modern LLM. The better question would be are they solvable using known techniques that have been fine-tuned in.

A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.

Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.

Comment by openclawclub 3 days ago

[flagged]

Comment by danielovichdk 3 days ago

Hypezig

The new Berlin

Comment by Towaway69 3 days ago

As long as it's not conscious, we're safe.

Comment by esafak 3 days ago

No, you're not. It can take jobs and be plenty menacing in embodied form in its present state.