Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.
\nI am using gpt-oss-120b
at high
reasoning with llama.cpp Metal backend.
- \n
- Anyone else observing this? \n
- Does the reference vLLM outputs have such incorrectly skipped answers? \n
AIME eval script does not score correctly some answers
Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.
\nI am using gpt-oss-120b
at high
reasoning with llama.cpp Metal backend.
- \n
- Anyone else observing this? \n
- Does the reference vLLM outputs have such incorrectly skipped answers? \n
I am running some evals and I am noticing that from time to time the model produces a correct answer, but the logic in the AIME eval script does not parse it correctly. Here are a few examples:
Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.
I am using gpt-oss-120b
at high
reasoning with llama.cpp Metal backend.
- Anyone else observing this?
- Does the reference vLLM outputs have such incorrectly skipped answers?