lynx   »   [go: up one dir, main page]

\"image.png\"

\n

\"image.png\"

\n

\"image.png\"

\n

\"image.png\"

\n

Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.

\n

I am using gpt-oss-120b at high reasoning with llama.cpp Metal backend.

\n
    \n
  • Anyone else observing this?
  • \n
  • Does the reference vLLM outputs have such incorrectly skipped answers?
  • \n
\n","updatedAt":"2025-08-28T12:58:12.125Z","author":{"_id":"63148d3b996c52bf0142cdbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63148d3b996c52bf0142cdbe/ec7pRNrQQy70d-11FiACq.jpeg","fullname":"Georgi Gerganov","name":"ggerganov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1621}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7599694132804871},"editors":["ggerganov"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63148d3b996c52bf0142cdbe/ec7pRNrQQy70d-11FiACq.jpeg"],"reactions":[{"reaction":"👀","users":["imweijh","upsatwal"],"count":2}],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"openai/gpt-oss-120b","type":"model"},"activeTab":"discussion","discussionRole":0,"watched":false,"muted":false,"repoDiscussionsLocked":false}">

AIME eval script does not score correctly some answers

#132
by ggerganov - opened
\"image.png\"

\n

\"image.png\"

\n

\"image.png\"

\n

\"image.png\"

\n

Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.

\n

I am using gpt-oss-120b at high reasoning with llama.cpp Metal backend.

\n
    \n
  • Anyone else observing this?
  • \n
  • Does the reference vLLM outputs have such incorrectly skipped answers?
  • \n
\n","updatedAt":"2025-08-28T12:58:12.125Z","author":{"_id":"63148d3b996c52bf0142cdbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63148d3b996c52bf0142cdbe/ec7pRNrQQy70d-11FiACq.jpeg","fullname":"Georgi Gerganov","name":"ggerganov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1621}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7599694132804871},"editors":["ggerganov"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63148d3b996c52bf0142cdbe/ec7pRNrQQy70d-11FiACq.jpeg"],"reactions":[{"reaction":"👀","users":["imweijh","upsatwal"],"count":2}],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"openai/gpt-oss-120b","type":"model"},"discussionRole":0,"acceptLanguages":["*"],"hideComments":true,"repoDiscussionsLocked":false,"isDiscussionAuthor":false}">

I am running some evals and I am noticing that from time to time the model produces a correct answer, but the logic in the AIME eval script does not parse it correctly. Here are a few examples:

image.png

image.png

image.png

image.png

Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.

I am using gpt-oss-120b at high reasoning with llama.cpp Metal backend.

  • Anyone else observing this?
  • Does the reference vLLM outputs have such incorrectly skipped answers?

Sign up or log in to comment

Лучший частный хостинг