https://ryanliu112.github.io/compute-optimal-tts/

\n","updatedAt":"2025-02-11T05:36:11.297Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":8260}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4669560492038727},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"🔥","users":["lintangnagari"],"count":1}],"isReport":false}},{"id":"67ab883e695f59ba90c894fc","author":{"_id":"677dd78941fe5ec45428ba02","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/s1Ws2iupRZHY6RiPxQw2n.png","fullname":"Paul Jones","name":"pauljones0","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-02-11T17:26:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"TLDR; This paper explores how we can make weak models stronger by using TTS approaches. The paper finds, that the \"compute optimal\" depends on the TTS, the PRM and the Policy Model. There is no right answer for every setup.\n\nThe paper also suggests that further work be done into how to do \"weak-to-strong\" supervision instead of \"strong-to-weak\" supervision (what we currently do, where he have a 405B model train a 70B or 7B model), as doing so would allow for more efficient and autonomous improvements in AI...","html":"

TLDR; This paper explores how we can make weak models stronger by using TTS approaches. The paper finds, that the \"compute optimal\" depends on the TTS, the PRM and the Policy Model. There is no right answer for every setup.

The paper also suggests that further work be done into how to do \"weak-to-strong\" supervision instead of \"strong-to-weak\" supervision (what we currently do, where he have a 405B model train a 70B or 7B model), as doing so would allow for more efficient and autonomous improvements in AI...

\n","updatedAt":"2025-02-11T17:26:22.516Z","author":{"_id":"677dd78941fe5ec45428ba02","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/s1Ws2iupRZHY6RiPxQw2n.png","fullname":"Paul Jones","name":"pauljones0","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9447181820869446},"editors":["pauljones0"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/s1Ws2iupRZHY6RiPxQw2n.png"],"reactions":[],"isReport":false}},{"id":"67aba08ca3662cb3fe1d1014","author":{"_id":"6307e0ff2d33960e6b769a8c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6307e0ff2d33960e6b769a8c/Z-6nrdlT_sICJcW2q3JaA.png","fullname":"Alexander Daly","name":"miserableape","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7},"createdAt":"2025-02-11T19:10:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting read!\n","html":"

Interesting read!

\n","updatedAt":"2025-02-11T19:10:04.783Z","author":{"_id":"6307e0ff2d33960e6b769a8c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6307e0ff2d33960e6b769a8c/Z-6nrdlT_sICJcW2q3JaA.png","fullname":"Alexander Daly","name":"miserableape","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8269127607345581},"editors":["miserableape"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6307e0ff2d33960e6b769a8c/Z-6nrdlT_sICJcW2q3JaA.png"],"reactions":[],"isReport":false}},{"id":"67abb1a714d5fe7767ce321f","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1},"createdAt":"2025-02-11T20:23:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We made a deep dive video for this paper: https://www.youtube.com/watch?v=f1KRQxxVh1k 🔮\n\n![TitleImage.png](https://cdn-uploads.huggingface.co/production/uploads/67818b1fa6b75c5dc3cf430c/5NqMRin5ZExHgOXQDFePl.png)\n","html":"

We made a deep dive video for this paper: https://www.youtube.com/watch?v=f1KRQxxVh1k 🔮

$\"TitleImage.png\"$

\n","updatedAt":"2025-02-11T20:23:03.018Z","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5284328460693359},"editors":["ribbitribbit365"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png"],"reactions":[{"reaction":"🔥","users":["suryakiran786","netspirit","nitishpandey04"],"count":3}],"isReport":false}},{"id":"67abfac27bf52c43dd52c30e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-02-12T01:34:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128) (2025)\n* [Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach](https://huggingface.co/papers/2502.05171) (2025)\n* [Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models](https://huggingface.co/papers/2412.15287) (2024)\n* [Generating Symbolic World Models via Test-time Scaling of Large Language Models](https://huggingface.co/papers/2502.04728) (2025)\n* [Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs](https://huggingface.co/papers/2501.18585) (2025)\n* [Training Language Models to Reason Efficiently](https://huggingface.co/papers/2502.04463) (2025)\n* [Entropy-Regularized Process Reward Model](https://huggingface.co/papers/2412.11006) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-12T01:34:58.249Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7003177404403687},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"67ac2dffce4ac7ebb61d10f2","author":{"_id":"65e560e7c368fce133b55f58","avatarUrl":"/avatars/4fb813267ca1cfa0b9bfd489aa434350.svg","fullname":"Himanshu Shukla","name":"himanshushukla12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-02-12T05:13:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I loved the work how can I be the part of this, I'll love to contribute in this.","html":"

I loved the work how can I be the part of this, I'll love to contribute in this.

\n","updatedAt":"2025-02-12T05:13:35.331Z","author":{"_id":"65e560e7c368fce133b55f58","avatarUrl":"/avatars/4fb813267ca1cfa0b9bfd489aa434350.svg","fullname":"Himanshu Shukla","name":"himanshushukla12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9644497036933899},"editors":["himanshushukla12"],"editorAvatarUrls":["/avatars/4fb813267ca1cfa0b9bfd489aa434350.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.06703","authors":[{"_id":"67aabf93c0f8648f68c68ce4","user":{"_id":"667187ba9ab144eb3ac43a1b","avatarUrl":"/avatars/db5558aa1c5160b9aee8b58573271959.svg","isPro":false,"fullname":"Runze Liu","user":"RyanLiu112","type":"user"},"name":"Runze Liu","status":"claimed_verified","statusLastChangedAt":"2025-02-11T07:55:22.940Z","hidden":false},{"_id":"67aabf93c0f8648f68c68ce5","user":{"_id":"67ab05fe4c6ca2d5db4c0c52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QpGUNDkeuKjX71s2GXlXF.png","isPro":false,"fullname":"Junqi Gao","user":"ChetKao","type":"user"},"name":"Junqi Gao","status":"admin_assigned","statusLastChangedAt":"2025-02-11T15:54:46.128Z","hidden":false},{"_id":"67aabf93c0f8648f68c68ce6","name":"Jian Zhao","hidden":false},{"_id":"67aabf93c0f8648f68c68ce7","user":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user"},"name":"Kaiyan Zhang","status":"claimed_verified","statusLastChangedAt":"2025-02-11T07:55:18.725Z","hidden":false},{"_id":"67aabf93c0f8648f68c68ce8","name":"Xiu Li","hidden":false},{"_id":"67aabf93c0f8648f68c68ce9","user":{"_id":"645d9c3058f9ee315148116d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645d9c3058f9ee315148116d/uBoAWgrF2Di4WcXVGW9fP.jpeg","isPro":false,"fullname":"Biqing Qi","user":"jackqi7","type":"user"},"name":"Biqing Qi","status":"admin_assigned","statusLastChangedAt":"2025-02-11T15:55:23.328Z","hidden":false},{"_id":"67aabf93c0f8648f68c68cea","name":"Wanli Ouyang","hidden":false},{"_id":"67aabf93c0f8648f68c68ceb","user":{"_id":"669f614b59adf5b56e05bce3","avatarUrl":"/avatars/ffd4189efbceb0e63a03db273065a44b.svg","isPro":false,"fullname":"BowenZhou","user":"bowenZhou","type":"user"},"name":"Bowen Zhou","status":"admin_assigned","statusLastChangedAt":"2025-02-11T15:55:11.315Z","hidden":false}],"publishedAt":"2025-02-10T17:30:23.000Z","submittedOnDailyAt":"2025-02-11T03:06:11.270Z","title":"Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time\n Scaling","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Test-Time Scaling (TTS) is an important method for improving the performance\nof Large Language Models (LLMs) by using additional computation during the\ninference phase. However, current studies do not systematically analyze how\npolicy models, Process Reward Models (PRMs), and problem difficulty influence\nTTS. This lack of analysis limits the understanding and practical use of TTS\nmethods. In this paper, we focus on two core questions: (1) What is the optimal\napproach to scale test-time computation across different policy models, PRMs,\nand problem difficulty levels? (2) To what extent can extended computation\nimprove the performance of LLMs on complex tasks, and can smaller language\nmodels outperform larger ones through this approach? Through comprehensive\nexperiments on MATH-500 and challenging AIME24 tasks, we have the following\nobservations: (1) The compute-optimal TTS strategy is highly dependent on the\nchoice of policy model, PRM, and problem difficulty. (2) With our\ncompute-optimal TTS strategy, extremely small policy models can outperform\nlarger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500.\nMoreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM\nsurpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher\ninference efficiency. These findings show the significance of adapting TTS\nstrategies to the specific characteristics of each task and model and indicate\nthat TTS is a promising approach for enhancing the reasoning abilities of LLMs.","upvotes":153,"discussionId":"67aabf94c0f8648f68c68d19","projectPage":"https://ryanliu112.github.io/compute-optimal-tts","githubRepo":"https://github.com/RyanLiu112/compute-optimal-tts","ai_summary":"Test-Time Scaling improves Large Language Model performance by optimizing inference computation based on policy models, Process Reward Models, and problem difficulty.","ai_keywords":["Test-Time Scaling","Large Language Models","process reward models","MATH-500","AIME24","compute-optimal strategy"],"githubStars":271},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"667187ba9ab144eb3ac43a1b","avatarUrl":"/avatars/db5558aa1c5160b9aee8b58573271959.svg","isPro":false,"fullname":"Runze Liu","user":"RyanLiu112","type":"user"},{"_id":"65d9903fdceb54d42011a98d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9903fdceb54d42011a98d/5jnLeCY9sDtS98JyO9qzX.jpeg","isPro":false,"fullname":"meng shao","user":"meng-shao","type":"user"},{"_id":"638a103212794d978d374658","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a103212794d978d374658/4uoFbnWdvzXwgRT18UCTy.png","isPro":false,"fullname":"Kanawat Vilasri","user":"gri11","type":"user"},{"_id":"6560d75d6ff1b91e28e3cd7b","avatarUrl":"/avatars/bf205b47c71b197c56414ad1aaae3453.svg","isPro":false,"fullname":"js","user":"rldy","type":"user"},{"_id":"622474f38dc6b0b64f5e903d","avatarUrl":"/avatars/d6b60a014277a8ec7d564163c5f644aa.svg","isPro":false,"fullname":"Yuxin Zuo","user":"yuxinzuo","type":"user"},{"_id":"62d81b2b14bc83f0febefc2e","avatarUrl":"/avatars/d6520e85d1cead2249d29becaf311e1d.svg","isPro":false,"fullname":"Felix Tuma","user":"floom","type":"user"},{"_id":"6749b9b54431ba7184411328","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/c2DvvGF_Ga5rKY9iJuyib.png","isPro":false,"fullname":"Xinfeng","user":"Joanna-Yuan","type":"user"},{"_id":"6776136bf4428b4454e92882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SQgIfHDOKNCntHQg9CNPq.png","isPro":false,"fullname":"Xunzhe Zhou","user":"zhouxunzhe402","type":"user"},{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user"},{"_id":"633038c7b68c7453d2e87416","avatarUrl":"/avatars/7e178840ef99bcc27bebfdc0a1799172.svg","isPro":false,"fullname":"pideeeelll","user":"pideeell","type":"user"},{"_id":"645b1a3ee3857e9d096dba12","avatarUrl":"/avatars/053cfc619810a8e0b5e749c0e005aad2.svg","isPro":false,"fullname":"zy","user":"TabCanNotTab","type":"user"},{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">

Papers

arxiv:2502.06703

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Published on Feb 10

· Submitted by

AK on Feb 11

#1 Paper of the day

Upvote

153

Authors:

Runze Liu ,

Junqi Gao ,

Kaiyan Zhang ,

Biqing Qi ,

Bowen Zhou

Abstract

Test-Time Scaling improves Large Language Model performance by optimizing inference computation based on policy models, Process Reward Models, and problem difficulty.

AI-generated summary

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

View arXiv page View PDF Project page GitHub 271 Add to collection

Community

akhaliq

Paper submitter Feb 11

https://ryanliu112.github.io/compute-optimal-tts/

pauljones0

Feb 11

TLDR; This paper explores how we can make weak models stronger by using TTS approaches. The paper finds, that the "compute optimal" depends on the TTS, the PRM and the Policy Model. There is no right answer for every setup.

The paper also suggests that further work be done into how to do "weak-to-strong" supervision instead of "strong-to-weak" supervision (what we currently do, where he have a 405B model train a 70B or 7B model), as doing so would allow for more efficient and autonomous improvements in AI...