https://github.com/NovaSky-AI/SkyThought\n","updatedAt":"2025-02-21T03:04:42.649Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":8255}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7095736861228943},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"❤️","users":["InvidFlower","DachengLi","sugatoray"],"count":3}],"isReport":false}},{"id":"67b8d1a00409462b123b39f9","author":{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","fullname":"Dacheng Li","name":"DachengLi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12},"createdAt":"2025-02-21T19:18:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you very much @akhaliq for featuring our work! \nThe code has been available at the repo!","html":"
Thank you very much \n\n@akhaliq\n\t for featuring our work! The code has been available at the repo!
\n","updatedAt":"2025-02-21T19:18:56.065Z","author":{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","fullname":"Dacheng Li","name":"DachengLi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9820214509963989},"editors":["DachengLi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg"],"reactions":[],"isReport":false}},{"id":"67b929d5adee478dde7ff391","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-02-22T01:35:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning](https://huggingface.co/papers/2502.07154) (2025)\n* [MetaSC: Test-Time Safety Specification Optimization for Language Models](https://huggingface.co/papers/2502.07985) (2025)\n* [Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models](https://huggingface.co/papers/2502.06039) (2025)\n* [Dynamic Scaling of Unit Tests for Code Reward Modeling](https://huggingface.co/papers/2501.01054) (2025)\n* [Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling](https://huggingface.co/papers/2502.06703) (2025)\n* [MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation](https://huggingface.co/papers/2502.12468) (2025)\n* [FairCode: Evaluating Social Bias of LLMs in Code Generation](https://huggingface.co/papers/2501.05396) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-02-22T01:35:17.643Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6971896886825562},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.14382","authors":[{"_id":"67b7ed3e58f6b70b18ddb4bc","user":{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","isPro":false,"fullname":"Dacheng Li","user":"DachengLi","type":"user"},"name":"Dacheng Li","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:45:13.558Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4bd","user":{"_id":"64ebbae6895a36ab28de811a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ebbae6895a36ab28de811a/gBiaQP4paS4L13eu-yRm7.jpeg","isPro":false,"fullname":"Shiyi Cao","user":"eva98","type":"user"},"name":"Shiyi Cao","status":"claimed_verified","statusLastChangedAt":"2025-02-21T09:58:43.358Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4be","name":"Chengkun Cao","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4bf","user":{"_id":"644570ba2d91b15b4c7f6311","avatarUrl":"/avatars/d5e66012066d0c330b8f23718b1499d8.svg","isPro":false,"fullname":"Xiuyu Li","user":"xiuyul","type":"user"},"name":"Xiuyu Li","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:45:28.150Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4c0","user":{"_id":"663cfbd9b0f659de3db65c1a","avatarUrl":"/avatars/b7d82d281026ee04a9932b44a770b840.svg","isPro":false,"fullname":"Shangyin Tan","user":"shangyint","type":"user"},"name":"Shangyin Tan","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:45:33.730Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4c1","user":{"_id":"6251bf4b183aa4266924ad91","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678041834400-6251bf4b183aa4266924ad91.jpeg","isPro":true,"fullname":"Kurt Keutzer","user":"kurtkeutzer","type":"user"},"name":"Kurt Keutzer","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:45:39.902Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4c2","user":{"_id":"66ff8aa43a31c499dc48fdd6","avatarUrl":"/avatars/060dc90fb13991bd013ce8173f12ae3e.svg","isPro":false,"fullname":"Jiarong Xing","user":"JerryPotter","type":"user"},"name":"Jiarong Xing","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:45:45.774Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4c3","user":{"_id":"645d2e8401f4eaab2a0878ce","avatarUrl":"/avatars/1273c5fb607b4b622a746a42692fa632.svg","isPro":false,"fullname":"Joseph E. Gonzalez","user":"ProfJoeyG","type":"user"},"name":"Joseph E. Gonzalez","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:45:52.578Z","hidden":false},{"_id":"67b7ed3e58f6b70b18ddb4c4","name":"Ion Stoica","hidden":false}],"publishedAt":"2025-02-20T09:18:53.000Z","submittedOnDailyAt":"2025-02-21T00:34:42.635Z","title":"S*: Test Time Scaling for Code Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Increasing test-time compute for LLMs shows promise across domains but\nremains underexplored in code generation, despite extensive study in math. In\nthis paper, we propose S*, the first hybrid test-time scaling framework that\nsubstantially improves the coverage and selection accuracy of generated code.\nS* extends the existing parallel scaling paradigm with sequential scaling to\npush performance boundaries. It further leverages a novel selection mechanism\nthat adaptively generates distinguishing inputs for pairwise comparison,\ncombined with execution-grounded information to robustly identify correct\nsolutions. We evaluate across 12 Large Language Models and Large Reasoning\nModel and show: (1) S* consistently improves performance across model families\nand sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables\nnon-reasoning models to surpass reasoning models - GPT-4o-mini with S*\noutperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts\nstate-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S*\nachieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be\navailable under https://github.com/NovaSky-AI/SkyThought.","upvotes":63,"discussionId":"67b7ed3f58f6b70b18ddb510","projectPage":"https://novasky-ai.github.io/posts/S*/","githubRepo":"https://github.com/NovaSky-AI/SkyThought","ai_summary":"A hybrid test-time scaling framework improves code generation coverage and accuracy across various models and domains.","ai_keywords":["hybrid test-time scaling framework","parallel scaling","sequential scaling","selection mechanism","distinguishing inputs","execution-grounded information","Large Language Models","Large Reasoning Models","LiveCodeBench","DeepSeek"],"githubStars":3333},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"655bc5376d02c2b1a920a3c4","avatarUrl":"/avatars/cbc4b02b20d3be3a83f46e3d1341e658.svg","isPro":false,"fullname":"Evan Frick","user":"evan-nexusflow","type":"user"},{"_id":"626e3449e7914f0d5ea78ad1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626e3449e7914f0d5ea78ad1/pVzdmdPMpNcxuj94qiIvB.jpeg","isPro":false,"fullname":"Yichuan","user":"Chrisyichuan","type":"user"},{"_id":"63715b25ffc0489ed7d1f415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63715b25ffc0489ed7d1f415/xZJepbs0LRqFbW1knnBKR.jpeg","isPro":false,"fullname":"Dacheng Li","user":"DachengLi","type":"user"},{"_id":"62f32eab52ad88c930bb3f3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1677134945205-62f32eab52ad88c930bb3f3b.png","isPro":true,"fullname":"Asankhaya Sharma","user":"codelion","type":"user"},{"_id":"6549b0a808775ce78e535c6a","avatarUrl":"/avatars/942066356843d0c424375937f157c975.svg","isPro":false,"fullname":"Yilong Zhao","user":"ylzhao","type":"user"},{"_id":"63f30d28f4e30ffd2bda9aff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676872959244-noauth.jpeg","isPro":false,"fullname":"Jiaming Tang","user":"Sakits","type":"user"},{"_id":"63be1068b3b8c44f8ceb6598","avatarUrl":"/avatars/5fc626315f037d14c94e5df4144cc74a.svg","isPro":false,"fullname":"Simon Mo","user":"simon-mo","type":"user"},{"_id":"670026a7d663e2efa5029ecb","avatarUrl":"/avatars/df02bef55f7a12b4117212db2b1bbd26.svg","isPro":false,"fullname":"vLLM CI","user":"vllm-ci","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A hybrid test-time scaling framework improves code generation coverage and accuracy across various models and domains.
AI-generated summary
Increasing test-time compute for LLMs shows promise across domains but
remains underexplored in code generation, despite extensive study in math. In
this paper, we propose S*, the first hybrid test-time scaling framework that
substantially improves the coverage and selection accuracy of generated code.
S* extends the existing parallel scaling paradigm with sequential scaling to
push performance boundaries. It further leverages a novel selection mechanism
that adaptively generates distinguishing inputs for pairwise comparison,
combined with execution-grounded information to robustly identify correct
solutions. We evaluate across 12 Large Language Models and Large Reasoning
Model and show: (1) S* consistently improves performance across model families
and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables
non-reasoning models to surpass reasoning models - GPT-4o-mini with S*
outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts
state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S*
achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be
available under https://github.com/NovaSky-AI/SkyThought.