lynx   »   [go: up one dir, main page]

https://leililab.github.io/HardTests/

\n

RLVR is not just about RL, it's more about VR!

\n

Particularly for LLM coding, good verifiers (tests) are hard to get!

\n

In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter?

\n

Current tests are BAD. Some of them are too easy to break inefficient programs. Others lack special judge functions for program outputs and mistake a right program for a wrong one. Combined, they create LOTS of false positives/negatives. So what do we do?

\n

We propose HardTestGen, an LLM-based test synthesis pipeline that gets you much better tests than the ones that people often use, such as TACO. With that, we curate a problem set with 47k competition problems and good tests. But why should you care?

\n

We run post-training experiments in 3 scenarios -- teacher-distillation, self-distillation, and RL -- to study when good tests matter. It turns out that they don't, for teacher-distillation. However, they matter a great deal for self-distillation and RL.

\n

Our problem set is now available at https://huggingface.co/datasets/sigcp/hardtests_problems, with the synthesis code and synthetic tests coming soon.

\n","updatedAt":"2025-06-03T00:49:43.297Z","author":{"_id":"62ee423b4bebb4ab55c674b1","avatarUrl":"/avatars/ce2797937e8225937fc84d6847d50077.svg","fullname":"Kexun Zhang","name":"k1z","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9427096843719482},"editors":["k1z"],"editorAvatarUrls":["/avatars/ce2797937e8225937fc84d6847d50077.svg"],"reactions":[],"isReport":false}},{"id":"683e5171e0e3308d242b6590","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-06-03T01:35:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs](https://huggingface.co/papers/2504.14655) (2025)\n* [OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs](https://huggingface.co/papers/2504.04030) (2025)\n* [SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs](https://huggingface.co/papers/2504.14757) (2025)\n* [rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset](https://huggingface.co/papers/2505.21297) (2025)\n* [Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback](https://huggingface.co/papers/2504.15804) (2025)\n* [VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models](https://huggingface.co/papers/2505.15801) (2025)\n* [Iterative Self-training for Code Generation via Reinforced Re-ranking](https://huggingface.co/papers/2504.09643) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-03T01:35:45.319Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7268385291099548},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.24098","authors":[{"_id":"683d2cee5bdbb3803e42bc8a","name":"Zhongmou He","hidden":false},{"_id":"683d2cee5bdbb3803e42bc8b","name":"Yee Man Choi","hidden":false},{"_id":"683d2cee5bdbb3803e42bc8c","name":"Kexun Zhang","hidden":false},{"_id":"683d2cee5bdbb3803e42bc8d","name":"Jiabao Ji","hidden":false},{"_id":"683d2cee5bdbb3803e42bc8e","user":{"_id":"65a374a59acab1998092a9bc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a374a59acab1998092a9bc/M3s_7bSf9G-6b9nLg7N3Z.jpeg","isPro":false,"fullname":"Antonio","user":"JuntingZhou","type":"user"},"name":"Junting Zhou","status":"claimed_verified","statusLastChangedAt":"2025-06-02T07:40:34.926Z","hidden":false},{"_id":"683d2cee5bdbb3803e42bc8f","name":"Dejia Xu","hidden":false},{"_id":"683d2cee5bdbb3803e42bc90","user":{"_id":"638a9361f560ea995b58c7e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638a9361f560ea995b58c7e7/pn40t46XsrB-6pKEmqAv-.jpeg","isPro":false,"fullname":"Ivan Bercovich","user":"ibercovich","type":"user"},"name":"Ivan Bercovich","status":"claimed_verified","statusLastChangedAt":"2025-06-10T09:29:32.638Z","hidden":false},{"_id":"683d2cee5bdbb3803e42bc91","name":"Aidan Zhang","hidden":false},{"_id":"683d2cee5bdbb3803e42bc92","name":"Lei Li","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/62ee423b4bebb4ab55c674b1/yE3pB5JGaOf-sdyDjpCk6.png"],"publishedAt":"2025-05-30T01:00:34.000Z","submittedOnDailyAt":"2025-06-02T03:20:27.903Z","title":"HardTests: Synthesizing High-Quality Test Cases for LLM Coding","submittedOnDailyBy":{"_id":"62ee423b4bebb4ab55c674b1","avatarUrl":"/avatars/ce2797937e8225937fc84d6847d50077.svg","isPro":false,"fullname":"Kexun Zhang","user":"k1z","type":"user"},"summary":"Verifiers play a crucial role in large language model (LLM) reasoning, needed\nby post-training techniques such as reinforcement learning. However, reliable\nverifiers are hard to get for difficult coding problems, because a\nwell-disguised wrong solution may only be detected by carefully human-written\nedge cases that are difficult to synthesize. To address this issue, we propose\nHARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this\npipeline, we curate a comprehensive competitive programming dataset HARDTESTS\nwith 47k problems and synthetic high-quality tests. Compared with existing\ntests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points\nhigher and recall that is 17.5 percentage points higher when evaluating\nLLM-generated code. For harder problems, the improvement in precision can be as\nlarge as 40 points. HARDTESTS also proves to be more effective for model\ntraining, measured by downstream code generation performance. We will\nopen-source our dataset and synthesis pipeline at\nhttps://leililab.github.io/HardTests/.","upvotes":43,"discussionId":"683d2cef5bdbb3803e42bccc","projectPage":"https://leililab.github.io/HardTests/","ai_summary":"HARDTESTGEN creates a large, high-quality competitive programming dataset to enhance the precision and recall of verifiers in evaluating LLM-generated code.","ai_keywords":["LLM reasoning","reinforcement learning","verifiers","test synthesis","LLMs","competitive programming","synthetic tests","precision","recall","code generation performance"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ee423b4bebb4ab55c674b1","avatarUrl":"/avatars/ce2797937e8225937fc84d6847d50077.svg","isPro":false,"fullname":"Kexun Zhang","user":"k1z","type":"user"},{"_id":"65562c0d1f308b7658ee2e1c","avatarUrl":"/avatars/c9619dc81555d93f9720eb10b733202c.svg","isPro":false,"fullname":"Kath Choi","user":"KathCYM","type":"user"},{"_id":"6826ada9d86f53ff982dab2e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6826ada9d86f53ff982dab2e/icG_g-C0xSUTxxgWLd0AG.jpeg","isPro":true,"fullname":"PKU-DS-LAB","user":"Lab1806","type":"user"},{"_id":"65a374a59acab1998092a9bc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a374a59acab1998092a9bc/M3s_7bSf9G-6b9nLg7N3Z.jpeg","isPro":false,"fullname":"Antonio","user":"JuntingZhou","type":"user"},{"_id":"6318a1d6da3063b19c1cb0b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662558651773-noauth.jpeg","isPro":false,"fullname":"Antonis Antoniades","user":"woanderer","type":"user"},{"_id":"61b312280d381ede9f1cf0c2","avatarUrl":"/avatars/0427bfe4336fc5b08185590fd6675ee0.svg","isPro":false,"fullname":"Zekun Li","user":"Zekunli","type":"user"},{"_id":"670f2b7eb7b981fe395f286d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/670f2b7eb7b981fe395f286d/UG8Ey-iv98EuJ43ISHxye.jpeg","isPro":false,"fullname":"Yuheng Tang","user":"tangken333","type":"user"},{"_id":"63fbe11a0aab060792fec566","avatarUrl":"/avatars/4d2a64f951e2acc83c2947da0e9cc277.svg","isPro":false,"fullname":"Siqi Ouyang","user":"owaski","type":"user"},{"_id":"62f687a0c58915315c4ff75d","avatarUrl":"/avatars/b657180c7666735062782edd4f6a69c9.svg","isPro":false,"fullname":"Dejia Xu","user":"ir1d","type":"user"},{"_id":"62e47d1b6a82e063860c587e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e47d1b6a82e063860c587e/jvFt1caSZNWDQTYKZQ9K-.jpeg","isPro":false,"fullname":"ruoxining","user":"ruoxining","type":"user"},{"_id":"632cdea254e2c512c8f95b12","avatarUrl":"/avatars/a6d06cdd75861ae7d589f1343d81a5c5.svg","isPro":false,"fullname":"Weiran Yao","user":"weirayao","type":"user"},{"_id":"661573234c2f29635e93bb71","avatarUrl":"/avatars/fba95e382454485766b6349d6281b715.svg","isPro":false,"fullname":"Weiran Yao","user":"weiranyao","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2505.24098

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Published on May 30
· Submitted by Kexun Zhang on Jun 2
Authors:
,
,
,
,
,
,

Abstract

HARDTESTGEN creates a large, high-quality competitive programming dataset to enhance the precision and recall of verifiers in evaluating LLM-generated code.

AI-generated summary

Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

Community

Paper submitter
edited Jun 3

See examples and results at: https://leililab.github.io/HardTests/

RLVR is not just about RL, it's more about VR!

Particularly for LLM coding, good verifiers (tests) are hard to get!

In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter?

Current tests are BAD. Some of them are too easy to break inefficient programs. Others lack special judge functions for program outputs and mistake a right program for a wrong one. Combined, they create LOTS of false positives/negatives. So what do we do?

We propose HardTestGen, an LLM-based test synthesis pipeline that gets you much better tests than the ones that people often use, such as TACO. With that, we curate a problem set with 47k competition problems and good tests. But why should you care?

We run post-training experiments in 3 scenarios -- teacher-distillation, self-distillation, and RL -- to study when good tests matter. It turns out that they don't, for teacher-distillation. However, they matter a great deal for self-distillation and RL.

Our problem set is now available at https://huggingface.co/datasets/sigcp/hardtests_problems, with the synthesis code and synthetic tests coming soon.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.24098 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.24098 in a Space README.md to link it from this page.

Collections including this paper 6

Лучший частный хостинг