https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench\n","updatedAt":"2025-06-02T19:55:07.764Z","author":{"_id":"64b7111e17681d64b19cf95e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg","fullname":"Patrick Kon","name":"patkon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6006284952163696},"editors":["patkon"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg"],"reactions":[],"isReport":false}},{"id":"683e51d776ce59263f0f3e8a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-06-03T01:37:27.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research](https://huggingface.co/papers/2505.19955) (2025)\n* [AI Idea Bench 2025: AI Research Idea Generation Benchmark](https://huggingface.co/papers/2504.14191) (2025)\n* [The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search](https://huggingface.co/papers/2504.08066) (2025)\n* [AI-Researcher: Autonomous Scientific Innovation](https://huggingface.co/papers/2505.18705) (2025)\n* [ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines](https://huggingface.co/papers/2504.04808) (2025)\n* [MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation](https://huggingface.co/papers/2505.17123) (2025)\n* [Generative to Agentic AI: Survey, Conceptualization, and Challenges](https://huggingface.co/papers/2504.18875) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-06-03T01:37:27.252Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7177594304084778},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.24785","authors":[{"_id":"683dfe8b4acfa22520c6a9ed","user":{"_id":"64b7111e17681d64b19cf95e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg","isPro":true,"fullname":"Patrick Kon","user":"patkon","type":"user"},"name":"Patrick Tser Jern Kon","status":"claimed_verified","statusLastChangedAt":"2025-06-03T08:47:31.338Z","hidden":false},{"_id":"683dfe8b4acfa22520c6a9ee","name":"Jiachen Liu","hidden":false},{"_id":"683dfe8b4acfa22520c6a9ef","name":"Xinyi Zhu","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f0","name":"Qiuyi Ding","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f1","name":"Jingjia Peng","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f2","name":"Jiarong Xing","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f3","name":"Yibo Huang","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f4","name":"Yiming Qiu","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f5","name":"Jayanth Srinivasa","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f6","name":"Myungjin Lee","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f7","name":"Mosharaf Chowdhury","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f8","name":"Matei Zaharia","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f9","name":"Ang Chen","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/648fc22019e7511674b31f12/NQsfxORQkdLT7SJ0OwTDp.png"],"publishedAt":"2025-05-30T16:46:29.000Z","submittedOnDailyAt":"2025-06-02T18:14:37.764Z","title":"EXP-Bench: Can AI Conduct AI Research Experiments?","submittedOnDailyBy":{"_id":"648fc22019e7511674b31f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648fc22019e7511674b31f12/9kRR00GMFYcuj6zR0BVfx.jpeg","isPro":false,"fullname":"Amber","user":"AmberLJC","type":"user"},"summary":"Automating AI research holds immense potential for accelerating scientific\nprogress, yet current AI agents struggle with the complexities of rigorous,\nend-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed\nto systematically evaluate AI agents on complete research experiments sourced\nfrom influential AI publications. Given a research question and incomplete\nstarter code, EXP-Bench challenges AI agents to formulate hypotheses, design\nand implement experimental procedures, execute them, and analyze results. To\nenable the creation of such intricate and authentic tasks with high-fidelity,\nwe design a semi-autonomous pipeline to extract and structure crucial\nexperimental details from these research papers and their associated\nopen-source code. With the pipeline, EXP-Bench curated 461 AI research tasks\nfrom 51 top-tier AI research papers. Evaluations of leading LLM-based agents,\nsuch as OpenHands and IterativeAgent on EXP-Bench demonstrate partial\ncapabilities: while scores on individual experimental aspects such as design or\nimplementation correctness occasionally reach 20-35%, the success rate for\ncomplete, executable experiments was a mere 0.5%. By identifying these\nbottlenecks and providing realistic step-by-step experiment procedures,\nEXP-Bench serves as a vital tool for future AI agents to improve their ability\nto conduct AI research experiments. EXP-Bench is open-sourced at\nhttps://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.","upvotes":23,"discussionId":"683dfe8d4acfa22520c6aa4f","githubRepo":"https://github.com/Just-Curieous/Curie","ai_summary":"EXP-Bench evaluates AI agents' end-to-end research experiment capabilities through curated tasks from top AI papers, highlighting current limitations.","ai_keywords":["LLM-based agents","OpenHands","IterativeAgent","end-to-end experimentation"],"githubStars":288},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648fc22019e7511674b31f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648fc22019e7511674b31f12/9kRR00GMFYcuj6zR0BVfx.jpeg","isPro":false,"fullname":"Amber","user":"AmberLJC","type":"user"},{"_id":"64b7111e17681d64b19cf95e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg","isPro":true,"fullname":"Patrick Kon","user":"patkon","type":"user"},{"_id":"683e01abe0e07b92e10a1e7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BbDjIzKiB3WZdFL_Fqb55.png","isPro":false,"fullname":"Yiming Qiu","user":"RichCoro","type":"user"},{"_id":"66ff8aa43a31c499dc48fdd6","avatarUrl":"/avatars/060dc90fb13991bd013ce8173f12ae3e.svg","isPro":false,"fullname":"Jiarong Xing","user":"JerryPotter","type":"user"},{"_id":"67bfd6c2d067f2e27ec5b552","avatarUrl":"/avatars/9ba2ce4f1cbaaccb3a1d5d2b94ab34b7.svg","isPro":false,"fullname":"Bob Huang","user":"yiboh","type":"user"},{"_id":"648b3ee8be27fd317dda7827","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648b3ee8be27fd317dda7827/D97r84viy7Me8M153X_B2.jpeg","isPro":false,"fullname":"Jae-Won Chung","user":"jaywonchung","type":"user"},{"_id":"66feed126e2703dc201e4dd8","avatarUrl":"/avatars/2bd7a7434322235b4100ff0092d68955.svg","isPro":false,"fullname":"Ruofan Wu","user":"ruofanwu","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"648b5308f6fffa8bba0a5c20","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/l_elUepfMlbNsEuUbKQ9z.jpeg","isPro":false,"fullname":"Mosharaf Chowdhury","user":"mosharaf","type":"user"},{"_id":"66ccc366f780ffef2570ef3b","avatarUrl":"/avatars/71ac30cfe79ec59cd59313b2e33bdbfd.svg","isPro":false,"fullname":"Myungjin Lee","user":"openmlee","type":"user"},{"_id":"67c0b56660f914064b7ad37b","avatarUrl":"/avatars/8785ce0f4064f12efaf0332bc0752853.svg","isPro":false,"fullname":"Ang Chen","user":"angchen","type":"user"},{"_id":"67bf6a68b141348461cb3a01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NuDRBl35E6I-oKMut2mut.png","isPro":false,"fullname":"Jayanth Srinivasa","user":"tronicsfan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
EXP-Bench evaluates AI agents' end-to-end research experiment capabilities through curated tasks from top AI papers, highlighting current limitations.
AI-generated summary
Automating AI research holds immense potential for accelerating scientific
progress, yet current AI agents struggle with the complexities of rigorous,
end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed
to systematically evaluate AI agents on complete research experiments sourced
from influential AI publications. Given a research question and incomplete
starter code, EXP-Bench challenges AI agents to formulate hypotheses, design
and implement experimental procedures, execute them, and analyze results. To
enable the creation of such intricate and authentic tasks with high-fidelity,
we design a semi-autonomous pipeline to extract and structure crucial
experimental details from these research papers and their associated
open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks
from 51 top-tier AI research papers. Evaluations of leading LLM-based agents,
such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial
capabilities: while scores on individual experimental aspects such as design or
implementation correctness occasionally reach 20-35%, the success rate for
complete, executable experiments was a mere 0.5%. By identifying these
bottlenecks and providing realistic step-by-step experiment procedures,
EXP-Bench serves as a vital tool for future AI agents to improve their ability
to conduct AI research experiments. EXP-Bench is open-sourced at
https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.