lynx   »   [go: up one dir, main page]

https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench

\n","updatedAt":"2025-06-02T19:55:07.764Z","author":{"_id":"64b7111e17681d64b19cf95e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg","fullname":"Patrick Kon","name":"patkon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6006284952163696},"editors":["patkon"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg"],"reactions":[],"isReport":false}},{"id":"683e51d776ce59263f0f3e8a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-06-03T01:37:27.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research](https://huggingface.co/papers/2505.19955) (2025)\n* [AI Idea Bench 2025: AI Research Idea Generation Benchmark](https://huggingface.co/papers/2504.14191) (2025)\n* [The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search](https://huggingface.co/papers/2504.08066) (2025)\n* [AI-Researcher: Autonomous Scientific Innovation](https://huggingface.co/papers/2505.18705) (2025)\n* [ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines](https://huggingface.co/papers/2504.04808) (2025)\n* [MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation](https://huggingface.co/papers/2505.17123) (2025)\n* [Generative to Agentic AI: Survey, Conceptualization, and Challenges](https://huggingface.co/papers/2504.18875) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-03T01:37:27.252Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7177594304084778},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.24785","authors":[{"_id":"683dfe8b4acfa22520c6a9ed","user":{"_id":"64b7111e17681d64b19cf95e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg","isPro":true,"fullname":"Patrick Kon","user":"patkon","type":"user"},"name":"Patrick Tser Jern Kon","status":"claimed_verified","statusLastChangedAt":"2025-06-03T08:47:31.338Z","hidden":false},{"_id":"683dfe8b4acfa22520c6a9ee","name":"Jiachen Liu","hidden":false},{"_id":"683dfe8b4acfa22520c6a9ef","name":"Xinyi Zhu","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f0","name":"Qiuyi Ding","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f1","name":"Jingjia Peng","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f2","name":"Jiarong Xing","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f3","name":"Yibo Huang","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f4","name":"Yiming Qiu","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f5","name":"Jayanth Srinivasa","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f6","name":"Myungjin Lee","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f7","name":"Mosharaf Chowdhury","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f8","name":"Matei Zaharia","hidden":false},{"_id":"683dfe8b4acfa22520c6a9f9","name":"Ang Chen","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/648fc22019e7511674b31f12/NQsfxORQkdLT7SJ0OwTDp.png"],"publishedAt":"2025-05-30T16:46:29.000Z","submittedOnDailyAt":"2025-06-02T18:14:37.764Z","title":"EXP-Bench: Can AI Conduct AI Research Experiments?","submittedOnDailyBy":{"_id":"648fc22019e7511674b31f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648fc22019e7511674b31f12/9kRR00GMFYcuj6zR0BVfx.jpeg","isPro":false,"fullname":"Amber","user":"AmberLJC","type":"user"},"summary":"Automating AI research holds immense potential for accelerating scientific\nprogress, yet current AI agents struggle with the complexities of rigorous,\nend-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed\nto systematically evaluate AI agents on complete research experiments sourced\nfrom influential AI publications. Given a research question and incomplete\nstarter code, EXP-Bench challenges AI agents to formulate hypotheses, design\nand implement experimental procedures, execute them, and analyze results. To\nenable the creation of such intricate and authentic tasks with high-fidelity,\nwe design a semi-autonomous pipeline to extract and structure crucial\nexperimental details from these research papers and their associated\nopen-source code. With the pipeline, EXP-Bench curated 461 AI research tasks\nfrom 51 top-tier AI research papers. Evaluations of leading LLM-based agents,\nsuch as OpenHands and IterativeAgent on EXP-Bench demonstrate partial\ncapabilities: while scores on individual experimental aspects such as design or\nimplementation correctness occasionally reach 20-35%, the success rate for\ncomplete, executable experiments was a mere 0.5%. By identifying these\nbottlenecks and providing realistic step-by-step experiment procedures,\nEXP-Bench serves as a vital tool for future AI agents to improve their ability\nto conduct AI research experiments. EXP-Bench is open-sourced at\nhttps://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.","upvotes":23,"discussionId":"683dfe8d4acfa22520c6aa4f","githubRepo":"https://github.com/Just-Curieous/Curie","ai_summary":"EXP-Bench evaluates AI agents' end-to-end research experiment capabilities through curated tasks from top AI papers, highlighting current limitations.","ai_keywords":["LLM-based agents","OpenHands","IterativeAgent","end-to-end experimentation"],"githubStars":288},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648fc22019e7511674b31f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648fc22019e7511674b31f12/9kRR00GMFYcuj6zR0BVfx.jpeg","isPro":false,"fullname":"Amber","user":"AmberLJC","type":"user"},{"_id":"64b7111e17681d64b19cf95e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7111e17681d64b19cf95e/VHPfCUl1nBS3OMMVi96CR.jpeg","isPro":true,"fullname":"Patrick Kon","user":"patkon","type":"user"},{"_id":"683e01abe0e07b92e10a1e7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BbDjIzKiB3WZdFL_Fqb55.png","isPro":false,"fullname":"Yiming Qiu","user":"RichCoro","type":"user"},{"_id":"66ff8aa43a31c499dc48fdd6","avatarUrl":"/avatars/060dc90fb13991bd013ce8173f12ae3e.svg","isPro":false,"fullname":"Jiarong Xing","user":"JerryPotter","type":"user"},{"_id":"67bfd6c2d067f2e27ec5b552","avatarUrl":"/avatars/9ba2ce4f1cbaaccb3a1d5d2b94ab34b7.svg","isPro":false,"fullname":"Bob Huang","user":"yiboh","type":"user"},{"_id":"648b3ee8be27fd317dda7827","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648b3ee8be27fd317dda7827/D97r84viy7Me8M153X_B2.jpeg","isPro":false,"fullname":"Jae-Won Chung","user":"jaywonchung","type":"user"},{"_id":"66feed126e2703dc201e4dd8","avatarUrl":"/avatars/2bd7a7434322235b4100ff0092d68955.svg","isPro":false,"fullname":"Ruofan Wu","user":"ruofanwu","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"648b5308f6fffa8bba0a5c20","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/l_elUepfMlbNsEuUbKQ9z.jpeg","isPro":false,"fullname":"Mosharaf Chowdhury","user":"mosharaf","type":"user"},{"_id":"66ccc366f780ffef2570ef3b","avatarUrl":"/avatars/71ac30cfe79ec59cd59313b2e33bdbfd.svg","isPro":false,"fullname":"Myungjin Lee","user":"openmlee","type":"user"},{"_id":"67c0b56660f914064b7ad37b","avatarUrl":"/avatars/8785ce0f4064f12efaf0332bc0752853.svg","isPro":false,"fullname":"Ang Chen","user":"angchen","type":"user"},{"_id":"67bf6a68b141348461cb3a01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NuDRBl35E6I-oKMut2mut.png","isPro":false,"fullname":"Jayanth Srinivasa","user":"tronicsfan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2505.24785

EXP-Bench: Can AI Conduct AI Research Experiments?

Published on May 30
· Submitted by Amber on Jun 2
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

EXP-Bench evaluates AI agents' end-to-end research experiment capabilities through curated tasks from top AI papers, highlighting current limitations.

AI-generated summary

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

Community

Paper submitter
edited Jun 2

Can AI Agents Conduct AI Research Experiments?

Paper author
edited Jun 2

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.24785 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.24785 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.
Лучший частный хостинг