lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-18T01:38:22.706Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7354360222816467},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68521925db95d7d4e4923ea0","author":{"_id":"6852105471041e2701fbea1a","avatarUrl":"/avatars/d52e80d43a01a85552cadcc37f639a41.svg","fullname":"Orrin Connor","name":"OrrinC","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-06-18T01:40:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"https://github.com/Ayanami0730/deep_research_bench","html":"

https://github.com/Ayanami0730/deep_research_bench

\n","updatedAt":"2025-06-18T01:40:53.574Z","author":{"_id":"6852105471041e2701fbea1a","avatarUrl":"/avatars/d52e80d43a01a85552cadcc37f639a41.svg","fullname":"Orrin Connor","name":"OrrinC","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7215880155563354},"editors":["OrrinC"],"editorAvatarUrls":["/avatars/d52e80d43a01a85552cadcc37f639a41.svg"],"reactions":[],"isReport":false}},{"id":"68542b2886d920058992bd62","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1},"createdAt":"2025-06-19T15:22:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"here is the ai generated full audio breakdown of this paper: https://arxivexplained.com/papers/deepresearch-bench-a-comprehensive-benchmark-for-deep-research-agents","html":"

here is the ai generated full audio breakdown of this paper: https://arxivexplained.com/papers/deepresearch-bench-a-comprehensive-benchmark-for-deep-research-agents

\n","updatedAt":"2025-06-19T15:22:16.250Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6947396993637085},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.11763","authors":[{"_id":"684ff5051d9b438aa3957a7f","user":{"_id":"646dbba74ad7f907279dd486","avatarUrl":"/avatars/fe2b95e9a55711164e9624e1d15e0af2.svg","isPro":false,"fullname":"Mingxuan Du","user":"Ayanami0730","type":"user"},"name":"Mingxuan Du","status":"claimed_verified","statusLastChangedAt":"2025-06-16T12:56:12.158Z","hidden":false},{"_id":"684ff5051d9b438aa3957a80","name":"Benfeng Xu","hidden":false},{"_id":"684ff5051d9b438aa3957a81","user":{"_id":"663b22a80966eef8686aadaf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/663b22a80966eef8686aadaf/iBzyQTyGZKf33RPVIFh9a.jpeg","isPro":false,"fullname":"Chiwei Zhu","user":"IgnoraZ","type":"user"},"name":"Chiwei Zhu","status":"claimed_verified","statusLastChangedAt":"2025-06-16T12:56:09.702Z","hidden":false},{"_id":"684ff5051d9b438aa3957a82","name":"Xiaorui Wang","hidden":false},{"_id":"684ff5051d9b438aa3957a83","name":"Zhendong Mao","hidden":false}],"publishedAt":"2025-06-13T13:17:32.000Z","submittedOnDailyAt":"2025-06-17T00:31:26.473Z","title":"DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents","submittedOnDailyBy":{"_id":"646dbba74ad7f907279dd486","avatarUrl":"/avatars/fe2b95e9a55711164e9624e1d15e0af2.svg","isPro":false,"fullname":"Mingxuan Du","user":"Ayanami0730","type":"user"},"summary":"Deep Research Agents are a prominent category of LLM-based agents. By\nautonomously orchestrating multistep web exploration, targeted retrieval, and\nhigher-order synthesis, they transform vast amounts of online information into\nanalyst-grade, citation-rich reports--compressing hours of manual desk research\ninto minutes. However, a comprehensive benchmark for systematically evaluating\nthe capabilities of these agents remains absent. To bridge this gap, we present\nDeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks,\neach meticulously crafted by domain experts across 22 distinct fields.\nEvaluating DRAs is inherently complex and labor-intensive. We therefore propose\ntwo novel methodologies that achieve strong alignment with human judgment. The\nfirst is a reference-based method with adaptive criteria to assess the quality\nof generated research reports. The other framework is introduced to evaluate\nDRA's information retrieval and collection capabilities by assessing its\neffective citation count and overall citation accuracy. We have open-sourced\nDeepResearch Bench and key components of these frameworks at\nhttps://github.com/Ayanami0730/deep_research_bench to accelerate the\ndevelopment of practical LLM-based agents.","upvotes":70,"discussionId":"684ff5051d9b438aa3957a84","projectPage":"https://deepresearch-bench.github.io","githubRepo":"https://github.com/Ayanami0730/deep_research_bench","ai_summary":"DeepResearch Bench offers a benchmark framework to evaluate the capabilities of Deep Research Agents in terms of research quality and information retrieval accuracy across multiple fields.","ai_keywords":["Deep Research Agents","LLM-based agents","multistep web exploration","targeted retrieval","higher-order synthesis","PhD-level research tasks","reference-based method","effective citation count","citation accuracy"],"githubStars":399},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646dbba74ad7f907279dd486","avatarUrl":"/avatars/fe2b95e9a55711164e9624e1d15e0af2.svg","isPro":false,"fullname":"Mingxuan Du","user":"Ayanami0730","type":"user"},{"_id":"6432225ddec2a70d813607bf","avatarUrl":"/avatars/89c4be2047e23d43b6898e8863fcd655.svg","isPro":false,"fullname":"Artificially Inclined","user":"artificiallyinclined","type":"user"},{"_id":"62c926c71eba56b9f6212922","avatarUrl":"/avatars/e78d2bceae9b9f2e5295408f7094c4c8.svg","isPro":false,"fullname":"Hal","user":"farawayxxx","type":"user"},{"_id":"62d56f063bf5e059f7cac515","avatarUrl":"/avatars/6bd5cfdc21506ace6176d00a2973d8e5.svg","isPro":false,"fullname":"BenfengXu","user":"SpiketheCowboy","type":"user"},{"_id":"6040987bf84ebe399f1c85d8","avatarUrl":"/avatars/a2947c03c9d744dfdbb92678c9970c3f.svg","isPro":false,"fullname":"小明","user":"xiaoming","type":"user"},{"_id":"64700a73e9c03ba436d73537","avatarUrl":"/avatars/62755f1bd8184edf06c4a213754954e2.svg","isPro":false,"fullname":"Chen Yihan","user":"YhChen9381","type":"user"},{"_id":"6440d49b7663594a126716f2","avatarUrl":"/avatars/bb04af9d1ae9c5bc058ebfbf08f4ebc8.svg","isPro":false,"fullname":"shaohanwang","user":"WShan","type":"user"},{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user"},{"_id":"65cf563c2fd803be4030208d","avatarUrl":"/avatars/b208d8bfd8338401f562333b618a82cd.svg","isPro":false,"fullname":"zhang licheng","user":"zhanglc06","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"62f4ac43567dbf9a39f75474","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661497922734-62f4ac43567dbf9a39f75474.jpeg","isPro":false,"fullname":"Daniel Huynh","user":"dhuynh95","type":"user"},{"_id":"64faed2e5ca946a010857aec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64faed2e5ca946a010857aec/eR8Hx0Dyy-DrPd1rmww8_.png","isPro":false,"fullname":"Xu Lin","user":"gatilin","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2506.11763

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Published on Jun 13
· Submitted by Mingxuan Du on Jun 17
#2 Paper of the day
Authors:
,
,

Abstract

DeepResearch Bench offers a benchmark framework to evaluate the capabilities of Deep Research Agents in terms of research quality and information retrieval accuracy across multiple fields.

AI-generated summary

Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

Community

Paper author Paper submitter
This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.11763 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.11763 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.11763 in a Space README.md to link it from this page.

Collections including this paper 13

Лучший частный хостинг