@liminghao1630\n\t - Nice paper, thanks for sharing! Btw, you can claim the paper by your HF account by clicking your name on the page. Feel free to let me know if you have any questions! \n","updatedAt":"2025-09-04T10:41:51.596Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1256}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8809733986854553},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68aea04979df8d1e51dc0db9"}}]},{"id":"68afb20f735981cd7ef66786","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-08-28T01:34:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry](https://huggingface.co/papers/2507.16280) (2025)\n* [Benchmarking Computer Science Survey Generation](https://huggingface.co/papers/2508.15658) (2025)\n* [SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models](https://huggingface.co/papers/2508.17647) (2025)\n* [Characterizing Deep Research: A Benchmark and Formal Definition](https://huggingface.co/papers/2508.04183) (2025)\n* [Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper](https://huggingface.co/papers/2508.14273) (2025)\n* [BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent](https://huggingface.co/papers/2508.06600) (2025)\n* [DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery](https://huggingface.co/papers/2508.06960) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-08-28T01:34:07.601Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6760755777359009},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.15804","authors":[{"_id":"68ae8365364411bea07df821","user":{"_id":"5f574f9e8bf55658acfed37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637285694180-5f574f9e8bf55658acfed37e.jpeg","isPro":false,"fullname":"Minghao Li","user":"liminghao1630","type":"user"},"name":"Minghao Li","status":"claimed_verified","statusLastChangedAt":"2025-09-18T13:28:09.486Z","hidden":false},{"_id":"68ae8365364411bea07df822","name":"Ying Zeng","hidden":false},{"_id":"68ae8365364411bea07df823","name":"Zhihao Cheng","hidden":false},{"_id":"68ae8365364411bea07df824","name":"Cong Ma","hidden":false},{"_id":"68ae8365364411bea07df825","name":"Kai Jia","hidden":false}],"publishedAt":"2025-08-14T03:33:43.000Z","submittedOnDailyAt":"2025-08-27T04:36:01.251Z","title":"ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks","submittedOnDailyBy":{"_id":"5f574f9e8bf55658acfed37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637285694180-5f574f9e8bf55658acfed37e.jpeg","isPro":false,"fullname":"Minghao Li","user":"liminghao1630","type":"user"},"summary":"The advent of Deep Research agents has substantially reduced the time\nrequired for conducting extensive research tasks. However, these tasks\ninherently demand rigorous standards of factual accuracy and comprehensiveness,\nnecessitating thorough evaluation before widespread adoption. In this paper, we\npropose ReportBench, a systematic benchmark designed to evaluate the content\nquality of research reports generated by large language models (LLMs). Our\nevaluation focuses on two critical dimensions: (1) the quality and relevance of\ncited literature, and (2) the faithfulness and veracity of the statements\nwithin the generated reports. ReportBench leverages high-quality published\nsurvey papers available on arXiv as gold-standard references, from which we\napply reverse prompt engineering to derive domain-specific prompts and\nestablish a comprehensive evaluation corpus. Furthermore, we develop an\nagent-based automated framework within ReportBench that systematically analyzes\ngenerated reports by extracting citations and statements, checking the\nfaithfulness of cited content against original sources, and validating\nnon-cited claims using web-based resources. Empirical evaluations demonstrate\nthat commercial Deep Research agents such as those developed by OpenAI and\nGoogle consistently generate more comprehensive and reliable reports than\nstandalone LLMs augmented with search or browsing tools. However, there remains\nsubstantial room for improvement in terms of the breadth and depth of research\ncoverage, as well as factual consistency. The complete code and data will be\nreleased at the following link: https://github.com/ByteDance-BandAI/ReportBench","upvotes":15,"discussionId":"68ae8365364411bea07df826","projectPage":"https://github.com/ByteDance-BandAI/ReportBench","githubRepo":"https://github.com/ByteDance-BandAI/ReportBench","ai_summary":"ReportBench evaluates the content quality of research reports generated by large language models, focusing on cited literature quality and statement faithfulness, demonstrating that commercial Deep Research agents produce more comprehensive and reliable reports than standalone LLMs.","ai_keywords":["Deep Research agents","large language models","ReportBench","reverse prompt engineering","evaluation corpus","agent-based automated framework","citations","statements","faithfulness","factual consistency"],"githubStars":27},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5f574f9e8bf55658acfed37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637285694180-5f574f9e8bf55658acfed37e.jpeg","isPro":false,"fullname":"Minghao Li","user":"liminghao1630","type":"user"},{"_id":"61717569f37773b575f8b38d","avatarUrl":"/avatars/2b4fd8e55be078eda5a05501a8f887c5.svg","isPro":false,"fullname":"MA Cong","user":"MACong","type":"user"},{"_id":"684f9514b3f06bc45f9dad06","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/bHC02qj6OLpx3KQqQosRL.png","isPro":false,"fullname":"John","user":"llm-player-01","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","isPro":true,"fullname":"Peter Szemraj","user":"pszemraj","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/PKyQ_-wTNH1Hyv5HxhWdX.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":false,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"62e1403f926f4892a4c545f8","avatarUrl":"/avatars/1f9d09bba8dd2d8657619536078f9ec2.svg","isPro":true,"fullname":"Basit mustafa","user":"BasitMustafa","type":"user"},{"_id":"689b1fc3c6276c8d353972a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/JUCk4-procGVrBBjwl9U1.png","isPro":false,"fullname":"Christos Kalyvas-Kasopatidis","user":"christoskalyvas","type":"user"},{"_id":"64df3ad6a9bcacc18bc0606a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/s3kpJyOf7NwO-tHEpRcok.png","isPro":false,"fullname":"Carlos","user":"Carlosvirella100","type":"user"},{"_id":"61719a29e7ef5fb660c712b7","avatarUrl":"/avatars/9f7b937d632059711b80dd8f7e2f4dad.svg","isPro":false,"fullname":"Nidhin pattaniyil","user":"Npatta01","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
ReportBench evaluates the content quality of research reports generated by large language models, focusing on cited literature quality and statement faithfulness, demonstrating that commercial Deep Research agents produce more comprehensive and reliable reports than standalone LLMs.
AI-generated summary
The advent of Deep Research agents has substantially reduced the time
required for conducting extensive research tasks. However, these tasks
inherently demand rigorous standards of factual accuracy and comprehensiveness,
necessitating thorough evaluation before widespread adoption. In this paper, we
propose ReportBench, a systematic benchmark designed to evaluate the content
quality of research reports generated by large language models (LLMs). Our
evaluation focuses on two critical dimensions: (1) the quality and relevance of
cited literature, and (2) the faithfulness and veracity of the statements
within the generated reports. ReportBench leverages high-quality published
survey papers available on arXiv as gold-standard references, from which we
apply reverse prompt engineering to derive domain-specific prompts and
establish a comprehensive evaluation corpus. Furthermore, we develop an
agent-based automated framework within ReportBench that systematically analyzes
generated reports by extracting citations and statements, checking the
faithfulness of cited content against original sources, and validating
non-cited claims using web-based resources. Empirical evaluations demonstrate
that commercial Deep Research agents such as those developed by OpenAI and
Google consistently generate more comprehensive and reliable reports than
standalone LLMs augmented with search or browsing tools. However, there remains
substantial room for improvement in terms of the breadth and depth of research
coverage, as well as factual consistency. The complete code and data will be
released at the following link: https://github.com/ByteDance-BandAI/ReportBench
We introduce ReportBench, the first systematic benchmark for evaluating research reports generated by Deep Research agents. By leveraging expert-authored survey papers from arXiv as gold standards, ReportBench assesses both the quality of cited literature and the factual accuracy of statements. It provides an automated pipeline with citation-based and web-based verification, and we open-source all datasets, prompts, and evaluation scripts to support reproducibility and community progress.
Hi @liminghao1630 - Nice paper, thanks for sharing! Btw, you can claim the paper by your HF account by clicking your name on the page. Feel free to let me know if you have any questions!