lynx   »   [go: up one dir, main page]

@liminghao1630\n\t - Nice paper, thanks for sharing! Btw, you can claim the paper by your HF account by clicking your name on the page. Feel free to let me know if you have any questions!

\n","updatedAt":"2025-09-04T10:41:51.596Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1256}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8809733986854553},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68aea04979df8d1e51dc0db9"}}]},{"id":"68afb20f735981cd7ef66786","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-08-28T01:34:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry](https://huggingface.co/papers/2507.16280) (2025)\n* [Benchmarking Computer Science Survey Generation](https://huggingface.co/papers/2508.15658) (2025)\n* [SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models](https://huggingface.co/papers/2508.17647) (2025)\n* [Characterizing Deep Research: A Benchmark and Formal Definition](https://huggingface.co/papers/2508.04183) (2025)\n* [Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper](https://huggingface.co/papers/2508.14273) (2025)\n* [BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent](https://huggingface.co/papers/2508.06600) (2025)\n* [DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery](https://huggingface.co/papers/2508.06960) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-08-28T01:34:07.601Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6760755777359009},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.15804","authors":[{"_id":"68ae8365364411bea07df821","user":{"_id":"5f574f9e8bf55658acfed37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637285694180-5f574f9e8bf55658acfed37e.jpeg","isPro":false,"fullname":"Minghao Li","user":"liminghao1630","type":"user"},"name":"Minghao Li","status":"claimed_verified","statusLastChangedAt":"2025-09-18T13:28:09.486Z","hidden":false},{"_id":"68ae8365364411bea07df822","name":"Ying Zeng","hidden":false},{"_id":"68ae8365364411bea07df823","name":"Zhihao Cheng","hidden":false},{"_id":"68ae8365364411bea07df824","name":"Cong Ma","hidden":false},{"_id":"68ae8365364411bea07df825","name":"Kai Jia","hidden":false}],"publishedAt":"2025-08-14T03:33:43.000Z","submittedOnDailyAt":"2025-08-27T04:36:01.251Z","title":"ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks","submittedOnDailyBy":{"_id":"5f574f9e8bf55658acfed37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637285694180-5f574f9e8bf55658acfed37e.jpeg","isPro":false,"fullname":"Minghao Li","user":"liminghao1630","type":"user"},"summary":"The advent of Deep Research agents has substantially reduced the time\nrequired for conducting extensive research tasks. However, these tasks\ninherently demand rigorous standards of factual accuracy and comprehensiveness,\nnecessitating thorough evaluation before widespread adoption. In this paper, we\npropose ReportBench, a systematic benchmark designed to evaluate the content\nquality of research reports generated by large language models (LLMs). Our\nevaluation focuses on two critical dimensions: (1) the quality and relevance of\ncited literature, and (2) the faithfulness and veracity of the statements\nwithin the generated reports. ReportBench leverages high-quality published\nsurvey papers available on arXiv as gold-standard references, from which we\napply reverse prompt engineering to derive domain-specific prompts and\nestablish a comprehensive evaluation corpus. Furthermore, we develop an\nagent-based automated framework within ReportBench that systematically analyzes\ngenerated reports by extracting citations and statements, checking the\nfaithfulness of cited content against original sources, and validating\nnon-cited claims using web-based resources. Empirical evaluations demonstrate\nthat commercial Deep Research agents such as those developed by OpenAI and\nGoogle consistently generate more comprehensive and reliable reports than\nstandalone LLMs augmented with search or browsing tools. However, there remains\nsubstantial room for improvement in terms of the breadth and depth of research\ncoverage, as well as factual consistency. The complete code and data will be\nreleased at the following link: https://github.com/ByteDance-BandAI/ReportBench","upvotes":15,"discussionId":"68ae8365364411bea07df826","projectPage":"https://github.com/ByteDance-BandAI/ReportBench","githubRepo":"https://github.com/ByteDance-BandAI/ReportBench","ai_summary":"ReportBench evaluates the content quality of research reports generated by large language models, focusing on cited literature quality and statement faithfulness, demonstrating that commercial Deep Research agents produce more comprehensive and reliable reports than standalone LLMs.","ai_keywords":["Deep Research agents","large language models","ReportBench","reverse prompt engineering","evaluation corpus","agent-based automated framework","citations","statements","faithfulness","factual consistency"],"githubStars":27},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5f574f9e8bf55658acfed37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637285694180-5f574f9e8bf55658acfed37e.jpeg","isPro":false,"fullname":"Minghao Li","user":"liminghao1630","type":"user"},{"_id":"61717569f37773b575f8b38d","avatarUrl":"/avatars/2b4fd8e55be078eda5a05501a8f887c5.svg","isPro":false,"fullname":"MA Cong","user":"MACong","type":"user"},{"_id":"684f9514b3f06bc45f9dad06","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/bHC02qj6OLpx3KQqQosRL.png","isPro":false,"fullname":"John","user":"llm-player-01","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","isPro":true,"fullname":"Peter Szemraj","user":"pszemraj","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/PKyQ_-wTNH1Hyv5HxhWdX.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":false,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"62e1403f926f4892a4c545f8","avatarUrl":"/avatars/1f9d09bba8dd2d8657619536078f9ec2.svg","isPro":true,"fullname":"Basit mustafa","user":"BasitMustafa","type":"user"},{"_id":"689b1fc3c6276c8d353972a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/JUCk4-procGVrBBjwl9U1.png","isPro":false,"fullname":"Christos Kalyvas-Kasopatidis","user":"christoskalyvas","type":"user"},{"_id":"64df3ad6a9bcacc18bc0606a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/s3kpJyOf7NwO-tHEpRcok.png","isPro":false,"fullname":"Carlos","user":"Carlosvirella100","type":"user"},{"_id":"61719a29e7ef5fb660c712b7","avatarUrl":"/avatars/9f7b937d632059711b80dd8f7e2f4dad.svg","isPro":false,"fullname":"Nidhin pattaniyil","user":"Npatta01","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2508.15804

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Published on Aug 14
· Submitted by Minghao Li on Aug 27
Authors:
,
,
,

Abstract

ReportBench evaluates the content quality of research reports generated by large language models, focusing on cited literature quality and statement faithfulness, demonstrating that commercial Deep Research agents produce more comprehensive and reliable reports than standalone LLMs.

AI-generated summary

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

Community

Paper author Paper submitter

We introduce ReportBench, the first systematic benchmark for evaluating research reports generated by Deep Research agents. By leveraging expert-authored survey papers from arXiv as gold standards, ReportBench assesses both the quality of cited literature and the factual accuracy of statements. It provides an automated pipeline with citation-based and web-based verification, and we open-source all datasets, prompts, and evaluation scripts to support reproducibility and community progress.

·

Hi @liminghao1630 - Nice paper, thanks for sharing! Btw, you can claim the paper by your HF account by clicking your name on the page. Feel free to let me know if you have any questions!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.15804 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.15804 in a Space README.md to link it from this page.

Collections including this paper 7

Лучший частный хостинг