lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-08-13T01:37:20.536Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.708629310131073},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.06600","authors":[{"_id":"689aa99dfab6fdd2e52ac443","user":{"_id":"644db73976c0ab1880b8bc74","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644db73976c0ab1880b8bc74/ZRjLlwMHKyfNFDewTzjxj.png","isPro":false,"fullname":"Steven Chen","user":"s42chen","type":"user"},"name":"Zijian Chen","status":"claimed_verified","statusLastChangedAt":"2025-08-22T07:24:52.373Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac444","user":{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},"name":"Xueguang Ma","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:21:03.660Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac445","user":{"_id":"60d97add9fe99457e2010efe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624866593786-60d97add9fe99457e2010efe.png","isPro":false,"fullname":"Shengyao Zhuang","user":"ArvinZhuang","type":"user"},"name":"Shengyao Zhuang","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:21:06.722Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac446","user":{"_id":"65358802a920f38780b3248a","avatarUrl":"/avatars/9415510b598079973c2b0436ad12db9c.svg","isPro":false,"fullname":"Ping Nie","user":"pingnieuk","type":"user"},"name":"Ping Nie","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:21:00.298Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac447","name":"Kai Zou","hidden":false},{"_id":"689aa99dfab6fdd2e52ac448","name":"Andrew Liu","hidden":false},{"_id":"689aa99dfab6fdd2e52ac449","name":"Joshua Green","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44a","name":"Kshama Patel","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44b","name":"Ruoxi Meng","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44c","name":"Mingyi Su","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44d","name":"Sahel Sharifymoghaddam","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44e","name":"Yanxi Li","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44f","name":"Haoran Hong","hidden":false},{"_id":"689aa99dfab6fdd2e52ac450","name":"Xinyu Shi","hidden":false},{"_id":"689aa99dfab6fdd2e52ac451","user":{"_id":"60845b6ca5da133ac6c38681","avatarUrl":"/avatars/01dfcf615a57c37ff19276d79f423cf1.svg","isPro":false,"fullname":"Xuye Liu","user":"richard","type":"user"},"name":"Xuye Liu","status":"claimed_verified","statusLastChangedAt":"2025-08-18T06:59:20.432Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac452","user":{"_id":"60196690dd31fde3c1062960","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1612277330660-noauth.jpeg","isPro":false,"fullname":"Nandan Thakur","user":"nthakur","type":"user"},"name":"Nandan Thakur","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:20:57.483Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac453","name":"Crystina Zhang","hidden":false},{"_id":"689aa99dfab6fdd2e52ac454","name":"Luyu Gao","hidden":false},{"_id":"689aa99dfab6fdd2e52ac455","name":"Wenhu Chen","hidden":false},{"_id":"689aa99dfab6fdd2e52ac456","name":"Jimmy Lin","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/5ec82854968f6028e0559f70/1jy5u5L06u17REGKLZqjH.png"],"publishedAt":"2025-08-08T17:55:11.000Z","submittedOnDailyAt":"2025-08-12T01:14:41.177Z","title":"BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of\n Deep-Research Agent","submittedOnDailyBy":{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},"summary":"Deep-Research agents, which integrate large language models (LLMs) with\nsearch tools, have shown success in improving the effectiveness of handling\ncomplex queries that require iterative search planning and reasoning over\nsearch results. Evaluations on current benchmarks like BrowseComp relies on\nblack-box live web search APIs, have notable limitations in (1) fairness:\ndynamic and opaque web APIs hinder fair comparisons and reproducibility of deep\nresearch methods; (2) transparency: lack of control over the document corpus\nmakes it difficult to isolate retriever contributions. In other words, the\ncurrent evaluations may compare a complete deep research system at a given\ntime, but they do not foster well-controlled experiments to provide insights\ninto the capability of underlying deep research LLMs. To address these\nchallenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp,\nemploying a fixed, carefully curated corpus. Each query in BrowseComp-Plus\nincludes human-verified supporting documents and mined challenging negatives,\nenabling controlled experimentation. The benchmark is shown to be effective in\ndistinguishing the performance of deep research systems. For instance, the\nopen-source model Search-R1, when paired with the BM25 retriever, achieves\n3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with\nthe Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with\nfewer search calls. This benchmark allows comprehensive evaluation and\ndisentangled analysis of deep research agents and retrieval methods, fostering\ninsights into retrieval effectiveness, citation accuracy, and context\nengineering in Deep-Research system.","upvotes":37,"discussionId":"689aa99dfab6fdd2e52ac457","projectPage":"https://texttron.github.io/BrowseComp-Plus/","githubRepo":"https://github.com/texttron/BrowseComp-Plus","ai_summary":"BrowseComp-Plus, a curated benchmark, enables controlled evaluation of deep research agents and retrieval methods, providing insights into their performance and effectiveness.","ai_keywords":["deep-Research agents","large language models (LLMs)","search tools","iterative search planning","BrowseComp","black-box live web search APIs","fairness","transparency","document corpus","retriever contributions","BrowseComp-Plus","human-verified supporting documents","challenging negatives","Search-R1","GPT-5","Qwen3-Embedding-8B","retrieval effectiveness","citation accuracy","context engineering"],"githubStars":83},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},{"_id":"64104b467a15af878ae6695d","avatarUrl":"/avatars/407983918c12411e5ed636bf7435522b.svg","isPro":false,"fullname":"Fangyu Lei","user":"FangyuLei","type":"user"},{"_id":"644db73976c0ab1880b8bc74","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644db73976c0ab1880b8bc74/ZRjLlwMHKyfNFDewTzjxj.png","isPro":false,"fullname":"Steven Chen","user":"s42chen","type":"user"},{"_id":"60d97add9fe99457e2010efe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624866593786-60d97add9fe99457e2010efe.png","isPro":false,"fullname":"Shengyao Zhuang","user":"ArvinZhuang","type":"user"},{"_id":"63130630f839c69a68deee95","avatarUrl":"/avatars/37ddcdbab0f3536e7c0b25e8b142a026.svg","isPro":false,"fullname":"Luo","user":"FinchLuo","type":"user"},{"_id":"65358802a920f38780b3248a","avatarUrl":"/avatars/9415510b598079973c2b0436ad12db9c.svg","isPro":false,"fullname":"Ping Nie","user":"pingnieuk","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"60196690dd31fde3c1062960","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1612277330660-noauth.jpeg","isPro":false,"fullname":"Nandan Thakur","user":"nthakur","type":"user"},{"_id":"611fb1cbfa8355ed0309de81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676377085-611fb1cbfa8355ed0309de81.jpeg","isPro":true,"fullname":"Xinyu ZHANG","user":"crystina-z","type":"user"},{"_id":"640584e81a3babee78e88491","avatarUrl":"/avatars/efc9060fe891d6d094173e9f4330fdb3.svg","isPro":false,"fullname":"jian","user":"lipliu","type":"user"},{"_id":"6317233cc92fd6fee317e030","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317233cc92fd6fee317e030/cJHSvvimr1kqgQfHOjO5n.png","isPro":false,"fullname":"Tom Aarsen","user":"tomaarsen","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"/avatars/97aa761d5255abf230304cfeade87835.svg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2508.06600

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Published on Aug 8
· Submitted by Xueguang Ma on Aug 12
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

BrowseComp-Plus, a curated benchmark, enables controlled evaluation of deep research agents and retrieval methods, providing insights into their performance and effectiveness.

AI-generated summary

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

Community

Paper author Paper submitter

A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent.

It is a new Deep-Research evaluation benchmark built on top of BrowseComp. It features

  • a fixed, carefully curated corpus of web documents
  • human-verified positive documents
  • web-mined challenging hard negatives.

BrowseComp-Plus allows fair comparison across different LLM search agents and assesses the impact of different retrievers for deep-research.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.06600 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 3

Лучший частный хостинг