Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-08-13T01:37:20.536Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.708629310131073},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.06600","authors":[{"_id":"689aa99dfab6fdd2e52ac443","user":{"_id":"644db73976c0ab1880b8bc74","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644db73976c0ab1880b8bc74/ZRjLlwMHKyfNFDewTzjxj.png","isPro":false,"fullname":"Steven Chen","user":"s42chen","type":"user"},"name":"Zijian Chen","status":"claimed_verified","statusLastChangedAt":"2025-08-22T07:24:52.373Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac444","user":{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},"name":"Xueguang Ma","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:21:03.660Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac445","user":{"_id":"60d97add9fe99457e2010efe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624866593786-60d97add9fe99457e2010efe.png","isPro":false,"fullname":"Shengyao Zhuang","user":"ArvinZhuang","type":"user"},"name":"Shengyao Zhuang","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:21:06.722Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac446","user":{"_id":"65358802a920f38780b3248a","avatarUrl":"/avatars/9415510b598079973c2b0436ad12db9c.svg","isPro":false,"fullname":"Ping Nie","user":"pingnieuk","type":"user"},"name":"Ping Nie","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:21:00.298Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac447","name":"Kai Zou","hidden":false},{"_id":"689aa99dfab6fdd2e52ac448","name":"Andrew Liu","hidden":false},{"_id":"689aa99dfab6fdd2e52ac449","name":"Joshua Green","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44a","name":"Kshama Patel","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44b","name":"Ruoxi Meng","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44c","name":"Mingyi Su","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44d","name":"Sahel Sharifymoghaddam","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44e","name":"Yanxi Li","hidden":false},{"_id":"689aa99dfab6fdd2e52ac44f","name":"Haoran Hong","hidden":false},{"_id":"689aa99dfab6fdd2e52ac450","name":"Xinyu Shi","hidden":false},{"_id":"689aa99dfab6fdd2e52ac451","user":{"_id":"60845b6ca5da133ac6c38681","avatarUrl":"/avatars/01dfcf615a57c37ff19276d79f423cf1.svg","isPro":false,"fullname":"Xuye Liu","user":"richard","type":"user"},"name":"Xuye Liu","status":"claimed_verified","statusLastChangedAt":"2025-08-18T06:59:20.432Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac452","user":{"_id":"60196690dd31fde3c1062960","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1612277330660-noauth.jpeg","isPro":false,"fullname":"Nandan Thakur","user":"nthakur","type":"user"},"name":"Nandan Thakur","status":"claimed_verified","statusLastChangedAt":"2025-08-13T07:20:57.483Z","hidden":false},{"_id":"689aa99dfab6fdd2e52ac453","name":"Crystina Zhang","hidden":false},{"_id":"689aa99dfab6fdd2e52ac454","name":"Luyu Gao","hidden":false},{"_id":"689aa99dfab6fdd2e52ac455","name":"Wenhu Chen","hidden":false},{"_id":"689aa99dfab6fdd2e52ac456","name":"Jimmy Lin","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/5ec82854968f6028e0559f70/1jy5u5L06u17REGKLZqjH.png"],"publishedAt":"2025-08-08T17:55:11.000Z","submittedOnDailyAt":"2025-08-12T01:14:41.177Z","title":"BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of\n Deep-Research Agent","submittedOnDailyBy":{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},"summary":"Deep-Research agents, which integrate large language models (LLMs) with\nsearch tools, have shown success in improving the effectiveness of handling\ncomplex queries that require iterative search planning and reasoning over\nsearch results. Evaluations on current benchmarks like BrowseComp relies on\nblack-box live web search APIs, have notable limitations in (1) fairness:\ndynamic and opaque web APIs hinder fair comparisons and reproducibility of deep\nresearch methods; (2) transparency: lack of control over the document corpus\nmakes it difficult to isolate retriever contributions. In other words, the\ncurrent evaluations may compare a complete deep research system at a given\ntime, but they do not foster well-controlled experiments to provide insights\ninto the capability of underlying deep research LLMs. To address these\nchallenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp,\nemploying a fixed, carefully curated corpus. Each query in BrowseComp-Plus\nincludes human-verified supporting documents and mined challenging negatives,\nenabling controlled experimentation. The benchmark is shown to be effective in\ndistinguishing the performance of deep research systems. For instance, the\nopen-source model Search-R1, when paired with the BM25 retriever, achieves\n3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with\nthe Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with\nfewer search calls. This benchmark allows comprehensive evaluation and\ndisentangled analysis of deep research agents and retrieval methods, fostering\ninsights into retrieval effectiveness, citation accuracy, and context\nengineering in Deep-Research system.","upvotes":37,"discussionId":"689aa99dfab6fdd2e52ac457","projectPage":"https://texttron.github.io/BrowseComp-Plus/","githubRepo":"https://github.com/texttron/BrowseComp-Plus","ai_summary":"BrowseComp-Plus, a curated benchmark, enables controlled evaluation of deep research agents and retrieval methods, providing insights into their performance and effectiveness.","ai_keywords":["deep-Research agents","large language models (LLMs)","search tools","iterative search planning","BrowseComp","black-box live web search APIs","fairness","transparency","document corpus","retriever contributions","BrowseComp-Plus","human-verified supporting documents","challenging negatives","Search-R1","GPT-5","Qwen3-Embedding-8B","retrieval effectiveness","citation accuracy","context engineering"],"githubStars":83},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},{"_id":"64104b467a15af878ae6695d","avatarUrl":"/avatars/407983918c12411e5ed636bf7435522b.svg","isPro":false,"fullname":"Fangyu Lei","user":"FangyuLei","type":"user"},{"_id":"644db73976c0ab1880b8bc74","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644db73976c0ab1880b8bc74/ZRjLlwMHKyfNFDewTzjxj.png","isPro":false,"fullname":"Steven Chen","user":"s42chen","type":"user"},{"_id":"60d97add9fe99457e2010efe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624866593786-60d97add9fe99457e2010efe.png","isPro":false,"fullname":"Shengyao Zhuang","user":"ArvinZhuang","type":"user"},{"_id":"63130630f839c69a68deee95","avatarUrl":"/avatars/37ddcdbab0f3536e7c0b25e8b142a026.svg","isPro":false,"fullname":"Luo","user":"FinchLuo","type":"user"},{"_id":"65358802a920f38780b3248a","avatarUrl":"/avatars/9415510b598079973c2b0436ad12db9c.svg","isPro":false,"fullname":"Ping Nie","user":"pingnieuk","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"60196690dd31fde3c1062960","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1612277330660-noauth.jpeg","isPro":false,"fullname":"Nandan Thakur","user":"nthakur","type":"user"},{"_id":"611fb1cbfa8355ed0309de81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665676377085-611fb1cbfa8355ed0309de81.jpeg","isPro":true,"fullname":"Xinyu ZHANG","user":"crystina-z","type":"user"},{"_id":"640584e81a3babee78e88491","avatarUrl":"/avatars/efc9060fe891d6d094173e9f4330fdb3.svg","isPro":false,"fullname":"jian","user":"lipliu","type":"user"},{"_id":"6317233cc92fd6fee317e030","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317233cc92fd6fee317e030/cJHSvvimr1kqgQfHOjO5n.png","isPro":false,"fullname":"Tom Aarsen","user":"tomaarsen","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"/avatars/97aa761d5255abf230304cfeade87835.svg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
BrowseComp-Plus, a curated benchmark, enables controlled evaluation of deep research agents and retrieval methods, providing insights into their performance and effectiveness.
AI-generated summary
Deep-Research agents, which integrate large language models (LLMs) with
search tools, have shown success in improving the effectiveness of handling
complex queries that require iterative search planning and reasoning over
search results. Evaluations on current benchmarks like BrowseComp relies on
black-box live web search APIs, have notable limitations in (1) fairness:
dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep
research methods; (2) transparency: lack of control over the document corpus
makes it difficult to isolate retriever contributions. In other words, the
current evaluations may compare a complete deep research system at a given
time, but they do not foster well-controlled experiments to provide insights
into the capability of underlying deep research LLMs. To address these
challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp,
employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus
includes human-verified supporting documents and mined challenging negatives,
enabling controlled experimentation. The benchmark is shown to be effective in
distinguishing the performance of deep research systems. For instance, the
open-source model Search-R1, when paired with the BM25 retriever, achieves
3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with
the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with
fewer search calls. This benchmark allows comprehensive evaluation and
disentangled analysis of deep research agents and retrieval methods, fostering
insights into retrieval effectiveness, citation accuracy, and context
engineering in Deep-Research system.