lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-22T01:36:58.109Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7052863836288452},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.07906","authors":[{"_id":"67dde94f169ce083a5098930","user":{"_id":"64530fc01a57e1179c1fe4c0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/0lncTpIHXn6suB0p-oSma.jpeg","isPro":false,"fullname":"QinghaoYe","user":"MAGAer13","type":"user"},"name":"Qinghao Ye","status":"claimed_verified","statusLastChangedAt":"2025-05-22T07:18:44.811Z","hidden":false},{"_id":"67dde94f169ce083a5098931","user":{"_id":"666d10506ec1340a7421a92d","avatarUrl":"/avatars/5bff16fca817b35760e0c26d8d7c9424.svg","isPro":false,"fullname":"Xianhan Zeng","user":"RitzzZz23","type":"user"},"name":"Xianhan Zeng","status":"claimed_verified","statusLastChangedAt":"2025-05-13T08:13:29.058Z","hidden":false},{"_id":"67dde94f169ce083a5098932","name":"Fu Li","hidden":false},{"_id":"67dde94f169ce083a5098933","name":"Chunyuan Li","hidden":false},{"_id":"67dde94f169ce083a5098934","name":"Haoqi Fan","hidden":false}],"publishedAt":"2025-03-10T22:53:56.000Z","submittedOnDailyAt":"2025-03-21T21:04:12.065Z","title":"Painting with Words: Elevating Detailed Image Captioning with Benchmark\n and Alignment Learning","submittedOnDailyBy":{"_id":"64530fc01a57e1179c1fe4c0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/0lncTpIHXn6suB0p-oSma.jpeg","isPro":false,"fullname":"QinghaoYe","user":"MAGAer13","type":"user"},"summary":"Image captioning has long been a pivotal task in visual understanding, with\nrecent advancements in vision-language models (VLMs) significantly enhancing\nthe ability to generate detailed image captions. However, the evaluation of\ndetailed image captioning remains underexplored due to outdated evaluation\nmetrics and coarse annotations. In this paper, we introduce DeCapBench along\nwith a novel metric, DCScore, specifically designed for detailed captioning\ntasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by\ndeconstructing responses into the smallest self-sufficient units, termed\nprimitive information units, and assessing them individually. Our evaluation\nshows that DCScore aligns more closely with human judgment than other\nrule-based or model-based metrics. Concurrently, DeCapBench exhibits a high\ncorrelation with VLM arena results on descriptive tasks, surpassing existing\nbenchmarks for vision-language models. Additionally, we present an automatic\nfine-grained feedback collection method, FeedQuill, for preference optimization\nbased on our advanced metric, showing robust generalization capabilities across\nauto-generated preference data. Extensive experiments on multiple VLMs\ndemonstrate that our method not only significantly reduces hallucinations but\nalso enhances performance across various benchmarks, achieving superior detail\ncaptioning performance while surpassing GPT-4o.","upvotes":4,"discussionId":"67dde951169ce083a509898a","ai_summary":"DeCapBench, a new benchmark with DCScore, improves detailed image captioning evaluation and performance by reducing hallucinations and enhancing comprehensiveness.","ai_keywords":["vision-language models","DeCapBench","DCScore","primitive information units","hallucinations","fine-grained comprehensiveness","FeedQuill","preference optimization","auto-generated preference data"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5df833bdda6d0311fd3d5403","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df833bdda6d0311fd3d5403/62OtGJEQXdOuhV9yCd4HS.png","isPro":false,"fullname":"Weihao Yu","user":"whyu","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"67d9375687d9915921009794","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7fNy2BJJy0uiIRr1KWW93.png","isPro":false,"fullname":"Francis Spencer","user":"Spencer3","type":"user"},{"_id":"67dbe5c2e2c23c4aeef1a75a","avatarUrl":"/avatars/0c71d9a8f18e04e6ba0c73fbe9c029bb.svg","isPro":false,"fullname":"Tech","user":"NolanLuna","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.07906

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Published on Mar 10
· Submitted by QinghaoYe on Mar 21
Authors:
,
,

Abstract

DeCapBench, a new benchmark with DCScore, improves detailed image captioning evaluation and performance by reducing hallucinations and enhancing comprehensiveness.

AI-generated summary

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Community

Paper author Paper submitter

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.07906 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.07906 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.07906 in a Space README.md to link it from this page.

Collections including this paper 1

Лучший частный хостинг