Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-05-02T01:24:07.914Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7522292137145996},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6654eb58fdf911031c903278","author":{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","fullname":"merve","name":"merve","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9204},"createdAt":"2024-05-27T20:21:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"would be nice to reproduce with open-source models :')","html":"
would be nice to reproduce with open-source models :')
\n","updatedAt":"2024-06-09T00:40:10.933Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5198525190353394},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2404.19752","authors":[{"_id":"6631b6fd351231c42803242e","user":{"_id":"6520e493b80dc49ba0f1e262","avatarUrl":"/avatars/af21c1ee154b2c9a0d56e69a07508ccb.svg","isPro":false,"fullname":"Yunhao Ge","user":"yunhaog","type":"user"},"name":"Yunhao Ge","status":"claimed_verified","statusLastChangedAt":"2024-05-20T07:39:04.409Z","hidden":false},{"_id":"6631b6fd351231c42803242f","user":{"_id":"63363cb8a8048332fdc268cd","avatarUrl":"/avatars/a9bf6ce3394a8b76eade1a7517b37fa1.svg","isPro":false,"fullname":"xiaohui zeng","user":"xiaohui2022","type":"user"},"name":"Xiaohui Zeng","status":"claimed_verified","statusLastChangedAt":"2024-11-16T19:47:26.135Z","hidden":false},{"_id":"6631b6fd351231c428032430","name":"Jacob Samuel Huffman","hidden":false},{"_id":"6631b6fd351231c428032431","user":{"_id":"63b738acbd2d1535227daa4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b738acbd2d1535227daa4c/dbPQFvHwC-Cf-ssMGYUo6.jpeg","isPro":false,"fullname":"Tsung-Yi Lin","user":"tsungyi","type":"user"},"name":"Tsung-Yi Lin","status":"admin_assigned","statusLastChangedAt":"2024-05-02T08:21:48.636Z","hidden":false},{"_id":"6631b6fd351231c428032432","user":{"_id":"62f049afdf4b93aad5c7f2d6","avatarUrl":"/avatars/e272e58ad996733d7098e50248e5b57e.svg","isPro":false,"fullname":"Ming-Yu Liu","user":"mingyuliutw","type":"user"},"name":"Ming-Yu Liu","status":"admin_assigned","statusLastChangedAt":"2024-05-02T08:21:54.003Z","hidden":false},{"_id":"6631b6fd351231c428032433","user":{"_id":"649f05367b57fab3a5b27c8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649f05367b57fab3a5b27c8b/UDJB4yqF2NmaRwCyTOfcl.jpeg","isPro":false,"fullname":"Yin Cui","user":"richardaecn","type":"user"},"name":"Yin Cui","status":"admin_assigned","statusLastChangedAt":"2024-05-02T08:21:59.307Z","hidden":false}],"publishedAt":"2024-04-30T17:55:27.000Z","submittedOnDailyAt":"2024-05-01T01:59:03.075Z","title":"Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Existing automatic captioning methods for visual content face challenges such\nas lack of detail, content hallucination, and poor instruction following. In\nthis work, we propose VisualFactChecker (VFC), a flexible training-free\npipeline that generates high-fidelity and detailed captions for both 2D images\nand 3D objects. VFC consists of three steps: 1) proposal, where image-to-text\ncaptioning models propose multiple initial captions; 2) verification, where a\nlarge language model (LLM) utilizes tools such as object detection and VQA\nmodels to fact-check proposed captions; 3) captioning, where an LLM generates\nthe final caption by summarizing caption proposals and the fact check\nverification results. In this step, VFC can flexibly generate captions in\nvarious styles following complex instructions. We conduct comprehensive\ncaptioning evaluations using four metrics: 1) CLIP-Score for image-text\nsimilarity; 2) CLIP-Image-Score for measuring the image-image similarity\nbetween the original and the reconstructed image generated by a text-to-image\nmodel using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V\nfor fine-grained evaluation. Evaluation results show that VFC outperforms\nstate-of-the-art open-sourced captioning methods for 2D images on the COCO\ndataset and 3D assets on the Objaverse dataset. Our study demonstrates that by\ncombining open-source models into a pipeline, we can attain captioning\ncapability comparable to proprietary models such as GPT-4V, despite being over\n10x smaller in model size.","upvotes":24,"discussionId":"6631b6ff351231c428032479","ai_summary":"VisualFactChecker improves caption accuracy and detail through a three-step pipeline combining image-to-text models, fact-checking with large language models, and final summarization, outperforming existing methods on multiple datasets.","ai_keywords":["image-to-text captioning models","large language model","object detection","VQA models","image-text similarity","CLIP-Score","CLIP-Image-Score","text-to-image model","human study","Amazon Mechanical Turk","GPT-4V","COCO dataset","Objaverse dataset"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"63c5d43ae2804cb2407e4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673909278097-noauth.png","isPro":false,"fullname":"xziayro","user":"xziayro","type":"user"},{"_id":"6555125a4f361968f0e3aad7","avatarUrl":"/avatars/e7692d82804338f21ecdc6e731f5c5ea.svg","isPro":false,"fullname":"marinaretikof","user":"marinaretik","type":"user"},{"_id":"6248edbd483bb50494c8ac62","avatarUrl":"/avatars/58e8e4f303b644a2c35fd3a5e1fe54dc.svg","isPro":false,"fullname":"sbywqlq","user":"sbywqlq","type":"user"},{"_id":"6303d6647b50dd9d0a35d322","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667928526189-6303d6647b50dd9d0a35d322.jpeg","isPro":false,"fullname":"Alain Bellon","user":"NeuroDynamics","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6580b7f7cc33cac19340c089","avatarUrl":"/avatars/b17f9f6de1707c919e1f7bac1915e173.svg","isPro":false,"fullname":"Samuel Heron Steinmetz","user":"ssteinmetz22","type":"user"},{"_id":"65139afdf60393414a10badb","avatarUrl":"/avatars/c0f60da8af86377e42356751a891213f.svg","isPro":false,"fullname":"Vlad Boyko","user":"VladBoyko","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
VisualFactChecker improves caption accuracy and detail through a three-step pipeline combining image-to-text models, fact-checking with large language models, and final summarization, outperforming existing methods on multiple datasets.
AI-generated summary
Existing automatic captioning methods for visual content face challenges such
as lack of detail, content hallucination, and poor instruction following. In
this work, we propose VisualFactChecker (VFC), a flexible training-free
pipeline that generates high-fidelity and detailed captions for both 2D images
and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text
captioning models propose multiple initial captions; 2) verification, where a
large language model (LLM) utilizes tools such as object detection and VQA
models to fact-check proposed captions; 3) captioning, where an LLM generates
the final caption by summarizing caption proposals and the fact check
verification results. In this step, VFC can flexibly generate captions in
various styles following complex instructions. We conduct comprehensive
captioning evaluations using four metrics: 1) CLIP-Score for image-text
similarity; 2) CLIP-Image-Score for measuring the image-image similarity
between the original and the reconstructed image generated by a text-to-image
model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V
for fine-grained evaluation. Evaluation results show that VFC outperforms
state-of-the-art open-sourced captioning methods for 2D images on the COCO
dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by
combining open-source models into a pipeline, we can attain captioning
capability comparable to proprietary models such as GPT-4V, despite being over
10x smaller in model size.