@librarian-bot\n\t recommend\n","updatedAt":"2025-07-09T02:23:14.942Z","author":{"_id":"6848efd865b3a0bf33d1cd68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/jbGx4HFwd9-ylaUFl0E35.png","fullname":"zhangtiangang","name":"ztg-cv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["ztg-cv"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/jbGx4HFwd9-ylaUFl0E35.png"],"reactions":[],"isReport":false},"replies":[{"id":"686dd4893d43b1ce6998c26b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-07-09T02:31:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation](https://huggingface.co/papers/2506.10395) (2025)\n* [Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models](https://huggingface.co/papers/2506.02557) (2025)\n* [UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings](https://huggingface.co/papers/2505.11815) (2025)\n* [Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better](https://huggingface.co/papers/2506.09040) (2025)\n* [Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM](https://huggingface.co/papers/2505.17726) (2025)\n* [Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration](https://huggingface.co/papers/2505.21472) (2025)\n* [FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens](https://huggingface.co/papers/2506.03096) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-07-09T02:31:37.732Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6992214322090149},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"686dd292b98f1fb67a342179"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2503.19900","authors":[{"_id":"684ae218dbd21a9cc27b1018","name":"Hao Yu","hidden":false},{"_id":"684ae218dbd21a9cc27b1019","name":"Zhuokai Zhao","hidden":false},{"_id":"684ae218dbd21a9cc27b101a","name":"Shen Yan","hidden":false},{"_id":"684ae218dbd21a9cc27b101b","name":"Lukasz Korycki","hidden":false},{"_id":"684ae218dbd21a9cc27b101c","name":"Jianyu Wang","hidden":false},{"_id":"684ae218dbd21a9cc27b101d","name":"Baosheng He","hidden":false},{"_id":"684ae218dbd21a9cc27b101e","name":"Jiayi Liu","hidden":false},{"_id":"684ae218dbd21a9cc27b101f","name":"Lizhu Zhang","hidden":false},{"_id":"684ae218dbd21a9cc27b1020","name":"Xiangjun Fan","hidden":false},{"_id":"684ae218dbd21a9cc27b1021","name":"Hanchao Yu","hidden":false}],"publishedAt":"2025-03-25T17:57:17.000Z","title":"CAFe: Unifying Representation and Generation with\n Contrastive-Autoregressive Finetuning","summary":"The rapid advancement of large vision-language models (LVLMs) has driven\nsignificant progress in multimodal tasks, enabling models to interpret, reason,\nand generate outputs across both visual and textual domains. While excelling in\ngenerative tasks, existing LVLMs often face limitations in tasks requiring\nhigh-fidelity representation learning, such as generating image or text\nembeddings for retrieval. Recent work has proposed finetuning LVLMs for\nrepresentational learning, but the fine-tuned model often loses its generative\ncapabilities due to the representational learning training paradigm. To address\nthis trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning\nframework that enhances LVLMs for both representation and generative tasks. By\nintegrating a contrastive objective with autoregressive language modeling, our\napproach unifies these traditionally separate tasks, achieving state-of-the-art\nresults in both multimodal retrieval and multimodal generative benchmarks,\nincluding object hallucination (OH) mitigation. CAFe establishes a novel\nframework that synergizes embedding and generative functionalities in a single\nmodel, setting a foundation for future multimodal models that excel in both\nretrieval precision and coherent output generation.","upvotes":0,"discussionId":"684ae219dbd21a9cc27b1022","ai_summary":"A contrastive-autoregressive fine-tuning framework, CAFe, enhances large vision-language models for both multimodal retrieval and generation, improving precision and coherence.","ai_keywords":["large vision-language models","multimodal tasks","generative tasks","high-fidelity representation learning","finetuning","contrastive-autoregressive fine-tuning","contrastive objective","autoregressive language modeling","multimodal retrieval","multimodal generative benchmarks","object hallucination","embedding","coherent output generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
A contrastive-autoregressive fine-tuning framework, CAFe, enhances large vision-language models for both multimodal retrieval and generation, improving precision and coherence.