The following papers were recommended by the Semantic Scholar API
\n- \n
- Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning (2025) \n
- Visual Representation Alignment for Multimodal Large Language Models (2025) \n
- Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping (2025) \n
- BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models (2025) \n
- Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment (2025) \n
- BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion (2025) \n
- FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
Nice paper 😃
\n","updatedAt":"2025-09-18T18:06:40.060Z","author":{"_id":"67ff950b9e4824de182acf83","avatarUrl":"/avatars/d8bae10afdd7457a08ec9554f79c7429.svg","fullname":"Elman Ghazaei","name":"ElmanGhazaei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6598848104476929},"editors":["ElmanGhazaei"],"editorAvatarUrls":["/avatars/d8bae10afdd7457a08ec9554f79c7429.svg"],"reactions":[{"reaction":"❤️","users":["lyan62"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.11986","authors":[{"_id":"68c8d7dd733e345e52ac1edd","user":{"_id":"619b506f70d03780cbec5806","avatarUrl":"/avatars/ea2b0b8f0a3eb16d53ef40da9981c397.svg","isPro":false,"fullname":"wenyan li","user":"lyan62","type":"user"},"name":"Wenyan Li","status":"claimed_verified","statusLastChangedAt":"2025-09-16T09:41:46.176Z","hidden":false},{"_id":"68c8d7dd733e345e52ac1ede","user":{"_id":"63250bb8d206fe7b2d2f1b8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668053024087-63250bb8d206fe7b2d2f1b8b.jpeg","isPro":false,"fullname":"Raphael Tang","user":"tetrisd","type":"user"},"name":"Raphael Tang","status":"claimed_verified","statusLastChangedAt":"2025-09-22T02:17:51.437Z","hidden":false},{"_id":"68c8d7dd733e345e52ac1edf","name":"Chengzu Li","hidden":false},{"_id":"68c8d7dd733e345e52ac1ee0","name":"Caiqi Zhang","hidden":false},{"_id":"68c8d7dd733e345e52ac1ee1","name":"Ivan Vulić","hidden":false},{"_id":"68c8d7dd733e345e52ac1ee2","name":"Anders Søgaard","hidden":false}],"publishedAt":"2025-09-15T14:38:06.000Z","submittedOnDailyAt":"2025-09-16T01:52:58.573Z","title":"Lost in Embeddings: Information Loss in Vision-Language Models","submittedOnDailyBy":{"_id":"619b506f70d03780cbec5806","avatarUrl":"/avatars/ea2b0b8f0a3eb16d53ef40da9981c397.svg","isPro":false,"fullname":"wenyan li","user":"lyan62","type":"user"},"summary":"Vision--language models (VLMs) often process visual inputs through a\npretrained vision encoder, followed by a projection into the language model's\nembedding space via a connector component. While crucial for modality fusion,\nthe potential information loss induced by this projection step and its direct\nimpact on model capabilities remain understudied. We introduce two\ncomplementary approaches to examine and quantify this loss by analyzing the\nlatent representation space. First, we evaluate semantic information\npreservation by analyzing changes in k-nearest neighbor relationships between\nimage representations, before and after projection. Second, we directly measure\ninformation loss by reconstructing visual embeddings from the projected\nrepresentation, localizing loss at an image patch level. Experiments reveal\nthat connectors substantially distort the local geometry of visual\nrepresentations, with k-nearest neighbors diverging by 40--60\\%\npost-projection, correlating with degradation in retrieval performance. The\npatch-level embedding reconstruction provides interpretable insights for model\nbehavior on visually grounded question-answering tasks, finding that areas of\nhigh information loss reliably predict instances where models struggle.","upvotes":25,"discussionId":"68c8d7dd733e345e52ac1ee3","githubRepo":"https://github.com/lyan62/vlm-info-loss","ai_summary":"Two approaches are introduced to analyze and quantify information loss in vision-language models during the projection of visual inputs into the language model's embedding space, revealing significant distortions and their impact on model performance.","ai_keywords":["vision--language models","pretrained vision encoder","connector component","modality fusion","latent representation space","semantic information preservation","k-nearest neighbor relationships","visual embeddings","patch-level embedding reconstruction","visually grounded question-answering tasks"],"githubStars":14},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"650e9b0288cdfe73a8575923","avatarUrl":"/avatars/0fc7fcd0776f63ea5f50a310e7def2f5.svg","isPro":false,"fullname":"Chengzu Li","user":"chengzu","type":"user"},{"_id":"644662145004f2cb3af08b27","avatarUrl":"/avatars/5f2af24c7410a5db46374d0b84fb479d.svg","isPro":false,"fullname":"Avishai Elmakies","user":"avishai-elmakies","type":"user"},{"_id":"650871397e0d56c27141e6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/1Rr_skag3jTgBTA9XmGy3.jpeg","isPro":false,"fullname":"Sayambhu Sen","user":"Testerpce","type":"user"},{"_id":"624ac233c04d55ec0f42b11e","avatarUrl":"/avatars/58a9abce945e71a65abc8a54085de6d7.svg","isPro":false,"fullname":"oh sehun","user":"sehun","type":"user"},{"_id":"65e34107ad42606f6f977195","avatarUrl":"/avatars/f9149411166dfe298730eb27013f36f7.svg","isPro":false,"fullname":"pavan kumar avn","user":"pk3388","type":"user"},{"_id":"6378fcde667ed02bb915cdc3","avatarUrl":"/avatars/69a7abaee92f4e53bf722f8a0833b2b1.svg","isPro":false,"fullname":"Vaibhav Singh","user":"veb-101","type":"user"},{"_id":"6438f6e6e1acfc375c68f330","avatarUrl":"/avatars/08e9df173b84b8147e0c42d7915b5075.svg","isPro":false,"fullname":"Guillermo FIgueroa","user":"mdmev","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"633cb69eccce04161f870676","avatarUrl":"/avatars/05c01bfa9d1e15ca351ef23ea0d8e7a1.svg","isPro":false,"fullname":"costa shafr","user":"costico","type":"user"},{"_id":"65c743aabbc318a59ece0871","avatarUrl":"/avatars/b70e85cb76da5403bf7f84d6a660f317.svg","isPro":false,"fullname":"Akash Manna","user":"Zemansky","type":"user"},{"_id":"6679ffdc13c37a0fe47ee21f","avatarUrl":"/avatars/667b29a4781a37a2cd66a0ab1f15bcb2.svg","isPro":false,"fullname":"Nitin Yadav","user":"ntnydv","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">Lost in Embeddings: Information Loss in Vision-Language Models
Abstract
Two approaches are introduced to analyze and quantify information loss in vision-language models during the projection of visual inputs into the language model's embedding space, revealing significant distortions and their impact on model performance.
Vision--language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60\% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.
Community
EMNLP 2025 findings paper on visual information loss in VLMs
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping (2025)
- BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models (2025)
- Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment (2025)
- BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion (2025)
- FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Nice paper 😃
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper