The following papers were recommended by the Semantic Scholar API
\n- \n
- 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding (2024) \n
- GroundingGPT:Language Enhanced Multi-modal Grounding Model (2024) \n
- Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey (2023) \n
- Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study (2024) \n
- Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis (2024) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
You can access the code of CoLLaVO-7B by https://github.com/ByungKwanLee/CoLLaVO
\n","updatedAt":"2024-03-01T04:19:45.858Z","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7319943308830261},"editors":["BK-Lee"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg"],"reactions":[{"reaction":"👍","users":["chae-won-kim","steve74"],"count":2}],"isReport":false}},{"id":"65e7a1e574ab027493a1ac79","author":{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","fullname":"merve","name":"merve","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9206},"createdAt":"2024-03-05T22:51:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@BK-Lee would you like to host the model and the demo on Hugging Face?","html":"\n\n@BK-Lee\n\t would you like to host the model and the demo on Hugging Face?
\n","updatedAt":"2024-03-05T22:51:17.313Z","author":{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","fullname":"merve","name":"merve","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9206}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9776211380958557},"editors":["merve"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"65e7c2d4239d815cc6004485","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60},"createdAt":"2024-03-06T01:11:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Yes! I am preparing the code first and then will upload the hosting model on huggingface space. Now, we are also preparing follow-up large language and vision model for more strong performance, so we plan to upload simultaneously. Thanks for your interest!","html":"Yes! I am preparing the code first and then will upload the hosting model on huggingface space. Now, we are also preparing follow-up large language and vision model for more strong performance, so we plan to upload simultaneously. Thanks for your interest!
\n","updatedAt":"2024-03-06T01:11:48.188Z","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9475874304771423},"editors":["BK-Lee"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"65e7a1e574ab027493a1ac79"}}]},{"id":"65f8a5d3ecfebaf32607443e","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60},"createdAt":"2024-03-18T20:36:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"CoLLaVO-7B model has been released in https://huggingface.co/BK-Lee/CoLLaVO-7B!","html":"CoLLaVO-7B model has been released in https://huggingface.co/BK-Lee/CoLLaVO-7B!
\n","updatedAt":"2024-03-18T20:36:35.739Z","author":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","fullname":"Byung-Kwan Lee","name":"BK-Lee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9498259425163269},"editors":["BK-Lee"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg"],"reactions":[],"isReport":false}},{"id":"65f9b448ca7b4b9e1857fe93","author":{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","fullname":"merve","name":"merve","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9206},"createdAt":"2024-03-19T15:50:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@BK-Lee great initiative with model card 🤩 looking forward to the demo!","html":"\n\n@BK-Lee\n\t great initiative with model card 🤩 looking forward to the demo!
\n","updatedAt":"2024-03-19T15:50:32.293Z","author":{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","fullname":"merve","name":"merve","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9206}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.887412428855896},"editors":["merve"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg"],"reactions":[{"reaction":"❤️","users":["BK-Lee"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.11248","authors":[{"_id":"65d43a60adc348158d095a23","user":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},"name":"Byung-Kwan Lee","status":"admin_assigned","statusLastChangedAt":"2024-02-20T10:13:56.582Z","hidden":false},{"_id":"65d43a60adc348158d095a24","user":{"_id":"65390e7ce279d2fc80a3f800","avatarUrl":"/avatars/a930c1f20fa1eef372ca92139cb32887.svg","isPro":false,"fullname":"Beomchan Park","user":"bpbpbp0810","type":"user"},"name":"Beomchan Park","status":"claimed_verified","statusLastChangedAt":"2024-02-20T10:10:04.916Z","hidden":false},{"_id":"65d43a60adc348158d095a25","user":{"_id":"65aa2e84e2a2c86356263340","avatarUrl":"/avatars/cba9ecceff0f67b2b5dde02e7c0a9991.svg","isPro":false,"fullname":"Chae Won Kim","user":"chae-won-kim","type":"user"},"name":"Chae Won Kim","status":"claimed_verified","statusLastChangedAt":"2024-02-20T11:43:57.662Z","hidden":false},{"_id":"65d43a60adc348158d095a26","user":{"_id":"660603529e3555d648e3baf1","avatarUrl":"/avatars/0f54479afcfc19df00b25d5aedb4cf67.svg","isPro":false,"fullname":"Yong Man Ro","user":"dwightro","type":"user"},"name":"Yong Man Ro","status":"claimed_verified","statusLastChangedAt":"2024-04-03T09:05:18.949Z","hidden":false}],"publishedAt":"2024-02-17T11:03:02.000Z","submittedOnDailyAt":"2024-02-20T03:06:33.162Z","title":"CoLLaVO: Crayon Large Language and Vision mOdel","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"The remarkable success of Large Language Models (LLMs) and instruction tuning\ndrives the evolution of Vision Language Models (VLMs) towards a versatile\ngeneral-purpose model. Yet, it remains unexplored whether current VLMs\ngenuinely possess quality object-level image understanding capabilities\ndetermined from 'what objects are in the image?' or 'which object corresponds\nto a specified bounding box?'. Our findings reveal that the image understanding\ncapabilities of current VLMs are strongly correlated with their zero-shot\nperformance on Vision Language (VL) tasks. This suggests that prioritizing\nbasic image understanding is crucial for VLMs to excel at VL tasks. To enhance\nobject-level image understanding, we propose Crayon Large Language and Vision\nmOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a\nnew visual prompt tuning scheme based on panoptic color maps. Furthermore, we\npresent a learning strategy of Dual QLoRA to preserve object-level image\nunderstanding without forgetting it during visual instruction tuning, thereby\nachieving a significant leap in zero-shot numerous VL benchmarks.","upvotes":22,"discussionId":"65d43a61adc348158d095a44","ai_summary":"The study proposes CoLLaVO, a Visual Language Model enhanced with crayon prompt tuning and Dual QLoRA to improve object-level image understanding and zero-shot performance in Vision Language tasks.","ai_keywords":["Large Language Models (LLMs)","Vision Language Models (VLMs)","zero-shot performance","Vision Language (VL) tasks","object-level image understanding","instruction tuning","crayon prompt","panoptic color maps","Dual QLoRA","VL benchmarks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"65390e7ce279d2fc80a3f800","avatarUrl":"/avatars/a930c1f20fa1eef372ca92139cb32887.svg","isPro":false,"fullname":"Beomchan Park","user":"bpbpbp0810","type":"user"},{"_id":"65aa2e84e2a2c86356263340","avatarUrl":"/avatars/cba9ecceff0f67b2b5dde02e7c0a9991.svg","isPro":false,"fullname":"Chae Won Kim","user":"chae-won-kim","type":"user"},{"_id":"5ecea265968f6028e0559fa5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619623771844-5ecea265968f6028e0559fa5.jpeg","isPro":true,"fullname":"Victor Sanh","user":"VictorSanh","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":false,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"6527e89a8808d80ccff88b7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527e89a8808d80ccff88b7a/CuGNmF1Et8KMQ0mCd1NEJ.jpeg","isPro":true,"fullname":"Hafedh Hichri","user":"not-lain","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"641a5747f5e9c66105009992","avatarUrl":"/avatars/971346de2396361114f11a2527e909b8.svg","isPro":false,"fullname":"Ermson Seg","user":"emersonium","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">CoLLaVO: Crayon Large Language and Vision mOdel
Abstract
The study proposes CoLLaVO, a Visual Language Model enhanced with crayon prompt tuning and Dual QLoRA to improve object-level image understanding and zero-shot performance in Vision Language tasks.
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding (2024)
- GroundingGPT:Language Enhanced Multi-modal Grounding Model (2024)
- Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey (2023)
- Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study (2024)
- Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Yes! I am preparing the code first and then will upload the hosting model on huggingface space. Now, we are also preparing follow-up large language and vision model for more strong performance, so we plan to upload simultaneously. Thanks for your interest!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper