\n","updatedAt":"2024-06-09T04:01:17.773Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5024954676628113},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.14520","authors":[{"_id":"65fd0a404d36be78e694bd0d","user":{"_id":"646dbbc8075bbcc48ddcecbf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646dbbc8075bbcc48ddcecbf/V52Em-78O5F3QxRbRwG5O.jpeg","isPro":false,"fullname":"Han Zhao","user":"han1997","type":"user"},"name":"Han Zhao","status":"claimed_verified","statusLastChangedAt":"2024-03-22T13:24:59.524Z","hidden":false},{"_id":"65fd0a404d36be78e694bd0e","name":"Min Zhang","hidden":false},{"_id":"65fd0a404d36be78e694bd0f","name":"Wei Zhao","hidden":false},{"_id":"65fd0a404d36be78e694bd10","name":"Pengxiang Ding","hidden":false},{"_id":"65fd0a404d36be78e694bd11","user":{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},"name":"Siteng Huang","status":"claimed_verified","statusLastChangedAt":"2024-03-22T13:25:01.255Z","hidden":false},{"_id":"65fd0a404d36be78e694bd12","name":"Donglin Wang","hidden":false}],"publishedAt":"2024-03-21T16:17:57.000Z","submittedOnDailyAt":"2024-03-22T03:04:09.151Z","title":"Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient\n Inference","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"In recent years, the application of multimodal large language models (MLLM)\nin various fields has achieved remarkable success. However, as the foundation\nmodel for many downstream tasks, current MLLMs are composed of the well-known\nTransformer network, which has a less efficient quadratic computation\ncomplexity. To improve the efficiency of such basic models, we propose Cobra, a\nlinear computational complexity MLLM. Specifically, Cobra integrates the\nefficient Mamba language model into the visual modality. Moreover, we explore\nand study various modal fusion schemes to create an effective multi-modal\nMamba. Extensive experiments demonstrate that (1) Cobra achieves extremely\ncompetitive performance with current computationally efficient state-of-the-art\nmethods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster\nspeed due to Cobra's linear sequential modeling. (2) Interestingly, the results\nof closed-set challenging prediction benchmarks show that Cobra performs well\nin overcoming visual illusions and spatial relationship judgments. (3) Notably,\nCobra even achieves comparable performance to LLaVA with about 43% of the\nnumber of parameters. We will make all codes of Cobra open-source and hope that\nthe proposed method can facilitate future research on complexity problems in\nMLLM. Our project page is available at: https://sites.google.com/view/cobravlm.","upvotes":35,"discussionId":"65fd0a414d36be78e694bd2a","ai_summary":"Cobra, a linear-complexity multimodal large language model, integrates the Mamba language model with visual modality, achieving competitive performance and faster inference speed compared to existing state-of-the-art models.","ai_keywords":["multimodal large language models","MLLMs","Transformer network","linear computational complexity","Mamba language model","modal fusion schemes","closed-set prediction benchmarks","visual illusions","spatial relationship judgments","parameter-efficient fine-tuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"63869d1e81fe8c678a3a9422","avatarUrl":"/avatars/3bb8728057fa2ba0e24f5ceb1600068d.svg","isPro":true,"fullname":"Zach Mustafa","user":"Zmu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"635f9fd1ae7144a6674c839b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667211208219-noauth.jpeg","isPro":false,"fullname":"Marcus Gawronsky","user":"marcusinthesky","type":"user"},{"_id":"65fd82762bf2cd20ddaa193f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yBYbWp_mT7UusYdkqtAvw.png","isPro":false,"fullname":"Siteng Huang","user":"huangsiteng","type":"user"},{"_id":"646dbbc8075bbcc48ddcecbf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646dbbc8075bbcc48ddcecbf/V52Em-78O5F3QxRbRwG5O.jpeg","isPro":false,"fullname":"Han Zhao","user":"han1997","type":"user"},{"_id":"6065a9cbe43e52694178ed78","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6065a9cbe43e52694178ed78/TwnrOfTacrwtLnBC8cS8e.jpeg","isPro":false,"fullname":"Emanuele Vivoli","user":"emanuelevivoli","type":"user"},{"_id":"65646b22ac9d3c2bd7b14788","avatarUrl":"/avatars/0bf19dcfa568a694361fb3a63b999997.svg","isPro":false,"fullname":"Juhwan Choi","user":"c-juhwan","type":"user"},{"_id":"620c35eece371f5bad535d6e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669407156872-620c35eece371f5bad535d6e.jpeg","isPro":true,"fullname":"Andrew Pouliot","user":"darknoon","type":"user"},{"_id":"63c1b96770b05b9663757e08","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673640235935-noauth.png","isPro":false,"fullname":"Brent Moreno","user":"Aideations","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">
Cobra, a linear-complexity multimodal large language model, integrates the Mamba language model with visual modality, achieving competitive performance and faster inference speed compared to existing state-of-the-art models.
AI-generated summary
In recent years, the application of multimodal large language models (MLLM)
in various fields has achieved remarkable success. However, as the foundation
model for many downstream tasks, current MLLMs are composed of the well-known
Transformer network, which has a less efficient quadratic computation
complexity. To improve the efficiency of such basic models, we propose Cobra, a
linear computational complexity MLLM. Specifically, Cobra integrates the
efficient Mamba language model into the visual modality. Moreover, we explore
and study various modal fusion schemes to create an effective multi-modal
Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely
competitive performance with current computationally efficient state-of-the-art
methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster
speed due to Cobra's linear sequential modeling. (2) Interestingly, the results
of closed-set challenging prediction benchmarks show that Cobra performs well
in overcoming visual illusions and spatial relationship judgments. (3) Notably,
Cobra even achieves comparable performance to LLaVA with about 43% of the
number of parameters. We will make all codes of Cobra open-source and hope that
the proposed method can facilitate future research on complexity problems in
MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.