https://github.com/NVlabs/MambaVision

\n","updatedAt":"2024-07-12T01:52:47.838Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":8248}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7458035945892334},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"🚀","users":["AdinaY","ahatamiz","Timmek"],"count":3},{"reaction":"❤️","users":["ahatamiz"],"count":1}],"isReport":false}},{"id":"66915484174a33476b5cae90","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1272},"createdAt":"2024-07-12T16:06:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @ahatamiz, thanks for claiming the paper! Can't wait for the model release 🔥\n","html":"

Hi \n\n@ahatamiz\n\t, thanks for claiming the paper! Can't wait for the model release 🔥

\n","updatedAt":"2024-07-12T16:06:28.805Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1272}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8134615421295166},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[{"reaction":"❤️","users":["ahatamiz"],"count":1}],"isReport":false},"replies":[{"id":"669162ab83a0ff79fb7a5934","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3},"createdAt":"2024-07-12T17:06:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you @AdinaY . We are working on the HF release.","html":"

Thank you \n\n@AdinaY\n\t . We are working on the HF release.

\n","updatedAt":"2024-07-12T17:06:51.564Z","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9374253153800964},"editors":["ahatamiz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66915484174a33476b5cae90"}},{"id":"66a2b9ecf9e2a38faefc13c4","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3},"createdAt":"2024-07-25T20:47:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @AdinaY , MambaVision models are now integrated into Hugging Face library :\nhttps://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3","html":"

Hi \n\n@AdinaY\n\t , MambaVision models are now integrated into Hugging Face library :
https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3

\n","updatedAt":"2024-07-25T20:47:40.537Z","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6956248879432678},"editors":["ahatamiz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66915484174a33476b5cae90"}}]},{"id":"669606dcf4d5f5d06cd0b323","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-07-16T05:36:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model](https://huggingface.co/papers/2405.14174) (2024)\n* [ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention](https://huggingface.co/papers/2405.18425) (2024)\n* [iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency](https://huggingface.co/papers/2407.07603) (2024)\n* [RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization](https://huggingface.co/papers/2406.16004) (2024)\n* [Mamba YOLO: SSMs-Based YOLO For Object Detection](https://huggingface.co/papers/2406.05835) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-16T05:36:28.270Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7360938191413879},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.08083","authors":[{"_id":"66908c6ab70d356ed3bd357e","user":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},"name":"Ali Hatamizadeh","status":"claimed_verified","statusLastChangedAt":"2024-07-12T16:03:28.880Z","hidden":false},{"_id":"66908c6ab70d356ed3bd357f","name":"Jan Kautz","hidden":false}],"publishedAt":"2024-07-10T23:02:45.000Z","submittedOnDailyAt":"2024-07-12T00:22:47.833Z","title":"MambaVision: A Hybrid Mamba-Transformer Vision Backbone","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision,\nwhich is specifically tailored for vision applications. Our core contribution\nincludes redesigning the Mamba formulation to enhance its capability for\nefficient modeling of visual features. In addition, we conduct a comprehensive\nablation study on the feasibility of integrating Vision Transformers (ViT) with\nMamba. Our results demonstrate that equipping the Mamba architecture with\nseveral self-attention blocks at the final layers greatly improves the modeling\ncapacity to capture long-range spatial dependencies. Based on our findings, we\nintroduce a family of MambaVision models with a hierarchical architecture to\nmeet various design criteria. For Image classification on ImageNet-1K dataset,\nMambaVision model variants achieve a new State-of-the-Art (SOTA) performance in\nterms of Top-1 accuracy and image throughput. In downstream tasks such as\nobject detection, instance segmentation and semantic segmentation on MS COCO\nand ADE20K datasets, MambaVision outperforms comparably-sized backbones and\ndemonstrates more favorable performance. Code:\nhttps://github.com/NVlabs/MambaVision.","upvotes":32,"discussionId":"66908c6bb70d356ed3bd35dd","ai_summary":"MambaVision, a hybrid Mamba-Transformer backbone, enhances visual feature modeling with self-attention blocks, achieving state-of-the-art performance in image classification and downstream tasks.","ai_keywords":["Mamba-Transformer","MambaVision","self-attention","long-range spatial dependencies","hierarchical architecture","ImageNet-1K","Top-1 accuracy","image throughput","object detection","instance segmentation","semantic segmentation","MS COCO","ADE20K"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","isPro":true,"fullname":"merve","user":"merve","type":"user"},{"_id":"6324c4195d0cf5c62c6db088","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6324c4195d0cf5c62c6db088/3Bd8TAFwHW5bLRpnEXUH2.png","isPro":false,"fullname":"Agata Polejowska","user":"polejowska","type":"user"},{"_id":"62627a439517ea567fb916f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62627a439517ea567fb916f2/nx3P1FdnLzaAxazhOS_4u.jpeg","isPro":false,"fullname":"Léo Hunout","user":"hunoutl","type":"user"},{"_id":"6281d941eeb15579946ca3ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6281d941eeb15579946ca3ce/0CdrBop_kjRkOqxUTYFbf.jpeg","isPro":false,"fullname":"Hui Sun","user":"CocoSun","type":"user"},{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},{"_id":"6468be9363a564ba347f3896","avatarUrl":"/avatars/f1ced2c6a005e76ca04351a5a0c42d8f.svg","isPro":false,"fullname":"Yang","user":"XaiverYang","type":"user"},{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"60aef0fbee40717d1a8fa6a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624676266012-60aef0fbee40717d1a8fa6a5.png","isPro":false,"fullname":"Mayank Bhaskar","user":"cataluna84","type":"user"},{"_id":"62cdea59a9be5c195561c2b8","avatarUrl":"/avatars/959b702e57718b9029634cb41772dcef.svg","isPro":false,"fullname":"Corentin Dancette","user":"cdancette","type":"user"},{"_id":"65617e90e0a7720b6af6ffbf","avatarUrl":"/avatars/ef50cf3b40d7c12bdbe7affc01b2f51b.svg","isPro":false,"fullname":"chencyu","user":"chencyu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2407.08083

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Published on Jul 10, 2024

· Submitted by

AK on Jul 12, 2024

Upvote

Authors:

Ali Hatamizadeh ,

Abstract

MambaVision, a hybrid Mamba-Transformer backbone, enhances visual feature modeling with self-attention blocks, achieving state-of-the-art performance in image classification and downstream tasks.

AI-generated summary

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For Image classification on ImageNet-1K dataset, MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput. In downstream tasks such as object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones and demonstrates more favorable performance. Code: https://github.com/NVlabs/MambaVision.