https://github.com/NVlabs/MambaVision\n","updatedAt":"2024-07-12T01:52:47.838Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":8248}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7458035945892334},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[{"reaction":"🚀","users":["AdinaY","ahatamiz","Timmek"],"count":3},{"reaction":"❤️","users":["ahatamiz"],"count":1}],"isReport":false}},{"id":"66915484174a33476b5cae90","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1272},"createdAt":"2024-07-12T16:06:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @ahatamiz, thanks for claiming the paper! Can't wait for the model release 🔥\n","html":"
Hi \n\n@ahatamiz\n\t, thanks for claiming the paper! Can't wait for the model release 🔥
\n","updatedAt":"2024-07-12T16:06:28.805Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1272}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8134615421295166},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[{"reaction":"❤️","users":["ahatamiz"],"count":1}],"isReport":false},"replies":[{"id":"669162ab83a0ff79fb7a5934","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3},"createdAt":"2024-07-12T17:06:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you @AdinaY . We are working on the HF release.","html":"
Thank you \n\n@AdinaY\n\t . We are working on the HF release.
\n","updatedAt":"2024-07-12T17:06:51.564Z","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9374253153800964},"editors":["ahatamiz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66915484174a33476b5cae90"}},{"id":"66a2b9ecf9e2a38faefc13c4","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3},"createdAt":"2024-07-25T20:47:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @AdinaY , MambaVision models are now integrated into Hugging Face library :\nhttps://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3","html":"
\n","updatedAt":"2024-07-25T20:47:40.537Z","author":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","fullname":"Ali","name":"ahatamiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6956248879432678},"editors":["ahatamiz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66915484174a33476b5cae90"}}]},{"id":"669606dcf4d5f5d06cd0b323","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-07-16T05:36:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model](https://huggingface.co/papers/2405.14174) (2024)\n* [ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention](https://huggingface.co/papers/2405.18425) (2024)\n* [iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency](https://huggingface.co/papers/2407.07603) (2024)\n* [RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization](https://huggingface.co/papers/2406.16004) (2024)\n* [Mamba YOLO: SSMs-Based YOLO For Object Detection](https://huggingface.co/papers/2406.05835) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-07-16T05:36:28.270Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7360938191413879},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.08083","authors":[{"_id":"66908c6ab70d356ed3bd357e","user":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},"name":"Ali Hatamizadeh","status":"claimed_verified","statusLastChangedAt":"2024-07-12T16:03:28.880Z","hidden":false},{"_id":"66908c6ab70d356ed3bd357f","name":"Jan Kautz","hidden":false}],"publishedAt":"2024-07-10T23:02:45.000Z","submittedOnDailyAt":"2024-07-12T00:22:47.833Z","title":"MambaVision: A Hybrid Mamba-Transformer Vision Backbone","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision,\nwhich is specifically tailored for vision applications. Our core contribution\nincludes redesigning the Mamba formulation to enhance its capability for\nefficient modeling of visual features. In addition, we conduct a comprehensive\nablation study on the feasibility of integrating Vision Transformers (ViT) with\nMamba. Our results demonstrate that equipping the Mamba architecture with\nseveral self-attention blocks at the final layers greatly improves the modeling\ncapacity to capture long-range spatial dependencies. Based on our findings, we\nintroduce a family of MambaVision models with a hierarchical architecture to\nmeet various design criteria. For Image classification on ImageNet-1K dataset,\nMambaVision model variants achieve a new State-of-the-Art (SOTA) performance in\nterms of Top-1 accuracy and image throughput. In downstream tasks such as\nobject detection, instance segmentation and semantic segmentation on MS COCO\nand ADE20K datasets, MambaVision outperforms comparably-sized backbones and\ndemonstrates more favorable performance. Code:\nhttps://github.com/NVlabs/MambaVision.","upvotes":32,"discussionId":"66908c6bb70d356ed3bd35dd","ai_summary":"MambaVision, a hybrid Mamba-Transformer backbone, enhances visual feature modeling with self-attention blocks, achieving state-of-the-art performance in image classification and downstream tasks.","ai_keywords":["Mamba-Transformer","MambaVision","self-attention","long-range spatial dependencies","hierarchical architecture","ImageNet-1K","Top-1 accuracy","image throughput","object detection","instance segmentation","semantic segmentation","MS COCO","ADE20K"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6141a88b3a0ec78603c9e784","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg","isPro":true,"fullname":"merve","user":"merve","type":"user"},{"_id":"6324c4195d0cf5c62c6db088","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6324c4195d0cf5c62c6db088/3Bd8TAFwHW5bLRpnEXUH2.png","isPro":false,"fullname":"Agata Polejowska","user":"polejowska","type":"user"},{"_id":"62627a439517ea567fb916f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62627a439517ea567fb916f2/nx3P1FdnLzaAxazhOS_4u.jpeg","isPro":false,"fullname":"Léo Hunout","user":"hunoutl","type":"user"},{"_id":"6281d941eeb15579946ca3ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6281d941eeb15579946ca3ce/0CdrBop_kjRkOqxUTYFbf.jpeg","isPro":false,"fullname":"Hui Sun","user":"CocoSun","type":"user"},{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},{"_id":"6468be9363a564ba347f3896","avatarUrl":"/avatars/f1ced2c6a005e76ca04351a5a0c42d8f.svg","isPro":false,"fullname":"Yang","user":"XaiverYang","type":"user"},{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"60aef0fbee40717d1a8fa6a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624676266012-60aef0fbee40717d1a8fa6a5.png","isPro":false,"fullname":"Mayank Bhaskar","user":"cataluna84","type":"user"},{"_id":"62cdea59a9be5c195561c2b8","avatarUrl":"/avatars/959b702e57718b9029634cb41772dcef.svg","isPro":false,"fullname":"Corentin Dancette","user":"cdancette","type":"user"},{"_id":"65617e90e0a7720b6af6ffbf","avatarUrl":"/avatars/ef50cf3b40d7c12bdbe7affc01b2f51b.svg","isPro":false,"fullname":"chencyu","user":"chencyu","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
MambaVision, a hybrid Mamba-Transformer backbone, enhances visual feature modeling with self-attention blocks, achieving state-of-the-art performance in image classification and downstream tasks.
AI-generated summary
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision,
which is specifically tailored for vision applications. Our core contribution
includes redesigning the Mamba formulation to enhance its capability for
efficient modeling of visual features. In addition, we conduct a comprehensive
ablation study on the feasibility of integrating Vision Transformers (ViT) with
Mamba. Our results demonstrate that equipping the Mamba architecture with
several self-attention blocks at the final layers greatly improves the modeling
capacity to capture long-range spatial dependencies. Based on our findings, we
introduce a family of MambaVision models with a hierarchical architecture to
meet various design criteria. For Image classification on ImageNet-1K dataset,
MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in
terms of Top-1 accuracy and image throughput. In downstream tasks such as
object detection, instance segmentation and semantic segmentation on MS COCO
and ADE20K datasets, MambaVision outperforms comparably-sized backbones and
demonstrates more favorable performance. Code:
https://github.com/NVlabs/MambaVision.