lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-23T01:34:55.799Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7054662704467773},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68d218769a749aa9e16e905a","author":{"_id":"620f5a1c3f76c50e6458a9b6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg","fullname":"Peiyu Wang","name":"OrlandoHugBot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8},"createdAt":"2025-09-23T03:48:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"please cit Skywork UniPic as previous work.","html":"

please cit Skywork UniPic as previous work.

\n","updatedAt":"2025-09-23T03:48:06.436Z","author":{"_id":"620f5a1c3f76c50e6458a9b6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg","fullname":"Peiyu Wang","name":"OrlandoHugBot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9794082045555115},"editors":["OrlandoHugBot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/620f5a1c3f76c50e6458a9b6/pXh_f5F0UvufxuUa-eS-v.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.16197","authors":[{"_id":"68d0a9f68adc5cd018d15a85","user":{"_id":"65dad3870af7e21ba473439f","avatarUrl":"/avatars/da542e7d68ae937bbdb791f17096bb1c.svg","isPro":false,"fullname":"Yanghao Li","user":"FrozzZen","type":"user"},"name":"Yanghao Li","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:56:15.401Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a86","name":"Rui Qian","hidden":false},{"_id":"68d0a9f68adc5cd018d15a87","user":{"_id":"640e3a753830fd441c2c768d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e3a753830fd441c2c768d/qztg6ML-c87VD8HajREsH.jpeg","isPro":false,"fullname":"Bowen Pan","user":"bpan","type":"user"},"name":"Bowen Pan","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:56:52.401Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a88","user":{"_id":"631516348d85ad332fa47b2c","avatarUrl":"/avatars/100f5ae3cf3c52faaecdaecd5d8f2881.svg","isPro":false,"fullname":"Haotian Zhang","user":"haotiz","type":"user"},"name":"Haotian Zhang","status":"claimed_verified","statusLastChangedAt":"2025-09-22T10:31:49.518Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a89","user":{"_id":"68d0cd691a9bcb17ad2bd300","avatarUrl":"/avatars/9711605539215f4db38335fc7f9f2f7c.svg","isPro":false,"fullname":"Haoshuo Huang","user":"haosoul122","type":"user"},"name":"Haoshuo Huang","status":"claimed_verified","statusLastChangedAt":"2025-09-22T10:31:47.450Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a8a","name":"Bowen Zhang","hidden":false},{"_id":"68d0a9f68adc5cd018d15a8b","user":{"_id":"652592ff6905057617ff5ddf","avatarUrl":"/avatars/d8105671a489407941b11d989810de45.svg","isPro":false,"fullname":"Jialing Tong","user":"jialingt","type":"user"},"name":"Jialing Tong","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:57:07.822Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a8c","name":"Haoxuan You","hidden":false},{"_id":"68d0a9f68adc5cd018d15a8d","user":{"_id":"65cc30b80390fce6291d03cf","avatarUrl":"/avatars/7e774270d1cca48f43f0b379af87003e.svg","isPro":false,"fullname":"Xianzhi Du","user":"xianzhi-du","type":"user"},"name":"Xianzhi Du","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:57:24.942Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a8e","name":"Zhe Gan","hidden":false},{"_id":"68d0a9f68adc5cd018d15a8f","user":{"_id":"68b27646a9ed991404721fe3","avatarUrl":"/avatars/1e80be53b57cb9a4d4ede47acc3c0835.svg","isPro":false,"fullname":"Hyunjik Kim","user":"hyunjik11","type":"user"},"name":"Hyunjik Kim","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:57:17.892Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a90","name":"Chao Jia","hidden":false},{"_id":"68d0a9f68adc5cd018d15a91","name":"Zhenbang Wang","hidden":false},{"_id":"68d0a9f68adc5cd018d15a92","name":"Yinfei Yang","hidden":false},{"_id":"68d0a9f68adc5cd018d15a93","user":{"_id":"64388d5f1efe72ba48071c52","avatarUrl":"/avatars/2653fba13e71f471e9d55c134fe25efc.svg","isPro":false,"fullname":"Mingfei Gao","user":"fly6464","type":"user"},"name":"Mingfei Gao","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:57:44.368Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a94","user":{"_id":"6266e8afe14b376cb73c460d","avatarUrl":"/avatars/dccb08791896527745682bcb4ee71a48.svg","isPro":false,"fullname":"Zi-Yi Dou","user":"zdou0830","type":"user"},"name":"Zi-Yi Dou","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:57:52.988Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a95","name":"Wenze Hu","hidden":false},{"_id":"68d0a9f68adc5cd018d15a96","user":{"_id":"65570843c4865c852d541688","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AEGlyozNi3YGSbdcJLmOL.jpeg","isPro":false,"fullname":"Chang Gao","user":"changgy","type":"user"},"name":"Chang Gao","status":"claimed_verified","statusLastChangedAt":"2025-09-24T13:56:04.966Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a97","user":{"_id":"6357362f811ee2fa05070f64","avatarUrl":"/avatars/2cf37efb80f5cfb3e4e9d08674de6dd1.svg","isPro":false,"fullname":"Dongxu Li","user":"dxli1","type":"user"},"name":"Dongxu Li","status":"claimed_verified","statusLastChangedAt":"2025-09-23T02:42:07.790Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a98","name":"Philipp Dufter","hidden":false},{"_id":"68d0a9f68adc5cd018d15a99","name":"Zirui Wang","hidden":false},{"_id":"68d0a9f68adc5cd018d15a9a","user":{"_id":"605b7f42935268bc086131ba","avatarUrl":"/avatars/a55109f714b33f9d59d69011ddeb0b9f.svg","isPro":false,"fullname":"Guoli Yin","user":"gyin94","type":"user"},"name":"Guoli Yin","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:58:09.764Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a9b","name":"Zhengdong Zhang","hidden":false},{"_id":"68d0a9f68adc5cd018d15a9c","name":"Chen Chen","hidden":false},{"_id":"68d0a9f68adc5cd018d15a9d","name":"Yang Zhao","hidden":false},{"_id":"68d0a9f68adc5cd018d15a9e","user":{"_id":"654ef8ad3fe6c0b1f871942f","avatarUrl":"/avatars/8d3689b9bf57c7c8060ac510a1839158.svg","isPro":false,"fullname":"Ruoming Pang","user":"ruoming","type":"user"},"name":"Ruoming Pang","status":"admin_assigned","statusLastChangedAt":"2025-09-22T02:58:17.855Z","hidden":false},{"_id":"68d0a9f68adc5cd018d15a9f","name":"Zhifeng Chen","hidden":false}],"publishedAt":"2025-09-19T17:58:00.000Z","submittedOnDailyAt":"2025-09-22T00:14:30.542Z","title":"MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid\n Vision Tokenizer","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"Unified multimodal Large Language Models (LLMs) that can both understand and\ngenerate visual content hold immense potential. However, existing open-source\nmodels often suffer from a performance trade-off between these capabilities. We\npresent Manzano, a simple and scalable unified framework that substantially\nreduces this tension by coupling a hybrid image tokenizer with a well-curated\ntraining recipe. A single shared vision encoder feeds two lightweight adapters\nthat produce continuous embeddings for image-to-text understanding and discrete\ntokens for text-to-image generation within a common semantic space. A unified\nautoregressive LLM predicts high-level semantics in the form of text and image\ntokens, with an auxiliary diffusion decoder subsequently translating the image\ntokens into pixels. The architecture, together with a unified training recipe\nover understanding and generation data, enables scalable joint learning of both\ncapabilities. Manzano achieves state-of-the-art results among unified models,\nand is competitive with specialist models, particularly on text-rich\nevaluation. Our studies show minimal task conflicts and consistent gains from\nscaling model size, validating our design choice of a hybrid tokenizer.","upvotes":48,"discussionId":"68d0a9f68adc5cd018d15aa0","ai_summary":"Manzano is a unified multimodal LLM framework that integrates image and text processing using a hybrid tokenizer and diffusion decoder, achieving state-of-the-art performance in both understanding and generating visual content.","ai_keywords":["multimodal Large Language Models","hybrid image tokenizer","vision encoder","lightweight adapters","continuous embeddings","discrete tokens","semantic space","unified autoregressive LLM","diffusion decoder","joint learning","text-rich evaluation","task conflicts"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"640e3a753830fd441c2c768d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640e3a753830fd441c2c768d/qztg6ML-c87VD8HajREsH.jpeg","isPro":false,"fullname":"Bowen Pan","user":"bpan","type":"user"},{"_id":"631516348d85ad332fa47b2c","avatarUrl":"/avatars/100f5ae3cf3c52faaecdaecd5d8f2881.svg","isPro":false,"fullname":"Haotian Zhang","user":"haotiz","type":"user"},{"_id":"6266e8afe14b376cb73c460d","avatarUrl":"/avatars/dccb08791896527745682bcb4ee71a48.svg","isPro":false,"fullname":"Zi-Yi Dou","user":"zdou0830","type":"user"},{"_id":"66b5295f83425904fa7a1a6a","avatarUrl":"/avatars/a35568fb933ceef7451bd88fb3d5ab17.svg","isPro":false,"fullname":"Zhengfeng Lai","user":"jefflai","type":"user"},{"_id":"656c2fa772c19de72367bd69","avatarUrl":"/avatars/540bb3d8a2afe2ef927b80d895cae28b.svg","isPro":false,"fullname":"Alex Yang","user":"yyf86","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"68d0cd691a9bcb17ad2bd300","avatarUrl":"/avatars/9711605539215f4db38335fc7f9f2f7c.svg","isPro":false,"fullname":"Haoshuo Huang","user":"haosoul122","type":"user"},{"_id":"65570843c4865c852d541688","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/AEGlyozNi3YGSbdcJLmOL.jpeg","isPro":false,"fullname":"Chang Gao","user":"changgy","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"6733d71ec504dbf468d27be4","avatarUrl":"/avatars/d91e781d8c4142a4d8c6ee63937ef18f.svg","isPro":false,"fullname":"ruiqian","user":"t2iwarrior","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2509.16197

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Published on Sep 19
· Submitted by taesiri on Sep 22
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Manzano is a unified multimodal LLM framework that integrates image and text processing using a hybrid tokenizer and diffusion decoder, achieving state-of-the-art performance in both understanding and generating visual content.

AI-generated summary

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Community

Paper submitter

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

please cit Skywork UniPic as previous work.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.16197 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.16197 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.16197 in a Space README.md to link it from this page.

Collections including this paper 12

Лучший частный хостинг