https://github.com/inclusionAI/Ming/tree/main/Ming-unify\n","updatedAt":"2025-05-06T02:30:49.702Z","author":{"_id":"644fcbea4f7316588267dc80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644fcbea4f7316588267dc80/w8-2Gkaw9BN9VzppNXrTP.jpeg","fullname":"Biao Gong","name":"BiaoGong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7246211171150208},"editors":["BiaoGong"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/644fcbea4f7316588267dc80/w8-2Gkaw9BN9VzppNXrTP.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.02471","authors":[{"_id":"681973cfa70a4728958323aa","user":{"_id":"644fcbea4f7316588267dc80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644fcbea4f7316588267dc80/w8-2Gkaw9BN9VzppNXrTP.jpeg","isPro":false,"fullname":"Biao Gong","user":"BiaoGong","type":"user"},"name":"Biao Gong","status":"admin_assigned","statusLastChangedAt":"2025-05-06T09:10:35.774Z","hidden":false},{"_id":"681973cfa70a4728958323ab","name":"Cheng Zou","hidden":false},{"_id":"681973cfa70a4728958323ac","user":{"_id":"65dd699a89a2a760d15f7d35","avatarUrl":"/avatars/e098b56c413d147d1f38cf33a4b0ecde.svg","isPro":false,"fullname":"Dandan Zheng","user":"zhengdd0422","type":"user"},"name":"Dandan Zheng","status":"admin_assigned","statusLastChangedAt":"2025-05-06T09:10:11.829Z","hidden":false},{"_id":"681973cfa70a4728958323ad","name":"Hu Yu","hidden":false},{"_id":"681973cfa70a4728958323ae","user":{"_id":"64575ac8cd935d48a47774ec","avatarUrl":"/avatars/5d211e2c13d6c4e011e5e58b738413f7.svg","isPro":false,"fullname":"chenjingdong ","user":"chenjingdong","type":"user"},"name":"Jingdong Chen","status":"admin_assigned","statusLastChangedAt":"2025-05-06T09:10:51.440Z","hidden":false},{"_id":"681973cfa70a4728958323af","user":{"_id":"6417cd278f689506e71439ac","avatarUrl":"/avatars/0993d834c6c3bbc53081aa139ee14a12.svg","isPro":false,"fullname":"jianxinsun","user":"jianxinsun","type":"user"},"name":"Jianxin Sun","status":"admin_assigned","statusLastChangedAt":"2025-05-06T09:10:02.728Z","hidden":false},{"_id":"681973cfa70a4728958323b0","name":"Junbo Zhao","hidden":false},{"_id":"681973cfa70a4728958323b1","name":"Jun Zhou","hidden":false},{"_id":"681973cfa70a4728958323b2","name":"Kaixiang Ji","hidden":false},{"_id":"681973cfa70a4728958323b3","name":"Lixiang Ru","hidden":false},{"_id":"681973cfa70a4728958323b4","name":"Libin Wang","hidden":false},{"_id":"681973cfa70a4728958323b5","name":"Qingpei Guo","hidden":false},{"_id":"681973cfa70a4728958323b6","name":"Rui Liu","hidden":false},{"_id":"681973cfa70a4728958323b7","name":"Weilong Chai","hidden":false},{"_id":"681973cfa70a4728958323b8","user":{"_id":"67cc852d2cfa481bce2dd07e","avatarUrl":"/avatars/0c1c32ec066a8de9148b083b39d1fab8.svg","isPro":false,"fullname":"xinyu xiao","user":"bear-xxy","type":"user"},"name":"Xinyu Xiao","status":"admin_assigned","statusLastChangedAt":"2025-05-06T09:11:49.316Z","hidden":false},{"_id":"681973cfa70a4728958323b9","name":"Ziyuan Huang","hidden":false}],"publishedAt":"2025-05-05T08:56:12.000Z","submittedOnDailyAt":"2025-05-06T01:00:49.692Z","title":"Ming-Lite-Uni: Advancements in Unified Architecture for Natural\n Multimodal Interaction","submittedOnDailyBy":{"_id":"644fcbea4f7316588267dc80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644fcbea4f7316588267dc80/w8-2Gkaw9BN9VzppNXrTP.jpeg","isPro":false,"fullname":"Biao Gong","user":"BiaoGong","type":"user"},"summary":"We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a\nnewly designed unified visual generator and a native multimodal autoregressive\nmodel tailored for unifying vision and language. Specifically, this project\nprovides an open-source implementation of the integrated MetaQueries and\nM2-omni framework, while introducing the novel multi-scale learnable tokens and\nmulti-scale representation alignment strategy. By leveraging a fixed MLLM and a\nlearnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to\nperform both text-to-image generation and instruction based image editing\ntasks, expanding their capabilities beyond pure visual understanding. Our\nexperimental results demonstrate the strong performance of Ming-Lite-Uni and\nillustrate the impressive fluid nature of its interactive process. All code and\nmodel weights are open-sourced to foster further exploration within the\ncommunity. Notably, this work aligns with concurrent multimodal AI milestones -\nsuch as ChatGPT-4o with native image generation updated in March 25, 2025 -\nunderscoring the broader significance of unified models like Ming-Lite-Uni on\nthe path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further\nrefined.","upvotes":12,"discussionId":"681973d2a70a47289583249d","projectPage":"https://github.com/inclusionAI/Ming/tree/main/Ming-unify","githubRepo":"https://github.com/inclusionAI/Ming/tree/main/Ming-unify","ai_summary":"Ming-Lite-Uni, an open-source multimodal framework, integrates vision and language using unified visual generators and autoregressive models, demonstrating strong performance in text-to-image generation and image editing.","ai_keywords":["Multimodal framework","unified visual generator","multimodal autoregressive model","MetaQueries","M2-omni","multi-scale learnable tokens","multi-scale representation alignment","MLLM","diffusion model","text-to-image generation","instruction-based image editing"],"githubStars":467},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"644fcbea4f7316588267dc80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644fcbea4f7316588267dc80/w8-2Gkaw9BN9VzppNXrTP.jpeg","isPro":false,"fullname":"Biao Gong","user":"BiaoGong","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6339650420c058d8e2369284","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6339650420c058d8e2369284/pFrGwkKtmwo9iBbDxrosW.jpeg","isPro":false,"fullname":"Oğuzhan Ercan","user":"oguzhanercan","type":"user"},{"_id":"659c0b765a19ba402fa7def2","avatarUrl":"/avatars/e375ddf88332fb9d75f600d88a975bf5.svg","isPro":false,"fullname":"Venkateshwarlu madala","user":"vmadala","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"643b19f8a856622f978df30f","avatarUrl":"/avatars/c82779fdf94f80cdb5020504f83c818b.svg","isPro":false,"fullname":"Yatharth Sharma","user":"YaTharThShaRma999","type":"user"},{"_id":"63e7b92c4577a86987a53cd6","avatarUrl":"/avatars/6cb0b5a1eaf4a84db663eeda96a3967d.svg","isPro":false,"fullname":"Roman Abramov","user":"monsetrum","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"},{"_id":"67eb5b4cd521f6bb19642605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67eb5b4cd521f6bb19642605/rQCHlFIsfHrkNvj3e9p_o.png","isPro":false,"fullname":"claudelee123","user":"claudeli1234","type":"user"},{"_id":"65a4567e212d6aca9a3e8f5a","avatarUrl":"/avatars/1a434e8006febee11cf86ae833986acd.svg","isPro":false,"fullname":"Lin Huang","user":"Lin17","type":"user"},{"_id":"662e1f9da266499277937d33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/FeYcJtiLyInacRk7elsji.jpeg","isPro":false,"fullname":"明城","user":"m1ngcheng","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Ming-Lite-Uni, an open-source multimodal framework, integrates vision and language using unified visual generators and autoregressive models, demonstrating strong performance in text-to-image generation and image editing.
AI-generated summary
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a
newly designed unified visual generator and a native multimodal autoregressive
model tailored for unifying vision and language. Specifically, this project
provides an open-source implementation of the integrated MetaQueries and
M2-omni framework, while introducing the novel multi-scale learnable tokens and
multi-scale representation alignment strategy. By leveraging a fixed MLLM and a
learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to
perform both text-to-image generation and instruction based image editing
tasks, expanding their capabilities beyond pure visual understanding. Our
experimental results demonstrate the strong performance of Ming-Lite-Uni and
illustrate the impressive fluid nature of its interactive process. All code and
model weights are open-sourced to foster further exploration within the
community. Notably, this work aligns with concurrent multimodal AI milestones -
such as ChatGPT-4o with native image generation updated in March 25, 2025 -
underscoring the broader significance of unified models like Ming-Lite-Uni on
the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further
refined.