Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-23T01:32:59.951Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7366045117378235},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.13861","authors":[{"_id":"671715637892af397b9a93f3","user":{"_id":"65b8724123d948d884b379b1","avatarUrl":"/avatars/ce189d1d8d688c17912f9b869035b2d0.svg","isPro":false,"fullname":"Rongyao Fang","user":"LucasFang","type":"user"},"name":"Rongyao Fang","status":"claimed_verified","statusLastChangedAt":"2024-10-22T08:00:53.368Z","hidden":false},{"_id":"671715637892af397b9a93f4","user":{"_id":"64a2b496e2e19de17db7de65","avatarUrl":"/avatars/241448ca487833d6cc5d57bb1fdb6ee5.svg","isPro":false,"fullname":"Duan Chengqi","user":"gogoduan","type":"user"},"name":"Chengqi Duan","status":"claimed_verified","statusLastChangedAt":"2024-10-22T08:00:48.851Z","hidden":false},{"_id":"671715637892af397b9a93f5","name":"Kun Wang","hidden":false},{"_id":"671715637892af397b9a93f6","user":{"_id":"64b9033777ae61bcc80aa4f3","avatarUrl":"/avatars/408c335395c79f3df69fd9bf70abc312.svg","isPro":false,"fullname":"Hao Li","user":"cpsxhao","type":"user"},"name":"Hao Li","status":"claimed_verified","statusLastChangedAt":"2024-12-16T09:42:55.990Z","hidden":false},{"_id":"671715637892af397b9a93f7","name":"Hao Tian","hidden":false},{"_id":"671715637892af397b9a93f8","user":{"_id":"666d4a0fe70e5838d95aebee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/6dkjoFA_sOjCkjvcvozZ5.jpeg","isPro":false,"fullname":"zengxingyu","user":"zengxingyu","type":"user"},"name":"Xingyu Zeng","status":"admin_assigned","statusLastChangedAt":"2024-10-22T08:32:52.126Z","hidden":false},{"_id":"671715637892af397b9a93f9","name":"Rui Zhao","hidden":false},{"_id":"671715637892af397b9a93fa","user":{"_id":"64686f7172d9180d4ac8b4e4","avatarUrl":"/avatars/db67dd6c4b2b41054ddcce5a18ade6f8.svg","isPro":false,"fullname":"Jifeng Dai","user":"daijifeng","type":"user"},"name":"Jifeng Dai","status":"admin_assigned","statusLastChangedAt":"2024-10-22T08:33:01.448Z","hidden":false},{"_id":"671715637892af397b9a93fb","user":{"_id":"65c04e9c27a5fdca81abcbd9","avatarUrl":"/avatars/12a155683c824fa23da4a9e2bed4f64e.svg","isPro":false,"fullname":"Hongsheng LI","user":"hsli-cuhk","type":"user"},"name":"Hongsheng Li","status":"admin_assigned","statusLastChangedAt":"2024-10-22T08:32:41.509Z","hidden":false},{"_id":"671715637892af397b9a93fc","user":{"_id":"65d5ec74cd05bc1eaa125040","avatarUrl":"/avatars/2de1b1539a86452c2c89570eeb02f5ab.svg","isPro":false,"fullname":"Xihui Liu","user":"XihuiLiu","type":"user"},"name":"Xihui Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-22T08:00:51.171Z","hidden":false}],"publishedAt":"2024-10-17T17:59:57.000Z","submittedOnDailyAt":"2024-10-22T01:33:52.453Z","title":"PUMA: Empowering Unified MLLM with Multi-granular Visual Generation","submittedOnDailyBy":{"_id":"65b8724123d948d884b379b1","avatarUrl":"/avatars/ce189d1d8d688c17912f9b869035b2d0.svg","isPro":false,"fullname":"Rongyao Fang","user":"LucasFang","type":"user"},"summary":"Recent advancements in multimodal foundation models have yielded significant\nprogress in vision-language understanding. Initial attempts have also explored\nthe potential of multimodal large language models (MLLMs) for visual content\ngeneration. However, existing works have insufficiently addressed the varying\ngranularity demands of different image generation tasks within a unified MLLM\nparadigm - from the diversity required in text-to-image generation to the\nprecise controllability needed in image manipulation. In this work, we propose\nPUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA\nunifies multi-granular visual features as both inputs and outputs of MLLMs,\nelegantly addressing the different granularity requirements of various image\ngeneration tasks within a unified MLLM framework. Following multimodal\npretraining and task-specific instruction tuning, PUMA demonstrates proficiency\nin a wide range of multimodal tasks. This work represents a significant step\ntowards a truly unified MLLM capable of adapting to the granularity demands of\nvarious visual tasks. The code and model will be released in\nhttps://github.com/rongyaofang/PUMA.","upvotes":56,"discussionId":"671715667892af397b9a94ab","ai_summary":"PUMA, a unified multimodal large language model, addresses varying granularity demands in visual content generation tasks through the integration of multi-granular visual features.","ai_keywords":["multimodal foundation models","MLLMs","text-to-image generation","image manipulation","multi-granular visual generation","multimodal pretraining","task-specific instruction tuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65b8724123d948d884b379b1","avatarUrl":"/avatars/ce189d1d8d688c17912f9b869035b2d0.svg","isPro":false,"fullname":"Rongyao Fang","user":"LucasFang","type":"user"},{"_id":"65d5ec74cd05bc1eaa125040","avatarUrl":"/avatars/2de1b1539a86452c2c89570eeb02f5ab.svg","isPro":false,"fullname":"Xihui Liu","user":"XihuiLiu","type":"user"},{"_id":"6427e08288215cee63b1c44d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6427e08288215cee63b1c44d/rzaG978FF-ywzicWNl_xl.jpeg","isPro":false,"fullname":"yao teng","user":"tytyt","type":"user"},{"_id":"63ea23b9dedfeebe54d02bdf","avatarUrl":"/avatars/4d9f9a546aa8c63e277161ea700075c4.svg","isPro":false,"fullname":"Yuqing Wang","user":"Epiphqny","type":"user"},{"_id":"64a2b496e2e19de17db7de65","avatarUrl":"/avatars/241448ca487833d6cc5d57bb1fdb6ee5.svg","isPro":false,"fullname":"Duan Chengqi","user":"gogoduan","type":"user"},{"_id":"60d045c4778bafd0fbcfa3f5","avatarUrl":"/avatars/0cc0c2739c1934430ea09df7e9668c80.svg","isPro":false,"fullname":"Yi Chen","user":"ChenYi99","type":"user"},{"_id":"64b4eecf2fc8324fcb63b404","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4eecf2fc8324fcb63b404/zGYqYVB4-o-GBMybJ8CDA.png","isPro":false,"fullname":"Yunhan Yang","user":"yhyang-myron","type":"user"},{"_id":"6349214f8146350b3a4c5cdf","avatarUrl":"/avatars/cfd24caac9a87efb528d0f4c375932bc.svg","isPro":false,"fullname":"Dongzhi Jiang","user":"CaraJ","type":"user"},{"_id":"668125557b50b433cda2a211","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668125557b50b433cda2a211/j3z3wT5Rv9IyUKtbzQpnc.png","isPro":false,"fullname":"Tianwei Xiong","user":"YuuTennYi","type":"user"},{"_id":"641af5fcf902cc42730b47e2","avatarUrl":"/avatars/73ac99dec226f0e814a16d2f1dbfbce8.svg","isPro":false,"fullname":"Xiaoyu Shi","user":"btwbtm","type":"user"},{"_id":"64478970e6161a1f32e24786","avatarUrl":"/avatars/670eb312a14e540fee9f24d740275774.svg","isPro":false,"fullname":"EvanLin","user":"evanlin","type":"user"},{"_id":"64ba096e760936217a3ad2e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba096e760936217a3ad2e2/aNQK83Jg5PsBkY0UDg-RA.jpeg","isPro":false,"fullname":"Linzheng Chai","user":"Challenging666","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
PUMA, a unified multimodal large language model, addresses varying granularity demands in visual content generation tasks through the integration of multi-granular visual features.
AI-generated summary
Recent advancements in multimodal foundation models have yielded significant
progress in vision-language understanding. Initial attempts have also explored
the potential of multimodal large language models (MLLMs) for visual content
generation. However, existing works have insufficiently addressed the varying
granularity demands of different image generation tasks within a unified MLLM
paradigm - from the diversity required in text-to-image generation to the
precise controllability needed in image manipulation. In this work, we propose
PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA
unifies multi-granular visual features as both inputs and outputs of MLLMs,
elegantly addressing the different granularity requirements of various image
generation tasks within a unified MLLM framework. Following multimodal
pretraining and task-specific instruction tuning, PUMA demonstrates proficiency
in a wide range of multimodal tasks. This work represents a significant step
towards a truly unified MLLM capable of adapting to the granularity demands of
various visual tasks. The code and model will be released in
https://github.com/rongyaofang/PUMA.
PUMA introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks.