\n","updatedAt":"2024-06-09T05:04:23.998Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":143}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4780833423137665},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.11708","authors":[{"_id":"65af1fe0101482afcc5e0c19","user":{"_id":"64fde4e252e82dd432b74ce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fde4e252e82dd432b74ce9/-CQZbBP7FsPPyawYrsi4z.jpeg","isPro":false,"fullname":"Ling Yang","user":"Lingaaaaaaa","type":"user"},"name":"Ling Yang","status":"claimed_verified","statusLastChangedAt":"2024-01-23T13:49:03.172Z","hidden":false},{"_id":"65af1fe0101482afcc5e0c1a","name":"Zhaochen Yu","hidden":false},{"_id":"65af1fe0101482afcc5e0c1b","name":"Chenlin Meng","hidden":false},{"_id":"65af1fe0101482afcc5e0c1c","user":{"_id":"64c0e950aa57599de1c75dad","avatarUrl":"/avatars/374d53317cbccc30fae70e5152ca13e0.svg","isPro":false,"fullname":"Minkai Xu","user":"mkxu","type":"user"},"name":"Minkai Xu","status":"admin_assigned","statusLastChangedAt":"2024-01-23T09:48:26.326Z","hidden":false},{"_id":"65af1fe0101482afcc5e0c1d","user":{"_id":"62f6e244329d4d014d1f4ac5","avatarUrl":"/avatars/5a8b2bb063c2ebc340504b22530f6811.svg","isPro":false,"fullname":"Stefano Ermon","user":"ermonste","type":"user"},"name":"Stefano Ermon","status":"admin_assigned","statusLastChangedAt":"2024-01-23T09:48:14.407Z","hidden":false},{"_id":"65af1fe0101482afcc5e0c1e","name":"Bin Cui","hidden":false}],"publishedAt":"2024-01-22T06:16:29.000Z","submittedOnDailyAt":"2024-01-23T00:49:39.837Z","title":"Mastering Text-to-Image Diffusion: Recaptioning, Planning, and\n Generating with Multimodal LLMs","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Diffusion models have exhibit exceptional performance in text-to-image\ngeneration and editing. However, existing methods often face challenges when\nhandling complex text prompts that involve multiple objects with multiple\nattributes and relationships. In this paper, we propose a brand new\ntraining-free text-to-image generation/editing framework, namely Recaption,\nPlan and Generate (RPG), harnessing the powerful chain-of-thought reasoning\nability of multimodal LLMs to enhance the compositionality of text-to-image\ndiffusion models. Our approach employs the MLLM as a global planner to\ndecompose the process of generating complex images into multiple simpler\ngeneration tasks within subregions. We propose complementary regional diffusion\nto enable region-wise compositional generation. Furthermore, we integrate\ntext-guided image generation and editing within the proposed RPG in a\nclosed-loop fashion, thereby enhancing generalization ability. Extensive\nexperiments demonstrate our RPG outperforms state-of-the-art text-to-image\ndiffusion models, including DALL-E 3 and SDXL, particularly in multi-category\nobject composition and text-image semantic alignment. Notably, our RPG\nframework exhibits wide compatibility with various MLLM architectures (e.g.,\nMiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available\nat: https://github.com/YangLing0818/RPG-DiffusionMaster","upvotes":30,"discussionId":"65af1fe4101482afcc5e0d0e","ai_summary":"A text-to-image generation/editing framework, RPG, uses multimodal LLMs for chain-of-thought reasoning to enhance the compositionality and performance of diffusion models, particularly in handling complex text prompts.","ai_keywords":["diffusion models","text-to-image generation","editing","chain-of-thought reasoning","multimodal LLMs","complementary regional diffusion","closed-loop fashion","DALL-E 3","SDXL","MiniGPT-4","ControlNet"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"6303aa19a362e7e8b51a8994","avatarUrl":"/avatars/68b1c81d3def9fe0f9d30d975e39efa0.svg","isPro":false,"fullname":"Peter Ding","user":"PeterDing","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63efbb1efc92a63ac81126d0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676655314726-noauth.jpeg","isPro":true,"fullname":"Yongsen Mao","user":"ysmao","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"6529f7f8703b3743c2322c2e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529f7f8703b3743c2322c2e/Q4hXmWCqLKDCOYV0ZEvUy.png","isPro":false,"fullname":"Ccino","user":"Hyperccino","type":"user"},{"_id":"65a8ab6a17d869bb7481183d","avatarUrl":"/avatars/660da72d105a9e66b827fee8e72dc3de.svg","isPro":false,"fullname":"Prasanna Rao ","user":"prasannax1","type":"user"},{"_id":"641b26ed1911d3be6743e8d0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b26ed1911d3be6743e8d0/oybjQjcEgiMBC-qGKCQVR.png","isPro":false,"fullname":"余昭辰","user":"BitStarWalkin","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"63470b9f3ea42ee2cb4f3279","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xv8-IxM4GYM91IUOkRnCG.png","isPro":false,"fullname":"NG","user":"SirRa1zel","type":"user"},{"_id":"64fde4e252e82dd432b74ce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fde4e252e82dd432b74ce9/-CQZbBP7FsPPyawYrsi4z.jpeg","isPro":false,"fullname":"Ling Yang","user":"Lingaaaaaaa","type":"user"},{"_id":"6317016f7b0ee0136e5f567d","avatarUrl":"/avatars/fe848d448318fdc17d42967b01101fea.svg","isPro":false,"fullname":"Jan Metzen","user":"jmetzen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A text-to-image generation/editing framework, RPG, uses multimodal LLMs for chain-of-thought reasoning to enhance the compositionality and performance of diffusion models, particularly in handling complex text prompts.
AI-generated summary
Diffusion models have exhibit exceptional performance in text-to-image
generation and editing. However, existing methods often face challenges when
handling complex text prompts that involve multiple objects with multiple
attributes and relationships. In this paper, we propose a brand new
training-free text-to-image generation/editing framework, namely Recaption,
Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning
ability of multimodal LLMs to enhance the compositionality of text-to-image
diffusion models. Our approach employs the MLLM as a global planner to
decompose the process of generating complex images into multiple simpler
generation tasks within subregions. We propose complementary regional diffusion
to enable region-wise compositional generation. Furthermore, we integrate
text-guided image generation and editing within the proposed RPG in a
closed-loop fashion, thereby enhancing generalization ability. Extensive
experiments demonstrate our RPG outperforms state-of-the-art text-to-image
diffusion models, including DALL-E 3 and SDXL, particularly in multi-category
object composition and text-image semantic alignment. Notably, our RPG
framework exhibits wide compatibility with various MLLM architectures (e.g.,
MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available
at: https://github.com/YangLing0818/RPG-DiffusionMaster