lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2024-01-24T01:27:34.346Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7095640897750854},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"666537d7653a846cc98c3aa3","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":143},"createdAt":"2024-06-09T05:04:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# Mastering Text-to-Image Diffusion: The RPG Framework Unveiled!\n\nhttps://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/aMUrgzMQAGCNb-ZHSv7QW.mp4 \n\n## Links 🔗:\n👉 Subscribe: https://www.youtube.com/@Arxflix\n👉 Twitter: https://x.com/arxflix\n👉 LMNT (Partner): https://lmnt.com/\n\n\nBy Arxflix\n![9t4iCUHx_400x400-1.jpg](https://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/v4S5zBurs0ouGNwYj1GEd.jpeg)","html":"

Mastering Text-to-Image Diffusion: The RPG Framework Unveiled!

\n

\n\n

Links 🔗:

\n

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

\n

By Arxflix
\"9t4iCUHx_400x400-1.jpg\"

\n","updatedAt":"2024-06-09T05:04:23.998Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":143}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4780833423137665},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.11708","authors":[{"_id":"65af1fe0101482afcc5e0c19","user":{"_id":"64fde4e252e82dd432b74ce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fde4e252e82dd432b74ce9/-CQZbBP7FsPPyawYrsi4z.jpeg","isPro":false,"fullname":"Ling Yang","user":"Lingaaaaaaa","type":"user"},"name":"Ling Yang","status":"claimed_verified","statusLastChangedAt":"2024-01-23T13:49:03.172Z","hidden":false},{"_id":"65af1fe0101482afcc5e0c1a","name":"Zhaochen Yu","hidden":false},{"_id":"65af1fe0101482afcc5e0c1b","name":"Chenlin Meng","hidden":false},{"_id":"65af1fe0101482afcc5e0c1c","user":{"_id":"64c0e950aa57599de1c75dad","avatarUrl":"/avatars/374d53317cbccc30fae70e5152ca13e0.svg","isPro":false,"fullname":"Minkai Xu","user":"mkxu","type":"user"},"name":"Minkai Xu","status":"admin_assigned","statusLastChangedAt":"2024-01-23T09:48:26.326Z","hidden":false},{"_id":"65af1fe0101482afcc5e0c1d","user":{"_id":"62f6e244329d4d014d1f4ac5","avatarUrl":"/avatars/5a8b2bb063c2ebc340504b22530f6811.svg","isPro":false,"fullname":"Stefano Ermon","user":"ermonste","type":"user"},"name":"Stefano Ermon","status":"admin_assigned","statusLastChangedAt":"2024-01-23T09:48:14.407Z","hidden":false},{"_id":"65af1fe0101482afcc5e0c1e","name":"Bin Cui","hidden":false}],"publishedAt":"2024-01-22T06:16:29.000Z","submittedOnDailyAt":"2024-01-23T00:49:39.837Z","title":"Mastering Text-to-Image Diffusion: Recaptioning, Planning, and\n Generating with Multimodal LLMs","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Diffusion models have exhibit exceptional performance in text-to-image\ngeneration and editing. However, existing methods often face challenges when\nhandling complex text prompts that involve multiple objects with multiple\nattributes and relationships. In this paper, we propose a brand new\ntraining-free text-to-image generation/editing framework, namely Recaption,\nPlan and Generate (RPG), harnessing the powerful chain-of-thought reasoning\nability of multimodal LLMs to enhance the compositionality of text-to-image\ndiffusion models. Our approach employs the MLLM as a global planner to\ndecompose the process of generating complex images into multiple simpler\ngeneration tasks within subregions. We propose complementary regional diffusion\nto enable region-wise compositional generation. Furthermore, we integrate\ntext-guided image generation and editing within the proposed RPG in a\nclosed-loop fashion, thereby enhancing generalization ability. Extensive\nexperiments demonstrate our RPG outperforms state-of-the-art text-to-image\ndiffusion models, including DALL-E 3 and SDXL, particularly in multi-category\nobject composition and text-image semantic alignment. Notably, our RPG\nframework exhibits wide compatibility with various MLLM architectures (e.g.,\nMiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available\nat: https://github.com/YangLing0818/RPG-DiffusionMaster","upvotes":30,"discussionId":"65af1fe4101482afcc5e0d0e","ai_summary":"A text-to-image generation/editing framework, RPG, uses multimodal LLMs for chain-of-thought reasoning to enhance the compositionality and performance of diffusion models, particularly in handling complex text prompts.","ai_keywords":["diffusion models","text-to-image generation","editing","chain-of-thought reasoning","multimodal LLMs","complementary regional diffusion","closed-loop fashion","DALL-E 3","SDXL","MiniGPT-4","ControlNet"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"6303aa19a362e7e8b51a8994","avatarUrl":"/avatars/68b1c81d3def9fe0f9d30d975e39efa0.svg","isPro":false,"fullname":"Peter Ding","user":"PeterDing","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63efbb1efc92a63ac81126d0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676655314726-noauth.jpeg","isPro":true,"fullname":"Yongsen Mao","user":"ysmao","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"6529f7f8703b3743c2322c2e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529f7f8703b3743c2322c2e/Q4hXmWCqLKDCOYV0ZEvUy.png","isPro":false,"fullname":"Ccino","user":"Hyperccino","type":"user"},{"_id":"65a8ab6a17d869bb7481183d","avatarUrl":"/avatars/660da72d105a9e66b827fee8e72dc3de.svg","isPro":false,"fullname":"Prasanna Rao ","user":"prasannax1","type":"user"},{"_id":"641b26ed1911d3be6743e8d0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b26ed1911d3be6743e8d0/oybjQjcEgiMBC-qGKCQVR.png","isPro":false,"fullname":"余昭辰","user":"BitStarWalkin","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"63470b9f3ea42ee2cb4f3279","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xv8-IxM4GYM91IUOkRnCG.png","isPro":false,"fullname":"NG","user":"SirRa1zel","type":"user"},{"_id":"64fde4e252e82dd432b74ce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fde4e252e82dd432b74ce9/-CQZbBP7FsPPyawYrsi4z.jpeg","isPro":false,"fullname":"Ling Yang","user":"Lingaaaaaaa","type":"user"},{"_id":"6317016f7b0ee0136e5f567d","avatarUrl":"/avatars/fe848d448318fdc17d42967b01101fea.svg","isPro":false,"fullname":"Jan Metzen","user":"jmetzen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2401.11708

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Published on Jan 22, 2024
· Submitted by AK on Jan 23, 2024
Authors:
,
,

Abstract

A text-to-image generation/editing framework, RPG, uses multimodal LLMs for chain-of-thought reasoning to enhance the compositionality and performance of diffusion models, particularly in handling complex text prompts.

AI-generated summary

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Mastering Text-to-Image Diffusion: The RPG Framework Unveiled!

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.11708 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.11708 in a Space README.md to link it from this page.

Collections including this paper 14

Лучший частный хостинг