lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-13T01:34:02.349Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7216005325317383},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68c8d37cdcdba02cdb638915","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2},"createdAt":"2025-09-16T03:03:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks!","html":"

Thanks!

\n","updatedAt":"2025-09-16T03:03:24.453Z","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7788880467414856},"editors":["panikov"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.09666","authors":[{"_id":"68c3b899fc1747b912403acb","name":"Zhiyuan Yan","hidden":false},{"_id":"68c3b899fc1747b912403acc","name":"Kaiqing Lin","hidden":false},{"_id":"68c3b899fc1747b912403acd","name":"Zongjian Li","hidden":false},{"_id":"68c3b899fc1747b912403ace","name":"Junyan Ye","hidden":false},{"_id":"68c3b899fc1747b912403acf","name":"Hui Han","hidden":false},{"_id":"68c3b899fc1747b912403ad0","name":"Zhendong Wang","hidden":false},{"_id":"68c3b899fc1747b912403ad1","name":"Hao Liu","hidden":false},{"_id":"68c3b899fc1747b912403ad2","name":"Bin Lin","hidden":false},{"_id":"68c3b899fc1747b912403ad3","user":{"_id":"6678e670a2873f979b492c5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6678e670a2873f979b492c5b/qlhNpwbbfL00SdpnFhUnn.png","isPro":false,"fullname":"HaoLi","user":"OzymandisLi","type":"user"},"name":"Hao Li","status":"claimed_verified","statusLastChangedAt":"2025-09-12T16:07:31.281Z","hidden":false},{"_id":"68c3b899fc1747b912403ad4","name":"Xue Xu","hidden":false},{"_id":"68c3b899fc1747b912403ad5","name":"Xinyan Xiao","hidden":false},{"_id":"68c3b899fc1747b912403ad6","name":"Jingdong Wang","hidden":false},{"_id":"68c3b899fc1747b912403ad7","name":"Haifeng Wang","hidden":false},{"_id":"68c3b899fc1747b912403ad8","name":"Li Yuan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6367a8175bb06007ea099b8f/DTzxWlEUl9PlV5z4x7b-N.jpeg"],"publishedAt":"2025-09-11T17:57:59.000Z","submittedOnDailyAt":"2025-09-12T04:45:44.849Z","title":"Can Understanding and Generation Truly Benefit Together -- or Just\n Coexist?","submittedOnDailyBy":{"_id":"6367a8175bb06007ea099b8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6367a8175bb06007ea099b8f/IjG7HyWyWRlVt_XwRbxRW.jpeg","isPro":false,"fullname":"linbin","user":"LanguageBind","type":"user"},"summary":"In this paper, we introduce an insightful paradigm through the Auto-Encoder\nlens-understanding as the encoder (I2T) that compresses images into text, and\ngeneration as the decoder (T2I) that reconstructs images from that text. Using\nreconstruction fidelity as the unified training objective, we enforce the\ncoherent bidirectional information flow between the understanding and\ngeneration processes, bringing mutual gains. To implement this, we propose UAE,\na novel framework for unified multimodal learning. We begin by pre-training the\ndecoder with large-scale long-context image captions to capture fine-grained\nsemantic and complex spatial relationships. We then propose Unified-GRPO via\nreinforcement learning (RL), which covers three stages: (1) A cold-start phase\nto gently initialize both encoder and decoder with a semantic reconstruction\nloss; (2) Generation for Understanding, where the encoder is trained to\ngenerate informative captions that maximize the decoder's reconstruction\nquality, enhancing its visual understanding; (3) Understanding for Generation,\nwhere the decoder is refined to reconstruct from these captions, forcing it to\nleverage every detail and improving its long-context instruction following and\ngeneration fidelity. For evaluation, we introduce Unified-Bench, the first\nbenchmark tailored to assess the degree of unification of the UMMs. A\nsurprising \"aha moment\" arises within the multimodal learning domain: as RL\nprogresses, the encoder autonomously produces more descriptive captions, while\nthe decoder simultaneously demonstrates a profound ability to understand these\nintricate descriptions, resulting in reconstructions of striking fidelity.","upvotes":33,"discussionId":"68c3b899fc1747b912403ad9","ai_summary":"A novel framework UAE uses reinforcement learning to unify image-to-text and text-to-image processes, enhancing mutual understanding and generation fidelity.","ai_keywords":["Auto-Encoder","encoder","decoder","reconstruction fidelity","unified multimodal learning","UAE","Unified-GRPO","reinforcement learning","semantic reconstruction loss","Unified-Bench","multimodal learning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6678e670a2873f979b492c5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6678e670a2873f979b492c5b/qlhNpwbbfL00SdpnFhUnn.png","isPro":false,"fullname":"HaoLi","user":"OzymandisLi","type":"user"},{"_id":"6367a8175bb06007ea099b8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6367a8175bb06007ea099b8f/IjG7HyWyWRlVt_XwRbxRW.jpeg","isPro":false,"fullname":"linbin","user":"LanguageBind","type":"user"},{"_id":"66978ee0b8656f6506b4acb2","avatarUrl":"/avatars/298acb8222e189fce4368985ee5374a1.svg","isPro":false,"fullname":"Junyan Ye","user":"Yejy53","type":"user"},{"_id":"6487e158f675b4a7867f45fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6487e158f675b4a7867f45fa/J0sls6zZ682o-SH7iQs7B.jpeg","isPro":false,"fullname":"Zilong Huang","user":"SereinH","type":"user"},{"_id":"646de6402fd5a8eb8c518aa6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646de6402fd5a8eb8c518aa6/HYWb8-fT1kTm-ROBr1-0X.jpeg","isPro":false,"fullname":"yunyangge","user":"yunyangge","type":"user"},{"_id":"67ef88eb9614b0ec7d93edd1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hB2itZIASGxJ8DLzmsSGG.png","isPro":false,"fullname":"Lin","user":"Kaiqing","type":"user"},{"_id":"67dd44d52599dbcecfb4cb9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/9yZaPuSMY-evu25DPT0o5.png","isPro":false,"fullname":"Zhiyuan Yan","user":"zhiyuanyan1","type":"user"},{"_id":"65e411b49fb58a51159be23b","avatarUrl":"/avatars/414fcc4625b852c9419f2644b8a0aac1.svg","isPro":false,"fullname":"han","user":"dragntigr","type":"user"},{"_id":"61cc533343b3d9d55e49b7d8","avatarUrl":"/avatars/56e2b23275e8e465033c3e46724d0226.svg","isPro":false,"fullname":"Alan Knuth","user":"pkualan","type":"user"},{"_id":"63f095be6309c84d5f48848a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f095be6309c84d5f48848a/pL2CKi-r-0mMfIGhYSAsm.jpeg","isPro":false,"fullname":"Wangbo Yu","user":"Drexubery","type":"user"},{"_id":"652b762fb118f26df749fa67","avatarUrl":"/avatars/9d82e6dce2ad9ff28c60d8ae818c5c2c.svg","isPro":false,"fullname":"jikang","user":"beautyremain1","type":"user"},{"_id":"6455cc8f654d8bccae50e4d4","avatarUrl":"/avatars/506a9992e5bf52e06d37cc22e4b307c0.svg","isPro":false,"fullname":"Yuying Ge","user":"tttoaster","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2509.09666

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Published on Sep 11
· Submitted by linbin on Sep 12
Authors:
,
,
,
,
,
,
,
,
Hao Li ,
,
,
,
,

Abstract

A novel framework UAE uses reinforcement learning to unify image-to-text and text-to-image processes, enhancing mutual understanding and generation fidelity.

AI-generated summary

In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.

Community

Paper submitter

🔥🔥🔥Understanding ↔ Generation can boost each other — not just coexist! Framed as an autoencoder (I2T=encoder, T2I=decoder) and trained with Unified-GRPO (RL).
🧠🧠🧠Result: encoder writes richer captions, decoder reconstructs with striking fidelity.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thanks!

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.09666 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.09666 in a Space README.md to link it from this page.

Collections including this paper 2

Лучший частный хостинг