an audio overview for learning on the go: https://youtu.be/0HmtJTO3ZXI
\n\n","updatedAt":"2025-05-21T18:27:35.772Z","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.28167250752449036},"editors":["yjh415"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png"],"reactions":[{"reaction":"👍","users":["bearcat"],"count":1}],"isReport":false}},{"id":"682e807b50671dc8267ed686","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-05-22T01:40:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset](https://huggingface.co/papers/2505.09568) (2025)\n* [UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation](https://huggingface.co/papers/2505.10483) (2025)\n* [Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction](https://huggingface.co/papers/2505.02471) (2025)\n* [UniGen: Enhanced Training&Test-Time Strategies for Unified Multimodal Understanding and Generation](https://huggingface.co/papers/2505.14682) (2025)\n* [UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding](https://huggingface.co/papers/2504.04423) (2025)\n* [Transfer between Modalities with MetaQueries](https://huggingface.co/papers/2504.06256) (2025)\n* [Preliminary Explorations with GPT-4o(mni) Native Image Generation](https://huggingface.co/papers/2505.05501) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset (2025) \n
- UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation (2025) \n
- Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction (2025) \n
- UniGen: Enhanced Training&Test-Time Strategies for Unified Multimodal Understanding and Generation (2025) \n
- UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding (2025) \n
- Transfer between Modalities with MetaQueries (2025) \n
- Preliminary Explorations with GPT-4o(mni) Native Image Generation (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
Here is the AI breakdown of this paper on arXiv explained: https://arxivexplained.com/papers/emerging-properties-in-unified-multimodal-pretraining
\n","updatedAt":"2025-06-09T17:40:19.070Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8306657671928406},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.14683","authors":[{"_id":"682d2fd84540abccd3b835e8","name":"Chaorui Deng","hidden":false},{"_id":"682d2fd84540abccd3b835e9","name":"Deyao Zhu","hidden":false},{"_id":"682d2fd84540abccd3b835ea","user":{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","isPro":false,"fullname":"Kunchang Li","user":"Andy1621","type":"user"},"name":"Kunchang Li","status":"claimed_verified","statusLastChangedAt":"2025-05-21T08:41:06.469Z","hidden":false},{"_id":"682d2fd84540abccd3b835eb","user":{"_id":"652e9c5774d1b0d7ff73d091","avatarUrl":"/avatars/a6d2098b3dde4a8b7488a193f0ecb776.svg","isPro":true,"fullname":"Chenhui Gou","user":"gouc","type":"user"},"name":"Chenhui Gou","status":"claimed_verified","statusLastChangedAt":"2025-05-21T08:41:08.903Z","hidden":false},{"_id":"682d2fd84540abccd3b835ec","name":"Feng Li","hidden":false},{"_id":"682d2fd84540abccd3b835ed","name":"Zeyu Wang","hidden":false},{"_id":"682d2fd84540abccd3b835ee","name":"Shu Zhong","hidden":false},{"_id":"682d2fd84540abccd3b835ef","user":{"_id":"5df833bdda6d0311fd3d5403","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df833bdda6d0311fd3d5403/62OtGJEQXdOuhV9yCd4HS.png","isPro":false,"fullname":"Weihao Yu","user":"whyu","type":"user"},"name":"Weihao Yu","status":"admin_assigned","statusLastChangedAt":"2025-05-21T09:31:55.569Z","hidden":false},{"_id":"682d2fd84540abccd3b835f0","user":{"_id":"64b6b81142134e053233c3c0","avatarUrl":"/avatars/5c7455d99a7a2648f77a531c9a71eb98.svg","isPro":false,"fullname":"Xiaonan Nie","user":"codecaution","type":"user"},"name":"Xiaonan Nie","status":"admin_assigned","statusLastChangedAt":"2025-05-21T10:06:14.057Z","hidden":false},{"_id":"682d2fd84540abccd3b835f1","user":{"_id":"617fe76105423df678cef199","avatarUrl":"/avatars/64c94a4d743edab18ecb4bb7c550f049.svg","isPro":false,"fullname":"Song","user":"Ziang","type":"user"},"name":"Ziang Song","status":"admin_assigned","statusLastChangedAt":"2025-05-21T10:06:07.780Z","hidden":false},{"_id":"682d2fd84540abccd3b835f2","name":"Guang Shi","hidden":false},{"_id":"682d2fd84540abccd3b835f3","name":"Haoqi Fan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/61fb81006374891646732f37/HQOfWqrOf9B97hWczL489.png"],"publishedAt":"2025-05-20T17:59:30.000Z","submittedOnDailyAt":"2025-05-21T00:38:53.960Z","title":"Emerging Properties in Unified Multimodal Pretraining","submittedOnDailyBy":{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","isPro":false,"fullname":"Kunchang Li","user":"Andy1621","type":"user"},"summary":"Unifying multimodal understanding and generation has shown impressive\ncapabilities in cutting-edge proprietary systems. In this work, we introduce\nBAGEL, an open0source foundational model that natively supports multimodal\nunderstanding and generation. BAGEL is a unified, decoder0only model pretrained\non trillions of tokens curated from large0scale interleaved text, image, video,\nand web data. When scaled with such diverse multimodal interleaved data, BAGEL\nexhibits emerging capabilities in complex multimodal reasoning. As a result, it\nsignificantly outperforms open-source unified models in both multimodal\ngeneration and understanding across standard benchmarks, while exhibiting\nadvanced multimodal reasoning abilities such as free-form image manipulation,\nfuture frame prediction, 3D manipulation, and world navigation. In the hope of\nfacilitating further opportunities for multimodal research, we share the key\nfindings, pretraining details, data creation protocal, and release our code and\ncheckpoints to the community. The project page is at https://bagel-ai.org/","upvotes":133,"discussionId":"682d2fdc4540abccd3b836ee","ai_summary":"BAGEL, an open-source foundational model trained on diverse multimodal data, significantly outperforms existing models in both generation and understanding tasks.","ai_keywords":["multimodal understanding","multimodal generation","foundational model","decoder-only model","trillions of tokens","large-scale interleaved data","complex multimodal reasoning","free-form image manipulation","future frame prediction","3D manipulation","world navigation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6289e1e6c65096f8c63be40e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653203427026-noauth.png","isPro":false,"fullname":"LazyPig","user":"SakuraD","type":"user"},{"_id":"6289e290edfa7a816db76774","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653203591668-noauth.png","isPro":false,"fullname":"Jack","user":"Jack9585","type":"user"},{"_id":"646f028385ccfb39f62c4d8d","avatarUrl":"/avatars/86adf071d2e8b80d3d61ae17ce923465.svg","isPro":false,"fullname":"bearcat","user":"bearcat","type":"user"},{"_id":"656d41258a37acfa3f1f284a","avatarUrl":"/avatars/520e72488441bd3eb35f152fbb6a9ba8.svg","isPro":false,"fullname":"feng li","user":"fenly","type":"user"},{"_id":"640d68938aee167ccda391da","avatarUrl":"/avatars/8e3fbf6ca10fe4e9b04a3a84d4e3e255.svg","isPro":false,"fullname":"Yunhao Fang","user":"Seerkfang","type":"user"},{"_id":"5df833bdda6d0311fd3d5403","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df833bdda6d0311fd3d5403/62OtGJEQXdOuhV9yCd4HS.png","isPro":false,"fullname":"Weihao Yu","user":"whyu","type":"user"},{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","isPro":false,"fullname":"Kunchang Li","user":"Andy1621","type":"user"},{"_id":"64b6b81142134e053233c3c0","avatarUrl":"/avatars/5c7455d99a7a2648f77a531c9a71eb98.svg","isPro":false,"fullname":"Xiaonan Nie","user":"codecaution","type":"user"},{"_id":"6397e914e8533c98cf64a641","avatarUrl":"/avatars/efb93d3dc9f42236501e4a705a64a83c.svg","isPro":false,"fullname":"Kane Chen","user":"KaneC","type":"user"},{"_id":"6622c710b0e5c5e3de8311c1","avatarUrl":"/avatars/a824c150040731679bbd77762ca9d4eb.svg","isPro":false,"fullname":"Zun Wang","user":"ZunWang","type":"user"},{"_id":"62aafa49f29ff279b51f0182","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62aafa49f29ff279b51f0182/rQx8QFQGOY2qIhqJ8zSRj.jpeg","isPro":false,"fullname":"yinanhe","user":"ynhe","type":"user"},{"_id":"62f8a6c677b722f1865fa727","avatarUrl":"/avatars/64563b5d4d2cdc72449e483edede70d4.svg","isPro":false,"fullname":"Tsu Tikgiau","user":"tsutikgiau","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">Abstract
BAGEL, an open-source foundational model trained on diverse multimodal data, significantly outperforms existing models in both generation and understanding tasks.
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/
Community
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset (2025)
- UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation (2025)
- Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction (2025)
- UniGen: Enhanced Training&Test-Time Strategies for Unified Multimodal Understanding and Generation (2025)
- UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding (2025)
- Transfer between Modalities with MetaQueries (2025)
- Preliminary Explorations with GPT-4o(mni) Native Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Here is the AI breakdown of this paper on arXiv explained: https://arxivexplained.com/papers/emerging-properties-in-unified-multimodal-pretraining