lynx   »   [go: up one dir, main page]

https://bagel-ai.org/

\n","updatedAt":"2025-05-21T02:08:53.982Z","author":{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","fullname":"Kunchang Li","name":"Andy1621","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":29}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8859921097755432},"editors":["Andy1621"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg"],"reactions":[{"reaction":"🔥","users":["wchengad","yqf3139","gouc","bearcat","luomingshuang"],"count":5},{"reaction":"🚀","users":["wchengad","yjh415","luomingshuang"],"count":3}],"isReport":false}},{"id":"682e1b1762f12477c189133b","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-05-21T18:27:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"an audio overview for learning on the go: https://youtu.be/0HmtJTO3ZXI\n\n![3E9521DB-01BD-43CE-A5E4-A223ABF1BDA3.jpeg](https://cdn-uploads.huggingface.co/production/uploads/6813ee19c9b224a738fea856/b83uX-JS_N2xKt_nI9ob2.jpeg)\n","html":"

an audio overview for learning on the go: https://youtu.be/0HmtJTO3ZXI

\n

\"3E9521DB-01BD-43CE-A5E4-A223ABF1BDA3.jpeg\"

\n","updatedAt":"2025-05-21T18:27:35.772Z","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.28167250752449036},"editors":["yjh415"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png"],"reactions":[{"reaction":"👍","users":["bearcat"],"count":1}],"isReport":false}},{"id":"682e807b50671dc8267ed686","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-05-22T01:40:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset](https://huggingface.co/papers/2505.09568) (2025)\n* [UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation](https://huggingface.co/papers/2505.10483) (2025)\n* [Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction](https://huggingface.co/papers/2505.02471) (2025)\n* [UniGen: Enhanced Training&Test-Time Strategies for Unified Multimodal Understanding and Generation](https://huggingface.co/papers/2505.14682) (2025)\n* [UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding](https://huggingface.co/papers/2504.04423) (2025)\n* [Transfer between Modalities with MetaQueries](https://huggingface.co/papers/2504.06256) (2025)\n* [Preliminary Explorations with GPT-4o(mni) Native Image Generation](https://huggingface.co/papers/2505.05501) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-05-22T01:40:11.314Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.738350510597229},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68471c8301c5d532a66635f9","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1},"createdAt":"2025-06-09T17:40:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Here is the AI breakdown of this paper on arXiv explained: https://arxivexplained.com/papers/emerging-properties-in-unified-multimodal-pretraining","html":"

Here is the AI breakdown of this paper on arXiv explained: https://arxivexplained.com/papers/emerging-properties-in-unified-multimodal-pretraining

\n","updatedAt":"2025-06-09T17:40:19.070Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8306657671928406},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.14683","authors":[{"_id":"682d2fd84540abccd3b835e8","name":"Chaorui Deng","hidden":false},{"_id":"682d2fd84540abccd3b835e9","name":"Deyao Zhu","hidden":false},{"_id":"682d2fd84540abccd3b835ea","user":{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","isPro":false,"fullname":"Kunchang Li","user":"Andy1621","type":"user"},"name":"Kunchang Li","status":"claimed_verified","statusLastChangedAt":"2025-05-21T08:41:06.469Z","hidden":false},{"_id":"682d2fd84540abccd3b835eb","user":{"_id":"652e9c5774d1b0d7ff73d091","avatarUrl":"/avatars/a6d2098b3dde4a8b7488a193f0ecb776.svg","isPro":true,"fullname":"Chenhui Gou","user":"gouc","type":"user"},"name":"Chenhui Gou","status":"claimed_verified","statusLastChangedAt":"2025-05-21T08:41:08.903Z","hidden":false},{"_id":"682d2fd84540abccd3b835ec","name":"Feng Li","hidden":false},{"_id":"682d2fd84540abccd3b835ed","name":"Zeyu Wang","hidden":false},{"_id":"682d2fd84540abccd3b835ee","name":"Shu Zhong","hidden":false},{"_id":"682d2fd84540abccd3b835ef","user":{"_id":"5df833bdda6d0311fd3d5403","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df833bdda6d0311fd3d5403/62OtGJEQXdOuhV9yCd4HS.png","isPro":false,"fullname":"Weihao Yu","user":"whyu","type":"user"},"name":"Weihao Yu","status":"admin_assigned","statusLastChangedAt":"2025-05-21T09:31:55.569Z","hidden":false},{"_id":"682d2fd84540abccd3b835f0","user":{"_id":"64b6b81142134e053233c3c0","avatarUrl":"/avatars/5c7455d99a7a2648f77a531c9a71eb98.svg","isPro":false,"fullname":"Xiaonan Nie","user":"codecaution","type":"user"},"name":"Xiaonan Nie","status":"admin_assigned","statusLastChangedAt":"2025-05-21T10:06:14.057Z","hidden":false},{"_id":"682d2fd84540abccd3b835f1","user":{"_id":"617fe76105423df678cef199","avatarUrl":"/avatars/64c94a4d743edab18ecb4bb7c550f049.svg","isPro":false,"fullname":"Song","user":"Ziang","type":"user"},"name":"Ziang Song","status":"admin_assigned","statusLastChangedAt":"2025-05-21T10:06:07.780Z","hidden":false},{"_id":"682d2fd84540abccd3b835f2","name":"Guang Shi","hidden":false},{"_id":"682d2fd84540abccd3b835f3","name":"Haoqi Fan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/61fb81006374891646732f37/HQOfWqrOf9B97hWczL489.png"],"publishedAt":"2025-05-20T17:59:30.000Z","submittedOnDailyAt":"2025-05-21T00:38:53.960Z","title":"Emerging Properties in Unified Multimodal Pretraining","submittedOnDailyBy":{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","isPro":false,"fullname":"Kunchang Li","user":"Andy1621","type":"user"},"summary":"Unifying multimodal understanding and generation has shown impressive\ncapabilities in cutting-edge proprietary systems. In this work, we introduce\nBAGEL, an open0source foundational model that natively supports multimodal\nunderstanding and generation. BAGEL is a unified, decoder0only model pretrained\non trillions of tokens curated from large0scale interleaved text, image, video,\nand web data. When scaled with such diverse multimodal interleaved data, BAGEL\nexhibits emerging capabilities in complex multimodal reasoning. As a result, it\nsignificantly outperforms open-source unified models in both multimodal\ngeneration and understanding across standard benchmarks, while exhibiting\nadvanced multimodal reasoning abilities such as free-form image manipulation,\nfuture frame prediction, 3D manipulation, and world navigation. In the hope of\nfacilitating further opportunities for multimodal research, we share the key\nfindings, pretraining details, data creation protocal, and release our code and\ncheckpoints to the community. The project page is at https://bagel-ai.org/","upvotes":133,"discussionId":"682d2fdc4540abccd3b836ee","ai_summary":"BAGEL, an open-source foundational model trained on diverse multimodal data, significantly outperforms existing models in both generation and understanding tasks.","ai_keywords":["multimodal understanding","multimodal generation","foundational model","decoder-only model","trillions of tokens","large-scale interleaved data","complex multimodal reasoning","free-form image manipulation","future frame prediction","3D manipulation","world navigation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6289e1e6c65096f8c63be40e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653203427026-noauth.png","isPro":false,"fullname":"LazyPig","user":"SakuraD","type":"user"},{"_id":"6289e290edfa7a816db76774","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653203591668-noauth.png","isPro":false,"fullname":"Jack","user":"Jack9585","type":"user"},{"_id":"646f028385ccfb39f62c4d8d","avatarUrl":"/avatars/86adf071d2e8b80d3d61ae17ce923465.svg","isPro":false,"fullname":"bearcat","user":"bearcat","type":"user"},{"_id":"656d41258a37acfa3f1f284a","avatarUrl":"/avatars/520e72488441bd3eb35f152fbb6a9ba8.svg","isPro":false,"fullname":"feng li","user":"fenly","type":"user"},{"_id":"640d68938aee167ccda391da","avatarUrl":"/avatars/8e3fbf6ca10fe4e9b04a3a84d4e3e255.svg","isPro":false,"fullname":"Yunhao Fang","user":"Seerkfang","type":"user"},{"_id":"5df833bdda6d0311fd3d5403","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5df833bdda6d0311fd3d5403/62OtGJEQXdOuhV9yCd4HS.png","isPro":false,"fullname":"Weihao Yu","user":"whyu","type":"user"},{"_id":"61fb81006374891646732f37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643872995181-61fb81006374891646732f37.jpeg","isPro":false,"fullname":"Kunchang Li","user":"Andy1621","type":"user"},{"_id":"64b6b81142134e053233c3c0","avatarUrl":"/avatars/5c7455d99a7a2648f77a531c9a71eb98.svg","isPro":false,"fullname":"Xiaonan Nie","user":"codecaution","type":"user"},{"_id":"6397e914e8533c98cf64a641","avatarUrl":"/avatars/efb93d3dc9f42236501e4a705a64a83c.svg","isPro":false,"fullname":"Kane Chen","user":"KaneC","type":"user"},{"_id":"6622c710b0e5c5e3de8311c1","avatarUrl":"/avatars/a824c150040731679bbd77762ca9d4eb.svg","isPro":false,"fullname":"Zun Wang","user":"ZunWang","type":"user"},{"_id":"62aafa49f29ff279b51f0182","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62aafa49f29ff279b51f0182/rQx8QFQGOY2qIhqJ8zSRj.jpeg","isPro":false,"fullname":"yinanhe","user":"ynhe","type":"user"},{"_id":"62f8a6c677b722f1865fa727","avatarUrl":"/avatars/64563b5d4d2cdc72449e483edede70d4.svg","isPro":false,"fullname":"Tsu Tikgiau","user":"tsutikgiau","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2505.14683

Emerging Properties in Unified Multimodal Pretraining

Published on May 20
· Submitted by Kunchang Li on May 21
#1 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

BAGEL, an open-source foundational model trained on diverse multimodal data, significantly outperforms existing models in both generation and understanding tasks.

AI-generated summary

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Community

Paper author Paper submitter

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

an audio overview for learning on the go: https://youtu.be/0HmtJTO3ZXI

3E9521DB-01BD-43CE-A5E4-A223ABF1BDA3.jpeg

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Here is the AI breakdown of this paper on arXiv explained: https://arxivexplained.com/papers/emerging-properties-in-unified-multimodal-pretraining

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 8

Collections including this paper 22

Лучший частный хостинг