https://github.com/JiuhaiChen/BLIP3o\n","updatedAt":"2025-05-15T01:53:18.023Z","author":{"_id":"6393847e3e30234ae798b7be","avatarUrl":"/avatars/daeb8c37dff4432d837a69b87c196521.svg","fullname":"JiuhaiChen","name":"jiuhai","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":30}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.6811232566833496},"editors":["jiuhai"],"editorAvatarUrls":["/avatars/daeb8c37dff4432d837a69b87c196521.svg"],"reactions":[],"isReport":false}},{"id":"6826530a378176562ebac59e","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-05-15T20:48:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"an audio overview for learning on the go: https://youtu.be/z5dMx-Azpxs","html":"
\n","updatedAt":"2025-05-15T20:48:10.191Z","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3325701951980591},"editors":["yjh415"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png"],"reactions":[],"isReport":false}},{"id":"68269650f032d814753f7377","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-05-16T01:35:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [An Empirical Study of GPT-4o Image Generation Capabilities](https://huggingface.co/papers/2504.05979) (2025)\n* [Harmonizing Visual Representations for Unified Multimodal Understanding and Generation](https://huggingface.co/papers/2503.21979) (2025)\n* [Preliminary Explorations with GPT-4o(mni) Native Image Generation](https://huggingface.co/papers/2505.05501) (2025)\n* [Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction](https://huggingface.co/papers/2505.02471) (2025)\n* [Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing](https://huggingface.co/papers/2504.21356) (2025)\n* [Transfer between Modalities with MetaQueries](https://huggingface.co/papers/2504.06256) (2025)\n* [Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation](https://huggingface.co/papers/2505.05472) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-05-16T01:35:12.530Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7348071336746216},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.09568","authors":[{"_id":"68254419181d43c25d829239","user":{"_id":"6393847e3e30234ae798b7be","avatarUrl":"/avatars/daeb8c37dff4432d837a69b87c196521.svg","isPro":true,"fullname":"JiuhaiChen","user":"jiuhai","type":"user"},"name":"Jiuhai Chen","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:48:28.916Z","hidden":false},{"_id":"68254419181d43c25d82923a","user":{"_id":"64b6c686cf5117d7962d8f62","avatarUrl":"/avatars/96ed7a9602aa4c21b3a3d89608e76dc8.svg","isPro":false,"fullname":"Zhiyang Xu","user":"Zhiyang03","type":"user"},"name":"Zhiyang Xu","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:48:51.984Z","hidden":false},{"_id":"68254419181d43c25d82923b","user":{"_id":"63172831c92fd6fee3181f50","avatarUrl":"/avatars/0f57068a138cb181e9451bfc1ed3d1c0.svg","isPro":true,"fullname":"Xichen Pan","user":"xcpan","type":"user"},"name":"Xichen Pan","status":"claimed_verified","statusLastChangedAt":"2025-05-15T10:31:43.038Z","hidden":false},{"_id":"68254419181d43c25d82923c","user":{"_id":"62b1474bdcbad6848a91a54e","avatarUrl":"/avatars/d7308899b46232cad4a48a0e876449a8.svg","isPro":false,"fullname":"Yushi Hu","user":"yushihu","type":"user"},"name":"Yushi Hu","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:49:05.178Z","hidden":false},{"_id":"68254419181d43c25d82923d","name":"Can Qin","hidden":false},{"_id":"68254419181d43c25d82923e","user":{"_id":"6381ca7d65dc156aba0b933d","avatarUrl":"/avatars/84dfdca8e1cd6fbf50d6fb2a6f1b488d.svg","isPro":false,"fullname":"Tom Goldstein","user":"tomgoldstein","type":"user"},"name":"Tom Goldstein","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:49:13.762Z","hidden":false},{"_id":"68254419181d43c25d82923f","name":"Lifu Huang","hidden":false},{"_id":"68254419181d43c25d829240","user":{"_id":"647f5af5b0e96764589f3b2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VJ4cDyjp5M3V5WmI5gPIU.jpeg","isPro":false,"fullname":"Tianyi Zhou","user":"zhoutianyi","type":"user"},"name":"Tianyi Zhou","status":"claimed_verified","statusLastChangedAt":"2025-05-15T10:32:05.507Z","hidden":false},{"_id":"68254419181d43c25d829241","user":{"_id":"6596422646624a86ff3b3bda","avatarUrl":"/avatars/216e12b77e45ac5f1fa20932f5745411.svg","isPro":false,"fullname":"Saining Xie","user":"sainx","type":"user"},"name":"Saining Xie","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:49:28.644Z","hidden":false},{"_id":"68254419181d43c25d829242","user":{"_id":"67d5674bbc03ef961e733ddd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/3EUXNd-mKvsXFDlr1FETh.png","isPro":false,"fullname":"Silvio Savarese","user":"SilvioSav8","type":"user"},"name":"Silvio Savarese","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:49:36.341Z","hidden":false},{"_id":"68254419181d43c25d829243","user":{"_id":"63dd73e7422ca8d7f7e3698c","avatarUrl":"/avatars/7b0f8419f6941230b81dbbbb4f273edf.svg","isPro":false,"fullname":"Le Xue","user":"SFXX","type":"user"},"name":"Le Xue","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:49:58.945Z","hidden":false},{"_id":"68254419181d43c25d829244","user":{"_id":"649dbcc4e0fff1ed099dc80a","avatarUrl":"/avatars/c87c273ca628dbcddccbf1ee19b2ce33.svg","isPro":false,"fullname":"Caiming Xiong","user":"cxiong","type":"user"},"name":"Caiming Xiong","status":"admin_assigned","statusLastChangedAt":"2025-05-15T13:49:42.774Z","hidden":false},{"_id":"68254419181d43c25d829245","user":{"_id":"6465c4c863e7e09dd02e3e1b","avatarUrl":"/avatars/200b029184d2616f98296a2c212f0785.svg","isPro":false,"fullname":"Ran Xu","user":"xurantju","type":"user"},"name":"Ran Xu","status":"claimed_verified","statusLastChangedAt":"2025-05-15T10:31:39.465Z","hidden":false}],"publishedAt":"2025-05-14T17:11:07.000Z","submittedOnDailyAt":"2025-05-15T00:07:05.564Z","title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,\n Training and Dataset","submittedOnDailyBy":{"_id":"6393847e3e30234ae798b7be","avatarUrl":"/avatars/daeb8c37dff4432d837a69b87c196521.svg","isPro":true,"fullname":"JiuhaiChen","user":"jiuhai","type":"user"},"summary":"Unifying image understanding and generation has gained growing attention in\nrecent research on multimodal models. Although design choices for image\nunderstanding have been extensively studied, the optimal model architecture and\ntraining recipe for a unified framework with image generation remain\nunderexplored. Motivated by the strong potential of autoregressive and\ndiffusion models for high-quality generation and scalability, we conduct a\ncomprehensive study of their use in unified multimodal settings, with emphasis\non image representations, modeling objectives, and training strategies.\nGrounded in these investigations, we introduce a novel approach that employs a\ndiffusion transformer to generate semantically rich CLIP image features, in\ncontrast to conventional VAE-based representations. This design yields both\nhigher training efficiency and improved generative quality. Furthermore, we\ndemonstrate that a sequential pretraining strategy for unified models-first\ntraining on image understanding and subsequently on image generation-offers\npractical advantages by preserving image understanding capability while\ndeveloping strong image generation ability. Finally, we carefully curate a\nhigh-quality instruction-tuning dataset BLIP3o-60k for image generation by\nprompting GPT-4o with a diverse set of captions covering various scenes,\nobjects, human gestures, and more. Building on our innovative model design,\ntraining recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art\nunified multimodal models. BLIP3-o achieves superior performance across most of\nthe popular benchmarks spanning both image understanding and generation tasks.\nTo facilitate future research, we fully open-source our models, including code,\nmodel weights, training scripts, and pretraining and instruction tuning\ndatasets.","upvotes":97,"discussionId":"6825441a181d43c25d82927a","githubRepo":"https://github.com/JiuhaiChen/BLIP3o","ai_summary":"A diffusion transformer is used in a unified multimodal model framework to improve image generation while maintaining image understanding capabilities.","ai_keywords":["diffusion transformer","CLIP image features","VAE-based representations","sequential pretraining","image understanding","image generation","BLIP3-o","instruction-tuning dataset"],"githubStars":1497},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5ead1b914e876668a0c37772","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ead1b914e876668a0c37772/ftW3bs6hy2Q_J63_OUKKW.png","isPro":false,"fullname":"PenutChen","user":"penut85420","type":"user"},{"_id":"6323f399462470712720c155","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6323f399462470712720c155/SWsMNa7vETUSrOt9Qf-oe.png","isPro":false,"fullname":"Yinxu Pan","user":"cppowboy","type":"user"},{"_id":"6765319aee01ca9cf236b8fd","avatarUrl":"/avatars/924943447ab585768f402481380e21b4.svg","isPro":false,"fullname":"bianbian","user":"yibian2014","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"66a9b3533d417b0baa9220a6","avatarUrl":"/avatars/adc372bd24df1d3bf43258833411e8af.svg","isPro":false,"fullname":"Luozheng Qin","user":"Fr0zencr4nE","type":"user"},{"_id":"6465c4c863e7e09dd02e3e1b","avatarUrl":"/avatars/200b029184d2616f98296a2c212f0785.svg","isPro":false,"fullname":"Ran Xu","user":"xurantju","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/PKyQ_-wTNH1Hyv5HxhWdX.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"647f5af5b0e96764589f3b2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VJ4cDyjp5M3V5WmI5gPIU.jpeg","isPro":false,"fullname":"Tianyi Zhou","user":"zhoutianyi","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"638f308fc4444c6ca870b60a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f308fc4444c6ca870b60a/Q11NK-8-JbiilJ-vk2LAR.png","isPro":true,"fullname":"Linoy Tsaban","user":"linoyts","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
A diffusion transformer is used in a unified multimodal model framework to improve image generation while maintaining image understanding capabilities.
AI-generated summary
Unifying image understanding and generation has gained growing attention in
recent research on multimodal models. Although design choices for image
understanding have been extensively studied, the optimal model architecture and
training recipe for a unified framework with image generation remain
underexplored. Motivated by the strong potential of autoregressive and
diffusion models for high-quality generation and scalability, we conduct a
comprehensive study of their use in unified multimodal settings, with emphasis
on image representations, modeling objectives, and training strategies.
Grounded in these investigations, we introduce a novel approach that employs a
diffusion transformer to generate semantically rich CLIP image features, in
contrast to conventional VAE-based representations. This design yields both
higher training efficiency and improved generative quality. Furthermore, we
demonstrate that a sequential pretraining strategy for unified models-first
training on image understanding and subsequently on image generation-offers
practical advantages by preserving image understanding capability while
developing strong image generation ability. Finally, we carefully curate a
high-quality instruction-tuning dataset BLIP3o-60k for image generation by
prompting GPT-4o with a diverse set of captions covering various scenes,
objects, human gestures, and more. Building on our innovative model design,
training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art
unified multimodal models. BLIP3-o achieves superior performance across most of
the popular benchmarks spanning both image understanding and generation tasks.
To facilitate future research, we fully open-source our models, including code,
model weights, training scripts, and pretraining and instruction tuning
datasets.