lynx   »   [go: up one dir, main page]

https://github.com/haon-chen/mmE5
Model: https://huggingface.co/intfloat/mmE5-mllama-11b-instruct
Dataset: https://huggingface.co/datasets/intfloat/mmE5-synthetic

\n","updatedAt":"2025-02-14T04:32:15.447Z","author":{"_id":"66add675c7a575aa0e03d5f3","avatarUrl":"/avatars/b72b18130664c1de197c1f8df371aa70.svg","fullname":"Haonan Chen","name":"Haon-Chen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5342205166816711},"editors":["Haon-Chen"],"editorAvatarUrls":["/avatars/b72b18130664c1de197c1f8df371aa70.svg"],"reactions":[],"isReport":false}},{"id":"67afef27b023991df153c57e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-02-15T01:34:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MLLM4PUE: Toward Universal Embeddings in Computational Pathology through Multimodal LLMs](https://huggingface.co/papers/2502.07221) (2025)\n* [GME: Improving Universal Multimodal Retrieval by Multimodal LLMs](https://huggingface.co/papers/2412.16855) (2024)\n* [SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning](https://huggingface.co/papers/2501.03675) (2025)\n* [MINIMA: Modality Invariant Image Matching](https://huggingface.co/papers/2412.19412) (2024)\n* [Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks](https://huggingface.co/papers/2501.02527) (2025)\n* [Asymmetric Reinforcing against Multi-modal Representation Bias](https://huggingface.co/papers/2501.01240) (2025)\n* [Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding](https://huggingface.co/papers/2412.12453) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-15T01:34:31.250Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7309861779212952},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.08468","authors":[{"_id":"67ad5f3fcad644864b4366ca","user":{"_id":"66add675c7a575aa0e03d5f3","avatarUrl":"/avatars/b72b18130664c1de197c1f8df371aa70.svg","isPro":false,"fullname":"Haonan Chen","user":"Haon-Chen","type":"user"},"name":"Haonan Chen","status":"claimed_verified","statusLastChangedAt":"2025-02-13T08:21:55.329Z","hidden":false},{"_id":"67ad5f3fcad644864b4366cb","name":"Liang Wang","hidden":false},{"_id":"67ad5f3fcad644864b4366cc","name":"Nan Yang","hidden":false},{"_id":"67ad5f3fcad644864b4366cd","user":{"_id":"625e62452a7279d3c77b5c38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/zJINew6U4_Gup4WTobb-0.jpeg","isPro":false,"fullname":"Yutao Zhu","user":"yutaozhu94","type":"user"},"name":"Yutao Zhu","status":"admin_assigned","statusLastChangedAt":"2025-02-14T12:53:09.223Z","hidden":false},{"_id":"67ad5f3fcad644864b4366ce","user":{"_id":"6639d5c106b25a7ea6f18391","avatarUrl":"/avatars/788e339472999a9159f77f857817d618.svg","isPro":false,"fullname":"Ziliang Zhao","user":"ZillionZhao","type":"user"},"name":"Ziliang Zhao","status":"admin_assigned","statusLastChangedAt":"2025-02-14T12:53:01.987Z","hidden":false},{"_id":"67ad5f3fcad644864b4366cf","user":{"_id":"6368c512fbfe97c16a40baba","avatarUrl":"/avatars/1c23bc7c0b6d9225699ce27647623d7a.svg","isPro":false,"fullname":"Furu Wei","user":"thegenerality","type":"user"},"name":"Furu Wei","status":"admin_assigned","statusLastChangedAt":"2025-02-14T12:52:40.042Z","hidden":false},{"_id":"67ad5f3fcad644864b4366d0","user":{"_id":"66f0bf59e9d50ec57febf751","avatarUrl":"/avatars/be97941e60064e5dd806c6fe9db3c537.svg","isPro":false,"fullname":"Zhicheng Dou","user":"douzc","type":"user"},"name":"Zhicheng Dou","status":"admin_assigned","statusLastChangedAt":"2025-02-14T12:52:33.880Z","hidden":false}],"publishedAt":"2025-02-12T15:03:33.000Z","submittedOnDailyAt":"2025-02-14T02:02:15.420Z","title":"mmE5: Improving Multimodal Multilingual Embeddings via High-quality\n Synthetic Data","submittedOnDailyBy":{"_id":"66add675c7a575aa0e03d5f3","avatarUrl":"/avatars/b72b18130664c1de197c1f8df371aa70.svg","isPro":false,"fullname":"Haonan Chen","user":"Haon-Chen","type":"user"},"summary":"Multimodal embedding models have gained significant attention for their\nability to map data from different modalities, such as text and images, into a\nunified representation space. However, the limited labeled multimodal data\noften hinders embedding performance. Recent approaches have leveraged data\nsynthesis to address this problem, yet the quality of synthetic data remains a\ncritical bottleneck. In this work, we identify three criteria for high-quality\nsynthetic multimodal data. First, broad scope ensures that the generated data\ncovers diverse tasks and modalities, making it applicable to various downstream\nscenarios. Second, robust cross-modal alignment makes different modalities\nsemantically consistent. Third, high fidelity ensures that the synthetic data\nmaintains realistic details to enhance its reliability. Guided by these\nprinciples, we synthesize datasets that: (1) cover a wide range of tasks,\nmodality combinations, and languages, (2) are generated via a deep thinking\nprocess within a single pass of a multimodal large language model, and (3)\nincorporate real-world images with accurate and relevant texts, ensuring\nfidelity through self-evaluation and refinement. Leveraging these high-quality\nsynthetic and labeled datasets, we train a multimodal multilingual E5 model\nmmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art\nperformance on the MMEB Benchmark and superior multilingual performance on the\nXTD benchmark. Our codes, datasets and models are released in\nhttps://github.com/haon-chen/mmE5.","upvotes":15,"discussionId":"67ad5f3fcad644864b4366f5","ai_summary":"A deep learning approach synthesizes high-quality multimodal data, enhancing mmE5 model performance on benchmark tasks in multilingual and cross-modal embeddings.","ai_keywords":["multimodal embedding models","labeled multimodal data","data synthesis","deep thinking process","multimodal large language model","MMEB Benchmark","XTD benchmark"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"635f9fd1ae7144a6674c839b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667211208219-noauth.jpeg","isPro":false,"fullname":"Marcus Gawronsky","user":"marcusinthesky","type":"user"},{"_id":"638324f862badff43269e588","avatarUrl":"/avatars/907a39a9b44fc8b7f3fad35858b01fb7.svg","isPro":false,"fullname":"Asaf Yehudai","user":"Asaf-Yehudai","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"65447c8aa3ab16953ba2cacc","avatarUrl":"/avatars/b287275a29959b277e17008a0303c822.svg","isPro":false,"fullname":"a11en0","user":"a11en000","type":"user"},{"_id":"6762b881f47f60b73c78548e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6762b881f47f60b73c78548e/di47VnUv9yUZGpYXAp5NC.jpeg","isPro":false,"fullname":"Yash Thube","user":"thubZ9","type":"user"},{"_id":"67b0a4aa8191c180b9421ba7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b0a4aa8191c180b9421ba7/xg3KQqy5nlqbaoIjxxRhJ.png","isPro":false,"fullname":"Taro","user":"KDoyoon","type":"user"},{"_id":"625e62452a7279d3c77b5c38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/zJINew6U4_Gup4WTobb-0.jpeg","isPro":false,"fullname":"Yutao Zhu","user":"yutaozhu94","type":"user"},{"_id":"6710ac3fb4ee4920580a5f0e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6710ac3fb4ee4920580a5f0e/OhQQFlZmkmLQpMYqKCGP6.jpeg","isPro":false,"fullname":"Chenghao Zhang","user":"SnowNation","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.08468

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Published on Feb 12
· Submitted by Haonan Chen on Feb 14
Authors:
,
,

Abstract

A deep learning approach synthesizes high-quality multimodal data, enhancing mmE5 model performance on benchmark tasks in multilingual and cross-modal embeddings.

AI-generated summary

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 6

Лучший частный хостинг