lynx   »   [go: up one dir, main page]

@librarian-bot\n\t recommend

\n","updatedAt":"2024-11-13T15:29:32.508Z","author":{"_id":"648994c5a9d27ef5f8c07b00","avatarUrl":"/avatars/f870077acec44771f0bae7910be46392.svg","fullname":"hongbin","name":"L-Hongbin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["L-Hongbin"],"editorAvatarUrls":["/avatars/f870077acec44771f0bae7910be46392.svg"],"reactions":[],"isReport":false},"replies":[{"id":"6734c5e4b6b9624119ebdce8","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-11-13T15:29:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos](https://huggingface.co/papers/2411.04923) (2024)\n* [SegLLM: Multi-round Reasoning Segmentation](https://huggingface.co/papers/2410.18923) (2024)\n* [Text4Seg: Reimagining Image Segmentation as Text Generation](https://huggingface.co/papers/2410.09855) (2024)\n* [PUMA: Empowering Unified MLLM with Multi-granular Visual Generation](https://huggingface.co/papers/2410.13861) (2024)\n* [Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels](https://huggingface.co/papers/2409.19846) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-11-13T15:29:40.336Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7264863848686218},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6734c5dcbbc5d5471e3db4fc"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2409.13407","authors":[{"_id":"6700989c4dbcde622dd93557","name":"Li Zhou","hidden":false},{"_id":"6700989c4dbcde622dd93558","name":"Xu Yuan","hidden":false},{"_id":"6700989c4dbcde622dd93559","name":"Zenghui Sun","hidden":false},{"_id":"6700989c4dbcde622dd9355a","user":{"_id":"658e7436c0a26644269c5c5d","avatarUrl":"/avatars/7ba64a70e64d48896d3d74d6897d1c25.svg","isPro":false,"fullname":"Zikun Zhou","user":"sparrow-zk26","type":"user"},"name":"Zikun Zhou","status":"extracted_pending","statusLastChangedAt":"2024-10-05T01:38:37.516Z","hidden":false},{"_id":"6700989c4dbcde622dd9355b","name":"Jingsong Lan","hidden":false}],"publishedAt":"2024-09-20T11:13:31.000Z","title":"Instruction-guided Multi-Granularity Segmentation and Captioning with\n Large Multimodal Model","summary":"Large Multimodal Models (LMMs) have achieved significant progress by\nextending large language models. Building on this progress, the latest\ndevelopments in LMMs demonstrate the ability to generate dense pixel-wise\nsegmentation through the integration of segmentation models.Despite the\ninnovations, the textual responses and segmentation masks of existing works\nremain at the instance level, showing limited ability to perform fine-grained\nunderstanding and segmentation even provided with detailed textual cues.To\novercome this limitation, we introduce a Multi-Granularity Large Multimodal\nModel (MGLMM), which is capable of seamlessly adjusting the granularity of\nSegmentation and Captioning (SegCap) following user instructions, from panoptic\nSegCap to fine-grained SegCap. We name such a new task Multi-Granularity\nSegmentation and Captioning (MGSC). Observing the lack of a benchmark for model\ntraining and evaluation over the MGSC task, we establish a benchmark with\naligned masks and captions in multi-granularity using our customized automated\nannotation pipeline. This benchmark comprises 10K images and more than 30K\nimage-question pairs. We will release our dataset along with the implementation\nof our automated dataset annotation pipeline for further research.Besides, we\npropose a novel unified SegCap data format to unify heterogeneous segmentation\ndatasets; it effectively facilitates learning to associate object concepts with\nvisual features during multi-task training. Extensive experiments demonstrate\nthat our MGLMM excels at tackling more than eight downstream tasks and achieves\nstate-of-the-art performance in MGSC, GCG, image captioning, referring\nsegmentation, multiple and empty segmentation, and reasoning segmentation\ntasks. The great performance and versatility of MGLMM underscore its potential\nimpact on advancing multimodal research.","upvotes":0,"discussionId":"6700989d4dbcde622dd935ec","ai_summary":"A Multi-Granularity Large Multimodal Model (MGLMM) enables adjustable granularity segmentation and captioning, demonstrating superior performance across multiple downstream tasks with a newly established benchmark.","ai_keywords":["Large Multimodal Models","LMMs","large language models","dense pixel-wise segmentation","segmentation models","Multi-Granularity Large Multimodal Model","MGLMM","Segmentation and Captioning","SegCap","Multi-Granularity Segmentation and Captioning","MGSC","benchmark","panoptic SegCap","fine-grained SegCap","automated annotation pipeline","unified SegCap data format","multi-task training","downstream tasks","GCG","image captioning","referring segmentation","multiple and empty segmentation","reasoning segmentation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
Papers
arxiv:2409.13407

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Published on Sep 20, 2024
Authors:
,
,
,

Abstract

A Multi-Granularity Large Multimodal Model (MGLMM) enables adjustable granularity segmentation and captioning, demonstrating superior performance across multiple downstream tasks with a newly established benchmark.

AI-generated summary

Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues.To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

Community

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.13407 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.13407 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.13407 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.
Лучший частный хостинг