The following papers were recommended by the Semantic Scholar API
\n- \n
- 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding (2024) \n
- Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers (2023) \n
- LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding (2023) \n
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI (2023) \n
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts (2023) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n","updatedAt":"2024-01-22T14:14:49.724Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7513877749443054},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.09340","authors":[{"_id":"65a8ab10f45ee5e5b56d5ca5","user":{"_id":"6304b389bad6ce7fc02691d5","avatarUrl":"/avatars/a762ca59624ce409650165f36b973488.svg","isPro":false,"fullname":"Baoxiong Jia","user":"BuzzBeater","type":"user"},"name":"Baoxiong Jia","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:41:38.451Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca6","user":{"_id":"64c396def6fe448b1ad553d6","avatarUrl":"/avatars/b2ce4739f42dc00ee974fff7ee1cb301.svg","isPro":false,"fullname":"Yixin Chen","user":"YixinChen","type":"user"},"name":"Yixin Chen","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:42:00.150Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca7","user":{"_id":"65a8cbd4669921943c3c0963","avatarUrl":"/avatars/c85e1253b8e3ea71eef677d5cd356a55.svg","isPro":false,"fullname":"Yu","user":"Huangyueeeee","type":"user"},"name":"Huangyue Yu","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:43:13.830Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca8","user":{"_id":"65a8cc0516e8e332e7515d7d","avatarUrl":"/avatars/4a84d28730776962242db6129cf0c4c1.svg","isPro":false,"fullname":"Yan Wang","user":"Yannnnnnnn","type":"user"},"name":"Yan Wang","status":"claimed_verified","statusLastChangedAt":"2024-01-18T08:19:33.225Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca9","user":{"_id":"64620ca8a489ecb6b6b5b4e4","avatarUrl":"/avatars/9bf8060cb7bc59c0369ddf14c50aa480.svg","isPro":false,"fullname":"Xuesong Niu","user":"nxsEdson","type":"user"},"name":"Xuesong Niu","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:43:19.971Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5caa","user":{"_id":"634379a81bdd3dfa55dcbe82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665366425318-noauth.png","isPro":false,"fullname":"Tengyu Liu","user":"oxrider","type":"user"},"name":"Tengyu Liu","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:43:27.043Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5cab","user":{"_id":"6455f8c2c8f569b995d603c9","avatarUrl":"/avatars/3cb2cad3ab123887c270234bb6a4ca43.svg","isPro":false,"fullname":"Qing Li","user":"li-qing","type":"user"},"name":"Qing Li","status":"claimed_verified","statusLastChangedAt":"2024-08-03T08:26:27.371Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5cac","user":{"_id":"63c7a33121bd95f80ed74652","avatarUrl":"/avatars/7dd59afea785a2bff0ec2b757abd474e.svg","isPro":false,"fullname":"Siyuan Huang","user":"thuhsy","type":"user"},"name":"Siyuan Huang","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:44:13.905Z","hidden":false}],"publishedAt":"2024-01-17T17:04:35.000Z","submittedOnDailyAt":"2024-01-18T03:34:59.759Z","title":"SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene\n Understanding","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"3D vision-language grounding, which focuses on aligning language with the 3D\nphysical environment, stands as a cornerstone in the development of embodied\nagents. In comparison to recent advancements in the 2D domain, grounding\nlanguage in 3D scenes faces several significant challenges: (i) the inherent\ncomplexity of 3D scenes due to the diverse object configurations, their rich\nattributes, and intricate relationships; (ii) the scarcity of paired 3D\nvision-language data to support grounded learning; and (iii) the absence of a\nunified learning framework to distill knowledge from grounded 3D data. In this\nwork, we aim to address these three major challenges in 3D vision-language by\nexamining the potential of systematically upscaling 3D vision-language learning\nin indoor environments. We introduce the first million-scale 3D vision-language\ndataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising\n2.5M vision-language pairs derived from both human annotations and our scalable\nscene-graph-based generation approach. We demonstrate that this scaling allows\nfor a unified pre-training framework, Grounded Pre-training for Scenes (GPS),\nfor 3D vision-language learning. Through extensive experiments, we showcase the\neffectiveness of GPS by achieving state-of-the-art performance on all existing\n3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is\nunveiled through zero-shot transfer experiments in the challenging 3D\nvision-language tasks. Project website: https://scene-verse.github.io .","upvotes":21,"discussionId":"65a8ab14f45ee5e5b56d5dde","ai_summary":"A large-scale 3D vision-language dataset and pre-training framework achieve state-of-the-art results in 3D visual grounding tasks.","ai_keywords":["3D vision-language grounding","grounded learning","3D scenes","object configurations","rich attributes","intricate relationships","paired 3D vision-language data","unified learning framework","SceneVerse","vision-language pairs","scene-graph-based generation","Grounded Pre-training for Scenes","GPS","3D visual grounding benchmarks","zero-shot transfer"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c7a33121bd95f80ed74652","avatarUrl":"/avatars/7dd59afea785a2bff0ec2b757abd474e.svg","isPro":false,"fullname":"Siyuan Huang","user":"thuhsy","type":"user"},{"_id":"65a8cbd4669921943c3c0963","avatarUrl":"/avatars/c85e1253b8e3ea71eef677d5cd356a55.svg","isPro":false,"fullname":"Yu","user":"Huangyueeeee","type":"user"},{"_id":"6455f8c2c8f569b995d603c9","avatarUrl":"/avatars/3cb2cad3ab123887c270234bb6a4ca43.svg","isPro":false,"fullname":"Qing Li","user":"li-qing","type":"user"},{"_id":"64c1fd85e82e55936c0ae202","avatarUrl":"/avatars/eff16777a636eb2dfa192b68bb20ce0e.svg","isPro":false,"fullname":"Peiyuan Zhi","user":"zhipy","type":"user"},{"_id":"63a30e84528eba15b13f7355","avatarUrl":"/avatars/79e26058c60fe0bc9f34f372b2e69ead.svg","isPro":false,"fullname":"Zan Wang","user":"SuperZan","type":"user"},{"_id":"65a8cc0516e8e332e7515d7d","avatarUrl":"/avatars/4a84d28730776962242db6129cf0c4c1.svg","isPro":false,"fullname":"Yan Wang","user":"Yannnnnnnn","type":"user"},{"_id":"636de85cc4a7a729c164d2b5","avatarUrl":"/avatars/3e281e547e1697e1c06805e7e63f3918.svg","isPro":false,"fullname":"Yu Liu","user":"YuLiu","type":"user"},{"_id":"65a8cfb15e49cc9fdc7dcd3e","avatarUrl":"/avatars/ae4616f5b457da5b2a636215ef3d59ed.svg","isPro":false,"fullname":"XiongkunLinghu","user":"EricLHK","type":"user"},{"_id":"6201fc5d91d53938a6432fbf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6201fc5d91d53938a6432fbf/VLs8ZYaZrop4KBpZn53fH.jpeg","isPro":false,"fullname":"Runpei Dong","user":"RunpeiDong","type":"user"},{"_id":"6304b389bad6ce7fc02691d5","avatarUrl":"/avatars/a762ca59624ce409650165f36b973488.svg","isPro":false,"fullname":"Baoxiong Jia","user":"BuzzBeater","type":"user"},{"_id":"63b5d2a90d5913eee486a945","avatarUrl":"/avatars/1e1ea28c93e88ac8a342a3e1e334ebe0.svg","isPro":false,"fullname":"Nan Qiao","user":"Nannnn","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Abstract
A large-scale 3D vision-language dataset and pre-training framework achieve state-of-the-art results in 3D visual grounding tasks.
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding (2024)
- Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers (2023)
- LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding (2023)
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI (2023)
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper