lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2024-01-22T14:14:49.724Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7513877749443054},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.09340","authors":[{"_id":"65a8ab10f45ee5e5b56d5ca5","user":{"_id":"6304b389bad6ce7fc02691d5","avatarUrl":"/avatars/a762ca59624ce409650165f36b973488.svg","isPro":false,"fullname":"Baoxiong Jia","user":"BuzzBeater","type":"user"},"name":"Baoxiong Jia","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:41:38.451Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca6","user":{"_id":"64c396def6fe448b1ad553d6","avatarUrl":"/avatars/b2ce4739f42dc00ee974fff7ee1cb301.svg","isPro":false,"fullname":"Yixin Chen","user":"YixinChen","type":"user"},"name":"Yixin Chen","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:42:00.150Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca7","user":{"_id":"65a8cbd4669921943c3c0963","avatarUrl":"/avatars/c85e1253b8e3ea71eef677d5cd356a55.svg","isPro":false,"fullname":"Yu","user":"Huangyueeeee","type":"user"},"name":"Huangyue Yu","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:43:13.830Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca8","user":{"_id":"65a8cc0516e8e332e7515d7d","avatarUrl":"/avatars/4a84d28730776962242db6129cf0c4c1.svg","isPro":false,"fullname":"Yan Wang","user":"Yannnnnnnn","type":"user"},"name":"Yan Wang","status":"claimed_verified","statusLastChangedAt":"2024-01-18T08:19:33.225Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5ca9","user":{"_id":"64620ca8a489ecb6b6b5b4e4","avatarUrl":"/avatars/9bf8060cb7bc59c0369ddf14c50aa480.svg","isPro":false,"fullname":"Xuesong Niu","user":"nxsEdson","type":"user"},"name":"Xuesong Niu","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:43:19.971Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5caa","user":{"_id":"634379a81bdd3dfa55dcbe82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665366425318-noauth.png","isPro":false,"fullname":"Tengyu Liu","user":"oxrider","type":"user"},"name":"Tengyu Liu","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:43:27.043Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5cab","user":{"_id":"6455f8c2c8f569b995d603c9","avatarUrl":"/avatars/3cb2cad3ab123887c270234bb6a4ca43.svg","isPro":false,"fullname":"Qing Li","user":"li-qing","type":"user"},"name":"Qing Li","status":"claimed_verified","statusLastChangedAt":"2024-08-03T08:26:27.371Z","hidden":false},{"_id":"65a8ab10f45ee5e5b56d5cac","user":{"_id":"63c7a33121bd95f80ed74652","avatarUrl":"/avatars/7dd59afea785a2bff0ec2b757abd474e.svg","isPro":false,"fullname":"Siyuan Huang","user":"thuhsy","type":"user"},"name":"Siyuan Huang","status":"admin_assigned","statusLastChangedAt":"2024-01-18T08:44:13.905Z","hidden":false}],"publishedAt":"2024-01-17T17:04:35.000Z","submittedOnDailyAt":"2024-01-18T03:34:59.759Z","title":"SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene\n Understanding","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"3D vision-language grounding, which focuses on aligning language with the 3D\nphysical environment, stands as a cornerstone in the development of embodied\nagents. In comparison to recent advancements in the 2D domain, grounding\nlanguage in 3D scenes faces several significant challenges: (i) the inherent\ncomplexity of 3D scenes due to the diverse object configurations, their rich\nattributes, and intricate relationships; (ii) the scarcity of paired 3D\nvision-language data to support grounded learning; and (iii) the absence of a\nunified learning framework to distill knowledge from grounded 3D data. In this\nwork, we aim to address these three major challenges in 3D vision-language by\nexamining the potential of systematically upscaling 3D vision-language learning\nin indoor environments. We introduce the first million-scale 3D vision-language\ndataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising\n2.5M vision-language pairs derived from both human annotations and our scalable\nscene-graph-based generation approach. We demonstrate that this scaling allows\nfor a unified pre-training framework, Grounded Pre-training for Scenes (GPS),\nfor 3D vision-language learning. Through extensive experiments, we showcase the\neffectiveness of GPS by achieving state-of-the-art performance on all existing\n3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is\nunveiled through zero-shot transfer experiments in the challenging 3D\nvision-language tasks. Project website: https://scene-verse.github.io .","upvotes":21,"discussionId":"65a8ab14f45ee5e5b56d5dde","ai_summary":"A large-scale 3D vision-language dataset and pre-training framework achieve state-of-the-art results in 3D visual grounding tasks.","ai_keywords":["3D vision-language grounding","grounded learning","3D scenes","object configurations","rich attributes","intricate relationships","paired 3D vision-language data","unified learning framework","SceneVerse","vision-language pairs","scene-graph-based generation","Grounded Pre-training for Scenes","GPS","3D visual grounding benchmarks","zero-shot transfer"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c7a33121bd95f80ed74652","avatarUrl":"/avatars/7dd59afea785a2bff0ec2b757abd474e.svg","isPro":false,"fullname":"Siyuan Huang","user":"thuhsy","type":"user"},{"_id":"65a8cbd4669921943c3c0963","avatarUrl":"/avatars/c85e1253b8e3ea71eef677d5cd356a55.svg","isPro":false,"fullname":"Yu","user":"Huangyueeeee","type":"user"},{"_id":"6455f8c2c8f569b995d603c9","avatarUrl":"/avatars/3cb2cad3ab123887c270234bb6a4ca43.svg","isPro":false,"fullname":"Qing Li","user":"li-qing","type":"user"},{"_id":"64c1fd85e82e55936c0ae202","avatarUrl":"/avatars/eff16777a636eb2dfa192b68bb20ce0e.svg","isPro":false,"fullname":"Peiyuan Zhi","user":"zhipy","type":"user"},{"_id":"63a30e84528eba15b13f7355","avatarUrl":"/avatars/79e26058c60fe0bc9f34f372b2e69ead.svg","isPro":false,"fullname":"Zan Wang","user":"SuperZan","type":"user"},{"_id":"65a8cc0516e8e332e7515d7d","avatarUrl":"/avatars/4a84d28730776962242db6129cf0c4c1.svg","isPro":false,"fullname":"Yan Wang","user":"Yannnnnnnn","type":"user"},{"_id":"636de85cc4a7a729c164d2b5","avatarUrl":"/avatars/3e281e547e1697e1c06805e7e63f3918.svg","isPro":false,"fullname":"Yu Liu","user":"YuLiu","type":"user"},{"_id":"65a8cfb15e49cc9fdc7dcd3e","avatarUrl":"/avatars/ae4616f5b457da5b2a636215ef3d59ed.svg","isPro":false,"fullname":"XiongkunLinghu","user":"EricLHK","type":"user"},{"_id":"6201fc5d91d53938a6432fbf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6201fc5d91d53938a6432fbf/VLs8ZYaZrop4KBpZn53fH.jpeg","isPro":false,"fullname":"Runpei Dong","user":"RunpeiDong","type":"user"},{"_id":"6304b389bad6ce7fc02691d5","avatarUrl":"/avatars/a762ca59624ce409650165f36b973488.svg","isPro":false,"fullname":"Baoxiong Jia","user":"BuzzBeater","type":"user"},{"_id":"63b5d2a90d5913eee486a945","avatarUrl":"/avatars/1e1ea28c93e88ac8a342a3e1e334ebe0.svg","isPro":false,"fullname":"Nan Qiao","user":"Nannnn","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2401.09340

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Published on Jan 17, 2024
· Submitted by AK on Jan 18, 2024

Abstract

A large-scale 3D vision-language dataset and pre-training framework achieve state-of-the-art results in 3D visual grounding tasks.

AI-generated summary

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io .

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.09340 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.09340 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.09340 in a Space README.md to link it from this page.

Collections including this paper 3

Лучший частный хостинг