Librarian Bot. I found the following papers similar to this paper. \n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-29T01:34:21.746Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7908233404159546},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.17856","authors":[{"_id":"6719e7d84dfd79aa9f3f6c69","user":{"_id":"6578459d62d3ac1817ed79fe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6578459d62d3ac1817ed79fe/AXDJuwLUoEOb4Fj3U0Xxo.jpeg","isPro":false,"fullname":"Shaofei Cai","user":"phython96","type":"user"},"name":"Shaofei Cai","status":"claimed_verified","statusLastChangedAt":"2024-10-28T11:25:08.953Z","hidden":false},{"_id":"6719e7d84dfd79aa9f3f6c6a","user":{"_id":"642e8c99c1b0f8e4e76bcaab","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e8c99c1b0f8e4e76bcaab/BOs9r0P9KyT9pEba9v0H4.png","isPro":false,"fullname":"Zihao Wang","user":"zhwang4ai","type":"user"},"name":"Zihao Wang","status":"claimed_verified","statusLastChangedAt":"2024-12-17T08:04:45.236Z","hidden":false},{"_id":"6719e7d84dfd79aa9f3f6c6b","user":{"_id":"64c232e577655fcf3ff06082","avatarUrl":"/avatars/5fc3ae34273c96f761a1eb9d336c64b9.svg","isPro":false,"fullname":"Kewei Lian","user":"kevinLian","type":"user"},"name":"Kewei Lian","status":"claimed_verified","statusLastChangedAt":"2024-10-28T11:24:52.514Z","hidden":false},{"_id":"6719e7d84dfd79aa9f3f6c6c","user":{"_id":"648f0c64ddc0620d54e199e9","avatarUrl":"/avatars/ff3fd6e11d451b03f82183e53fc48613.svg","isPro":false,"fullname":"ZhancunMu","user":"Zhancun","type":"user"},"name":"Zhancun Mu","status":"admin_assigned","statusLastChangedAt":"2024-10-28T15:17:23.469Z","hidden":false},{"_id":"6719e7d84dfd79aa9f3f6c6d","user":{"_id":"60dd0e36a15ddd7d2006d2e9","avatarUrl":"/avatars/8bd98177a79efbf295be8f6457683297.svg","isPro":false,"fullname":"Xiaojian Ma","user":"jeasinema","type":"user"},"name":"Xiaojian Ma","status":"admin_assigned","statusLastChangedAt":"2024-10-28T15:17:12.954Z","hidden":false},{"_id":"6719e7d84dfd79aa9f3f6c6e","user":{"_id":"66757955b3882fd587d5f363","avatarUrl":"/avatars/41fdfa19057ad3ea5ece29b94f163218.svg","isPro":false,"fullname":"Anji Liu","user":"anjiliu","type":"user"},"name":"Anji Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-28T15:16:10.809Z","hidden":false},{"_id":"6719e7d84dfd79aa9f3f6c6f","user":{"_id":"64683a5776bb704aa14588b7","avatarUrl":"/avatars/e532756f52c5b95981470ace41a10556.svg","isPro":false,"fullname":"Yitao Liang","user":"YitaoLiang","type":"user"},"name":"Yitao Liang","status":"claimed_verified","statusLastChangedAt":"2025-04-09T07:38:25.072Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6578459d62d3ac1817ed79fe/qwPSKAEEaYYo-R3TmnmUr.mp4"],"publishedAt":"2024-10-23T13:26:59.000Z","submittedOnDailyAt":"2024-10-28T02:54:44.530Z","title":"ROCKET-1: Master Open-World Interaction with Visual-Temporal Context\n Prompting","submittedOnDailyBy":{"_id":"6578459d62d3ac1817ed79fe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6578459d62d3ac1817ed79fe/AXDJuwLUoEOb4Fj3U0Xxo.jpeg","isPro":false,"fullname":"Shaofei Cai","user":"phython96","type":"user"},"summary":"Vision-language models (VLMs) have excelled in multimodal tasks, but adapting\nthem to embodied decision-making in open-world environments presents\nchallenges. A key issue is the difficulty in smoothly connecting individual\nentities in low-level observations with abstract concepts required for\nplanning. A common approach to address this problem is through the use of\nhierarchical agents, where VLMs serve as high-level reasoners that break down\ntasks into executable sub-tasks, typically specified using language and\nimagined observations. However, language often fails to effectively convey\nspatial information, while generating future images with sufficient accuracy\nremains challenging. To address these limitations, we propose visual-temporal\ncontext prompting, a novel communication protocol between VLMs and policy\nmodels. This protocol leverages object segmentation from both past and present\nobservations to guide policy-environment interactions. Using this approach, we\ntrain ROCKET-1, a low-level policy that predicts actions based on concatenated\nvisual observations and segmentation masks, with real-time object tracking\nprovided by SAM-2. Our method unlocks the full potential of VLMs\nvisual-language reasoning abilities, enabling them to solve complex creative\ntasks, especially those heavily reliant on spatial understanding. Experiments\nin Minecraft demonstrate that our approach allows agents to accomplish\npreviously unattainable tasks, highlighting the effectiveness of\nvisual-temporal context prompting in embodied decision-making. Codes and demos\nwill be available on the project page: https://craftjarvis.github.io/ROCKET-1.","upvotes":51,"discussionId":"6719e7db4dfd79aa9f3f6d18","projectPage":"https://craftjarvis.github.io/ROCKET-1/","githubRepo":"https://github.com/CraftJarvis/ROCKET-1","ai_summary":"Visual-temporal context prompting integrates object segmentation and visual observations to enhance VLMs in embodied decision-making, enabling complex spatial tasks.","ai_keywords":["vision-language models","hierarchical agents","visual-temporal context prompting","object segmentation","policy models","real-time object tracking","SAM-2","ROCKET-1","Minecraft"],"githubStars":45},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","isPro":true,"fullname":"Fangyuan Yu","user":"Ksgk-fy","type":"user"},{"_id":"6578459d62d3ac1817ed79fe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6578459d62d3ac1817ed79fe/AXDJuwLUoEOb4Fj3U0Xxo.jpeg","isPro":false,"fullname":"Shaofei Cai","user":"phython96","type":"user"},{"_id":"642e8c99c1b0f8e4e76bcaab","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e8c99c1b0f8e4e76bcaab/BOs9r0P9KyT9pEba9v0H4.png","isPro":false,"fullname":"Zihao Wang","user":"zhwang4ai","type":"user"},{"_id":"6433c6ad7b8247480106c189","avatarUrl":"/avatars/42d84a7bbcfb9f5cc60bb460e6375f14.svg","isPro":false,"fullname":"Aurora","user":"xiaojuan0920","type":"user"},{"_id":"648f0c64ddc0620d54e199e9","avatarUrl":"/avatars/ff3fd6e11d451b03f82183e53fc48613.svg","isPro":false,"fullname":"ZhancunMu","user":"Zhancun","type":"user"},{"_id":"661fc68885e64617eddcb6a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661fc68885e64617eddcb6a7/my987TTCRNmDy-oQuai6r.jpeg","isPro":false,"fullname":"Limuyao","user":"limuyu011","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"64683a5776bb704aa14588b7","avatarUrl":"/avatars/e532756f52c5b95981470ace41a10556.svg","isPro":false,"fullname":"Yitao Liang","user":"YitaoLiang","type":"user"},{"_id":"65331f72b3852ed1ce9c5c06","avatarUrl":"/avatars/b2704f0820ca1f2a561742c978ce75e4.svg","isPro":false,"fullname":"bigainlco","user":"bigainlco","type":"user"},{"_id":"671f1c0a77035878c53ec5c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/GsTX3fV3yrRSKF-sRWpQ7.png","isPro":false,"fullname":"Yi Hu","user":"huyi2002","type":"user"},{"_id":"64c232e577655fcf3ff06082","avatarUrl":"/avatars/5fc3ae34273c96f761a1eb9d336c64b9.svg","isPro":false,"fullname":"Kewei Lian","user":"kevinLian","type":"user"},{"_id":"5f32b2367e583543386214d9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635314457124-5f32b2367e583543386214d9.jpeg","isPro":false,"fullname":"Sergei Averkiev","user":"averoo","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Published on Oct 23, 2024
#1 Paper of the day
Abstract
Visual-temporal context prompting integrates object segmentation and visual observations to enhance VLMs in embodied decision-making, enabling complex spatial tasks.
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting
them to embodied decision-making in open-world environments presents
challenges. A key issue is the difficulty in smoothly connecting individual
entities in low-level observations with abstract concepts required for
planning. A common approach to address this problem is through the use of
hierarchical agents, where VLMs serve as high-level reasoners that break down
tasks into executable sub-tasks, typically specified using language and
imagined observations. However, language often fails to effectively convey
spatial information, while generating future images with sufficient accuracy
remains challenging. To address these limitations, we propose visual-temporal
context prompting, a novel communication protocol between VLMs and policy
models. This protocol leverages object segmentation from both past and present
observations to guide policy-environment interactions. Using this approach, we
train ROCKET-1, a low-level policy that predicts actions based on concatenated
visual observations and segmentation masks, with real-time object tracking
provided by SAM-2. Our method unlocks the full potential of VLMs
visual-language reasoning abilities, enabling them to solve complex creative
tasks, especially those heavily reliant on spatial understanding. Experiments
in Minecraft demonstrate that our approach allows agents to accomplish
previously unattainable tasks, highlighting the effectiveness of
visual-temporal context prompting in embodied decision-making. Codes and demos
will be available on the project page: https://craftjarvis.github.io/ROCKET-1.