MiniMax/One-RL-to-See-Them-All.\n","updatedAt":"2025-05-26T02:24:44.430Z","author":{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","fullname":"Xuyang Shen","name":"Ryan1122","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8543492555618286},"editors":["Ryan1122"],"editorAvatarUrls":["/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg"],"reactions":[{"reaction":"🔥","users":["MiniMax-AI","zzp1012","abidlabs","sugatoray"],"count":4},{"reaction":"😎","users":["nftmoran"],"count":1}],"isReport":false}},{"id":"6835175a047e3cef19aafcab","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-05-27T01:37:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model](https://huggingface.co/papers/2504.07615) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models](https://huggingface.co/papers/2504.11468) (2025)\n* [NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation](https://huggingface.co/papers/2504.13055) (2025)\n* [OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning](https://huggingface.co/papers/2505.08617) (2025)\n* [Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme](https://huggingface.co/papers/2504.02587) (2025)\n* [VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning](https://huggingface.co/papers/2505.12081) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-05-27T01:37:30.242Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6905050277709961},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.18129","authors":[{"_id":"6833cf89df7cbb5c087a4caa","user":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},"name":"Yan Ma","status":"claimed_verified","statusLastChangedAt":"2025-05-30T06:55:09.156Z","hidden":false},{"_id":"6833cf89df7cbb5c087a4cab","name":"Linge Du","hidden":false},{"_id":"6833cf89df7cbb5c087a4cac","user":{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},"name":"Xuyang Shen","status":"claimed_verified","statusLastChangedAt":"2025-05-26T08:10:03.920Z","hidden":false},{"_id":"6833cf89df7cbb5c087a4cad","user":{"_id":"6829f5feb461b85f5dd0f036","avatarUrl":"/avatars/a1308c337ee366ad06819cbd81dbdcb3.svg","isPro":false,"fullname":"shaoxiang","user":"chenshaoxiang1","type":"user"},"name":"Shaoxiang Chen","status":"admin_assigned","statusLastChangedAt":"2025-05-26T14:27:11.208Z","hidden":false},{"_id":"6833cf89df7cbb5c087a4cae","name":"Pengfei Li","hidden":false},{"_id":"6833cf89df7cbb5c087a4caf","user":{"_id":"642cede11f576acdab6520e8","avatarUrl":"/avatars/95c57579beea2bfe703380a370dc2322.svg","isPro":false,"fullname":"renqibing","user":"renqibing","type":"user"},"name":"Qibing Ren","status":"admin_assigned","statusLastChangedAt":"2025-05-26T14:26:48.219Z","hidden":false},{"_id":"6833cf89df7cbb5c087a4cb0","name":"Lizhuang Ma","hidden":false},{"_id":"6833cf89df7cbb5c087a4cb1","user":{"_id":"647df7670ed7d0c8760b00ed","avatarUrl":"/avatars/180dfa736af2aac64cb459ee4563dc61.svg","isPro":false,"fullname":"Yuchao Dai","user":"daiyuchao","type":"user"},"name":"Yuchao Dai","status":"admin_assigned","statusLastChangedAt":"2025-05-26T14:26:30.646Z","hidden":false},{"_id":"6833cf89df7cbb5c087a4cb2","user":{"_id":"6144a0c4ff1146bbd84d9865","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661715958139-6144a0c4ff1146bbd84d9865.png","isPro":false,"fullname":"Pengfei Liu","user":"Pengfei","type":"user"},"name":"Pengfei Liu","status":"admin_assigned","statusLastChangedAt":"2025-05-26T14:26:36.896Z","hidden":false},{"_id":"6833cf89df7cbb5c087a4cb3","name":"Junjie Yan","hidden":false}],"publishedAt":"2025-05-23T17:41:14.000Z","submittedOnDailyAt":"2025-05-26T00:54:44.420Z","title":"One RL to See Them All: Visual Triple Unified Reinforcement Learning","submittedOnDailyBy":{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},"summary":"Reinforcement learning (RL) has significantly advanced the reasoning\ncapabilities of vision-language models (VLMs). However, the use of RL beyond\nreasoning tasks remains largely unexplored, especially for perceptionintensive\ntasks like object detection and grounding. We propose V-Triune, a Visual Triple\nUnified Reinforcement Learning system that enables VLMs to jointly learn visual\nreasoning and perception tasks within a single training pipeline. V-Triune\ncomprises triple complementary components: Sample-Level Data Formatting (to\nunify diverse task inputs), Verifier-Level Reward Computation (to deliver\ncustom rewards via specialized verifiers) , and Source-Level Metric Monitoring\n(to diagnose problems at the data-source level). We further introduce a novel\nDynamic IoU reward, which provides adaptive, progressive, and definite feedback\nfor perception tasks handled by V-Triune. Our approach is instantiated within\noff-the-shelf RL training framework using open-source 7B and 32B backbone\nmodels. The resulting model, dubbed Orsta (One RL to See Them All),\ndemonstrates consistent improvements across both reasoning and perception\ntasks. This broad capability is significantly shaped by its training on a\ndiverse dataset, constructed around four representative visual reasoning tasks\n(Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding,\nDetection, Counting, and OCR). Subsequently, Orsta achieves substantial gains\non MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1\nacross its various 7B and 32B model variants, with performance benefits\nextending to a wide range of downstream tasks. These results highlight the\neffectiveness and scalability of our unified RL approach for VLMs. The V-Triune\nsystem, along with the Orsta models, is publicly available at\nhttps://github.com/MiniMax-AI.","upvotes":59,"discussionId":"6833cf8adf7cbb5c087a4d0c","githubRepo":"https://github.com/MiniMax-AI/One-RL-to-See-Them-All","ai_summary":"A unified reinforcement learning system, V-Triune, combines visual reasoning and perception tasks in vision-language models through a single training pipeline, achieving significant improvements across various tasks.","ai_keywords":["visual triple unified reinforcement learning","sample-level data formatting","verifier-level reward computation","source-level metric monitoring","dynamic IoU reward","reinforcement learning","vision-language models","object detection","grounding","Orsta","MEGA-Bench Core"],"githubStars":313},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},{"_id":"652809a7726aad101908f869","avatarUrl":"/avatars/b384e0e324c2543e5f41afd5daf8c363.svg","isPro":false,"fullname":"qw","user":"dudulg","type":"user"},{"_id":"6443859d3c323e0918f4f710","avatarUrl":"/avatars/d777717ae180f1a32eedb96022740296.svg","isPro":false,"fullname":"CHEN","user":"forwchen","type":"user"},{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"640c4d2d536d9fe0f000b209","avatarUrl":"/avatars/419e59b78f7fc7f6b3ac343398d11ec1.svg","isPro":false,"fullname":"Weixuan Sun","user":"weixuansun","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6299f4765ab4232a3fdb06c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6299f4765ab4232a3fdb06c5/m5a5yD3AYYzCFynjg8wL4.png","isPro":false,"fullname":"Gerred Dillon","user":"gerred","type":"user"},{"_id":"633a00248f27255b6b54ea5f","avatarUrl":"/avatars/8ad54c2d8a42093923cbdd6f15e0d7a7.svg","isPro":false,"fullname":"dfuhoiysOHSVFh82934gfjklb","user":"huba-buba","type":"user"},{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},{"_id":"6761500fe5d10c2b311f8c3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/d1oji8SjgngoSvD8K44YS.png","isPro":false,"fullname":"Julius Duin","user":"duinamit","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A unified reinforcement learning system, V-Triune, combines visual reasoning and perception tasks in vision-language models through a single training pipeline, achieving significant improvements across various tasks.
AI-generated summary
Reinforcement learning (RL) has significantly advanced the reasoning
capabilities of vision-language models (VLMs). However, the use of RL beyond
reasoning tasks remains largely unexplored, especially for perceptionintensive
tasks like object detection and grounding. We propose V-Triune, a Visual Triple
Unified Reinforcement Learning system that enables VLMs to jointly learn visual
reasoning and perception tasks within a single training pipeline. V-Triune
comprises triple complementary components: Sample-Level Data Formatting (to
unify diverse task inputs), Verifier-Level Reward Computation (to deliver
custom rewards via specialized verifiers) , and Source-Level Metric Monitoring
(to diagnose problems at the data-source level). We further introduce a novel
Dynamic IoU reward, which provides adaptive, progressive, and definite feedback
for perception tasks handled by V-Triune. Our approach is instantiated within
off-the-shelf RL training framework using open-source 7B and 32B backbone
models. The resulting model, dubbed Orsta (One RL to See Them All),
demonstrates consistent improvements across both reasoning and perception
tasks. This broad capability is significantly shaped by its training on a
diverse dataset, constructed around four representative visual reasoning tasks
(Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding,
Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains
on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1
across its various 7B and 32B model variants, with performance benefits
extending to a wide range of downstream tasks. These results highlight the
effectiveness and scalability of our unified RL approach for VLMs. The V-Triune
system, along with the Orsta models, is publicly available at
https://github.com/MiniMax-AI.
V-Triune is a visual unified reinforcement learning (RL) system that enables vision-language models (VLMs) to jointly learn reasoning and perception tasks. It integrates three key components—sample-level data formatting, verifier-level reward computation, and source-level metric monitoring—and introduces a novel Dynamic IoU reward for adaptive perception feedback. Built on open-source 7B and 32B models, the resulting system, Orsta, achieves significant performance gains (up to +14.1) across diverse tasks in MEGA-Bench Core, demonstrating the scalability and effectiveness of RL beyond reasoning.