https://github.com/GAIR-NLP/MAYE\n","updatedAt":"2025-04-04T02:13:03.111Z","author":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","fullname":"Yan Ma","name":"ManTle","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7845044732093811},"editors":["ManTle"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg"],"reactions":[{"reaction":"🔥","users":["AdinaY"],"count":1},{"reaction":"🚀","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"67f088a9b1294b227c57d27c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-04-05T01:34:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement](https://huggingface.co/papers/2503.17352) (2025)\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) (2025)\n* [R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization](https://huggingface.co/papers/2503.10615) (2025)\n* [OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning](https://huggingface.co/papers/2503.16081) (2025)\n* [LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL](https://huggingface.co/papers/2503.07536) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-05T01:34:33.304Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7200517058372498},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.02587","authors":[{"_id":"67ef3f9804be7fba0c882738","user":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},"name":"Yan Ma","status":"admin_assigned","statusLastChangedAt":"2025-04-04T07:17:53.820Z","hidden":false},{"_id":"67ef3f9804be7fba0c882739","user":{"_id":"64b370fe6d953e7c75ede314","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b370fe6d953e7c75ede314/RdP2q3hGXWE4E2zfSv0KU.png","isPro":false,"fullname":"Steffi Chern","user":"steffichern","type":"user"},"name":"Steffi Chern","status":"claimed_verified","statusLastChangedAt":"2025-04-04T07:09:14.660Z","hidden":false},{"_id":"67ef3f9804be7fba0c88273a","user":{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},"name":"Xuyang Shen","status":"admin_assigned","statusLastChangedAt":"2025-04-04T07:17:40.397Z","hidden":false},{"_id":"67ef3f9804be7fba0c88273b","user":{"_id":"64c525e4d68946edad6c7067","avatarUrl":"/avatars/1b108661634af602717a4ab4b66a151f.svg","isPro":false,"fullname":"Yiran Zhong","user":"IanZhong","type":"user"},"name":"Yiran Zhong","status":"claimed_verified","statusLastChangedAt":"2025-04-04T07:09:16.707Z","hidden":false},{"_id":"67ef3f9804be7fba0c88273c","user":{"_id":"6144a0c4ff1146bbd84d9865","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661715958139-6144a0c4ff1146bbd84d9865.png","isPro":false,"fullname":"Pengfei Liu","user":"Pengfei","type":"user"},"name":"Pengfei Liu","status":"admin_assigned","statusLastChangedAt":"2025-04-04T07:17:34.472Z","hidden":false}],"publishedAt":"2025-04-03T13:53:28.000Z","submittedOnDailyAt":"2025-04-04T00:42:23.044Z","title":"Rethinking RL Scaling for Vision Language Models: A Transparent,\n From-Scratch Framework and Comprehensive Evaluation Scheme","submittedOnDailyBy":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},"summary":"Reinforcement learning (RL) has recently shown strong potential in improving\nthe reasoning capabilities of large language models and is now being actively\nextended to vision-language models (VLMs). However, existing RL applications in\nVLMs often rely on heavily engineered frameworks that hinder reproducibility\nand accessibility, while lacking standardized evaluation protocols, making it\ndifficult to compare results or interpret training dynamics. This work\nintroduces a transparent, from-scratch framework for RL in VLMs, offering a\nminimal yet functional four-step pipeline validated across multiple models and\ndatasets. In addition, a standardized evaluation scheme is proposed to assess\ntraining dynamics and reflective behaviors. Extensive experiments on visual\nreasoning tasks uncover key empirical findings: response length is sensitive to\nrandom seeds, reflection correlates with output length, and RL consistently\noutperforms supervised fine-tuning (SFT) in generalization, even with\nhigh-quality data. These findings, together with the proposed framework, aim to\nestablish a reproducible baseline and support broader engagement in RL-based\nVLM research.","upvotes":32,"discussionId":"67ef3f9904be7fba0c882772","ai_summary":"A transparent reinforcement learning framework and standardized evaluation are introduced for vision-language models, showing RL's superiority in generalization over supervised fine-tuning.","ai_keywords":["reinforcement learning","vision-language models","reproducibility","evaluation protocols","training dynamics","reflection","output length","supervised fine-tuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},{"_id":"64c525e4d68946edad6c7067","avatarUrl":"/avatars/1b108661634af602717a4ab4b66a151f.svg","isPro":false,"fullname":"Yiran Zhong","user":"IanZhong","type":"user"},{"_id":"65900d4ff5a209eeac08b463","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65900d4ff5a209eeac08b463/PJNNBRJIk1qR24oaRLTex.jpeg","isPro":false,"fullname":"shijie xia","user":"seven-cat","type":"user"},{"_id":"64e3562342d8e2c1c69f64b3","avatarUrl":"/avatars/1ccf22d60deec213bbe069d30811efbe.svg","isPro":false,"fullname":"xuefengli","user":"xuefengli","type":"user"},{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64b370fe6d953e7c75ede314","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b370fe6d953e7c75ede314/RdP2q3hGXWE4E2zfSv0KU.png","isPro":false,"fullname":"Steffi Chern","user":"steffichern","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"5fa241b4a13e063b8b2b5e2f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fa241b4a13e063b8b2b5e2f/lbrO-eAcRDHqeoTPdMjkR.png","isPro":true,"fullname":"Prince Canuma","user":"prince-canuma","type":"user"},{"_id":"67ef88d818ee7ec5982c644c","avatarUrl":"/avatars/15a053accd604159b5f25bd6ac903585.svg","isPro":false,"fullname":"Steph Moreland","user":"smoreland","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"/avatars/97aa761d5255abf230304cfeade87835.svg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"67ef92467c9803451be0bef5","avatarUrl":"/avatars/9d5fe1a9465a0e220544c5af08923918.svg","isPro":false,"fullname":"libin xiong","user":"lbxiong","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A transparent reinforcement learning framework and standardized evaluation are introduced for vision-language models, showing RL's superiority in generalization over supervised fine-tuning.
AI-generated summary
Reinforcement learning (RL) has recently shown strong potential in improving
the reasoning capabilities of large language models and is now being actively
extended to vision-language models (VLMs). However, existing RL applications in
VLMs often rely on heavily engineered frameworks that hinder reproducibility
and accessibility, while lacking standardized evaluation protocols, making it
difficult to compare results or interpret training dynamics. This work
introduces a transparent, from-scratch framework for RL in VLMs, offering a
minimal yet functional four-step pipeline validated across multiple models and
datasets. In addition, a standardized evaluation scheme is proposed to assess
training dynamics and reflective behaviors. Extensive experiments on visual
reasoning tasks uncover key empirical findings: response length is sensitive to
random seeds, reflection correlates with output length, and RL consistently
outperforms supervised fine-tuning (SFT) in generalization, even with
high-quality data. These findings, together with the proposed framework, aim to
establish a reproducible baseline and support broader engagement in RL-based
VLM research.
This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.