lynx   »   [go: up one dir, main page]

https://github.com/GAIR-NLP/MAYE

\n","updatedAt":"2025-04-04T02:13:03.111Z","author":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","fullname":"Yan Ma","name":"ManTle","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7845044732093811},"editors":["ManTle"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg"],"reactions":[{"reaction":"🔥","users":["AdinaY"],"count":1},{"reaction":"🚀","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"67f088a9b1294b227c57d27c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-04-05T01:34:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement](https://huggingface.co/papers/2503.17352) (2025)\n* [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) (2025)\n* [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) (2025)\n* [R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization](https://huggingface.co/papers/2503.10615) (2025)\n* [OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning](https://huggingface.co/papers/2503.16081) (2025)\n* [LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL](https://huggingface.co/papers/2503.07536) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-05T01:34:33.304Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7200517058372498},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.02587","authors":[{"_id":"67ef3f9804be7fba0c882738","user":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},"name":"Yan Ma","status":"admin_assigned","statusLastChangedAt":"2025-04-04T07:17:53.820Z","hidden":false},{"_id":"67ef3f9804be7fba0c882739","user":{"_id":"64b370fe6d953e7c75ede314","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b370fe6d953e7c75ede314/RdP2q3hGXWE4E2zfSv0KU.png","isPro":false,"fullname":"Steffi Chern","user":"steffichern","type":"user"},"name":"Steffi Chern","status":"claimed_verified","statusLastChangedAt":"2025-04-04T07:09:14.660Z","hidden":false},{"_id":"67ef3f9804be7fba0c88273a","user":{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},"name":"Xuyang Shen","status":"admin_assigned","statusLastChangedAt":"2025-04-04T07:17:40.397Z","hidden":false},{"_id":"67ef3f9804be7fba0c88273b","user":{"_id":"64c525e4d68946edad6c7067","avatarUrl":"/avatars/1b108661634af602717a4ab4b66a151f.svg","isPro":false,"fullname":"Yiran Zhong","user":"IanZhong","type":"user"},"name":"Yiran Zhong","status":"claimed_verified","statusLastChangedAt":"2025-04-04T07:09:16.707Z","hidden":false},{"_id":"67ef3f9804be7fba0c88273c","user":{"_id":"6144a0c4ff1146bbd84d9865","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661715958139-6144a0c4ff1146bbd84d9865.png","isPro":false,"fullname":"Pengfei Liu","user":"Pengfei","type":"user"},"name":"Pengfei Liu","status":"admin_assigned","statusLastChangedAt":"2025-04-04T07:17:34.472Z","hidden":false}],"publishedAt":"2025-04-03T13:53:28.000Z","submittedOnDailyAt":"2025-04-04T00:42:23.044Z","title":"Rethinking RL Scaling for Vision Language Models: A Transparent,\n From-Scratch Framework and Comprehensive Evaluation Scheme","submittedOnDailyBy":{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},"summary":"Reinforcement learning (RL) has recently shown strong potential in improving\nthe reasoning capabilities of large language models and is now being actively\nextended to vision-language models (VLMs). However, existing RL applications in\nVLMs often rely on heavily engineered frameworks that hinder reproducibility\nand accessibility, while lacking standardized evaluation protocols, making it\ndifficult to compare results or interpret training dynamics. This work\nintroduces a transparent, from-scratch framework for RL in VLMs, offering a\nminimal yet functional four-step pipeline validated across multiple models and\ndatasets. In addition, a standardized evaluation scheme is proposed to assess\ntraining dynamics and reflective behaviors. Extensive experiments on visual\nreasoning tasks uncover key empirical findings: response length is sensitive to\nrandom seeds, reflection correlates with output length, and RL consistently\noutperforms supervised fine-tuning (SFT) in generalization, even with\nhigh-quality data. These findings, together with the proposed framework, aim to\nestablish a reproducible baseline and support broader engagement in RL-based\nVLM research.","upvotes":32,"discussionId":"67ef3f9904be7fba0c882772","ai_summary":"A transparent reinforcement learning framework and standardized evaluation are introduced for vision-language models, showing RL's superiority in generalization over supervised fine-tuning.","ai_keywords":["reinforcement learning","vision-language models","reproducibility","evaluation protocols","training dynamics","reflection","output length","supervised fine-tuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"633fc70529b5a95f6e15a6b7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633fc70529b5a95f6e15a6b7/Fzh7wWuqU-fBbzdupOUtF.jpeg","isPro":false,"fullname":"Yan Ma","user":"ManTle","type":"user"},{"_id":"64c525e4d68946edad6c7067","avatarUrl":"/avatars/1b108661634af602717a4ab4b66a151f.svg","isPro":false,"fullname":"Yiran Zhong","user":"IanZhong","type":"user"},{"_id":"65900d4ff5a209eeac08b463","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65900d4ff5a209eeac08b463/PJNNBRJIk1qR24oaRLTex.jpeg","isPro":false,"fullname":"shijie xia","user":"seven-cat","type":"user"},{"_id":"64e3562342d8e2c1c69f64b3","avatarUrl":"/avatars/1ccf22d60deec213bbe069d30811efbe.svg","isPro":false,"fullname":"xuefengli","user":"xuefengli","type":"user"},{"_id":"642e4d4d6748dd4f8eeb7732","avatarUrl":"/avatars/fd911e9143d1a7aedd21a7d611543fcc.svg","isPro":false,"fullname":"Xuyang Shen","user":"Ryan1122","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64b370fe6d953e7c75ede314","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b370fe6d953e7c75ede314/RdP2q3hGXWE4E2zfSv0KU.png","isPro":false,"fullname":"Steffi Chern","user":"steffichern","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"5fa241b4a13e063b8b2b5e2f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fa241b4a13e063b8b2b5e2f/lbrO-eAcRDHqeoTPdMjkR.png","isPro":true,"fullname":"Prince Canuma","user":"prince-canuma","type":"user"},{"_id":"67ef88d818ee7ec5982c644c","avatarUrl":"/avatars/15a053accd604159b5f25bd6ac903585.svg","isPro":false,"fullname":"Steph Moreland","user":"smoreland","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"/avatars/97aa761d5255abf230304cfeade87835.svg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"67ef92467c9803451be0bef5","avatarUrl":"/avatars/9d5fe1a9465a0e220544c5af08923918.svg","isPro":false,"fullname":"libin xiong","user":"lbxiong","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.02587

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Published on Apr 3
· Submitted by Yan Ma on Apr 4

Abstract

A transparent reinforcement learning framework and standardized evaluation are introduced for vision-language models, showing RL's superiority in generalization over supervised fine-tuning.

AI-generated summary

Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Community

Paper author Paper submitter

This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Paper author Paper submitter
edited Apr 4

Code is public and available at: https://github.com/GAIR-NLP/MAYE

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.02587 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.02587 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.02587 in a Space README.md to link it from this page.

Collections including this paper 12

Лучший частный хостинг