The following papers were recommended by the Semantic Scholar API
\n- \n
- Revisiting LLM Reasoning via Information Bottleneck (2025) \n
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization (2025) \n
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR (2025) \n
- COPO: Consistency-Aware Policy Optimization (2025) \n
- Geometric-Mean Policy Optimization (2025) \n
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models (2025) \n
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
What to feed into MLP for Z_φ(x) in decoder-only architecture?
\nThe partition function Z_φ(x)
is implemented as a 3-layer MLP taking prompt representation x
as input. What exactly should be fed into the MLP for decoder-only models?
Options:
\n- \n
- Last prompt token - hidden state of the final token before generation starts \n
- Prompt pooling - mean/max pooling over all prompt token hidden states \n
- Separator token - add special token between prompt and response \n
Which approach is most common for this use case?
\n","updatedAt":"2025-09-20T20:55:28.764Z","author":{"_id":"649dfb2935032bf979e2a820","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/fi7mmlWi7IH50GH2sY6q2.jpeg","fullname":"Andrew","name":"WpythonW","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8114343881607056},"editors":["WpythonW"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/fi7mmlWi7IH50GH2sY6q2.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"68cf488cd0094e2c683d15e8","author":{"_id":"647ffddeb82adfa7cc1a10d9","avatarUrl":"/avatars/26aa168d6b2068298ebb16584aa52b6c.svg","fullname":"zhu","name":"xuekai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5},"createdAt":"2025-09-21T00:36:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Sorry for the confused part about Log_Z! I'll detail this for you and update our paper ASAP.\n\nFrom the flow perspective: Log_Z measures the probability flow from initial state S_0. Intuitively, it estimates a denominator - the sum of rewards across all possible paths, so we can convert to a distribution via reward/Z.\n\nFrom the implementation perspective: Since it's the initial state, we use the prompt encoded by the LM's last layer hidden states. To convert variable-length prompts to a scalar, we empirically take the mean. There are definitely other approaches here that we haven't explored yet","html":"Sorry for the confused part about Log_Z! I'll detail this for you and update our paper ASAP.
\nFrom the flow perspective: Log_Z measures the probability flow from initial state S_0. Intuitively, it estimates a denominator - the sum of rewards across all possible paths, so we can convert to a distribution via reward/Z.
\nFrom the implementation perspective: Since it's the initial state, we use the prompt encoded by the LM's last layer hidden states. To convert variable-length prompts to a scalar, we empirically take the mean. There are definitely other approaches here that we haven't explored yet
\n","updatedAt":"2025-09-21T00:36:28.303Z","author":{"_id":"647ffddeb82adfa7cc1a10d9","avatarUrl":"/avatars/26aa168d6b2068298ebb16584aa52b6c.svg","fullname":"zhu","name":"xuekai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8722737431526184},"editors":["xuekai"],"editorAvatarUrls":["/avatars/26aa168d6b2068298ebb16584aa52b6c.svg"],"reactions":[{"reaction":"👍","users":["Dinghuai","daixuancheng","WpythonW","Charlie-LChen"],"count":4}],"isReport":false,"parentCommentId":"68cf14c0081d4f29a37106ec"}},{"id":"68cf492b99a9d37caa74b832","author":{"_id":"647ffddeb82adfa7cc1a10d9","avatarUrl":"/avatars/26aa168d6b2068298ebb16584aa52b6c.svg","fullname":"zhu","name":"xuekai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5},"createdAt":"2025-09-21T00:39:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"There's a thread discussing this implementation here on X: https://x.com/zhu_xuekai/with_replies. \nGot advice to use the prompt's last token to save compute - we'll try this in our scale-up experiments.\n\nAnd welcome for further discussion!","html":"There's a thread discussing this implementation here on X: https://x.com/zhu_xuekai/with_replies.
Got advice to use the prompt's last token to save compute - we'll try this in our scale-up experiments.
And welcome for further discussion!
\n","updatedAt":"2025-09-21T00:39:07.737Z","author":{"_id":"647ffddeb82adfa7cc1a10d9","avatarUrl":"/avatars/26aa168d6b2068298ebb16584aa52b6c.svg","fullname":"zhu","name":"xuekai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8201732039451599},"editors":["xuekai"],"editorAvatarUrls":["/avatars/26aa168d6b2068298ebb16584aa52b6c.svg"],"reactions":[],"isReport":false,"parentCommentId":"68cf14c0081d4f29a37106ec"}},{"id":"68cfbf2e4d69871830a75678","author":{"_id":"649dfb2935032bf979e2a820","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/fi7mmlWi7IH50GH2sY6q2.jpeg","fullname":"Andrew","name":"WpythonW","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2},"createdAt":"2025-09-21T09:02:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the clarification on Z_φ(x) implementation!","html":"Thanks for the clarification on Z_φ(x) implementation!
\n","updatedAt":"2025-09-21T09:02:38.087Z","author":{"_id":"649dfb2935032bf979e2a820","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/fi7mmlWi7IH50GH2sY6q2.jpeg","fullname":"Andrew","name":"WpythonW","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7130376100540161},"editors":["WpythonW"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/fi7mmlWi7IH50GH2sY6q2.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"68cf14c0081d4f29a37106ec"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2509.15207","authors":[{"_id":"68ccb7983df9ac65e93dc626","user":{"_id":"647ffddeb82adfa7cc1a10d9","avatarUrl":"/avatars/26aa168d6b2068298ebb16584aa52b6c.svg","isPro":false,"fullname":"zhu","user":"xuekai","type":"user"},"name":"Xuekai Zhu","status":"claimed_verified","statusLastChangedAt":"2025-09-19T06:48:45.314Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc627","user":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},"name":"Daixuan Cheng","status":"claimed_verified","statusLastChangedAt":"2025-09-19T06:48:48.340Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc628","user":{"_id":"64384e18d221ff12edae4c75","avatarUrl":"/avatars/3a5d2400af0f26091c233d63984df412.svg","isPro":false,"fullname":"Dinghuai Zhang","user":"Dinghuai","type":"user"},"name":"Dinghuai Zhang","status":"admin_assigned","statusLastChangedAt":"2025-09-19T13:11:24.546Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc629","name":"Hengli Li","hidden":false},{"_id":"68ccb7983df9ac65e93dc62a","name":"Kaiyan Zhang","hidden":false},{"_id":"68ccb7983df9ac65e93dc62b","name":"Che Jiang","hidden":false},{"_id":"68ccb7983df9ac65e93dc62c","name":"Youbang Sun","hidden":false},{"_id":"68ccb7983df9ac65e93dc62d","name":"Ermo Hua","hidden":false},{"_id":"68ccb7983df9ac65e93dc62e","user":{"_id":"622474f38dc6b0b64f5e903d","avatarUrl":"/avatars/d6b60a014277a8ec7d564163c5f644aa.svg","isPro":false,"fullname":"Yuxin Zuo","user":"yuxinzuo","type":"user"},"name":"Yuxin Zuo","status":"admin_assigned","statusLastChangedAt":"2025-09-19T13:11:54.405Z","hidden":true},{"_id":"68ccb7983df9ac65e93dc62f","user":{"_id":"663f07d029be04778ba97871","avatarUrl":"/avatars/fb7c9d4a2c537d918a3267e7cbc03f04.svg","isPro":false,"fullname":"Xingtai Lv","user":"XingtaiHF","type":"user"},"name":"Xingtai Lv","status":"admin_assigned","statusLastChangedAt":"2025-09-19T13:11:46.119Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc630","user":{"_id":"6663e5b10c54dffcd2a921ca","avatarUrl":"/avatars/2a4589fef05306ccde06728c752e5601.svg","isPro":false,"fullname":"Qizheng Zhang","user":"qizhengz","type":"user"},"name":"Qizheng Zhang","status":"admin_assigned","statusLastChangedAt":"2025-09-19T13:12:02.816Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc631","user":{"_id":"67c27673e5911f17dbdded18","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hp1VNuQe9Lf8QH0-km5dQ.png","isPro":false,"fullname":"LChen","user":"Charlie-LChen","type":"user"},"name":"Lin Chen","status":"claimed_verified","statusLastChangedAt":"2025-09-21T13:13:01.825Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc632","name":"Fanghao Shao","hidden":false},{"_id":"68ccb7983df9ac65e93dc633","name":"Bo Xue","hidden":false},{"_id":"68ccb7983df9ac65e93dc634","name":"Yunchong Song","hidden":false},{"_id":"68ccb7983df9ac65e93dc635","user":{"_id":"65574c70d16b524f1d2345c1","avatarUrl":"/avatars/8edbe9ebea85c570bc19b63bf3a727d9.svg","isPro":false,"fullname":"Zhenjie Yang","user":"jayyoung0802","type":"user"},"name":"Zhenjie Yang","status":"claimed_verified","statusLastChangedAt":"2025-09-19T06:48:37.248Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc636","user":{"_id":"650eba9555dc1e841746f132","avatarUrl":"/avatars/af6f5ee78f161d25ec0afc45d2def8eb.svg","isPro":false,"fullname":"Ganqu Cui","user":"ganqu","type":"user"},"name":"Ganqu Cui","status":"admin_assigned","statusLastChangedAt":"2025-09-19T13:12:20.955Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc637","name":"Ning Ding","hidden":false},{"_id":"68ccb7983df9ac65e93dc638","user":{"_id":"641904caf9d6f1d772ec7af7","avatarUrl":"/avatars/4a63eac71eb30f70b1a0e9d4708f26c1.svg","isPro":false,"fullname":"Jianfeng Gao","user":"wyngjf","type":"user"},"name":"Jianfeng Gao","status":"admin_assigned","statusLastChangedAt":"2025-09-19T13:12:29.442Z","hidden":false},{"_id":"68ccb7983df9ac65e93dc639","name":"Xiaodong Liu","hidden":false},{"_id":"68ccb7983df9ac65e93dc63a","name":"Bowen Zhou","hidden":false},{"_id":"68ccb7983df9ac65e93dc63b","name":"Hongyuan Mei","hidden":false},{"_id":"68ccb7983df9ac65e93dc63c","user":{"_id":"68d21f8ca090df93e72c09a0","avatarUrl":"/avatars/3aac21cea09bf30230bab4d52dfc0582.svg","isPro":false,"fullname":"Zhouhan Lin","user":"ZhouhanLin","type":"user"},"name":"Zhouhan Lin","status":"claimed_verified","statusLastChangedAt":"2025-09-23T10:07:24.087Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/649e6761f9134a06ed1e0cea/wtc3vGNJQlj3FdOsBXwp3.png"],"publishedAt":"2025-09-18T17:56:36.000Z","submittedOnDailyAt":"2025-09-19T00:24:38.079Z","title":"FlowRL: Matching Reward Distributions for LLM Reasoning","submittedOnDailyBy":{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},"summary":"We propose FlowRL: matching the full reward distribution via flow balancing\ninstead of maximizing rewards in large language model (LLM) reinforcement\nlearning (RL). Recent advanced reasoning models adopt reward-maximizing methods\n(\\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while\nneglecting less frequent but valid reasoning paths, thus reducing diversity. In\ncontrast, we transform scalar rewards into a normalized target distribution\nusing a learnable partition function, and then minimize the reverse KL\ndivergence between the policy and the target distribution. We implement this\nidea as a flow-balanced optimization method that promotes diverse exploration\nand generalizable reasoning trajectories. We conduct experiments on math and\ncode reasoning tasks: FlowRL achieves a significant average improvement of\n10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs\nconsistently better on code reasoning tasks. These results highlight reward\ndistribution-matching as a key step toward efficient exploration and diverse\nreasoning in LLM reinforcement learning.","upvotes":101,"discussionId":"68ccb7983df9ac65e93dc63d","githubRepo":"https://github.com/Xuekai-Zhu/FlowRL","ai_summary":"FlowRL enhances LLM reinforcement learning by matching the full reward distribution through flow balancing, improving diversity and performance over reward-maximizing methods.","ai_keywords":["FlowRL","reward distribution","flow balancing","reinforcement learning","reward-maximizing methods","PPO","GRPO","normalized target distribution","learnable partition function","reverse KL divergence","diverse exploration","generalizable reasoning trajectories","math reasoning","code reasoning"],"githubStars":71},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"649e6761f9134a06ed1e0cea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649e6761f9134a06ed1e0cea/XNeKceE8xSwI0xWwWUwwJ.jpeg","isPro":false,"fullname":"Daixuan Cheng","user":"daixuancheng","type":"user"},{"_id":"647ffddeb82adfa7cc1a10d9","avatarUrl":"/avatars/26aa168d6b2068298ebb16584aa52b6c.svg","isPro":false,"fullname":"zhu","user":"xuekai","type":"user"},{"_id":"66711d2ee12fa6cc5f5dfc89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66711d2ee12fa6cc5f5dfc89/uOzD5ztCzmexXZF24UVxh.png","isPro":false,"fullname":"instruction-pretrain","user":"instruction-pretrain","type":"user"},{"_id":"650801ced5578ef7e20b33d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650801ced5578ef7e20b33d4/oLptSnKMecbu62EgglmO6.png","isPro":false,"fullname":"AdaptLLM","user":"AdaptLLM","type":"user"},{"_id":"65574c70d16b524f1d2345c1","avatarUrl":"/avatars/8edbe9ebea85c570bc19b63bf3a727d9.svg","isPro":false,"fullname":"Zhenjie Yang","user":"jayyoung0802","type":"user"},{"_id":"622474f38dc6b0b64f5e903d","avatarUrl":"/avatars/d6b60a014277a8ec7d564163c5f644aa.svg","isPro":false,"fullname":"Yuxin Zuo","user":"yuxinzuo","type":"user"},{"_id":"662f638ba9891e43cc4c5125","avatarUrl":"/avatars/77c22de5511f9b85d98ec75fb0b5e9be.svg","isPro":true,"fullname":"Li Haozhan","user":"Haozhan72","type":"user"},{"_id":"663f07d029be04778ba97871","avatarUrl":"/avatars/fb7c9d4a2c537d918a3267e7cbc03f04.svg","isPro":false,"fullname":"Xingtai Lv","user":"XingtaiHF","type":"user"},{"_id":"68085b92f4cd66b8c7e3f757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68085b92f4cd66b8c7e3f757/GhqcS6e1a2SQJpj8PX5aQ.jpeg","isPro":false,"fullname":"Yinda (Frédéric) Xu","user":"MARMOTatZJU","type":"user"},{"_id":"6458e8ce4b7baff9a84aa0da","avatarUrl":"/avatars/c450f4885e68d28c22fd87f9efdfedec.svg","isPro":false,"fullname":"kaikai zhao","user":"LifeIsSoSolong","type":"user"},{"_id":"679ce8c048ebd7903d76a832","avatarUrl":"/avatars/5f3fecaacfee6e2d5a72dd19fe87055a.svg","isPro":false,"fullname":"Youbang Sun","user":"Youbang","type":"user"},{"_id":"68b013324640ce38c97de573","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/At5XpAwgszfYXaXRtJIJ6.png","isPro":false,"fullname":"Yifan Liu","user":"PaulGoodman0700","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">Abstract
FlowRL enhances LLM reinforcement learning by matching the full reward distribution through flow balancing, improving diversity and performance over reward-maximizing methods.
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Community
- We propose FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching via flow balance, encouraging diverse reasoning path exploration while addressing the inherent mode-collapse limitations of existing RL methods.
- We introduce length normalization and importance sampling to enable effective training on variable-length CoT reasoning, addressing gradient explosion and sampling mismatch issues.
- FlowRL outperforms GRPO and PPO by 10.0% and 5.1% respectively across math benchmarks and demonstrates strong generalization on code reasoning tasks, with diversity analysis confirming substantially more diverse solution exploration.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Revisiting LLM Reasoning via Information Bottleneck (2025)
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization (2025)
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- Geometric-Mean Policy Optimization (2025)
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models (2025)
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
What to feed into MLP for Z_φ(x) in decoder-only architecture?
The partition function Z_φ(x)
is implemented as a 3-layer MLP taking prompt representation x
as input. What exactly should be fed into the MLP for decoder-only models?
Options:
- Last prompt token - hidden state of the final token before generation starts
- Prompt pooling - mean/max pooling over all prompt token hidden states
- Separator token - add special token between prompt and response
Which approach is most common for this use case?
Sorry for the confused part about Log_Z! I'll detail this for you and update our paper ASAP.
From the flow perspective: Log_Z measures the probability flow from initial state S_0. Intuitively, it estimates a denominator - the sum of rewards across all possible paths, so we can convert to a distribution via reward/Z.
From the implementation perspective: Since it's the initial state, we use the prompt encoded by the LM's last layer hidden states. To convert variable-length prompts to a scalar, we empirically take the mean. There are definitely other approaches here that we haven't explored yet
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper