Thanks
\n","updatedAt":"2025-09-04T01:26:19.781Z","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9009895920753479},"editors":["panikov"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.02479","authors":[{"_id":"68b7b9f8295f15ff6091130d","user":{"_id":"64c38b3413dc689c2f12f03f","avatarUrl":"/avatars/d49bf90bb6860189a761b9f5773c09fc.svg","isPro":false,"fullname":"Zhenghai Xue","user":"ZhenghaiXue","type":"user"},"name":"Zhenghai Xue","status":"claimed_verified","statusLastChangedAt":"2025-09-05T14:16:41.635Z","hidden":false},{"_id":"68b7b9f8295f15ff6091130e","user":{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},"name":"Longtao Zheng","status":"claimed_verified","statusLastChangedAt":"2025-09-03T19:25:45.985Z","hidden":false},{"_id":"68b7b9f8295f15ff6091130f","user":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"name":"Qian Liu","status":"claimed_verified","statusLastChangedAt":"2025-09-03T19:25:48.069Z","hidden":false},{"_id":"68b7b9f8295f15ff60911310","user":{"_id":"67277d20eebb94a257cd6925","avatarUrl":"/avatars/c1f714b59fecb53770e28920c0c267cb.svg","isPro":false,"fullname":"Yingru Li","user":"R1ch0rd","type":"user"},"name":"Yingru Li","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:27:12.625Z","hidden":false},{"_id":"68b7b9f8295f15ff60911311","user":{"_id":"6274a2315d12b3a734adebc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6274a2315d12b3a734adebc9/zLQDbszAvWh0F2BjdKock.jpeg","isPro":false,"fullname":"Xiaosen Zheng","user":"xszheng2020","type":"user"},"name":"Xiaosen Zheng","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:27:15.115Z","hidden":false},{"_id":"68b7b9f8295f15ff60911312","name":"Zejun Ma","hidden":false},{"_id":"68b7b9f8295f15ff60911313","name":"Bo An","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/612ee6a7b960e78c6d2319d4/8Ju5mD0aVNHopWdipRWTa.png"],"publishedAt":"2025-09-02T16:30:19.000Z","submittedOnDailyAt":"2025-09-03T02:21:40.676Z","title":"SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn\n Tool-Integrated Reasoning","submittedOnDailyBy":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"summary":"Large Language Models (LLMs) can significantly improve their reasoning\ncapabilities by interacting with external tools, a paradigm known as\nTool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios\nusing Reinforcement Learning (RL) is often hindered by training instability and\nperformance collapse. We identify that such instability is primarily caused by\na distributional drift from external tool feedback, leading to the generation\nof low-probability tokens. This issue compounds over successive turns, causing\ncatastrophic gradient norm explosions that derail the training process. To\naddress this challenge, we introduce SimpleTIR , a plug-and-play algorithm that\nstabilizes multi-turn TIR training. Its core strategy is to identify and filter\nout trajectories containing void turns, i.e., turns that yield neither a code\nblock nor a final answer. By removing these problematic trajectories from the\npolicy update, SimpleTIR effectively blocks the harmful, high-magnitude\ngradients, thus stabilizing the learning dynamics. Extensive experiments show\nthat SimpleTIR achieves state-of-the-art performance on challenging math\nreasoning benchmarks, notably elevating the AIME24 score from a text-only\nbaseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model.\nFurthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR\nencourages the model to discover diverse and sophisticated reasoning patterns,\nsuch as self-correction and cross-validation.","upvotes":83,"discussionId":"68b7b9f8295f15ff60911314","projectPage":"https://simpletir.notion.site/report","githubRepo":"https://github.com/ltzheng/SimpleTIR","ai_summary":"SimpleTIR stabilizes multi-turn Tool-Integrated Reasoning training by filtering out void turns, achieving state-of-the-art performance on math reasoning benchmarks.","ai_keywords":["Tool-Integrated Reasoning","TIR","Reinforcement Learning","RL","distributional drift","low-probability tokens","gradient norm explosions","SimpleTIR","void turns","policy update","AIME24","Qwen2.5-7B","self-correction","cross-validation"],"githubStars":290},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62c66504031996c36c86976a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62c66504031996c36c86976a/wIq0YJhkWnEhlzsh-TGYO.png","isPro":false,"fullname":"steve z","user":"stzhao","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"/avatars/97aa761d5255abf230304cfeade87835.svg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","isPro":false,"fullname":"QRQ","user":"RichardQRQ","type":"user"},{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},{"_id":"67277d20eebb94a257cd6925","avatarUrl":"/avatars/c1f714b59fecb53770e28920c0c267cb.svg","isPro":false,"fullname":"Yingru Li","user":"R1ch0rd","type":"user"},{"_id":"6274a2315d12b3a734adebc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6274a2315d12b3a734adebc9/zLQDbszAvWh0F2BjdKock.jpeg","isPro":false,"fullname":"Xiaosen Zheng","user":"xszheng2020","type":"user"},{"_id":"646d7b50534e52f8c3072a0a","avatarUrl":"/avatars/cfe02ce1dec71d6dc0acc112205b121d.svg","isPro":false,"fullname":"Li Tianlin","user":"ltl7155","type":"user"},{"_id":"6496b06a4a9a7e1fe4253ae2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/144NlRW_ETmmOgSYUs_SM.png","isPro":false,"fullname":"Haonan Wang","user":"haonan3","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Abstract
SimpleTIR stabilizes multi-turn Tool-Integrated Reasoning training by filtering out void turns, achieving state-of-the-art performance on math reasoning benchmarks.
Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.
Community
🎉 SimpleTIR Paper is Now Available!
The official research paper for SimpleTIR has been released! This work advances end-to-end multi-turn reinforcement learning with tool using.
📄 Resources:
Thanks
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper