lynx   »   [go: up one dir, main page]

https://simpletir.notion.site/report (July 2, 2025)\n
  • Paper: https://arxiv.org/abs/2509.02479
  • \n
  • Models: https://huggingface.co/collections/ZhenghaiXue/simpletir-686ce09ae6e1db33b375f03d
  • \n
  • Code: https://github.com/ltzheng/SimpleTIR
  • \n\n","updatedAt":"2025-09-03T03:51:40.684Z","author":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","fullname":"Qian Liu","name":"SivilTaram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":95}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7937366962432861},"editors":["SivilTaram"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg"],"reactions":[],"isReport":false}},{"id":"68b8eabb3762ee5898531e81","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2},"createdAt":"2025-09-04T01:26:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks","html":"

    Thanks

    \n","updatedAt":"2025-09-04T01:26:19.781Z","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9009895920753479},"editors":["panikov"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.02479","authors":[{"_id":"68b7b9f8295f15ff6091130d","user":{"_id":"64c38b3413dc689c2f12f03f","avatarUrl":"/avatars/d49bf90bb6860189a761b9f5773c09fc.svg","isPro":false,"fullname":"Zhenghai Xue","user":"ZhenghaiXue","type":"user"},"name":"Zhenghai Xue","status":"claimed_verified","statusLastChangedAt":"2025-09-05T14:16:41.635Z","hidden":false},{"_id":"68b7b9f8295f15ff6091130e","user":{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},"name":"Longtao Zheng","status":"claimed_verified","statusLastChangedAt":"2025-09-03T19:25:45.985Z","hidden":false},{"_id":"68b7b9f8295f15ff6091130f","user":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"name":"Qian Liu","status":"claimed_verified","statusLastChangedAt":"2025-09-03T19:25:48.069Z","hidden":false},{"_id":"68b7b9f8295f15ff60911310","user":{"_id":"67277d20eebb94a257cd6925","avatarUrl":"/avatars/c1f714b59fecb53770e28920c0c267cb.svg","isPro":false,"fullname":"Yingru Li","user":"R1ch0rd","type":"user"},"name":"Yingru Li","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:27:12.625Z","hidden":false},{"_id":"68b7b9f8295f15ff60911311","user":{"_id":"6274a2315d12b3a734adebc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6274a2315d12b3a734adebc9/zLQDbszAvWh0F2BjdKock.jpeg","isPro":false,"fullname":"Xiaosen Zheng","user":"xszheng2020","type":"user"},"name":"Xiaosen Zheng","status":"claimed_verified","statusLastChangedAt":"2025-09-03T08:27:15.115Z","hidden":false},{"_id":"68b7b9f8295f15ff60911312","name":"Zejun Ma","hidden":false},{"_id":"68b7b9f8295f15ff60911313","name":"Bo An","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/612ee6a7b960e78c6d2319d4/8Ju5mD0aVNHopWdipRWTa.png"],"publishedAt":"2025-09-02T16:30:19.000Z","submittedOnDailyAt":"2025-09-03T02:21:40.676Z","title":"SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn\n Tool-Integrated Reasoning","submittedOnDailyBy":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"summary":"Large Language Models (LLMs) can significantly improve their reasoning\ncapabilities by interacting with external tools, a paradigm known as\nTool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios\nusing Reinforcement Learning (RL) is often hindered by training instability and\nperformance collapse. We identify that such instability is primarily caused by\na distributional drift from external tool feedback, leading to the generation\nof low-probability tokens. This issue compounds over successive turns, causing\ncatastrophic gradient norm explosions that derail the training process. To\naddress this challenge, we introduce SimpleTIR , a plug-and-play algorithm that\nstabilizes multi-turn TIR training. Its core strategy is to identify and filter\nout trajectories containing void turns, i.e., turns that yield neither a code\nblock nor a final answer. By removing these problematic trajectories from the\npolicy update, SimpleTIR effectively blocks the harmful, high-magnitude\ngradients, thus stabilizing the learning dynamics. Extensive experiments show\nthat SimpleTIR achieves state-of-the-art performance on challenging math\nreasoning benchmarks, notably elevating the AIME24 score from a text-only\nbaseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model.\nFurthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR\nencourages the model to discover diverse and sophisticated reasoning patterns,\nsuch as self-correction and cross-validation.","upvotes":83,"discussionId":"68b7b9f8295f15ff60911314","projectPage":"https://simpletir.notion.site/report","githubRepo":"https://github.com/ltzheng/SimpleTIR","ai_summary":"SimpleTIR stabilizes multi-turn Tool-Integrated Reasoning training by filtering out void turns, achieving state-of-the-art performance on math reasoning benchmarks.","ai_keywords":["Tool-Integrated Reasoning","TIR","Reinforcement Learning","RL","distributional drift","low-probability tokens","gradient norm explosions","SimpleTIR","void turns","policy update","AIME24","Qwen2.5-7B","self-correction","cross-validation"],"githubStars":290},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62c66504031996c36c86976a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62c66504031996c36c86976a/wIq0YJhkWnEhlzsh-TGYO.png","isPro":false,"fullname":"steve z","user":"stzhao","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"/avatars/97aa761d5255abf230304cfeade87835.svg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"6683a05e74fb1736a4b7c934","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6683a05e74fb1736a4b7c934/eiz6qlqIUjAWGy5zfg8Cs.jpeg","isPro":false,"fullname":"QRQ","user":"RichardQRQ","type":"user"},{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},{"_id":"67277d20eebb94a257cd6925","avatarUrl":"/avatars/c1f714b59fecb53770e28920c0c267cb.svg","isPro":false,"fullname":"Yingru Li","user":"R1ch0rd","type":"user"},{"_id":"6274a2315d12b3a734adebc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6274a2315d12b3a734adebc9/zLQDbszAvWh0F2BjdKock.jpeg","isPro":false,"fullname":"Xiaosen Zheng","user":"xszheng2020","type":"user"},{"_id":"646d7b50534e52f8c3072a0a","avatarUrl":"/avatars/cfe02ce1dec71d6dc0acc112205b121d.svg","isPro":false,"fullname":"Li Tianlin","user":"ltl7155","type":"user"},{"_id":"6496b06a4a9a7e1fe4253ae2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/144NlRW_ETmmOgSYUs_SM.png","isPro":false,"fullname":"Haonan Wang","user":"haonan3","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
    Papers
    arxiv:2509.02479

    SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

    Published on Sep 2
    · Submitted by Qian Liu on Sep 3
    Authors:
    ,

    Abstract

    SimpleTIR stabilizes multi-turn Tool-Integrated Reasoning training by filtering out void turns, achieving state-of-the-art performance on math reasoning benchmarks.

    AI-generated summary

    Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

    Community

    Paper author Paper submitter

    🎉 SimpleTIR Paper is Now Available!

    The official research paper for SimpleTIR has been released! This work advances end-to-end multi-turn reinforcement learning with tool using.

    📄 Resources:

    Thanks

    Sign up or log in to comment

    Models citing this paper 0

    No model linking this paper

    Cite arxiv.org/abs/2509.02479 in a model README.md to link it from this page.

    Datasets citing this paper 0

    No dataset linking this paper

    Cite arxiv.org/abs/2509.02479 in a dataset README.md to link it from this page.

    Spaces citing this paper 0

    No Space linking this paper

    Cite arxiv.org/abs/2509.02479 in a Space README.md to link it from this page.

    Collections including this paper 18

    Лучший частный хостинг