lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-02-28T01:22:24.148Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7590186595916748},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6665925050858361a839e8f3","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142},"createdAt":"2024-06-09T11:30:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# MegaScale: Unleashing LLM Training on 10,000+ GPUs! \n\nhttps://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/CjDRu9SDzo6ujjYd79xIn.mp4 \n\n## Links 🔗:\n👉 Subscribe: https://www.youtube.com/@Arxflix\n👉 Twitter: https://x.com/arxflix\n👉 LMNT (Partner): https://lmnt.com/\n\n\nBy Arxflix\n![9t4iCUHx_400x400-1.jpg](https://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/v4S5zBurs0ouGNwYj1GEd.jpeg)","html":"

MegaScale: Unleashing LLM Training on 10,000+ GPUs!

\n

\n\n

Links 🔗:

\n

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

\n

By Arxflix
\"9t4iCUHx_400x400-1.jpg\"

\n","updatedAt":"2024-06-09T11:30:24.191Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.48115792870521545},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.15627","authors":[{"_id":"65dd6c8aea1a202d3cc84c8b","name":"Ziheng Jiang","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c8c","name":"Haibin Lin","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c8d","user":{"_id":"64e5d04767a759dd7155a445","avatarUrl":"/avatars/70090182b71db4de31fb8c325d25e73b.svg","isPro":false,"fullname":"Yinmin Zhong","user":"PKUFlyingPig","type":"user"},"name":"Yinmin Zhong","status":"admin_assigned","statusLastChangedAt":"2024-02-27T09:41:08.213Z","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c8e","name":"Qi Huang","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c8f","name":"Yangrui Chen","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c90","name":"Zhi Zhang","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c91","name":"Yanghua Peng","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c92","name":"Xiang Li","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c93","name":"Cong Xie","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c94","name":"Shibiao Nong","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c95","name":"Yulu Jia","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c96","name":"Sun He","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c97","name":"Hongmin Chen","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c98","name":"Zhihao Bai","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c99","name":"Qi Hou","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c9a","name":"Shipeng Yan","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c9b","name":"Ding Zhou","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c9c","name":"Yiyao Sheng","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c9d","name":"Zhuo Jiang","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c9e","name":"Haohan Xu","hidden":false},{"_id":"65dd6c8aea1a202d3cc84c9f","name":"Haoran Wei","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca0","name":"Zhang Zhang","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca1","name":"Pengfei Nie","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca2","name":"Leqi Zou","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca3","name":"Sida Zhao","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca4","name":"Liang Xiang","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca5","name":"Zherui Liu","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca6","name":"Zhe Li","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca7","name":"Xiaoying Jia","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca8","name":"Jianxi Ye","hidden":false},{"_id":"65dd6c8aea1a202d3cc84ca9","name":"Xin Jin","hidden":false},{"_id":"65dd6c8aea1a202d3cc84caa","name":"Xin Liu","hidden":false}],"publishedAt":"2024-02-23T22:10:59.000Z","submittedOnDailyAt":"2024-02-27T02:30:59.648Z","title":"MegaScale: Scaling Large Language Model Training to More Than 10,000\n GPUs","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We present the design, implementation and engineering experience in building\nand deploying MegaScale, a production system for training large language models\n(LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale\nbrings unprecedented challenges to training efficiency and stability. We take a\nfull-stack approach that co-designs the algorithmic and system components\nacross model block and optimizer design, computation and communication\noverlapping, operator optimization, data pipeline, and network performance\ntuning. Maintaining high efficiency throughout the training process (i.e.,\nstability) is an important consideration in production given the long extent of\nLLM training jobs. Many hard stability issues only emerge at large scale, and\nin-depth observability is the key to address them. We develop a set of\ndiagnosis tools to monitor system components and events deep in the stack,\nidentify root causes, and derive effective techniques to achieve fault\ntolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs\nUtilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the\nMFU by 1.34x compared to Megatron-LM. We share our operational experience in\nidentifying and fixing failures and stragglers. We hope by articulating the\nproblems and sharing our experience from a systems perspective, this work can\ninspire future LLM systems research.","upvotes":38,"discussionId":"65dd6c8bea1a202d3cc84ccf","ai_summary":"MegaScale is a production system designed to train large language models at scale using more than 10,000 GPUs, focusing on efficiency, stability, and fault tolerance through advanced observability and diagnostic tools.","ai_keywords":["MegaScale","large language models (LLMs)","Model FLOPs Utilization (MFU)","Megatron-LM","fault tolerance","stragglers","operator optimization","data pipeline","network performance tuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"635cada2c017767a629db012","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"_id":"63e93b57ca4fc7d30decb380","avatarUrl":"/avatars/611a96669ddcd187c298a67ec24a509a.svg","isPro":false,"fullname":"Hammer++++","user":"HammerW","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"637f0eb22438d7485b8ef5d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637f0eb22438d7485b8ef5d7/70h7dekqj7LuBobOXckmJ.jpeg","isPro":false,"fullname":"Ming Li","user":"limingcv","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"6461fa2ea71d092e694637ea","avatarUrl":"/avatars/fd0d57b13514839147213593f84e7772.svg","isPro":false,"fullname":"liu","user":"bobliu2020","type":"user"},{"_id":"64d0df7ac9d00e38470ceaf1","avatarUrl":"/avatars/ec9e7442560017eb1bb46c784d573831.svg","isPro":false,"fullname":"zzz","user":"xlbqc","type":"user"},{"_id":"641a6c5919fc5647be1918f8","avatarUrl":"/avatars/c03297c68f90b9cdb7adc1f21c724dce.svg","isPro":false,"fullname":"David Euler","user":"davideuler","type":"user"},{"_id":"64e5d04767a759dd7155a445","avatarUrl":"/avatars/70090182b71db4de31fb8c325d25e73b.svg","isPro":false,"fullname":"Yinmin Zhong","user":"PKUFlyingPig","type":"user"},{"_id":"635964636a61954080850e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635964636a61954080850e1d/0bfExuDTrHTtm8c-40cDM.png","isPro":false,"fullname":"William Lamkin","user":"phanes","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2402.15627

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Published on Feb 23, 2024
· Submitted by AK on Feb 27, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

MegaScale is a production system designed to train large language models at scale using more than 10,000 GPUs, focusing on efficiency, stability, and fault tolerance through advanced observability and diagnostic tools.

AI-generated summary

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

MegaScale: Unleashing LLM Training on 10,000+ GPUs!

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.15627 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.15627 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.15627 in a Space README.md to link it from this page.

Collections including this paper 27

Лучший частный хостинг