lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2024-01-22T14:14:55.115Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7495049834251404},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.08671","authors":[{"_id":"65a896d39aec16459940d5fd","user":{"_id":"630e826cc0eca3037a04e9c7","avatarUrl":"/avatars/d71f42635f3e678acae4eb99cf9d8431.svg","isPro":false,"fullname":"Connor Holmes","user":"cmikeh2","type":"user"},"name":"Connor Holmes","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:20:21.356Z","hidden":false},{"_id":"65a896d39aec16459940d5fe","user":{"_id":"65160d2c8a8269300db0a3b5","avatarUrl":"/avatars/d6b5e0211d3b99226651aca18f7aba1e.svg","isPro":false,"fullname":"Masahiro Tanaka","user":"mtanaka","type":"user"},"name":"Masahiro Tanaka","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:20:40.022Z","hidden":false},{"_id":"65a896d39aec16459940d5ff","user":{"_id":"62910c52d1be2630c0d38a72","avatarUrl":"/avatars/40a72a10e3ea17809a09bde77360c194.svg","isPro":false,"fullname":"Michael Wyatt","user":"mwyatt","type":"user"},"name":"Michael Wyatt","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:20:54.201Z","hidden":false},{"_id":"65a896d39aec16459940d600","user":{"_id":"6324b8417d6d0cbbe2c65019","avatarUrl":"/avatars/7c1a468daaa984203a600a3b3526298f.svg","isPro":false,"fullname":"Ammar Awan","user":"ammarawan","type":"user"},"name":"Ammar Ahmad Awan","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:21:21.108Z","hidden":false},{"_id":"65a896d39aec16459940d601","user":{"_id":"60871172511e863acdb20b9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60871172511e863acdb20b9f/qlhLl8hAryhQY3uvjoZ3i.jpeg","isPro":false,"fullname":"Jeff Rasley","user":"jeffra","type":"user"},"name":"Jeff Rasley","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:21:27.489Z","hidden":false},{"_id":"65a896d39aec16459940d602","user":{"_id":"63238e79d444e1d90888ea03","avatarUrl":"/avatars/cd0accb5ddec4f2ae23c94da4094b5e5.svg","isPro":false,"fullname":"Samyam Rajbhandari","user":"samyam","type":"user"},"name":"Samyam Rajbhandari","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:21:33.367Z","hidden":false},{"_id":"65a896d39aec16459940d603","name":"Reza Yazdani Aminabadi","hidden":false},{"_id":"65a896d39aec16459940d604","user":{"_id":"63bc6a611374e3ef9134a24e","avatarUrl":"/avatars/9df037b9cf7d7952a10d919e24a2ffa2.svg","isPro":false,"fullname":"Heyang Qin","user":"heyangqin","type":"user"},"name":"Heyang Qin","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:21:47.716Z","hidden":false},{"_id":"65a896d39aec16459940d605","user":{"_id":"6553cda0c81411e2aaf930b1","avatarUrl":"/avatars/291e4e9cef092d649ab734a3f09a3af1.svg","isPro":false,"fullname":"Arash Bakhtiari","user":"arashb","type":"user"},"name":"Arash Bakhtiari","status":"claimed_verified","statusLastChangedAt":"2024-01-29T08:26:26.348Z","hidden":false},{"_id":"65a896d39aec16459940d606","user":{"_id":"635852141d66b442317cbe26","avatarUrl":"/avatars/bcf9164f6ac6393ffecc5f4383d12cbe.svg","isPro":false,"fullname":"Lev Kurilenko","user":"lekurile","type":"user"},"name":"Lev Kurilenko","status":"admin_assigned","statusLastChangedAt":"2024-01-18T09:22:24.973Z","hidden":false},{"_id":"65a896d39aec16459940d607","name":"Yuxiong He","hidden":false}],"publishedAt":"2024-01-09T06:49:40.000Z","submittedOnDailyAt":"2024-01-18T00:41:16.316Z","title":"DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and\n DeepSpeed-Inference","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"The deployment and scaling of large language models (LLMs) have become\ncritical as they permeate various applications, demanding high-throughput and\nlow-latency serving systems. Existing frameworks struggle to balance these\nrequirements, especially for workloads with long prompts. This paper introduces\nDeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and\ngeneration composition strategy, to deliver up to 2.3x higher effective\nthroughput, 2x lower latency on average, and up to 3.7x lower (token-level)\ntail latency, compared to state-of-the-art systems like vLLM. We leverage a\nsynergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an\nefficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced\nimplementation supports a range of models and offers both non-persistent and\npersistent deployment options, catering to diverse user scenarios from\ninteractive sessions to long-running applications. We present a detailed\nbenchmarking methodology, analyze the performance through latency-throughput\ncurves, and investigate scalability via load balancing. Our evaluations\ndemonstrate substantial improvements in throughput and latency across various\nmodels and hardware configurations. We discuss our roadmap for future\nenhancements, including broader model support and new hardware backends. The\nDeepSpeed-FastGen code is readily available for community engagement and\ncontribution.","upvotes":15,"discussionId":"65a896d49aec16459940d64e","ai_summary":"DeepSpeed-FastGen enhances the deployment of large language models by introducing Dynamic SplitFuse, improving throughput and latency compared to existing frameworks.","ai_keywords":["DeepSpeed-FastGen","Dynamic SplitFuse","DeepSpeed-MII","DeepSpeed-Inference","prompt and generation composition","non-persistent deployment","persistent deployment","latency-throughput curves","load balancing","hardware backends"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63107b18e87051f3e3e0f598","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63107b18e87051f3e3e0f598/R9onir4Y0MZuq1jEWCZ2-.jpeg","isPro":false,"fullname":"Unchun Yang","user":"ucyang","type":"user"},{"_id":"630c2ddb86b8b9904c3860a6","avatarUrl":"/avatars/9b6cec2e9e269ccac1533eb7bf1ac2c5.svg","isPro":false,"fullname":"Igor Melnyk","user":"imelnyk","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63ef22b2bfe4ead22ca9e1e4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676616348535-noauth.jpeg","isPro":false,"fullname":"Phú Võ","user":"phuvo","type":"user"},{"_id":"6221cba410c0365167f6e3a1","avatarUrl":"/avatars/e164c341d13af9d73d01d0b4be78cc1a.svg","isPro":false,"fullname":"machao","user":"machao","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"630f21f802ce39336c3fcfdf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630f21f802ce39336c3fcfdf/exTqqQrq68JH6f4ji3JpH.png","isPro":true,"fullname":"Mirek Rusin","user":"mirekrusin","type":"user"},{"_id":"6101c620900eaa0057c2ce1d","avatarUrl":"/avatars/bd282166c120711c65b5409dc860ac58.svg","isPro":false,"fullname":"Abdel-Dayane Marcos","user":"admarcosai","type":"user"},{"_id":"6457885a75f8f7d26aa5bc44","avatarUrl":"/avatars/8ce57c4d60a1f1b5afa2c592207a8335.svg","isPro":false,"fullname":"allthingsdisaggregated","user":"lastweek","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"6478cef3bb9a5693c48941da","avatarUrl":"/avatars/8f47dbaed3f305c5e0ee147966a45505.svg","isPro":false,"fullname":"Bajra","user":"Mandur","type":"user"},{"_id":"65025370b6595dc45c397340","avatarUrl":"/avatars/9469599b176034548042922c0afa7051.svg","isPro":true,"fullname":"J C","user":"dark-pen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2401.08671

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Published on Jan 9, 2024
· Submitted by AK on Jan 18, 2024

Abstract

DeepSpeed-FastGen enhances the deployment of large language models by introducing Dynamic SplitFuse, improving throughput and latency compared to existing frameworks.

AI-generated summary

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.

Community

deleted
This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.08671 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.08671 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.08671 in a Space README.md to link it from this page.

Collections including this paper 11

Лучший частный хостинг