👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/\n\n","updatedAt":"2024-06-09T01:47:34.904Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.49049171805381775},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2312.04985","authors":[{"_id":"65767c3fe09de6aa746884cf","user":{"_id":"65536094460d2d830c26ed6f","avatarUrl":"/avatars/d1a088f69f96bf7b80778232904f9986.svg","isPro":false,"fullname":"Luka Ribar","user":"lukaribar","type":"user"},"name":"Luka Ribar","status":"admin_assigned","statusLastChangedAt":"2023-12-11T08:20:15.155Z","hidden":false},{"_id":"65767c3fe09de6aa746884d0","user":{"_id":"62c2b8b486741fbec3e69103","avatarUrl":"/avatars/204953abced4edcdb80d29994611ba11.svg","isPro":false,"fullname":"Ivan Chelombiev","user":"ivanc","type":"user"},"name":"Ivan Chelombiev","status":"admin_assigned","statusLastChangedAt":"2023-12-11T08:20:21.785Z","hidden":false},{"_id":"65767c3fe09de6aa746884d1","user":{"_id":"629095d8a29097b211b79cce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629095d8a29097b211b79cce/5bQnDw801V6C6YTwq_nGD.jpeg","isPro":false,"fullname":"Luke Hudlass-Galley","user":"lukehg","type":"user"},"name":"Luke Hudlass-Galley","status":"admin_assigned","statusLastChangedAt":"2023-12-11T08:20:28.082Z","hidden":false},{"_id":"65767c3fe09de6aa746884d2","user":{"_id":"64c978a1dc59e43a61146233","avatarUrl":"/avatars/95a4a98d90079154316ca7a87ee22078.svg","isPro":false,"fullname":"Charlie Blake","user":"ccbbb23","type":"user"},"name":"Charlie Blake","status":"admin_assigned","statusLastChangedAt":"2023-12-11T08:21:10.504Z","hidden":false},{"_id":"65767c3fe09de6aa746884d3","name":"Carlo Luschi","hidden":false},{"_id":"65767c3fe09de6aa746884d4","user":{"_id":"65364966f551a245bb049742","avatarUrl":"/avatars/77caad0df6d5c1f667bb5b441f2aa0dc.svg","isPro":false,"fullname":"Douglas Orr","user":"douglasorr","type":"user"},"name":"Douglas Orr","status":"admin_assigned","statusLastChangedAt":"2023-12-11T08:21:52.063Z","hidden":false}],"publishedAt":"2023-12-08T11:47:35.000Z","submittedOnDailyAt":"2023-12-11T00:34:31.634Z","title":"SparQ Attention: Bandwidth-Efficient LLM Inference","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Generative large language models (LLMs) have opened up numerous novel\npossibilities, but due to their significant computational requirements their\nubiquitous use remains challenging. Some of the most useful applications\nrequire processing large numbers of samples at a time and using long contexts,\nboth significantly increasing the memory communication load of the models. We\nintroduce SparQ Attention, a technique for increasing the inference throughput\nof LLMs by reducing the memory bandwidth requirements within the attention\nblocks through selective fetching of the cached history. Our proposed technique\ncan be applied directly to off-the-shelf LLMs during inference, without\nrequiring any modification to the pre-training setup or additional fine-tuning.\nWe show how SparQ Attention can decrease the attention memory bandwidth\nrequirements up to eight times without any loss in accuracy by evaluating Llama\n2 and Pythia models on a wide range of downstream tasks.","upvotes":40,"discussionId":"65767c3fe09de6aa746884e9","ai_summary":"SparQ Attention reduces memory bandwidth requirements in LLM attention blocks, enhancing inference throughput without accuracy loss.","ai_keywords":["SparQ Attention","generative large language models (LLMs)","inference throughput","attention blocks","memory bandwidth","Llama 2","Pythia models","downstream tasks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"651d618a18be7acf8e602c41","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/kEDoJKsGXpNDTOiU7FRMP.jpeg","isPro":false,"fullname":"Abreu Magalhães","user":"Hildeberto","type":"user"},{"_id":"619c6d79030b707ff1f05e48","avatarUrl":"/avatars/bbd75d0a89cd280947ffe0f4489a8c9a.svg","isPro":false,"fullname":"huodon","user":"huodon","type":"user"},{"_id":"63119cc5af10c9efa1e9b620","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63119cc5af10c9efa1e9b620/RA-UgDNTPsF6j5uDnG3-N.jpeg","isPro":false,"fullname":"Akarshan Biswas","user":"qnixsynapse","type":"user"},{"_id":"65509069a12178dd75c5b3b3","avatarUrl":"/avatars/cac2ae5be6d8eb07611ab1c0486242d4.svg","isPro":false,"fullname":"DefTruth","user":"DefTruth","type":"user"},{"_id":"64522233ea94bf023430dd95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/CVDqDeJ_fLTULhCTTSogb.png","isPro":true,"fullname":"Chenhui Zhang","user":"danielz01","type":"user"},{"_id":"6126f3567faf48ab18fcf188","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6126f3567faf48ab18fcf188/yfNqymlvI-DkV_HwPxs85.jpeg","isPro":false,"fullname":"Allan Victor","user":"BecomeAllan","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"62d65fb09cf7596385b260ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658216359242-noauth.jpeg","isPro":false,"fullname":"Bail Adnan Farid","user":"zakoman","type":"user"},{"_id":"5dd96eb166059660ed1ee413","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/NQtzmrDdbG0H8qkZvRyGk.jpeg","isPro":true,"fullname":"Julien Chaumond","user":"julien-c","type":"user"},{"_id":"62ec1bfc7e6d199d7f785490","avatarUrl":"/avatars/c45a011c4cdd6dd46b2f6c354f0fb599.svg","isPro":false,"fullname":"Zoroa Strella","user":"ZoroaStrella","type":"user"},{"_id":"64b183300a54158d66dd3587","avatarUrl":"/avatars/55ee1f1a0bf57a9d045e739f6dbbaeed.svg","isPro":false,"fullname":"Rocl Jamez","user":"James62","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Abstract
SparQ Attention reduces memory bandwidth requirements in LLM attention blocks, enhancing inference throughput without accuracy loss.
Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.
Community
Game-Changer for Large Language Models: SparQ Attention Explained
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper