@librarian-bot\n\t recommend\n","updatedAt":"2024-05-22T15:20:52.123Z","author":{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","fullname":"Mayank Mishra","name":"mayank-mishra","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":64}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["mayank-mishra"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg"],"reactions":[{"reaction":"👍","users":["AdinaY"],"count":1}],"isReport":false},"replies":[{"id":"664e0d5b7430fe2d324d570e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-05-22T15:20:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://huggingface.co/papers/2405.05254) (2024)\n* [Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://huggingface.co/papers/2405.10637) (2024)\n* [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://huggingface.co/papers/2404.07143) (2024)\n* [SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget](https://huggingface.co/papers/2404.04793) (2024)\n* [Improving Transformers with Dynamically Composable Multi-Head Attention](https://huggingface.co/papers/2405.08553) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n","updatedAt":"2024-06-09T05:04:12.390Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5218453407287598},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.12981","authors":[{"_id":"664d6cfe6ac7ea31242494eb","name":"William Brandon","hidden":false},{"_id":"664d6cfe6ac7ea31242494ec","user":{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","isPro":false,"fullname":"Mayank Mishra","user":"mayank-mishra","type":"user"},"name":"Mayank Mishra","status":"claimed_verified","statusLastChangedAt":"2024-05-22T07:15:35.825Z","hidden":false},{"_id":"664d6cfe6ac7ea31242494ed","name":"Aniruddha Nrusimha","hidden":false},{"_id":"664d6cfe6ac7ea31242494ee","user":{"_id":"648a7542ac1fbb8a1a654645","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648a7542ac1fbb8a1a654645/jLPnY11EOuOa6ZHE7dVEX.jpeg","isPro":false,"fullname":"Rameswar Panda","user":"rpand002","type":"user"},"name":"Rameswar Panda","status":"admin_assigned","statusLastChangedAt":"2024-05-22T07:17:31.018Z","hidden":false},{"_id":"664d6cfe6ac7ea31242494ef","name":"Jonathan Ragan Kelly","hidden":false}],"publishedAt":"2024-05-21T17:59:29.000Z","submittedOnDailyAt":"2024-05-22T03:00:38.547Z","title":"Reducing Transformer Key-Value Cache Size with Cross-Layer Attention","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Key-value (KV) caching plays an essential role in accelerating decoding for\ntransformer-based autoregressive large language models (LLMs). However, the\namount of memory required to store the KV cache can become prohibitive at long\nsequence lengths and large batch sizes. Since the invention of the transformer,\ntwo of the most effective interventions discovered for reducing the size of the\nKV cache have been Multi-Query Attention (MQA) and its generalization,\nGrouped-Query Attention (GQA). MQA and GQA both modify the design of the\nattention block so that multiple query heads can share a single key/value head,\nreducing the number of distinct key/value heads by a large factor while only\nminimally degrading accuracy. In this paper, we show that it is possible to\ntake Multi-Query Attention a step further by also sharing key and value heads\nbetween adjacent layers, yielding a new attention design we call Cross-Layer\nAttention (CLA). With CLA, we find that it is possible to reduce the size of\nthe KV cache by another 2x while maintaining nearly the same accuracy as\nunmodified MQA. In experiments training 1B- and 3B-parameter models from\nscratch, we demonstrate that CLA provides a Pareto improvement over the\nmemory/accuracy tradeoffs which are possible with traditional MQA, enabling\ninference with longer sequence lengths and larger batch sizes than would\notherwise be possible","upvotes":33,"discussionId":"664d6cff6ac7ea3124249509","ai_summary":"Cross-Layer Attention modifies transformer-based autoregressive large language models to reduce KV cache size while maintaining accuracy, enabling longer sequence lengths and larger batch sizes during inference.","ai_keywords":["Key-value caching","transformer-based autoregressive large language models","MQA","GQA","Cross-Layer Attention"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","isPro":false,"fullname":"Mayank Mishra","user":"mayank-mishra","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64def7308761a0f3029c7194","avatarUrl":"/avatars/c15fc9fc6bad6afcb1b605756e32104c.svg","isPro":false,"fullname":"Hyemin Lee","user":"hmlee","type":"user"},{"_id":"62a4ac6fd83c3facafa50892","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a4ac6fd83c3facafa50892/qFpobw9B5XaLZvwn0XbmB.jpeg","isPro":false,"fullname":"Mohammed Brıman","user":"mohammedbriman","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"6495e122f1d3ee1d68dfe170","avatarUrl":"/avatars/cb3d538ecf2077dfcbc616ecae571dd9.svg","isPro":false,"fullname":"Sebastiaan","user":"SharkWipf","type":"user"},{"_id":"6234a8105e7398c64ac62199","avatarUrl":"/avatars/1ae2fc3910a64bd91d20aadb267c0bc3.svg","isPro":false,"fullname":"Maozhou Ge","user":"Gmc2","type":"user"},{"_id":"5e67bdd61009063689407479","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857146757-5e67bdd61009063689407479.jpeg","isPro":true,"fullname":"Clem 🤗","user":"clem","type":"user"},{"_id":"6366b56ecbf2cf329189777e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667675471625-noauth.jpeg","isPro":false,"fullname":"Yuchen Cheng","user":"rudeigerc","type":"user"},{"_id":"6435f6abaa4211ef553d0dd4","avatarUrl":"/avatars/2bc5d5db4a19480a3507f6180ea4eb0c.svg","isPro":false,"fullname":"Ian J","user":"iyanello","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Cross-Layer Attention modifies transformer-based autoregressive large language models to reduce KV cache size while maintaining accuracy, enabling longer sequence lengths and larger batch sizes during inference.
AI-generated summary
Key-value (KV) caching plays an essential role in accelerating decoding for
transformer-based autoregressive large language models (LLMs). However, the
amount of memory required to store the KV cache can become prohibitive at long
sequence lengths and large batch sizes. Since the invention of the transformer,
two of the most effective interventions discovered for reducing the size of the
KV cache have been Multi-Query Attention (MQA) and its generalization,
Grouped-Query Attention (GQA). MQA and GQA both modify the design of the
attention block so that multiple query heads can share a single key/value head,
reducing the number of distinct key/value heads by a large factor while only
minimally degrading accuracy. In this paper, we show that it is possible to
take Multi-Query Attention a step further by also sharing key and value heads
between adjacent layers, yielding a new attention design we call Cross-Layer
Attention (CLA). With CLA, we find that it is possible to reduce the size of
the KV cache by another 2x while maintaining nearly the same accuracy as
unmodified MQA. In experiments training 1B- and 3B-parameter models from
scratch, we demonstrate that CLA provides a Pareto improvement over the
memory/accuracy tradeoffs which are possible with traditional MQA, enabling
inference with longer sequence lengths and larger batch sizes than would
otherwise be possible