lynx   »   [go: up one dir, main page]

@librarian-bot\n\t recommend

\n","updatedAt":"2024-05-22T15:20:52.123Z","author":{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","fullname":"Mayank Mishra","name":"mayank-mishra","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":64}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["mayank-mishra"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg"],"reactions":[{"reaction":"👍","users":["AdinaY"],"count":1}],"isReport":false},"replies":[{"id":"664e0d5b7430fe2d324d570e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-05-22T15:20:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://huggingface.co/papers/2405.05254) (2024)\n* [Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://huggingface.co/papers/2405.10637) (2024)\n* [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://huggingface.co/papers/2404.07143) (2024)\n* [SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget](https://huggingface.co/papers/2404.04793) (2024)\n* [Improving Transformers with Dynamically Composable Multi-Head Attention](https://huggingface.co/papers/2405.08553) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-05-22T15:20:59.683Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7263085842132568},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"🤗","users":["mayank-mishra","fjzzq2002","cataluna84"],"count":3}],"isReport":false,"parentCommentId":"664e0d54a4ab3fda2bea6443"}}]},{"id":"666537ccf6aef6af9935ea47","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142},"createdAt":"2024-06-09T05:04:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# How Cross-Layer Attention Reduces Transformer Memory Footprint\n\nhttps://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/2bxCaT-EqCOIzUR7Mvmmu.mp4 \n\n## Links 🔗:\n👉 Subscribe: https://www.youtube.com/@Arxflix\n👉 Twitter: https://x.com/arxflix\n👉 LMNT (Partner): https://lmnt.com/\n\n\nBy Arxflix\n![9t4iCUHx_400x400-1.jpg](https://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/v4S5zBurs0ouGNwYj1GEd.jpeg)","html":"

How Cross-Layer Attention Reduces Transformer Memory Footprint

\n

\n\n

Links 🔗:

\n

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

\n

By Arxflix
\"9t4iCUHx_400x400-1.jpg\"

\n","updatedAt":"2024-06-09T05:04:12.390Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5218453407287598},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.12981","authors":[{"_id":"664d6cfe6ac7ea31242494eb","name":"William Brandon","hidden":false},{"_id":"664d6cfe6ac7ea31242494ec","user":{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","isPro":false,"fullname":"Mayank Mishra","user":"mayank-mishra","type":"user"},"name":"Mayank Mishra","status":"claimed_verified","statusLastChangedAt":"2024-05-22T07:15:35.825Z","hidden":false},{"_id":"664d6cfe6ac7ea31242494ed","name":"Aniruddha Nrusimha","hidden":false},{"_id":"664d6cfe6ac7ea31242494ee","user":{"_id":"648a7542ac1fbb8a1a654645","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648a7542ac1fbb8a1a654645/jLPnY11EOuOa6ZHE7dVEX.jpeg","isPro":false,"fullname":"Rameswar Panda","user":"rpand002","type":"user"},"name":"Rameswar Panda","status":"admin_assigned","statusLastChangedAt":"2024-05-22T07:17:31.018Z","hidden":false},{"_id":"664d6cfe6ac7ea31242494ef","name":"Jonathan Ragan Kelly","hidden":false}],"publishedAt":"2024-05-21T17:59:29.000Z","submittedOnDailyAt":"2024-05-22T03:00:38.547Z","title":"Reducing Transformer Key-Value Cache Size with Cross-Layer Attention","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Key-value (KV) caching plays an essential role in accelerating decoding for\ntransformer-based autoregressive large language models (LLMs). However, the\namount of memory required to store the KV cache can become prohibitive at long\nsequence lengths and large batch sizes. Since the invention of the transformer,\ntwo of the most effective interventions discovered for reducing the size of the\nKV cache have been Multi-Query Attention (MQA) and its generalization,\nGrouped-Query Attention (GQA). MQA and GQA both modify the design of the\nattention block so that multiple query heads can share a single key/value head,\nreducing the number of distinct key/value heads by a large factor while only\nminimally degrading accuracy. In this paper, we show that it is possible to\ntake Multi-Query Attention a step further by also sharing key and value heads\nbetween adjacent layers, yielding a new attention design we call Cross-Layer\nAttention (CLA). With CLA, we find that it is possible to reduce the size of\nthe KV cache by another 2x while maintaining nearly the same accuracy as\nunmodified MQA. In experiments training 1B- and 3B-parameter models from\nscratch, we demonstrate that CLA provides a Pareto improvement over the\nmemory/accuracy tradeoffs which are possible with traditional MQA, enabling\ninference with longer sequence lengths and larger batch sizes than would\notherwise be possible","upvotes":33,"discussionId":"664d6cff6ac7ea3124249509","ai_summary":"Cross-Layer Attention modifies transformer-based autoregressive large language models to reduce KV cache size while maintaining accuracy, enabling longer sequence lengths and larger batch sizes during inference.","ai_keywords":["Key-value caching","transformer-based autoregressive large language models","MQA","GQA","Cross-Layer Attention"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62cd5057674cdb524450093d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd5057674cdb524450093d/f67rlrdsKPRTLdXCXoa_X.jpeg","isPro":false,"fullname":"Mayank Mishra","user":"mayank-mishra","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64def7308761a0f3029c7194","avatarUrl":"/avatars/c15fc9fc6bad6afcb1b605756e32104c.svg","isPro":false,"fullname":"Hyemin Lee","user":"hmlee","type":"user"},{"_id":"62a4ac6fd83c3facafa50892","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a4ac6fd83c3facafa50892/qFpobw9B5XaLZvwn0XbmB.jpeg","isPro":false,"fullname":"Mohammed Brıman","user":"mohammedbriman","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"6495e122f1d3ee1d68dfe170","avatarUrl":"/avatars/cb3d538ecf2077dfcbc616ecae571dd9.svg","isPro":false,"fullname":"Sebastiaan","user":"SharkWipf","type":"user"},{"_id":"6234a8105e7398c64ac62199","avatarUrl":"/avatars/1ae2fc3910a64bd91d20aadb267c0bc3.svg","isPro":false,"fullname":"Maozhou Ge","user":"Gmc2","type":"user"},{"_id":"5e67bdd61009063689407479","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857146757-5e67bdd61009063689407479.jpeg","isPro":true,"fullname":"Clem 🤗","user":"clem","type":"user"},{"_id":"6366b56ecbf2cf329189777e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667675471625-noauth.jpeg","isPro":false,"fullname":"Yuchen Cheng","user":"rudeigerc","type":"user"},{"_id":"6435f6abaa4211ef553d0dd4","avatarUrl":"/avatars/2bc5d5db4a19480a3507f6180ea4eb0c.svg","isPro":false,"fullname":"Ian J","user":"iyanello","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2405.12981

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Published on May 21, 2024
· Submitted by AK on May 22, 2024
#2 Paper of the day
Authors:
,
,

Abstract

Cross-Layer Attention modifies transformer-based autoregressive large language models to reduce KV cache size while maintaining accuracy, enabling longer sequence lengths and larger batch sizes during inference.

AI-generated summary

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible

Community

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

How Cross-Layer Attention Reduces Transformer Memory Footprint

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.12981 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.12981 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.12981 in a Space README.md to link it from this page.

Collections including this paper 14

Лучший частный хостинг