lynx   »   [go: up one dir, main page]

@librarian-bot\n\t recommend

\n","updatedAt":"2024-04-27T16:29:37.474Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3745}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false}},{"id":"662d27f804f9341b5674abd4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-04-27T16:29:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM](https://huggingface.co/papers/2403.05527) (2024)\n* [Sequence can Secretly Tell You What to Discard](https://huggingface.co/papers/2404.15949) (2024)\n* [No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization](https://huggingface.co/papers/2402.18096) (2024)\n* [Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference](https://huggingface.co/papers/2403.09054) (2024)\n* [CHAI: Clustered Head Attention for Efficient LLM Inference](https://huggingface.co/papers/2403.08058) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-04-27T16:29:44.383Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7355952858924866},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.09636","authors":[{"_id":"65f50ed4fdb0e12d2c892856","user":{"_id":"640deb5d3c82bd463ee44735","avatarUrl":"/avatars/0e748d7c91d97526b280e40ccb25c9e0.svg","isPro":false,"fullname":"Piotr Nawrot","user":"pnawrot","type":"user"},"name":"Piotr Nawrot","status":"claimed_verified","statusLastChangedAt":"2025-04-26T08:54:34.243Z","hidden":false},{"_id":"65f50ed4fdb0e12d2c892857","name":"Adrian Łańcucki","hidden":false},{"_id":"65f50ed4fdb0e12d2c892858","name":"Marcin Chochowski","hidden":false},{"_id":"65f50ed4fdb0e12d2c892859","name":"David Tarjan","hidden":false},{"_id":"65f50ed4fdb0e12d2c89285a","user":{"_id":"60809ad44ad99100d63ce36a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619040921084-noauth.jpeg","isPro":false,"fullname":"Edoardo Maria Ponti","user":"ducdauge","type":"user"},"name":"Edoardo M. Ponti","status":"claimed_verified","statusLastChangedAt":"2025-04-28T12:57:47.512Z","hidden":false}],"publishedAt":"2024-03-14T17:59:26.000Z","title":"Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference","summary":"Transformers have emerged as the backbone of large language models (LLMs).\nHowever, generation remains inefficient due to the need to store in memory a\ncache of key-value representations for past tokens, whose size scales linearly\nwith the input sequence length and batch size. As a solution, we propose\nDynamic Memory Compression (DMC), a method for on-line key-value cache\ncompression at inference time. Most importantly, the model learns to apply\ndifferent compression rates in different heads and layers. We retrofit\npre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers,\nachieving up to ~3.7x throughput increase in auto-regressive inference on a\nNVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible\npercentage of the original data without adding any extra parameters. We find\nthat DMC preserves the original downstream performance with up to 4x cache\ncompression, outperforming up-trained grouped-query attention (GQA). GQA and\nDMC can be even combined to obtain compounded gains. As a result DMC fits\nlonger contexts and larger batches within any given memory budget.","upvotes":3,"discussionId":"65f50ed5fdb0e12d2c892875","ai_summary":"Dynamic Memory Compression (DMC) improves large language model throughput with on-line key-value cache compression, preserving performance and enabling the handling of longer contexts and larger batches.","ai_keywords":["Transformers","large language models (LLMs)","key-value representations","Dynamic Memory Compression (DMC)","auto-regressive inference","DMC Transformers","grouped-query attention (GQA)"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","isPro":false,"fullname":"Lee Gao","user":"leegao19","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"640deb5d3c82bd463ee44735","avatarUrl":"/avatars/0e748d7c91d97526b280e40ccb25c9e0.svg","isPro":false,"fullname":"Piotr Nawrot","user":"pnawrot","type":"user"}],"acceptLanguages":["*"]}">
Papers
arxiv:2403.09636

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Published on Mar 14, 2024
Authors:
,
,
,

Abstract

Dynamic Memory Compression (DMC) improves large language model throughput with on-line key-value cache compression, preserving performance and enabling the handling of longer contexts and larger batches.

AI-generated summary

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on a NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget.

Community

@librarian-bot recommend

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.09636 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.09636 in a Space README.md to link it from this page.

Collections including this paper 4

Лучший частный хостинг