@librarian-bot\n\t recommend\n","updatedAt":"2024-09-24T17:16:06.888Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3763}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"66f2f3db9c7a0edacdf28b32","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-09-24T17:16:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ThinK: Thinner Key Cache by Query-Driven Pruning](https://huggingface.co/papers/2407.21018) (2024)\n* [Eigen Attention: Attention in Low-Rank Space for KV Cache Compression](https://huggingface.co/papers/2408.05646) (2024)\n* [Cross-layer Attention Sharing for Large Language Models](https://huggingface.co/papers/2408.01890) (2024)\n* [Accelerating Large Language Model Inference with Self-Supervised Early Exits](https://huggingface.co/papers/2407.21082) (2024)\n* [CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs](https://huggingface.co/papers/2409.12490) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-09-24T17:16:11.047Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7541744709014893},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66f2f3d698d8654563aebb69"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2404.06954","authors":[{"_id":"6640e4c623dcace65d073631","name":"Yijin Liu","hidden":false},{"_id":"6640e4c623dcace65d073632","name":"Fandong Meng","hidden":false},{"_id":"6640e4c623dcace65d073633","name":"Jie Zhou","hidden":false}],"publishedAt":"2024-04-10T12:12:07.000Z","title":"Accelerating Inference in Large Language Models with a Unified Layer\n Skipping Strategy","summary":"Recently, dynamic computation methods have shown notable acceleration for\nLarge Language Models (LLMs) by skipping several layers of computations through\nelaborate heuristics or additional predictors. However, in the decoding process\nof existing approaches, different samples are assigned different computational\nbudgets, which cannot guarantee a stable and precise acceleration effect.\nFurthermore, existing approaches generally skip multiple contiguous layers at\nthe bottom or top of the layers, leading to a drastic change in the model's\nlayer-wise representations, and thus a consequent performance degeneration.\nTherefore, we propose a Unified Layer Skipping strategy, which selects the\nnumber of layers to skip computation based solely on the target speedup ratio,\nand then skips the corresponding number of intermediate layer computations in a\nbalanced manner. Since the Unified Layer Skipping strategy is independent of\ninput samples, it naturally supports popular acceleration techniques such as\nbatch decoding and KV caching, thus demonstrating more practicality for\nreal-world applications. Experimental results on two common tasks, i.e.,\nmachine translation and text summarization, indicate that given a target\nspeedup ratio, the Unified Layer Skipping strategy significantly enhances both\nthe inference performance and the actual model throughput over existing dynamic\napproaches.","upvotes":0,"discussionId":"6640e4c723dcace65d073684","ai_summary":"A Unified Layer Skipping strategy enhances inference performance and throughput in Large Language Models by dynamically and evenly skipping layers based on a target speedup ratio.","ai_keywords":["Large Language Models","dynamic computation methods","Unified Layer Skipping strategy","computational budget","layer-wise representations","batch decoding","KV caching","machine translation","text summarization","inference performance","model throughput"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
A Unified Layer Skipping strategy enhances inference performance and throughput in Large Language Models by dynamically and evenly skipping layers based on a target speedup ratio.
AI-generated summary
Recently, dynamic computation methods have shown notable acceleration for
Large Language Models (LLMs) by skipping several layers of computations through
elaborate heuristics or additional predictors. However, in the decoding process
of existing approaches, different samples are assigned different computational
budgets, which cannot guarantee a stable and precise acceleration effect.
Furthermore, existing approaches generally skip multiple contiguous layers at
the bottom or top of the layers, leading to a drastic change in the model's
layer-wise representations, and thus a consequent performance degeneration.
Therefore, we propose a Unified Layer Skipping strategy, which selects the
number of layers to skip computation based solely on the target speedup ratio,
and then skips the corresponding number of intermediate layer computations in a
balanced manner. Since the Unified Layer Skipping strategy is independent of
input samples, it naturally supports popular acceleration techniques such as
batch decoding and KV caching, thus demonstrating more practicality for
real-world applications. Experimental results on two common tasks, i.e.,
machine translation and text summarization, indicate that given a target
speedup ratio, the Unified Layer Skipping strategy significantly enhances both
the inference performance and the actual model throughput over existing dynamic
approaches.