@librarian-bot\n\t recommend\n","updatedAt":"2024-06-08T17:44:19.243Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3769}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"6664988defbf2cc81ebf0065","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-06-08T17:44:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LLoCO: Learning Long Contexts Offline](https://huggingface.co/papers/2404.07979) (2024)\n* [Loki: Low-Rank Keys for Efficient Sparse Attention](https://huggingface.co/papers/2406.02542) (2024)\n* [A Survey on Efficient Inference for Large Language Models](https://huggingface.co/papers/2404.14294) (2024)\n* [FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference](https://huggingface.co/papers/2405.04065) (2024)\n* [Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models](https://huggingface.co/papers/2404.11502) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-06-08T17:44:45.695Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7169002890586853},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"666498734e36a92d1dc81fa9"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2404.09336","authors":[{"_id":"666302160e5d9b8287ab2630","name":"Tian Jin","hidden":false},{"_id":"666302160e5d9b8287ab2631","name":"Wanzin Yazar","hidden":false},{"_id":"666302160e5d9b8287ab2632","name":"Zifei Xu","hidden":false},{"_id":"666302160e5d9b8287ab2633","name":"Sayeh Sharify","hidden":false},{"_id":"666302160e5d9b8287ab2634","name":"Xin Wang","hidden":false}],"publishedAt":"2024-04-14T19:36:04.000Z","title":"Self-Selected Attention Span for Accelerating Large Language Model\n Inference","summary":"Large language models (LLMs) can solve challenging tasks. However, their\ninference computation on modern GPUs is highly inefficient due to the\nincreasing number of tokens they must attend to as they generate new ones. To\naddress this inefficiency, we capitalize on LLMs' problem-solving capabilities\nto optimize their own inference-time efficiency. We demonstrate with two\nspecific tasks: (a) evaluating complex arithmetic expressions and (b)\nsummarizing news articles. For both tasks, we create custom datasets to\nfine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM\nlearn to solve the evaluation or summarization task, and second, to train it to\nidentify the minimal attention spans required for each step of the task. As a\nresult, the fine-tuned model is able to convert these self-identified minimal\nattention spans into sparse attention masks on-the-fly during inference. We\ndevelop a custom CUDA kernel to take advantage of the reduced context to attend\nto. We demonstrate that using this custom CUDA kernel improves the throughput\nof LLM inference by 28%. Our work presents an end-to-end demonstration showing\nthat training LLMs to self-select their attention spans speeds up\nautoregressive inference in solving real-world tasks.","upvotes":0,"discussionId":"666302170e5d9b8287ab268c","ai_summary":"LLMs fine-tuned to identify minimal attention spans improve inference efficiency through sparse attention masks, enhancing throughput in real-world tasks.","ai_keywords":["LLMs","large language models","inference computation","GPUs","tokens","attention spans","fine-tuning","sparse attention masks","CUDA kernel","autoregressive inference"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
LLMs fine-tuned to identify minimal attention spans improve inference efficiency through sparse attention masks, enhancing throughput in real-world tasks.
AI-generated summary
Large language models (LLMs) can solve challenging tasks. However, their
inference computation on modern GPUs is highly inefficient due to the
increasing number of tokens they must attend to as they generate new ones. To
address this inefficiency, we capitalize on LLMs' problem-solving capabilities
to optimize their own inference-time efficiency. We demonstrate with two
specific tasks: (a) evaluating complex arithmetic expressions and (b)
summarizing news articles. For both tasks, we create custom datasets to
fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM
learn to solve the evaluation or summarization task, and second, to train it to
identify the minimal attention spans required for each step of the task. As a
result, the fine-tuned model is able to convert these self-identified minimal
attention spans into sparse attention masks on-the-fly during inference. We
develop a custom CUDA kernel to take advantage of the reduced context to attend
to. We demonstrate that using this custom CUDA kernel improves the throughput
of LLM inference by 28%. Our work presents an end-to-end demonstration showing
that training LLMs to self-select their attention spans speeds up
autoregressive inference in solving real-world tasks.