\n","updatedAt":"2024-06-08T22:23:42.998Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":142}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.533446192741394},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2311.04934","authors":[{"_id":"654dbb1bca2f3d20259209be","user":{"_id":"63d31d672727d7888cbc2428","avatarUrl":"/avatars/1df66789dd7ed8f3c068f9d440860ed9.svg","isPro":false,"fullname":"In Gim","user":"ingim","type":"user"},"name":"In Gim","status":"admin_assigned","statusLastChangedAt":"2023-11-10T10:39:55.008Z","hidden":false},{"_id":"654dbb1bca2f3d20259209bf","user":{"_id":"6500693c051fae19fc4c2c1b","avatarUrl":"/avatars/10fe24e1eabde8016b7c95a53ac153b7.svg","isPro":false,"fullname":"Guojun Chen","user":"Leonana69","type":"user"},"name":"Guojun Chen","status":"claimed_verified","statusLastChangedAt":"2023-11-10T18:39:41.660Z","hidden":false},{"_id":"654dbb1bca2f3d20259209c0","user":{"_id":"64bec1fd9f94ea2554258be0","avatarUrl":"/avatars/13d8a12f937ab5a95c2c1bfdb136aee9.svg","isPro":false,"fullname":"Seung-seob Lee","user":"seungseob7lee","type":"user"},"name":"Seung-seob Lee","status":"admin_assigned","statusLastChangedAt":"2023-11-10T10:40:25.582Z","hidden":false},{"_id":"654dbb1bca2f3d20259209c1","user":{"_id":"64f9a619a5067f6b65542cbf","avatarUrl":"/avatars/d0ebeeb7123a8d5f432970e2efb6cd6a.svg","isPro":false,"fullname":"Nikhil Sarda","user":"diffoperator","type":"user"},"name":"Nikhil Sarda","status":"admin_assigned","statusLastChangedAt":"2023-11-10T10:40:32.989Z","hidden":false},{"_id":"654dbb1bca2f3d20259209c2","user":{"_id":"644d68cf5c1b4e14d0c20be9","avatarUrl":"/avatars/0dd280cbb75951834a013521734124ec.svg","isPro":false,"fullname":"Anurag Khandelwal ","user":"Anurag88","type":"user"},"name":"Anurag Khandelwal","status":"admin_assigned","statusLastChangedAt":"2023-11-10T10:40:39.747Z","hidden":false},{"_id":"654dbb1bca2f3d20259209c3","name":"Lin Zhong","hidden":false}],"publishedAt":"2023-11-07T18:17:05.000Z","submittedOnDailyAt":"2023-11-10T02:39:48.214Z","title":"Prompt Cache: Modular Attention Reuse for Low-Latency Inference","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We present Prompt Cache, an approach for accelerating inference for large\nlanguage models (LLM) by reusing attention states across different LLM prompts.\nMany input prompts have overlapping text segments, such as system messages,\nprompt templates, and documents provided for context. Our key insight is that\nby precomputing and storing the attention states of these frequently occurring\ntext segments on the inference server, we can efficiently reuse them when these\nsegments appear in user prompts. Prompt Cache employs a schema to explicitly\ndefine such reusable text segments, called prompt modules. The schema ensures\npositional accuracy during attention state reuse and provides users with an\ninterface to access cached states in their prompt. Using a prototype\nimplementation, we evaluate Prompt Cache across several LLMs. We show that\nPrompt Cache significantly reduce latency in time-to-first-token, especially\nfor longer prompts such as document-based question answering and\nrecommendations. The improvements range from 8x for GPU-based inference to 60x\nfor CPU-based inference, all while maintaining output accuracy and without the\nneed for model parameter modifications.","upvotes":34,"discussionId":"654dbb1cca2f3d20259209d8","ai_summary":"Prompt Cache improves inference speed for large language models by reusing precomputed attention states of common text segments in user prompts.","ai_keywords":["attention states","large language models (LLM)","prompt modules","positional accuracy","time-to-first-token","document-based question answering","recommendations"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"649080b5df7c85dd93770a2e","avatarUrl":"/avatars/8f705ced3a47859061c733ed219f3adf.svg","isPro":false,"fullname":"Freeman N I","user":"noeyve-03","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63d31d672727d7888cbc2428","avatarUrl":"/avatars/1df66789dd7ed8f3c068f9d440860ed9.svg","isPro":false,"fullname":"In Gim","user":"ingim","type":"user"},{"_id":"6362ddb7d3be91534c30bfd6","avatarUrl":"/avatars/dac76ebd3b8a08099497ec0b0524bc7c.svg","isPro":false,"fullname":"Art Atk","user":"ArtAtk","type":"user"},{"_id":"6548d97fc0bc1b5104edbd7b","avatarUrl":"/avatars/7714fdf20c4867856449f8dbe16bbdd1.svg","isPro":false,"fullname":"Rasmus Kjær Nielsen","user":"RasmusKN99","type":"user"},{"_id":"64b183300a54158d66dd3587","avatarUrl":"/avatars/55ee1f1a0bf57a9d045e739f6dbbaeed.svg","isPro":false,"fullname":"Rocl Jamez","user":"James62","type":"user"},{"_id":"64bae71b01f1983a86322fec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bae71b01f1983a86322fec/xLAyVy6atkMu8gWOJuEAB.jpeg","isPro":false,"fullname":"Sam Butler","user":"sambutler","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"63d24d7d8700bc77c8e6fba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678993823326-63d24d7d8700bc77c8e6fba4.jpeg","isPro":false,"fullname":"Eugene Klimov","user":"Slach","type":"user"},{"_id":"64438bcb1bc692d87b237c04","avatarUrl":"/avatars/19f9eeb9281b47b34f68c312092ca468.svg","isPro":false,"fullname":"Yu","user":"Yhyu13","type":"user"},{"_id":"636ac507e3ad78bc68b31cfe","avatarUrl":"/avatars/e6dd4027945909c7cf13c61807c78f23.svg","isPro":false,"fullname":"Anas Saeed","user":"SaeedAnas","type":"user"},{"_id":"6239ac869895e5c2a4345131","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6239ac869895e5c2a4345131/QErVhX2EdKUSg06eRrU7L.jpeg","isPro":false,"fullname":"Edd","user":"Erland","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">
Prompt Cache improves inference speed for large language models by reusing precomputed attention states of common text segments in user prompts.
AI-generated summary
We present Prompt Cache, an approach for accelerating inference for large
language models (LLM) by reusing attention states across different LLM prompts.
Many input prompts have overlapping text segments, such as system messages,
prompt templates, and documents provided for context. Our key insight is that
by precomputing and storing the attention states of these frequently occurring
text segments on the inference server, we can efficiently reuse them when these
segments appear in user prompts. Prompt Cache employs a schema to explicitly
define such reusable text segments, called prompt modules. The schema ensures
positional accuracy during attention state reuse and provides users with an
interface to access cached states in their prompt. Using a prototype
implementation, we evaluate Prompt Cache across several LLMs. We show that
Prompt Cache significantly reduce latency in time-to-first-token, especially
for longer prompts such as document-based question answering and
recommendations. The improvements range from 8x for GPU-based inference to 60x
for CPU-based inference, all while maintaining output accuracy and without the
need for model parameter modifications.