Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n","updatedAt":"2023-11-03T16:44:31.708Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":263}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7796055674552917},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"655eebf45808cf4c1ddad474","author":{"_id":"64759c1f3a3559d09b577d16","avatarUrl":"/avatars/bc1d78f188699b6b700fced1b09e1593.svg","fullname":"Peiqing Jiang","name":"jiangpq","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2023-11-23T06:06:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The ARC-e and HellaSwag scores look weird.\nProbably the author swapped them by mistake?\n\n\n","html":"
The ARC-e and HellaSwag scores look weird. Probably the author swapped them by mistake?
\n
\n","updatedAt":"2023-11-23T06:06:44.253Z","author":{"_id":"64759c1f3a3559d09b577d16","avatarUrl":"/avatars/bc1d78f188699b6b700fced1b09e1593.svg","fullname":"Peiqing Jiang","name":"jiangpq","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7756308913230896},"editors":["jiangpq"],"editorAvatarUrls":["/avatars/bc1d78f188699b6b700fced1b09e1593.svg"],"reactions":[],"isReport":false}},{"id":"655efad26e89f16ecc0375b9","author":{"_id":"64d1950a0fbfb00b919e9481","avatarUrl":"/avatars/a4a487431b9ae4a8ad916a8b553ec5b2.svg","fullname":"Chien-Yu Lin","name":"cylinbao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2023-11-23T07:10:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @jiangpq, thanks for your interests in our work.\nI'm one of the co-authors of this paper. For this table, we use `lm-eval` version `0.3.0` to get our results, which is the stable version we can download using `pip`. We are aware if we use a different version of `lm-eval`, then we can get a different accuracy results. However, the results in this table are evaluated under the same environment, so we believe it's still representative to capture each method's capacity.","html":"
Hi \n\n@jiangpq\n\t, thanks for your interests in our work. I'm one of the co-authors of this paper. For this table, we use lm-eval version 0.3.0 to get our results, which is the stable version we can download using pip. We are aware if we use a different version of lm-eval, then we can get a different accuracy results. However, the results in this table are evaluated under the same environment, so we believe it's still representative to capture each method's capacity.
\n","updatedAt":"2023-11-23T07:10:10.134Z","author":{"_id":"64d1950a0fbfb00b919e9481","avatarUrl":"/avatars/a4a487431b9ae4a8ad916a8b553ec5b2.svg","fullname":"Chien-Yu Lin","name":"cylinbao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9404511451721191},"editors":["cylinbao"],"editorAvatarUrls":["/avatars/a4a487431b9ae4a8ad916a8b553ec5b2.svg"],"reactions":[{"reaction":"👍","users":["abcdabcd987"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2310.19102","authors":[{"_id":"65408d341c4b203c012efcde","user":{"_id":"6549b0a808775ce78e535c6a","avatarUrl":"/avatars/942066356843d0c424375937f157c975.svg","isPro":false,"fullname":"Yilong Zhao","user":"ylzhao","type":"user"},"name":"Yilong Zhao","status":"claimed_verified","statusLastChangedAt":"2023-11-07T11:20:52.551Z","hidden":false},{"_id":"65408d341c4b203c012efcdf","user":{"_id":"64d1950a0fbfb00b919e9481","avatarUrl":"/avatars/a4a487431b9ae4a8ad916a8b553ec5b2.svg","isPro":false,"fullname":"Chien-Yu Lin","user":"cylinbao","type":"user"},"name":"Chien-Yu Lin","status":"admin_assigned","statusLastChangedAt":"2023-10-31T10:33:33.415Z","hidden":false},{"_id":"65408d341c4b203c012efce0","name":"Kan Zhu","hidden":false},{"_id":"65408d341c4b203c012efce1","user":{"_id":"632b9f58ffe3618eff2ffb93","avatarUrl":"/avatars/1ffaf2da94fb9f993aa0af4bc3808197.svg","isPro":false,"fullname":"Zihao Ye","user":"zhye","type":"user"},"name":"Zihao Ye","status":"admin_assigned","statusLastChangedAt":"2023-10-31T10:33:49.784Z","hidden":false},{"_id":"65408d341c4b203c012efce2","user":{"_id":"63f6a8f9e94ed998c460302a","avatarUrl":"/avatars/c2b78e2725a5c8c1cd3ab2e502afff71.svg","isPro":false,"fullname":"Lequn Chen","user":"abcdabcd987","type":"user"},"name":"Lequn Chen","status":"admin_assigned","statusLastChangedAt":"2023-10-31T10:33:56.655Z","hidden":false},{"_id":"65408d341c4b203c012efce3","user":{"_id":"636c5cb1d963c567935dd7a7","avatarUrl":"/avatars/bfe016cd8f644039f21ed36163d36cae.svg","isPro":false,"fullname":"Size Zheng","user":"ZCHNO","type":"user"},"name":"Size Zheng","status":"admin_assigned","statusLastChangedAt":"2023-10-31T10:34:03.611Z","hidden":false},{"_id":"65408d341c4b203c012efce4","name":"Luis Ceze","hidden":false},{"_id":"65408d341c4b203c012efce5","name":"Arvind Krishnamurthy","hidden":false},{"_id":"65408d341c4b203c012efce6","user":{"_id":"6407c3cb2e309e654528860f","avatarUrl":"/avatars/b2177ae76ff8001564d0606752538884.svg","isPro":false,"fullname":"Tianqi Chen","user":"tqchen","type":"user"},"name":"Tianqi Chen","status":"claimed_verified","statusLastChangedAt":"2023-10-31T15:04:46.324Z","hidden":false},{"_id":"65408d341c4b203c012efce7","user":{"_id":"654132fe5a9a913c6c870e79","avatarUrl":"/avatars/2f6807eddef1929c571977e9af35f952.svg","isPro":false,"fullname":"Baris Kasikci","user":"kasikci","type":"user"},"name":"Baris Kasikci","status":"claimed_verified","statusLastChangedAt":"2023-10-31T17:02:05.536Z","hidden":false}],"publishedAt":"2023-10-29T18:33:05.000Z","submittedOnDailyAt":"2023-10-31T03:44:28.658Z","title":"Atom: Low-bit Quantization for Efficient and Accurate LLM Serving","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"The growing demand for Large Language Models (LLMs) in applications such as\ncontent generation, intelligent chatbots, and sentiment analysis poses\nconsiderable challenges for LLM service providers. To efficiently use GPU\nresources and boost throughput, batching multiple requests has emerged as a\npopular paradigm; to further speed up batching, LLM quantization techniques\nreduce memory consumption and increase computing capacity. However, prevalent\nquantization schemes (e.g., 8-bit weight-activation quantization) cannot fully\nleverage the capabilities of modern GPUs, such as 4-bit integer operators,\nresulting in sub-optimal performance.\n To maximize LLMs' serving throughput, we introduce Atom, a low-bit\nquantization method that achieves high throughput improvements with negligible\naccuracy loss. Atom significantly boosts serving throughput by using low-bit\noperators and considerably reduces memory consumption via low-bit quantization.\nIt attains high accuracy by applying a novel mixed-precision and fine-grained\nquantization process. We evaluate Atom on 4-bit weight-activation quantization\nsetups in the serving context. Atom improves end-to-end throughput by up to\n7.73times compared to the FP16 and by 2.53times compared to INT8\nquantization, while maintaining the same latency target.","upvotes":11,"discussionId":"65408d341c4b203c012efcfb","ai_summary":"Atom is a low-bit quantization method that enhances LLM serving throughput with minimal accuracy loss by utilizing low-bit operators and a mixed-precision process.","ai_keywords":["Large Language Models","LLMs","GPU","batching","quantization","memory consumption","4-bit integer operators","serving throughput","low-bit operators","mixed-precision","fine-grained quantization","FP16","INT8 quantization","latency target"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"63f6a8f9e94ed998c460302a","avatarUrl":"/avatars/c2b78e2725a5c8c1cd3ab2e502afff71.svg","isPro":false,"fullname":"Lequn Chen","user":"abcdabcd987","type":"user"},{"_id":"654132fe5a9a913c6c870e79","avatarUrl":"/avatars/2f6807eddef1929c571977e9af35f952.svg","isPro":false,"fullname":"Baris Kasikci","user":"kasikci","type":"user"},{"_id":"631f54aa5ba8c026340b13cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631f54aa5ba8c026340b13cf/2jI0VUDG5cKkdf2C5KJuy.png","isPro":false,"fullname":"Nicolas Chapados","user":"nicolaschapados","type":"user"},{"_id":"632b9f58ffe3618eff2ffb93","avatarUrl":"/avatars/1ffaf2da94fb9f993aa0af4bc3808197.svg","isPro":false,"fullname":"Zihao Ye","user":"zhye","type":"user"},{"_id":"61e7c06064d3c6c929057bee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e7c06064d3c6c929057bee/QxULx1EA1bgmjXxupQX4B.jpeg","isPro":false,"fullname":"蓋瑞王","user":"gary109","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"668982b13e1066772fba1c8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668982b13e1066772fba1c8f/CqhWo7OVnFL96pXt-GLN3.jpeg","isPro":false,"fullname":"Darrin Mccann","user":"darreen","type":"user"},{"_id":"66a25aaa041002424381e649","avatarUrl":"/avatars/db4937ccac1abd07b053da03855960cb.svg","isPro":false,"fullname":"Gaspard","user":"gaspardthrl","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Atom is a low-bit quantization method that enhances LLM serving throughput with minimal accuracy loss by utilizing low-bit operators and a mixed-precision process.
AI-generated summary
The growing demand for Large Language Models (LLMs) in applications such as
content generation, intelligent chatbots, and sentiment analysis poses
considerable challenges for LLM service providers. To efficiently use GPU
resources and boost throughput, batching multiple requests has emerged as a
popular paradigm; to further speed up batching, LLM quantization techniques
reduce memory consumption and increase computing capacity. However, prevalent
quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully
leverage the capabilities of modern GPUs, such as 4-bit integer operators,
resulting in sub-optimal performance.
To maximize LLMs' serving throughput, we introduce Atom, a low-bit
quantization method that achieves high throughput improvements with negligible
accuracy loss. Atom significantly boosts serving throughput by using low-bit
operators and considerably reduces memory consumption via low-bit quantization.
It attains high accuracy by applying a novel mixed-precision and fine-grained
quantization process. We evaluate Atom on 4-bit weight-activation quantization
setups in the serving context. Atom improves end-to-end throughput by up to
7.73times compared to the FP16 and by 2.53times compared to INT8
quantization, while maintaining the same latency target.
Hi @jiangpq, thanks for your interests in our work. I'm one of the co-authors of this paper. For this table, we use lm-eval version 0.3.0 to get our results, which is the stable version we can download using pip. We are aware if we use a different version of lm-eval, then we can get a different accuracy results. However, the results in this table are evaluated under the same environment, so we believe it's still representative to capture each method's capacity.