lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-02-17T01:21:41.374Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7546991109848022},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["linekin"],"count":1}],"isReport":false}},{"id":"65d6406498ef588b3e7d7715","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12},"createdAt":"2024-02-21T18:26:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Is the idea here mainly:\n\n1. _Data_ - (**novel contribution**) continual pretraining while **perserving** the pretraining data mixture (avoid biasing benchmark performance in other areas, in contrast to e.g. just training on long-form books)\n2. _Architecture_ - minimal changes beyond Adjusted Base Frequency (changing the base from 10m000 to 500,000 à la Code LLaMA).\n3. _Training_ - with recent sub-quadratic memory optimizations (Flash Attention), brute-force training with long sequences is no longer prohibitively expensive, and a large part of the latency bottleneck has shifted to linear IO cost (for < ~50K sequences). I believe FlashAttention 2 proposes a double-buffering technique that can also help \"overlap\" these IO and GEMM costs to avoid serializing on them.\n\nI believe https://arxiv.org/abs/2309.16039 also proposes something very similar (continual pretraining using 500000 ABF as the only minor architectural change), but using lots of tokens for continual pretraining and without preserving the same pretraining data mixture.","html":"

Is the idea here mainly:

\n
    \n
  1. Data - (novel contribution) continual pretraining while perserving the pretraining data mixture (avoid biasing benchmark performance in other areas, in contrast to e.g. just training on long-form books)
  2. \n
  3. Architecture - minimal changes beyond Adjusted Base Frequency (changing the base from 10m000 to 500,000 à la Code LLaMA).
  4. \n
  5. Training - with recent sub-quadratic memory optimizations (Flash Attention), brute-force training with long sequences is no longer prohibitively expensive, and a large part of the latency bottleneck has shifted to linear IO cost (for &lt; ~50K sequences). I believe FlashAttention 2 proposes a double-buffering technique that can also help \"overlap\" these IO and GEMM costs to avoid serializing on them.
  6. \n
\n

I believe https://arxiv.org/abs/2309.16039 also proposes something very similar (continual pretraining using 500000 ABF as the only minor architectural change), but using lots of tokens for continual pretraining and without preserving the same pretraining data mixture.

\n","updatedAt":"2024-02-21T18:26:44.020Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.849871814250946},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[],"isReport":false},"replies":[{"id":"65d6449183e5a3725662bab4","author":{"_id":"636c929d5ef5a1a07db4e785","avatarUrl":"/avatars/1bb48dbba964f7fa84036586032684cf.svg","fullname":"Yao Fu","name":"yaofu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15},"createdAt":"2024-02-21T18:44:33.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I tend to view the contribution is data and data alone, not only the data composition but also the data scale. \n\nWhen comparing this work with https://arxiv.org/abs/2309.16039, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source. \n\nBut https://arxiv.org/abs/2309.16039 (implicitly) holds the opposite belief that the long context capability is NOT within the base model, and they continue pretrain on 400B tokens. This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.\n\nConsequently, imagine a company trying to build long context model. Before our paper, suppose they follow https://arxiv.org/abs/2309.16039, then they may need to spend 128 A100s for two weeks. After knowing our message, they can reduce their cost to 8 A100s of 5 days. This is a million dollar cost reduction. \n\nAnd it already happened :) \n","html":"

I tend to view the contribution is data and data alone, not only the data composition but also the data scale.

\n

When comparing this work with https://arxiv.org/abs/2309.16039, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source.

\n

But https://arxiv.org/abs/2309.16039 (implicitly) holds the opposite belief that the long context capability is NOT within the base model, and they continue pretrain on 400B tokens. This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.

\n

Consequently, imagine a company trying to build long context model. Before our paper, suppose they follow https://arxiv.org/abs/2309.16039, then they may need to spend 128 A100s for two weeks. After knowing our message, they can reduce their cost to 8 A100s of 5 days. This is a million dollar cost reduction.

\n

And it already happened :)

\n","updatedAt":"2024-02-21T18:46:18.073Z","author":{"_id":"636c929d5ef5a1a07db4e785","avatarUrl":"/avatars/1bb48dbba964f7fa84036586032684cf.svg","fullname":"Yao Fu","name":"yaofu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.934770405292511},"editors":["yaofu"],"editorAvatarUrls":["/avatars/1bb48dbba964f7fa84036586032684cf.svg"],"reactions":[{"reaction":"🤗","users":["leegao19","FlynnFlag"],"count":2}],"isReport":false,"parentCommentId":"65d6406498ef588b3e7d7715"}},{"id":"65d660622eddc8392605b43e","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12},"createdAt":"2024-02-21T20:43:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.\n\nI see what you mean, and it's true that the presumption for most has been that it's prohibitively expensive to do context extension via continued pretraining without some architectural changes to RoPE or attention (400B tokens, 100K steps, lots and lots of flops).\n\nI do see an ablation experiment in https://arxiv.org/abs/2309.16039 on the performance of LLaMA 2 over # of continued training steps (AKA training tokens), but I think the main thing there is they're still trying to minimize training-loss and they continue to see decreasing ppl, which you (along with many other folks recently looking at context extension) mentions as being a poor substitute for downstream long-context performance (e.g. did this model achieve reasonable context extension).\n\nIs the idea here that - they're looking at plateauing training loss (where 400B tokens was chosen) is not a good judge of whether the model has achieved context extension. Instead, using needle-in-haystack style (long-context accuracy) evaluation allows us to observe context extension significantly (80x) earlier.","html":"
\n

This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.

\n
\n

I see what you mean, and it's true that the presumption for most has been that it's prohibitively expensive to do context extension via continued pretraining without some architectural changes to RoPE or attention (400B tokens, 100K steps, lots and lots of flops).

\n

I do see an ablation experiment in https://arxiv.org/abs/2309.16039 on the performance of LLaMA 2 over # of continued training steps (AKA training tokens), but I think the main thing there is they're still trying to minimize training-loss and they continue to see decreasing ppl, which you (along with many other folks recently looking at context extension) mentions as being a poor substitute for downstream long-context performance (e.g. did this model achieve reasonable context extension).

\n

Is the idea here that - they're looking at plateauing training loss (where 400B tokens was chosen) is not a good judge of whether the model has achieved context extension. Instead, using needle-in-haystack style (long-context accuracy) evaluation allows us to observe context extension significantly (80x) earlier.

\n","updatedAt":"2024-02-21T20:43:14.524Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.955520510673523},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[],"isReport":false,"parentCommentId":"65d6406498ef588b3e7d7715"}},{"id":"65ded014beffeb39ba687d60","author":{"_id":"636c929d5ef5a1a07db4e785","avatarUrl":"/avatars/1bb48dbba964f7fa84036586032684cf.svg","fullname":"Yao Fu","name":"yaofu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15},"createdAt":"2024-02-28T06:17:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I tend to view that Needle-in-a-haystack is an entry barrier: you need to get an all-green figure first, then consider more realistic long-context tasks, such as multi-document question answering, where one optimizes user preference. The training loss / NLL does not reveal any of them :)","html":"

I tend to view that Needle-in-a-haystack is an entry barrier: you need to get an all-green figure first, then consider more realistic long-context tasks, such as multi-document question answering, where one optimizes user preference. The training loss / NLL does not reveal any of them :)

\n","updatedAt":"2024-02-28T06:17:56.744Z","author":{"_id":"636c929d5ef5a1a07db4e785","avatarUrl":"/avatars/1bb48dbba964f7fa84036586032684cf.svg","fullname":"Yao Fu","name":"yaofu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9319621324539185},"editors":["yaofu"],"editorAvatarUrls":["/avatars/1bb48dbba964f7fa84036586032684cf.svg"],"reactions":[{"reaction":"👍","users":["leegao19"],"count":1}],"isReport":false,"parentCommentId":"65d6406498ef588b3e7d7715"}},{"id":"65df2fdb20bb41d40c730fc6","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12},"createdAt":"2024-02-28T13:06:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Makes sense, thanks for answering these questions, I think I understand what the paper is aiming for better","html":"

Makes sense, thanks for answering these questions, I think I understand what the paper is aiming for better

\n","updatedAt":"2024-02-28T13:06:35.945Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9707428812980652},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[],"isReport":false,"parentCommentId":"65d6406498ef588b3e7d7715"}},{"id":"66b44888df8e4c1c78fe3caa","author":{"_id":"637883027df2fefdcaed3b89","avatarUrl":"/avatars/09f7c96984fe1e1116d3fc9b3430b7aa.svg","fullname":"Yifan Bian","name":"FlynnFlag","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2024-08-08T04:24:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The varying conclusions might be due to the specific strategies used to handle long sentences. Since Meta has not disclosed the detailed criteria for upsampling or discarding long sentences, they may not consider the specific field or scope mentioned in your paper. This is just referred to as \"quality matters\" in their paper.\n\n Excellent job on your part! I look forward to seeing more research on how to effectively compose data mixes.","html":"

The varying conclusions might be due to the specific strategies used to handle long sentences. Since Meta has not disclosed the detailed criteria for upsampling or discarding long sentences, they may not consider the specific field or scope mentioned in your paper. This is just referred to as \"quality matters\" in their paper.

\n

Excellent job on your part! I look forward to seeing more research on how to effectively compose data mixes.

\n","updatedAt":"2024-08-08T04:24:40.792Z","author":{"_id":"637883027df2fefdcaed3b89","avatarUrl":"/avatars/09f7c96984fe1e1116d3fc9b3430b7aa.svg","fullname":"Yifan Bian","name":"FlynnFlag","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9472067952156067},"editors":["FlynnFlag"],"editorAvatarUrls":["/avatars/09f7c96984fe1e1116d3fc9b3430b7aa.svg"],"reactions":[],"isReport":false,"parentCommentId":"65d6406498ef588b3e7d7715"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2402.10171","authors":[{"_id":"65ced3ba2a811b35593b217b","user":{"_id":"636c929d5ef5a1a07db4e785","avatarUrl":"/avatars/1bb48dbba964f7fa84036586032684cf.svg","isPro":false,"fullname":"Yao Fu","user":"yaofu","type":"user"},"name":"Yao Fu","status":"admin_assigned","statusLastChangedAt":"2024-02-16T14:29:21.605Z","hidden":false},{"_id":"65ced3ba2a811b35593b217c","user":{"_id":"648a7542ac1fbb8a1a654645","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648a7542ac1fbb8a1a654645/jLPnY11EOuOa6ZHE7dVEX.jpeg","isPro":false,"fullname":"Rameswar Panda","user":"rpand002","type":"user"},"name":"Rameswar Panda","status":"admin_assigned","statusLastChangedAt":"2024-02-16T14:29:28.707Z","hidden":false},{"_id":"65ced3ba2a811b35593b217d","user":{"_id":"627f4c100d2c1c0ba3da8a72","avatarUrl":"/avatars/404869a87480ef526c1d5fccb583bfa8.svg","isPro":false,"fullname":"Xinyao Niu","user":"sirius-ctrl","type":"user"},"name":"Xinyao Niu","status":"admin_assigned","statusLastChangedAt":"2024-02-16T14:29:34.558Z","hidden":false},{"_id":"65ced3ba2a811b35593b217e","user":{"_id":"6230d750d93e84e233882dbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6230d750d93e84e233882dbc/4MGEekLW3oWzqeFWDWvIK.jpeg","isPro":false,"fullname":"Xiang Yue","user":"yuexiang96","type":"user"},"name":"Xiang Yue","status":"admin_assigned","statusLastChangedAt":"2024-02-16T14:29:59.838Z","hidden":false},{"_id":"65ced3ba2a811b35593b217f","name":"Hannaneh Hajishirzi","hidden":false},{"_id":"65ced3ba2a811b35593b2180","user":{"_id":"630ee22c05b5c5280e3cdcdf","avatarUrl":"/avatars/b282162d6d9ae8512873eba0275f603e.svg","isPro":false,"fullname":"Yoon Kim","user":"yoon-kim","type":"user"},"name":"Yoon Kim","status":"extracted_pending","statusLastChangedAt":"2024-02-16T03:17:14.629Z","hidden":false},{"_id":"65ced3ba2a811b35593b2181","user":{"_id":"660ec5a2509153ca49775a7c","avatarUrl":"/avatars/97570fc245cc8ec7628da9c13bd35b71.svg","isPro":false,"fullname":"Hao Peng","user":"haopeng01","type":"user"},"name":"Hao Peng","status":"extracted_pending","statusLastChangedAt":"2024-04-04T15:24:18.817Z","hidden":false}],"publishedAt":"2024-02-15T18:19:16.000Z","submittedOnDailyAt":"2024-02-16T00:47:14.647Z","title":"Data Engineering for Scaling Language Models to 128K Context","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We study the continual pretraining recipe for scaling language models'\ncontext lengths to 128K, with a focus on data engineering. We hypothesize that\nlong context modeling, in particular the ability to utilize information\nat arbitrary input locations, is a capability that is mostly already acquired\nthrough large-scale pretraining, and that this capability can be readily\nextended to contexts substantially longer than seen during training~(e.g., 4K\nto 128K) through lightweight continual pretraining on appropriate data mixture.\nWe investigate the quantity and quality of the data for\ncontinual pretraining: (1) for quantity, we show that 500 million to 5 billion\ntokens are enough to enable the model to retrieve information anywhere within\nthe 128K context; (2) for quality, our results equally emphasize domain\nbalance and length upsampling. Concretely, we find that naively\nupsampling longer data on certain domains like books, a common practice of\nexisting work, gives suboptimal performance, and that a balanced domain mixture\nis important. We demonstrate that continual pretraining of the full model on\n1B-5B tokens of such data is an effective and affordable strategy for scaling\nthe context length of language models to 128K. Our recipe outperforms strong\nopen-source long-context models and closes the gap to frontier models like\nGPT-4 128K.","upvotes":25,"discussionId":"65ced3ba2a811b35593b219f","ai_summary":"Continual pretraining with domain-balanced and length-upsampled data effectively scales language models' context length to 128K without significant additional computational cost.","ai_keywords":["continual pretraining","context lengths","long context modeling","data engineering","data mixture","domain balance","length upsampling","language models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"630c2ddb86b8b9904c3860a6","avatarUrl":"/avatars/9b6cec2e9e269ccac1533eb7bf1ac2c5.svg","isPro":false,"fullname":"Igor Melnyk","user":"imelnyk","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},{"_id":"637f0eb22438d7485b8ef5d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637f0eb22438d7485b8ef5d7/70h7dekqj7LuBobOXckmJ.jpeg","isPro":false,"fullname":"Ming Li","user":"limingcv","type":"user"},{"_id":"635964636a61954080850e1d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635964636a61954080850e1d/0bfExuDTrHTtm8c-40cDM.png","isPro":false,"fullname":"William Lamkin","user":"phanes","type":"user"},{"_id":"633b71b47af633cbcd0671d8","avatarUrl":"/avatars/6671941ced18ae516db6ebfbf73e239f.svg","isPro":false,"fullname":"juand4bot","user":"juandavidgf","type":"user"},{"_id":"655e1c11accde1bbc8c4034b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655e1c11accde1bbc8c4034b/UZYLm8-LZupx8nJbFIth9.jpeg","isPro":false,"fullname":"Mahesh Sathiamoorthy","user":"madiator","type":"user"},{"_id":"635ff4717ac39ff884ab1853","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635ff4717ac39ff884ab1853/C9djUov04Y64tTiEvXQJk.jpeg","isPro":false,"fullname":"Pierre Dulac","user":"dulacp","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","isPro":false,"fullname":"Lee Gao","user":"leegao19","type":"user"},{"_id":"63a7422854f1d0225b075bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a7422854f1d0225b075bfc/XGYAcDPZG5ZEsNBWG6guw.jpeg","isPro":true,"fullname":"lhl","user":"leonardlin","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2402.10171

Data Engineering for Scaling Language Models to 128K Context

Published on Feb 15, 2024
· Submitted by AK on Feb 16, 2024
Authors:
Yao Fu ,
,

Abstract

Continual pretraining with domain-balanced and length-upsampled data effectively scales language models' context length to 128K without significant additional computational cost.

AI-generated summary

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Is the idea here mainly:

  1. Data - (novel contribution) continual pretraining while perserving the pretraining data mixture (avoid biasing benchmark performance in other areas, in contrast to e.g. just training on long-form books)
  2. Architecture - minimal changes beyond Adjusted Base Frequency (changing the base from 10m000 to 500,000 à la Code LLaMA).
  3. Training - with recent sub-quadratic memory optimizations (Flash Attention), brute-force training with long sequences is no longer prohibitively expensive, and a large part of the latency bottleneck has shifted to linear IO cost (for < ~50K sequences). I believe FlashAttention 2 proposes a double-buffering technique that can also help "overlap" these IO and GEMM costs to avoid serializing on them.

I believe https://arxiv.org/abs/2309.16039 also proposes something very similar (continual pretraining using 500000 ABF as the only minor architectural change), but using lots of tokens for continual pretraining and without preserving the same pretraining data mixture.

·

I tend to view the contribution is data and data alone, not only the data composition but also the data scale.

When comparing this work with https://arxiv.org/abs/2309.16039, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source.

But https://arxiv.org/abs/2309.16039 (implicitly) holds the opposite belief that the long context capability is NOT within the base model, and they continue pretrain on 400B tokens. This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.

Consequently, imagine a company trying to build long context model. Before our paper, suppose they follow https://arxiv.org/abs/2309.16039, then they may need to spend 128 A100s for two weeks. After knowing our message, they can reduce their cost to 8 A100s of 5 days. This is a million dollar cost reduction.

And it already happened :)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.10171 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.10171 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.10171 in a Space README.md to link it from this page.

Collections including this paper 10

Лучший частный хостинг