https://huggingface.co/papers/2406.11794 from the amazing
\n\n@vaishaal\n\t and many many co-authors (dataset is also here:
https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0)\n
https://huggingface.co/papers/2406.08446 from the AllenAI team \n\n@SaveBertAndGpt\n\t \n\n@OyvindTafjord\n\t \n\n@baileyk\n\t \n\n@JesseDodge\n\t\n\n","updatedAt":"2024-06-26T12:25:48.567Z","author":{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","fullname":"Thomas Wolf","name":"thomwolf","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1487}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8047207593917847},"editors":["thomwolf"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg"],"reactions":[{"reaction":"👍","users":["AdinaY","alielfilali01","mertege","julien-c"],"count":4},{"reaction":"🤗","users":["AdinaY","alielfilali01"],"count":2}],"isReport":false}},{"id":"667c14c9986bec0ba40c5417","author":{"_id":"5e9ecfc04957053f60648a3e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594214747713-5e9ecfc04957053f60648a3e.png","fullname":"Quentin Lhoest","name":"lhoestq","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":371},"createdAt":"2024-06-26T13:16:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Some Spaces demo to query the dataset in SQL:\n- https://huggingface.co/spaces/lhoestq/fineweb-sql\n- https://huggingface.co/spaces/lhoestq/fineweb-edu-sql","html":"
Some Spaces demo to query the dataset in SQL:
\n
\n","updatedAt":"2024-06-26T13:16:57.392Z","author":{"_id":"5e9ecfc04957053f60648a3e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594214747713-5e9ecfc04957053f60648a3e.png","fullname":"Quentin Lhoest","name":"lhoestq","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":371}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4448516070842743},"editors":["lhoestq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1594214747713-5e9ecfc04957053f60648a3e.png"],"reactions":[{"reaction":"🚀","users":["AdinaY","loubnabnl","anton-l","julien-c"],"count":4},{"reaction":"👍","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"673a432ea7a311f5bee77959","author":{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","fullname":"Daniel van Strien","name":"davanstrien","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":707},"createdAt":"2024-11-17T19:25:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@librarian-bot recommend","html":"
\n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-11-17T19:25:34.298Z","author":{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","fullname":"Daniel van Strien","name":"davanstrien","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":707}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["davanstrien"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"673a433119cbbe30913649d4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-11-17T19:25:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SWEb: A Large Web Dataset for the Scandinavian Languages](https://huggingface.co/papers/2410.04456) (2024)\n* [InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning](https://huggingface.co/papers/2409.12568) (2024)\n* [CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models](https://huggingface.co/papers/2410.18505) (2024)\n* [Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language](https://huggingface.co/papers/2410.23956) (2024)\n* [Data Processing for the OpenGPT-X Model Family](https://huggingface.co/papers/2410.08800) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-11-17T19:25:37.485Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7499853372573853},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"🤝","users":["davanstrien","mrkarol","lhoestq"],"count":3}],"isReport":false,"parentCommentId":"673a432ea7a311f5bee77959"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2406.17557","authors":[{"_id":"667baf1709faa9d48c05b503","user":{"_id":"62596f9e1c0a084224b93e00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/X2aLkJ0ofhkXwAg7lXvxD.jpeg","isPro":false,"fullname":"Guilherme Penedo","user":"guipenedo","type":"user"},"name":"Guilherme Penedo","status":"claimed_verified","statusLastChangedAt":"2024-06-26T08:03:02.663Z","hidden":false},{"_id":"667baf1709faa9d48c05b504","user":{"_id":"626ede24d2fa9e7d598c8709","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626ede24d2fa9e7d598c8709/JKS8-Y2Jw87EgNQZBRswq.jpeg","isPro":false,"fullname":"Hynek Kydlicek","user":"hynky","type":"user"},"name":"Hynek Kydlíček","status":"admin_assigned","statusLastChangedAt":"2024-06-26T09:53:11.720Z","hidden":false},{"_id":"667baf1709faa9d48c05b505","user":{"_id":"61c141342aac764ce1654e43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/81AwoT5IQ_Xdw0OVw7TKu.jpeg","isPro":false,"fullname":"Loubna Ben Allal","user":"loubnabnl","type":"user"},"name":"Loubna Ben allal","status":"claimed_verified","statusLastChangedAt":"2024-06-26T08:36:05.109Z","hidden":false},{"_id":"667baf1709faa9d48c05b506","user":{"_id":"602e6dee60e3dd96631c906e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1613655355830-noauth.png","isPro":false,"fullname":"Anton Lozhkov","user":"anton-l","type":"user"},"name":"Anton Lozhkov","status":"admin_assigned","statusLastChangedAt":"2024-06-26T09:53:20.691Z","hidden":false},{"_id":"667baf1709faa9d48c05b507","user":{"_id":"60c757ea5f9a76ab3f844f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1626214544196-60c757ea5f9a76ab3f844f12.png","isPro":false,"fullname":"Margaret Mitchell","user":"meg","type":"user"},"name":"Margaret Mitchell","status":"admin_assigned","statusLastChangedAt":"2024-06-26T09:53:30.450Z","hidden":false},{"_id":"667baf1709faa9d48c05b508","user":{"_id":"6079c29765b9d0165cb18392","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1618592397610-noauth.jpeg","isPro":false,"fullname":"Colin Raffel","user":"craffel","type":"user"},"name":"Colin Raffel","status":"admin_assigned","statusLastChangedAt":"2024-06-26T09:53:36.838Z","hidden":false},{"_id":"667baf1709faa9d48c05b509","user":{"_id":"5e48005437cb5b49818287a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e48005437cb5b49818287a5/4uCXGGui-9QifAT4qelxU.png","isPro":false,"fullname":"Leandro von Werra","user":"lvwerra","type":"user"},"name":"Leandro Von Werra","status":"admin_assigned","statusLastChangedAt":"2024-06-26T09:53:42.769Z","hidden":false},{"_id":"667baf1709faa9d48c05b50a","user":{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","isPro":true,"fullname":"Thomas Wolf","user":"thomwolf","type":"user"},"name":"Thomas Wolf","status":"admin_assigned","statusLastChangedAt":"2024-06-26T09:53:49.151Z","hidden":false}],"publishedAt":"2024-06-25T13:50:56.000Z","submittedOnDailyAt":"2024-06-26T04:44:38.772Z","title":"The FineWeb Datasets: Decanting the Web for the Finest Text Data at\n Scale","submittedOnDailyBy":{"_id":"5ff5d596f244529b3ec0fb89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624629516652-5ff5d596f244529b3ec0fb89.png","isPro":false,"fullname":"Philipp Schmid","user":"philschmid","type":"user"},"summary":"The performance of a large language model (LLM) depends heavily on the\nquality and size of its pretraining dataset. However, the pretraining datasets\nfor state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly\navailable and very little is known about how they were created. In this work,\nwe introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl\nsnapshots that produces better-performing LLMs than other open pretraining\ndatasets. To advance the understanding of how best to curate high-quality\npretraining datasets, we carefully document and ablate all of the design\nchoices used in FineWeb, including in-depth investigations of deduplication and\nfiltering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion\ntoken collection of educational text filtered from FineWeb. LLMs pretrained on\nFineWeb-Edu exhibit dramatically better performance on knowledge- and\nreasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we\npublicly release our data curation codebase and all of the models trained\nduring our ablation experiments.","upvotes":97,"discussionId":"667baf1809faa9d48c05b5ce","ai_summary":"FineWeb, a 15-trillion token dataset from Common Crawl snapshots, outperforms other open datasets in LLM training, and its educational subset, FineWeb-Edu, significantly improves performance on knowledge and reasoning benchmarks.","ai_keywords":["large language model","pretraining dataset","FineWeb","Common Crawl","deduplication","filtering strategies","FineWeb-Edu","MMLU","ARC"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62596f9e1c0a084224b93e00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/X2aLkJ0ofhkXwAg7lXvxD.jpeg","isPro":false,"fullname":"Guilherme Penedo","user":"guipenedo","type":"user"},{"_id":"5ff5d596f244529b3ec0fb89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624629516652-5ff5d596f244529b3ec0fb89.png","isPro":false,"fullname":"Philipp Schmid","user":"philschmid","type":"user"},{"_id":"64054b351a3babee78e67622","avatarUrl":"/avatars/f4d8b92c8c70608b76299a28fedeb83d.svg","isPro":false,"fullname":"wangrui","user":"varuy322","type":"user"},{"_id":"60dc25da6155a8319f008a6f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630322686754-60dc25da6155a8319f008a6f.jpeg","isPro":false,"fullname":"Wannaphong Phatthiyaphaibun","user":"wannaphong","type":"user"},{"_id":"619507e7b74b6c591f794340","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619507e7b74b6c591f794340/JbPDoy6Ko1V1-6oJJwFV8.jpeg","isPro":false,"fullname":"Weiyun Wang","user":"Weiyun1025","type":"user"},{"_id":"630b4269e67c604e9b7a429c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630b4269e67c604e9b7a429c/qsmA2ObMFfLwPIAyveo9F.jpeg","isPro":true,"fullname":"Steffen Röcker","user":"sroecker","type":"user"},{"_id":"62aff8128098f0e9defb8807","avatarUrl":"/avatars/d063a5a601ac2f4219c570febcb1f078.svg","isPro":false,"fullname":"daje kang","user":"daje","type":"user"},{"_id":"64c2998cc3a5b1606549bf56","avatarUrl":"/avatars/ddfd54f4cf90bdd615ef1ab409e26a62.svg","isPro":false,"fullname":"Piotr","user":"piotr-ai","type":"user"},{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},{"_id":"60f2fc91b92afccb7c34b8ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60f2fc91b92afccb7c34b8ed/W2-Nay12Ef4Ltyaf8EKE9.jpeg","isPro":false,"fullname":"Gabriel Martín Blázquez","user":"gabrielmbmb","type":"user"},{"_id":"6464ad8634037a50516ad38d","avatarUrl":"/avatars/1a82d81fa0ff5c7aaa392c3e2b55f4b2.svg","isPro":false,"fullname":"Leon","user":"Leon-Leee","type":"user"},{"_id":"6459fa0f5b3111fbe83286e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6459fa0f5b3111fbe83286e1/E6Buqu8Wd9WmIHKOCZXCc.jpeg","isPro":false,"fullname":"Louis Brulé Naudet","user":"louisbrulenaudet","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
The FineWeb Datasets: Decanting the Web for the Finest Text Data at
Scale
Published on Jun 25, 2024
#1 Paper of the day
Abstract
FineWeb, a 15-trillion token dataset from Common Crawl snapshots, outperforms other open datasets in LLM training, and its educational subset, FineWeb-Edu, significantly improves performance on knowledge and reasoning benchmarks.
The performance of a large language model (LLM) depends heavily on the
quality and size of its pretraining dataset. However, the pretraining datasets
for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly
available and very little is known about how they were created. In this work,
we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl
snapshots that produces better-performing LLMs than other open pretraining
datasets. To advance the understanding of how best to curate high-quality
pretraining datasets, we carefully document and ablate all of the design
choices used in FineWeb, including in-depth investigations of deduplication and
filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion
token collection of educational text filtered from FineWeb. LLMs pretrained on
FineWeb-Edu exhibit dramatically better performance on knowledge- and
reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we
publicly release our data curation codebase and all of the models trained
during our ablation experiments.