Post
2122
Proof that ablative educational dataset significantly enhances model capabilities (independent of model parameters or architecture) 🤩
Yesterday, FineWeb’s technical report was published. FYI FineWeb (by 🤗) is currently the best opensource text dataset that can scale up model performance up to that of GPT-3 level.
While proprietary datasets used in training models like GPT-4/Claude/LlaMA are crawled internally and never released, FineWeb builds on CommonCrawl (an open repo for crawled web data). They preprocessed the data using their custom built data preprocessing library datatrove (which they also opensourced), and then evaluate the data quality on lighteval by training small sized models “ablation models” using nanotron (a library for pretraining transformer models).
Of all versions of FineWeb, FineWeb-Edu outperforms all other subsets. This is thanks to a new filtering technique wherein they used synthetic data to develop classifiers for identifying educational contents.
Turned out “Education is All You Need”:)
Yesterday, FineWeb’s technical report was published. FYI FineWeb (by 🤗) is currently the best opensource text dataset that can scale up model performance up to that of GPT-3 level.
While proprietary datasets used in training models like GPT-4/Claude/LlaMA are crawled internally and never released, FineWeb builds on CommonCrawl (an open repo for crawled web data). They preprocessed the data using their custom built data preprocessing library datatrove (which they also opensourced), and then evaluate the data quality on lighteval by training small sized models “ablation models” using nanotron (a library for pretraining transformer models).
Of all versions of FineWeb, FineWeb-Edu outperforms all other subsets. This is thanks to a new filtering technique wherein they used synthetic data to develop classifiers for identifying educational contents.
Turned out “Education is All You Need”:)