arxiv:2410.07064

Data Selection via Optimal Control for Language Models

Published on Oct 9

· Submitted by

t1101675 on Oct 10

Upvote

Authors:

Yuxian Gu ,

Li Dong ,

Hongning Wang ,

Qingxiu Dong ,

Furu Wei ,

Abstract

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/data_selection.

View arXiv page View PDF Add to collection

Community

t1101675

Paper author Paper submitter 28 days ago

•

edited 28 days ago

TL;DR:

We provide a novel theoretical perspective for data selection by fomulating the problem as Optimal Control, which can be rigorous solved by Pontryagin's Maximum Principle (PMP).
Based on the theoretical results, we derive a scalable data selection framework: PMP-based Data Selection (PDS) to select pre-training data for LMs. PDS enjoys strong theoretical basis, offering an alternative to the ad-hoc trial-and-error practices that currently dominate LM pre-training
Experiments shows that PDS boosts LMs' downstream performance, saves pre-training computation, and improves pre-training data utilization. The benefits extends to ~400B LMs trained on ~10T tokens (scale of LLaMA3.1), as evidenced by the Scaling Law.