Rephrasing the Web A Recipe for Compute and Data-Efficient Language Modeling
In this paper[1], the authors introduce Web Rephrase Augmented Pre-training (WRAP), aimed at enhancing language model training efficiency by rephrasing web documents into styles like Wikipedia or question-answer formats. This approach addresses the challenges of learning from noisy, unstructured web data, which typically requires significant compute and data resources.
Method Overview
WRAP uses an instruction-tuned model to rephrase web documents into various styles, creating synthetic data. Here's an overview of the method:
WRAP overview
This method allows for efficient learning from a blend of real and synthetic data, significantly reducing the need for high-quality web data. The process involves prompting a pre-trained LLM to generate paraphrases, combining these with real data for model training.
Building on the observation that high-quality data, like Wikipedia, improves language modeling, WRAP employs a strategy to rephrase web documents into four distinct styles:
Easy - understandable even by a toddler
Medium - similar to Wikipedia articles
Hard - in terse and abstruse language
Q/A - in question-answering format
The prompts for each style are shown below:
Prompt templates for the 4 styles
By utilizing an instruction-tuned model, specifically Mistral-7B, WRAP generates synthetic data. WRAP then combines this synthetic data with real web data in a 1:1 ratio, incorporating both the diversity of internet content and the quality of structured rephrasing, thus enabling the model to learn from a rich dataset that balances informative content with the realistic messiness of web text.
Results
The application of WRAP on the C4 dataset resulted in approximately 3x faster pre-training and improved model perplexity by over 10% across various subsets of the Pile dataset.
C4 WRAP results
It also enhanced zero-shot question-answering accuracy across 13 tasks by more than 2%.
WRAP results on various tasks
Conclusion
WRAP demonstrates significant improvements in the efficiency and effectiveness of language model training by leveraging synthetic rephrases of web data. For more details, please consult the full paper.
Congrats to the authors for their work!
[1] Maini, Pratyush et al. “Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling.” ArXiv abs/2401.16380 (2024): n. pag.