|
--- |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: out |
|
results: [] |
|
datasets: |
|
- roneneldan/TinyStories |
|
pipeline_tag: text-generation |
|
language: |
|
- en |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# TinyStories-GPT2-3M |
|
|
|
This model is a tiny (3M trainable parameters) GPT-2 model pre-trained for 3 epochs on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) V2 dataset. |
|
|
|
## Model description |
|
|
|
TinyStories-GPT2-3M is a replication of the TinyStories model, using a GPT-2 architecture in place of GPT-Neo. This was a |
|
deliberate choice made to accelerate research, as the GPT-2 architecture is more widely supported across tooling. We do not |
|
contribute any performance improvements of note, though similarly to the original model, we find a surprising degree of coherency |
|
within the model, given its size. |
|
|
|
## Intended uses & limitations |
|
|
|
Research use only - NOT suitable for commercial use per OpenAI TOS on using their APIs to source training data. |
|
|
|
|
|
Note that the vocabulary this model was trained on is quite minimal. Out of distribution inputs will not work as well as |
|
a larger, more general purpose model. To observe this behaviour, try generating a few tokens after a non-trivial word like |
|
"Biology". The model typically treats words that did not frequently appear in training as character names in a story. |
|
|
|
|
|
All training data is English. As such, input with other languages is out of distribution, and will result in the model treating |
|
previous input as character names, ignoring it entirely, or generating meaningless tokens. |
|
|
|
## Training and evaluation data |
|
|
|
Trained for 3 epochs on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) V2 dataset, produced by GPT-4. |
|
|
|
## Training procedure |
|
Trained for 400k steps (~7 hours) on 2xH100 80GB PCIe with 32vCPU and 500GB RAM on Runpod. |
|
|
|
To replicate, download GPT-4 V2 version of the TinyStories dataset alongside HuggingFace's `train_clm.py` script. Then run the following: |
|
```bash |
|
#! /bin/bash |
|
|
|
python train_clm.py \ |
|
--model_type=gpt2 \ |
|
--config_overrides=n_embd=64,n_layer=8,n_head=16 \ |
|
--tokenizer_name=gpt2 \ |
|
--train_file="data/TinyStoriesV2-GPT4-train.txt" \ |
|
--validation_file="data/TinyStoriesV2-GPT4-valid.txt" \ |
|
--block_size=256 \ |
|
--preprocessing_num_workers=8 \ |
|
--output_dir="out" \ |
|
--logging_dir="./log" \ |
|
--logging_steps=100 \ |
|
--logging_strategy=steps \ |
|
--save_steps=5000 \ |
|
--save_total_limit=10 \ |
|
--do_train |
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- n_embd: 64 |
|
- n_layer: 8 |
|
- n_head: 16 |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 3.0 |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.35.0.dev0 |
|
- Pytorch 2.0.1+cu118 |
|
- Datasets 2.14.5 |
|
- Tokenizers 0.14.1 |