--- license: mit --- # Pythia 12B SFT This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). # Model Details ## Model Description - **Developed by:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Model type:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] - **Finetuned from model [optional]:** [More Information Needed] ## Model Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] # Uses ## Direct Use [More Information Needed] ## Downstream Use [optional] [More Information Needed] ## Out-of-Scope Use [More Information Needed] # Bias, Risks, and Limitations [More Information Needed] ## Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] # Training Details ## Training Data Trainining data includes 2023-02-10 openassistant unfiltered conversation tree dump ## Training Procedure ``` deepspeed trainer_sft.py --configs defaults pythia-80 --deepspeed ``` ### Preprocessing [optional] [More Information Needed] ### Training Hyperparameters deepspeed stage 2 config are as follows: ``` defaults: learning_rate: 1e-5 gradient_checkpointing: false gradient_accumulation_steps: 32 per_device_train_batch_size: 2 per_device_eval_batch_size: 2 weight_decay: 0.00 warmup_steps: 600 eval_steps: 250 save_steps: 250 max_length: 512 num_train_epochs: 2 logging_steps: 10 max_grad_norm: 2.0 save_total_limit: 4 fp16: true eval_accumulation_steps: freeze_layer: datasets: - gsm8k_hard - webgpt - squad_v2 - adversarial_qa - private_tuning - oa_translated - prosocial_dialogue - math_qa - wikihow - joke - gsm8k - ted_trans_en-hi - ted_trans_de-ja - ted_trans_nl-en - ted_trans_en-ja - ted_trans_en-es - ted_trans_en-ms - xsum: fraction: 0.5 - cnn_dailymail: fraction: 0.5 - multi_news: fraction: 0.5 - tldr_news: fraction: 0.5 - scitldr: fraction: 0.5 - samsum: fraction: 0.5 - debate_sum: fraction: 0.5 - billsum: fraction: 0.5 - wmt2019_zh-en: fraction: 0.9 - wmt2019_ru-en: fraction: 0.9 - wmt2019_de-en: fraction: 0.9 - wmt2019_fr-de: fraction: 0.9 - essay_instruction - reddit_eli5 - reddit_askh - reddit_asks loss_fn: CrossEntropyLoss log_dir: "base" quantization: false seq2seqmodel: false poly_eps: 1.0 fuse_gelu: true log_wandb: true samples_mixing: true # uses collator that mixes samples in the batch to create a single sample with possible multiple tasks within verbose: false pythia-80: learning_rate: 5e-6 model_name: EleutherAI/pythia-12b-deduped weight_decay: 0.01 max_length: 520 warmup_steps: 1000 gradient_checkpointing: false gradient_accumulation_steps: 20 per_device_train_batch_size: 6 per_device_eval_batch_size: 6 ``` ### Speeds, Sizes, Times [optional] [More Information Needed] # Evaluation ## Testing Data, Factors & Metrics ### Testing Data [More Information Needed] ### Factors [More Information Needed] ### Metrics [More Information Needed] ## Results [More Information Needed] ### Summary # Model Examination [optional] [More Information Needed] # Technical Specifications [optional] ## Model Architecture and Objective Pythia 12B deduppped model ## Compute Infrastructure Stability AWS Slurm Cluster ### Hardware 8 x A100 80G ### Software [More Information Needed] # Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] # Glossary [optional] [More Information Needed] # More Information [optional] [More Information Needed] # Model Card Authors [optional] [More Information Needed] # Model Card Contact [More Information Needed]