--- language: - en - ko license: llama3 library_name: transformers datasets: - legacy-datasets/wikipedia pipeline_tag: text-generation --- ## Model Details This model was continually pretrained from the [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), using English and Korean datasets. The goal is to enhance its proficiency in Korean while maintaining its English language capabilities from the original model. ### Datasets We sampled 16B tokens from the following datasets for training:
Sources Tokens (Llama-3-8B)
AI-Hub 9.2B
Modu Corpus 5.8B
Wikipedia 5.4B
### Hyperparameters
Learning rate Optimizer Betas Weight decay Warm-up ratio
3e-5 AdamW (0.9, 0.95) 0.1 0.05
## Intended Use This model has not been fine-tuned, so you will need to train it on your own dataset before using it. ## Evaluations We evaluated this model using both English and Korean benchmarks, and compared it with similar models that were continually pretrained from the [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B).
English Korean
Model MMLU (5-shot) HellaSwag (10-shot) GSM8K (8-shot, CoT) BBH (3-shot, CoT) KMMLU (5-shot) HAE-RAE (5-shot) KoBEST (5-shot)
meta-llama/Meta-Llama-3-8B 65.1 82.1 52.0 61.9 40.2 61.1 69.2
saltlux/Ko-Llama3-Luxia-8B 57.1 77.1 32.3 51.8 39.4 69.2 71.9
beomi/Llama-3-Open-Ko-8B 56.2 77.4 31.5 46.8 40.3 68.1 72.1
beomi/Llama-3-KoEn-8B 52.5 77.7 21.2 43.2 40.8 71.3 73.8
tesser-ai/Tesser-Llama-3-Ko-8B 60.5 79.8 40.3 56.3 42.5 72.1 73.8
## Limitations We trained this model using a context length of 4k due to resource limitations and to maximize training speed. However, the original model was trained with a context length of 8k, so an 8k context length could work well in downstream tasks. ## License This model follows the original [Llama-3 license](https://llama.meta.com/llama3/license/).