datasets: | |
- infCapital/vnnews_corpus_100K | |
language: | |
- vi | |
## Base Model: LLaMa2 7B Chat HF | |
+ Extend vocab to 44,800 for better Vietnamese understanding | |
+ Continual Pre-Train with >2B tokens Vietnamese | |
+ Trainning profile: LoRa (rank=32, alpha=128, 16fp), 1 epoch, block size = 512. Takes 300GPU Hours x RXT4090 24GB | |
## Can be better use for | |
+ Futher training / Fine-tuning for Vietnamese tasks |