Model Card for the Danoliterate Baseline 7B Model
A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.)
Model Details
Model Description
As test model part of the thesis Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish with relevant details in Sections 4.1, 5.1 and 6.1.
- Developed by: Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
- Model type: Base, autoregressive LLM with LLaMa 2 7B architecture.
- Language(s) (NLP): Danish
- License: MIT
Uses
This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
Bias, Risks, and Limitations
The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
Training Details
Training Data
The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX. For more details, see Section 4.1 in the thesis.
Training Procedure
See Sections 5.1 and 6.1 in the thesis
Evaluation
On the Danoliterate LLM Benchmark, this model gets an index score of 13 as of June 2024.
Model Card Contact
Contact Søren Vejlgaard Holm at [email protected] or [email protected].
- Downloads last month
- 3