language: Swedish
license: apache-2.0
KB-BERT distilled base model (cased)
This model is a distilled version of KB-BERT. It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The code for the distillation process can be found here. This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish.
Model description
This is a 6-layer version of KB-BERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.
Training data
The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The tokenized data had a file size of approximately 7.4 GB.
Evaluation results
When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.887 which is competitive with the score KB-BERT obtained, 0.894.
Additional results and comparisons are presented in my Master's Thesis