Poro-34B-chat / README.md
jonabur
initial commit
e1590f5
|
raw
history blame
No virus
3.2 kB
metadata
license: apache-2.0
datasets:
  - LumiOpen/instruction-collection-fin
language:
  - fi
  - en

Poro 34B Chat

Poro 34b chat is a chat-tuned version of Poro 34B trained to follow instructions in both Finnish and English.

Because of the limited amount of instruction tuning available for Finnish, documents from the English datasets were machine-translated by the Poro 34B base model into Finnish, then used to train this chat version. We selected only datasets that are available for commercial use and only contain synthetic data if it was gathered in ToS-compliant fashion.

More information about the data selection and translation process for our Finnish dataset are available on the LumiOpen/instruction-collection-fin page.

Poro was created in a collaboration between SiloGen from Silo AI, the TurkuNLP group of the University of Turku, and High Performance Language Technologies (HPLT). Training was conducted on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland.

This project is part of an ongoing effort to create open source large language models for non-English and especially low resource languages like Finnish. Through the combination of English and Finnish training data we get a model that outperforms previous Finnish only models, while also being fluent in English and code, and capable of basic translation between English and Finnish.

Fine Tuning

Zephyr--??? TODO

Datasets

Finnish and Cross-lingual

English

Evaluations

We relied on the popular MTBench benchmark to evaluate multi-turn performance.

Since MTBench is an English only benchmark, we also release this fork of MTBench Finnish with multilingual support and machine translated Finnish prompts. Our scores for both benchmarks follow.

Eval Score
MTBench 5.93
MTBench Finnish 5.90

License

Poro 34B chat is released under the Apache 2.0 license.

Citation

@misc{luukkonen2024poro,
      title={Poro 34B and the Blessing of Multilinguality},
      author={Risto Luukkonen and Jonathan Burdge and Elaine Zosa and Aarne
Talman and Ville Komulainen and Väinö Hatanpää and Peter Sarlin and Sampo
Pyysalo},
      year={2024},
      eprint={2404.01856},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}