Initial commit

Browse files

Files changed (8) hide show

README.md +110 -0
config.json +43 -0
pytorch_model.bin +3 -0
source.spm +0 -0
special_tokens_map.json +1 -0
target.spm +0 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,110 @@

+---
+language:
+- gmw
+- gmw
+tags:
+- translation
+license: CC-BY 4.0
+---
+# opus-mt-tc-base-gmw-gmw
+Neural machine translation model for translating from West Germanic languages to West Germanic languages.
+This model is part of the [OPUS-MT project](https://github.com/Helsinki-NLP/Opus-MT), an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of [Marian NMT](https://marian-nmt.github.io/), an efficient NMT implementation writtin in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from [OPUS](https://opus.nlpl.eu/) and training pipelines use the procedures of [OPUS-MT-train](https://github.com/Helsinki-NLP/Opus-MT-train).
+* Publications: [OPUS-MT – Building open translation services for the World](https://aclanthology.org/2020.eamt-1.61/) , [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/)
+## Model info
+* Release: 2021-02-23
+* source language(s): afr deu eng fry gos hrx ltz nds nld pdc yid
+* target language(s): afr deu eng fry nds nld
+* valid target language labels: >>afr<< >>ang_Latn<< >>deu<< >>eng<< >>fry<< >>ltz<< >>nds<< >>nld<< >>sco<< >>yid<<
+* model: transformer
+* data: opus
+* tokenization: SentencePiece (spm32k,spm32k)
+* original model: [opus-2021-02-23.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/gmw-gmw/opus-2021-02-23.zip)
+This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of `>>id<<` (id = valid target language ID), e.g. `>>afr<<`
+## Usage
+You can use OPUS-MT models with the transformers pipelines, for example:
+```python
+from transformers import pipeline
+pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-base-gmw-gmw")
+print(pipe(">>afr<< Replace this with text in an accepted source language.")
+```
+## Benchmarks
+| langpair | testset | BLEU  | chr-F | #sent | #words | BP |
+|----------|---------|-------|-------|-------|--------|----|
+| afr-deu | Tatoeba-test 	| 48.5 	| 0.677 	| 1583 	| 9105 	| 1.000 |
+| afr-eng | Tatoeba-test 	| 58.7 	| 0.727 	| 1374 	| 9622 	| 0.995 |
+| afr-nld | Tatoeba-test 	| 54.7 	| 0.713 	| 1056 	| 6710 	| 0.989 |
+| deu-afr | Tatoeba-test 	| 52.4 	| 0.697 	| 1583 	| 9507 	| 1.000 |
+| deu-eng | newssyscomb2009 	| 25.4 	| 0.527 	| 502 	| 11821 	| 0.986 |
+| deu-eng | news-test2008 	| 23.9 	| 0.519 	| 2051 	| 49380 	| 0.992 |
+| deu-eng | newstest2009 	| 23.5 	| 0.517 	| 2525 	| 65402 	| 0.978 |
+| deu-eng | newstest2010 	| 26.1 	| 0.548 	| 2489 	| 61724 	| 1.000 |
+| deu-eng | newstest2011 	| 23.9 	| 0.525 	| 3003 	| 74681 	| 1.000 |
+| deu-eng | newstest2012 	| 25.0 	| 0.533 	| 3003 	| 72812 	| 1.000 |
+| deu-eng | newstest2013 	| 27.7 	| 0.549 	| 3000 	| 64505 	| 1.000 |
+| deu-eng | newstest2014-deen 	| 27.4 	| 0.549 	| 3003 	| 67337 	| 0.977 |
+| deu-eng | newstest2015-ende 	| 28.8 	| 0.554 	| 2169 	| 46443 	| 0.973 |
+| deu-eng | newstest2016-ende 	| 33.7 	| 0.598 	| 2999 	| 64126 	| 1.000 |
+| deu-eng | newstest2017-ende 	| 29.6 	| 0.562 	| 3004 	| 64399 	| 0.979 |
+| deu-eng | newstest2018-ende 	| 36.3 	| 0.611 	| 2998 	| 67013 	| 0.977 |
+| deu-eng | newstest2019-deen 	| 32.7 	| 0.585 	| 2000 	| 39282 	| 0.984 |
+| deu-eng | Tatoeba-test 	| 44.7 	| 0.629 	| 10000 	| 81233 	| 0.975 |
+| deu-nds | Tatoeba-test 	| 18.7 	| 0.444 	| 10000 	| 76144 	| 0.988 |
+| deu-nld | Tatoeba-test 	| 48.7 	| 0.672 	| 10000 	| 73546 	| 0.969 |
+| eng-afr | Tatoeba-test 	| 56.5 	| 0.735 	| 1374 	| 10317 	| 0.984 |
+| eng-deu | newssyscomb2009 	| 19.4 	| 0.503 	| 502 	| 11271 	| 0.991 |
+| eng-deu | news-test2008 	| 19.5 	| 0.493 	| 2051 	| 47427 	| 0.996 |
+| eng-deu | newstest2009 	| 18.8 	| 0.499 	| 2525 	| 62816 	| 0.993 |
+| eng-deu | newstest2010 	| 20.8 	| 0.509 	| 2489 	| 61511 	| 0.958 |
+| eng-deu | newstest2011 	| 19.2 	| 0.493 	| 3003 	| 72981 	| 0.980 |
+| eng-deu | newstest2012 	| 19.6 	| 0.494 	| 3003 	| 72886 	| 0.960 |
+| eng-deu | newstest2013 	| 22.8 	| 0.518 	| 3000 	| 63737 	| 0.974 |
+| eng-deu | newstest2015-ende 	| 25.8 	| 0.545 	| 2169 	| 44260 	| 1.000 |
+| eng-deu | newstest2016-ende 	| 30.3 	| 0.581 	| 2999 	| 62670 	| 0.989 |
+| eng-deu | newstest2017-ende 	| 24.2 	| 0.537 	| 3004 	| 61291 	| 1.000 |
+| eng-deu | newstest2018-ende 	| 35.5 	| 0.616 	| 2998 	| 64276 	| 1.000 |
+| eng-deu | newstest2019-ende 	| 31.6 	| 0.586 	| 1997 	| 48969 	| 0.973 |
+| eng-deu | Tatoeba-test 	| 37.8 	| 0.591 	| 10000 	| 83347 	| 0.991 |
+| eng-nds | Tatoeba-test 	| 16.5 	| 0.411 	| 2500 	| 18264 	| 0.992 |
+| eng-nld | Tatoeba-test 	| 50.3 	| 0.677 	| 10000 	| 71436 	| 0.979 |
+| fry-deu | Tatoeba-test 	| 28.7 	| 0.545 	| 66 	| 432 	| 1.000 |
+| fry-eng | Tatoeba-test 	| 31.9 	| 0.496 	| 205 	| 1500 	| 1.000 |
+| fry-nld | Tatoeba-test 	| 43.0 	| 0.634 	| 233 	| 1672 	| 1.000 |
+| gos-nld | Tatoeba-test 	| 15.9 	| 0.409 	| 1852 	| 9903 	| 0.959 |
+| hrx-deu | Tatoeba-test 	| 24.7 	| 0.487 	| 471 	| 2805 	| 0.984 |
+| ltz-deu | Tatoeba-test 	| 36.6 	| 0.552 	| 337 	| 2144 	| 1.000 |
+| ltz-eng | Tatoeba-test 	| 31.4 	| 0.477 	| 283 	| 1751 	| 1.000 |
+| ltz-nld | Tatoeba-test 	| 37.5 	| 0.523 	| 273 	| 1567 	| 1.000 |
+| multi-multi | Tatoeba-test 	| 37.1 	| 0.569 	| 10000 	| 73153 	| 1.000 |
+| nds-deu | Tatoeba-test 	| 34.5 	| 0.572 	| 10000 	| 74571 	| 1.000 |
+| nds-eng | Tatoeba-test 	| 29.6 	| 0.492 	| 2500 	| 17589 	| 1.000 |
+| nds-nld | Tatoeba-test 	| 42.2 	| 0.621 	| 1657 	| 11490 	| 0.994 |
+| nld-afr | Tatoeba-test 	| 59.0 	| 0.756 	| 1056 	| 6823 	| 1.000 |
+| nld-deu | Tatoeba-test 	| 50.6 	| 0.688 	| 10000 	| 72438 	| 1.000 |
+| nld-eng | Tatoeba-test 	| 54.5 	| 0.702 	| 10000 	| 69848 	| 0.975 |
+| nld-fry | Tatoeba-test 	| 23.3 	| 0.462 	| 233 	| 1679 	| 1.000 |
+| nld-nds | Tatoeba-test 	| 21.7 	| 0.462 	| 1657 	| 11711 	| 0.998 |
+| pdc-eng | Tatoeba-test 	| 24.3 	| 0.402 	| 53 	| 399 	| 1.000 |
+| yid-nld | Tatoeba-test 	| 21.3 	| 0.402 	| 55 	| 323 	| 1.000 |
+* test set translations: [opus-2021-02-23.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/gmw-gmw/opus-2021-02-23.test.txt)
+* test set scores: [opus-2021-02-23.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/gmw-gmw/opus-2021-02-23.eval.txt)
+## Model conversion info
+* transformers version: 4.12.3
+* OPUS-MT git hash: fc19512
+* port time: Thu Jan 27 18:04:00 EET 2022
+* port machine: LM0-400-22516.local

config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "activation_dropout": 0.0,
+  "activation_function": "swish",
+  "architectures": [
+    "MarianMTModel"
+  ],
+  "attention_dropout": 0.0,
+  "bad_words_ids": [
+    [
+      35451
+    ]
+  ],
+  "bos_token_id": 0,
+  "classifier_dropout": 0.0,
+  "d_model": 512,
+  "decoder_attention_heads": 8,
+  "decoder_ffn_dim": 2048,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 6,
+  "decoder_start_token_id": 35451,
+  "dropout": 0.1,
+  "encoder_attention_heads": 8,
+  "encoder_ffn_dim": 2048,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 6,
+  "eos_token_id": 0,
+  "forced_eos_token_id": 0,
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "max_length": 512,
+  "max_position_embeddings": 512,
+  "model_type": "marian",
+  "normalize_embedding": false,
+  "num_beams": 6,
+  "num_hidden_layers": 6,
+  "pad_token_id": 35451,
+  "scale_embedding": true,
+  "static_position_embeddings": true,
+  "torch_dtype": "float16",
+  "transformers_version": "4.12.3",
+  "use_cache": true,
+  "vocab_size": 35452
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4586f60a829f1bd8331e4f877b3ddf493c333c821afe8bac9b84847681df2c7
+size 161034627

source.spm ADDED Viewed

Binary file (802 kB). View file

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}

target.spm ADDED Viewed

Binary file (802 kB). View file

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"source_lang": "gmw", "target_lang": "gmw", "unk_token": "<unk>", "eos_token": "</s>", "pad_token": "<pad>", "model_max_length": 512, "sp_model_kwargs": {}, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "marian-models/opus-2021-02-23/gmw-gmw", "tokenizer_class": "MarianTokenizer"}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff