File size: 5,623 Bytes
47bbd43 2d25caf 47bbd43 2d25caf 7978b81 5f92894 7978b81 2d25caf 26abccc 2d25caf 33d9157 2d25caf 26abccc 2d25caf 33d9157 2d25caf 33d9157 2d25caf 7acbcf4 2d25caf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
license: apache-2.0
library_name: generic
tags:
- text2text-generation
- punctuation
- sentence-boundary-detection
- truecasing
language:
- af
- am
- ar
- bg
- bn
- de
- el
- en
- es
- et
- fa
- fi
- fr
- gu
- hi
- hr
- hu
- id
- is
- it
- ja
- kk
- kn
- ko
- ky
- lt
- lv
- mk
- ml
- mr
- nl
- or
- pa
- pl
- ps
- pt
- ro
- ru
- rw
- so
- sr
- sw
- ta
- te
- tr
- uk
- zh
---
# Model Overview
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
and detects sentence boundaries (full stops) in 47 languages.
## Tokenizer
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode
the text. Per HF's comments,
```python
# Original fairseq vocab and spm vocab must be "aligned":
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'
```
The SP model was un-hacked with the following snippet
(SentencePiece experts, let me know if there is a problem here):
```python
from sentencepiece import SentencePieceProcessor
from sentencepiece.sentencepiece_model_pb2 import ModelProto
m = ModelProto()
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())
pieces = list(m.pieces)
pieces = (
[
ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
]
+ pieces[3:]
+ [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
)
del m.pieces[:]
m.pieces.extend(pieces)
with open("/path/to/new/sp.model", "wb") as f:
f.write(m.SerializeToString())
```
## Post-Punctuation Tokens
This model predicts the following set of punctuation tokens after each subtoken:
| Token | Description | Relevant Languages |
| ---: | :---------- | :----------- |
| \<NULL\> | No punctuation | All |
| \<ACRONYM\> | Every character in this subword is followed by a period | Primarily English, some European |
| . | Latin full stop | Many |
| , | Latin comma | Many |
| ? | Latin question mark | Many |
| ? | Full-width question mark | Chinese, Japanese |
| , | Full-width comma | Chinese, Japanese |
| 。 | Full-width full stop | Chinese, Japanese |
| 、 | Ideographic comma | Chinese, Japanese |
| ・ | Middle dot | Japanese |
| । | Danda | Hindi, Bengali, Oriya |
| ؟ | Arabic question mark | Arabic |
| ; | Greek question mark | Greek |
| ። | Ethiopic full stop | Amharic |
| ፣ | Ethiopic comma | Amharic |
| ፧ | Ethiopic question mark | Amharic |
## Pre-Punctuation Tokens
This model predicts the following set of punctuation tokens before each subword:
| Token | Description | Relevant Languages |
| ---: | :---------- | :----------- |
| \<NULL\> | No punctuation | All |
| ¿ | Inverted question mark | Spanish |
# Training Details
This model was trained in the NeMo framework.
## Training Data
This model was trained with News Crawl data from WMT.
1M lines of text for each language was used, except for a few low-resource languages which may have used less.
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
# Limitations
This model was trained on news data, and may not perform well on conversational or informal data.
Further, this model is unlikely to be of production quality.
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
This is also a base-sized model with many languages and many tasks, so capacity may be limited.
# Evaluation
In these metrics, keep in mind that
1. The data is noisy
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
4. Punctuation can be subjective. E.g.,
`Hola mundo, ¿cómo estás?`
or
`Hola mundo. ¿Cómo estás?`
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
## Test Data and Example Generation
Each test example was generated using the following procedure:
1. Concatenate 10 random sentences
2. Lower-case the concatenated sentence
3. Remove all punctuation
The data is a held-out portion of News Crawl, which has been deduplicated.
3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each.
The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.
Examples longer than the model's maximum length were truncated.
The number of affected sentences can be estimated from the "full stop" support: with 3,000
sentences and 10 sentences per example, we expect 30,000 full stop targets total.
## Selected Language Evaluation Reports
|