|
--- |
|
license: apache-2.0 |
|
library_name: generic |
|
tags: |
|
- text2text-generation |
|
- punctuation |
|
- sentence-boundary-detection |
|
- truecasing |
|
- true-casing |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- bg |
|
- bn |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- fi |
|
- fr |
|
- gu |
|
- hi |
|
- hr |
|
- hu |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- kk |
|
- kn |
|
- ko |
|
- ky |
|
- lt |
|
- lv |
|
- mk |
|
- ml |
|
- mr |
|
- nl |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- rw |
|
- so |
|
- sr |
|
- sw |
|
- ta |
|
- te |
|
- tr |
|
- uk |
|
- zh |
|
--- |
|
|
|
# Model Overview |
|
This is an `xlm-roberta` fine-tuned to restore punctuation, true-case (capitalize), |
|
and detect sentence boundaries (full stops) in 47 languages. |
|
|
|
# Model Architecture |
|
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction |
|
in every language without language-specific behavior: |
|
|
|
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png) |
|
|
|
We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph. |
|
|
|
Then we predict punctuation before and after every subtoken. |
|
Predicting before each token allows for Spanish inverted question marks. |
|
Predicting after every token allows for all other punctuation, including punctuation within continuous-script |
|
languages and acronyms. |
|
|
|
We use embeddings to represent the predicted punctuation tokens to inform the sentence boundary head of the |
|
punctuation that'll be inserted into the text. This allows proper full stop prediction, since certain punctuation |
|
tokens (periods, questions marks, etc.) are strongly correlated with sentence boundaries. |
|
|
|
We then shift full stop predictions to the right by one, to inform the true-casing head of where the beginning |
|
of each new sentence is. This is important since true-casing is strongly correlated to sentence boundaries. |
|
|
|
For true-casing, we predict `N` predictions per subtoken, where `N` is the number of characters in the subtoken. |
|
In practice, `N` is the maximum subtoken length and extra predictions are ignored. Essentially, true-casing is |
|
modeled as a multi-label problem. This allows for upper-casing arbitrary characters, e.g., "NATO", "MacDonald", "mRNA", etc. |
|
|
|
Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language. |
|
|
|
## Tokenizer |
|
|
|
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode |
|
the text. Per HF's comments, |
|
|
|
```python |
|
# Original fairseq vocab and spm vocab must be "aligned": |
|
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ---- |
|
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-' |
|
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a' |
|
``` |
|
|
|
The SP model was un-hacked with the following snippet |
|
(SentencePiece experts, let me know if there is a problem here): |
|
|
|
```python |
|
from sentencepiece import SentencePieceProcessor |
|
from sentencepiece.sentencepiece_model_pb2 import ModelProto |
|
|
|
m = ModelProto() |
|
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read()) |
|
|
|
pieces = list(m.pieces) |
|
pieces = ( |
|
[ |
|
ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN), |
|
] |
|
+ pieces[3:] |
|
+ [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)] |
|
) |
|
del m.pieces[:] |
|
m.pieces.extend(pieces) |
|
|
|
with open("/path/to/new/sp.model", "wb") as f: |
|
f.write(m.SerializeToString()) |
|
``` |
|
|
|
|
|
## Post-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens after each subtoken: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| \<ACRONYM\> | Every character in this subword is followed by a period | Primarily English, some European | |
|
| . | Latin full stop | Many | |
|
| , | Latin comma | Many | |
|
| ? | Latin question mark | Many | |
|
| ? | Full-width question mark | Chinese, Japanese | |
|
| , | Full-width comma | Chinese, Japanese | |
|
| 。 | Full-width full stop | Chinese, Japanese | |
|
| 、 | Ideographic comma | Chinese, Japanese | |
|
| ・ | Middle dot | Japanese | |
|
| । | Danda | Hindi, Bengali, Oriya | |
|
| ؟ | Arabic question mark | Arabic | |
|
| ; | Greek question mark | Greek | |
|
| ። | Ethiopic full stop | Amharic | |
|
| ፣ | Ethiopic comma | Amharic | |
|
| ፧ | Ethiopic question mark | Amharic | |
|
|
|
|
|
## Pre-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens before each subword: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| ¿ | Inverted question mark | Spanish | |
|
|
|
|
|
|
|
# Training Details |
|
This model was trained in the NeMo framework. |
|
|
|
## Training Data |
|
This model was trained with News Crawl data from WMT. |
|
|
|
1M lines of text for each language was used, except for a few low-resource languages which may have used less. |
|
|
|
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author. |
|
|
|
# Limitations |
|
This model was trained on news data, and may not perform well on conversational or informal data. |
|
|
|
Further, this model is unlikely to be of production quality. |
|
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data. |
|
|
|
This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below). |
|
Since `¿` is a rare token, especially in the |
|
context of a 47-language model, Spanish questions were over-sampled by selecting more of these sentences from |
|
additional training data that was not used. However, this seems to have "over-corrected" the problem and a lot |
|
of Spanish question marks are predicted. This can be fixed by exposing prior probabilities, but I'll fine-tune |
|
it later to fix this the right way. |
|
|
|
|
|
# Evaluation |
|
In these metrics, keep in mind that |
|
1. The data is noisy |
|
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. |
|
When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages. |
|
4. Punctuation can be subjective. E.g., |
|
|
|
`Hola mundo, ¿cómo estás?` |
|
|
|
or |
|
|
|
`Hola mundo. ¿Cómo estás?` |
|
|
|
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics. |
|
|
|
## Test Data and Example Generation |
|
Each test example was generated using the following procedure: |
|
|
|
1. Concatenate 11 random sentences (1 + 10 for each sentence in the test set) |
|
2. Lower-case the concatenated sentence |
|
3. Remove all punctuation |
|
|
|
The data is a held-out portion of News Crawl, which has been deduplicated. |
|
3,000 lines of data per language was used, generating 3,000 unique examples of 11 sentences each. |
|
We generate 3,000 examples, where example `i` begins with sentence `i` and is followed by 10 random |
|
sentences selected from the 3,000 sentence test set. |
|
|
|
## Selected Language Evaluation Reports |
|
For now, metrics for a few selected languages are shown below. |
|
Given the amount of work required to collect and pretty-print metrics in 47 languages, I'll add more eventually. |
|
|
|
Expand any of the following tabs to see metrics for that language. |
|
|
|
|
|
<details> |
|
<summary>English</summary> |
|
|
|
```text |
|
punct_post test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.18 98.47 98.82 538769 |
|
<ACRONYM> (label_id: 1) 66.03 78.63 71.78 571 |
|
. (label_id: 2) 90.66 93.68 92.14 30581 |
|
, (label_id: 3) 74.18 82.93 78.31 23230 |
|
? (label_id: 4) 78.10 80.08 79.07 1024 |
|
? (label_id: 5) 0.00 0.00 0.00 0 |
|
, (label_id: 6) 0.00 0.00 0.00 0 |
|
。 (label_id: 7) 0.00 0.00 0.00 0 |
|
、 (label_id: 8) 0.00 0.00 0.00 0 |
|
・ (label_id: 9) 0.00 0.00 0.00 0 |
|
। (label_id: 10) 0.00 0.00 0.00 0 |
|
؟ (label_id: 11) 0.00 0.00 0.00 0 |
|
، (label_id: 12) 0.00 0.00 0.00 0 |
|
; (label_id: 13) 0.00 0.00 0.00 0 |
|
። (label_id: 14) 0.00 0.00 0.00 0 |
|
፣ (label_id: 15) 0.00 0.00 0.00 0 |
|
፧ (label_id: 16) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 97.56 97.56 97.56 594175 |
|
macro avg 81.63 86.76 84.03 594175 |
|
weighted avg 97.70 97.56 97.62 594175 |
|
``` |
|
|
|
```text |
|
cap test report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.71 99.85 99.78 2036824 |
|
UPPER (label_id: 1) 96.40 93.27 94.81 87747 |
|
------------------- |
|
micro avg 99.58 99.58 99.58 2124571 |
|
macro avg 98.06 96.56 97.30 2124571 |
|
weighted avg 99.57 99.58 99.58 2124571 |
|
``` |
|
|
|
```text |
|
seg test report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.97 99.98 99.98 564228 |
|
FULLSTOP (label_id: 1) 99.73 99.54 99.64 32947 |
|
------------------- |
|
micro avg 99.96 99.96 99.96 597175 |
|
macro avg 99.85 99.76 99.81 597175 |
|
weighted avg 99.96 99.96 99.96 597175 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
|
|
<details> |
|
<summary>Spanish</summary> |
|
|
|
```text |
|
punct_pre test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.96 99.76 99.86 609200 |
|
¿ (label_id: 1) 39.66 77.89 52.56 1221 |
|
------------------- |
|
micro avg 99.72 99.72 99.72 610421 |
|
macro avg 69.81 88.82 76.21 610421 |
|
weighted avg 99.83 99.72 99.76 610421 |
|
``` |
|
|
|
```text |
|
punct_post test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.17 98.44 98.80 553100 |
|
<ACRONYM> (label_id: 1) 23.33 43.75 30.43 48 |
|
. (label_id: 2) 91.92 92.58 92.25 29623 |
|
, (label_id: 3) 73.07 82.04 77.30 26432 |
|
? (label_id: 4) 49.44 71.84 58.57 1218 |
|
? (label_id: 5) 0.00 0.00 0.00 0 |
|
, (label_id: 6) 0.00 0.00 0.00 0 |
|
。 (label_id: 7) 0.00 0.00 0.00 0 |
|
、 (label_id: 8) 0.00 0.00 0.00 0 |
|
・ (label_id: 9) 0.00 0.00 0.00 0 |
|
। (label_id: 10) 0.00 0.00 0.00 0 |
|
؟ (label_id: 11) 0.00 0.00 0.00 0 |
|
، (label_id: 12) 0.00 0.00 0.00 0 |
|
; (label_id: 13) 0.00 0.00 0.00 0 |
|
። (label_id: 14) 0.00 0.00 0.00 0 |
|
፣ (label_id: 15) 0.00 0.00 0.00 0 |
|
፧ (label_id: 16) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 97.39 97.39 97.39 610421 |
|
macro avg 67.39 77.73 71.47 610421 |
|
weighted avg 97.58 97.39 97.47 610421 |
|
``` |
|
|
|
```text |
|
cap test report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.82 99.86 99.84 2222062 |
|
UPPER (label_id: 1) 95.96 94.64 95.29 75940 |
|
------------------- |
|
micro avg 99.69 99.69 99.69 2298002 |
|
macro avg 97.89 97.25 97.57 2298002 |
|
weighted avg 99.69 99.69 99.69 2298002 |
|
``` |
|
|
|
```text |
|
seg test report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.99 99.97 99.98 580519 |
|
FULLSTOP (label_id: 1) 99.52 99.81 99.66 32902 |
|
------------------- |
|
micro avg 99.96 99.96 99.96 613421 |
|
macro avg 99.75 99.89 99.82 613421 |
|
weighted avg 99.96 99.96 99.96 613421 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
<summary>Amharic</summary> |
|
|
|
```text |
|
punct_post test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.81 99.40 99.60 729695 |
|
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0 |
|
. (label_id: 2) 0.00 0.00 0.00 0 |
|
, (label_id: 3) 0.00 0.00 0.00 0 |
|
? (label_id: 4) 0.00 0.00 0.00 0 |
|
? (label_id: 5) 0.00 0.00 0.00 0 |
|
, (label_id: 6) 0.00 0.00 0.00 0 |
|
。 (label_id: 7) 0.00 0.00 0.00 0 |
|
、 (label_id: 8) 0.00 0.00 0.00 0 |
|
・ (label_id: 9) 0.00 0.00 0.00 0 |
|
। (label_id: 10) 0.00 0.00 0.00 0 |
|
؟ (label_id: 11) 0.00 0.00 0.00 0 |
|
، (label_id: 12) 0.00 0.00 0.00 0 |
|
; (label_id: 13) 0.00 0.00 0.00 0 |
|
። (label_id: 14) 91.44 97.78 94.50 25288 |
|
፣ (label_id: 15) 66.93 80.45 73.07 5774 |
|
፧ (label_id: 16) 72.14 77.01 74.49 1170 |
|
------------------- |
|
micro avg 99.17 99.17 99.17 761927 |
|
macro avg 82.58 88.66 85.42 761927 |
|
weighted avg 99.24 99.17 99.19 761927 |
|
``` |
|
|
|
```text |
|
cap test report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 98.50 97.22 97.86 1150 |
|
UPPER (label_id: 1) 56.16 70.69 62.60 58 |
|
------------------- |
|
micro avg 95.94 95.94 95.94 1208 |
|
macro avg 77.33 83.95 80.23 1208 |
|
weighted avg 96.47 95.94 96.16 1208 |
|
``` |
|
|
|
```text |
|
seg test report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.97 99.91 99.94 743103 |
|
FULLSTOP (label_id: 1) 97.16 99.04 98.09 21824 |
|
------------------- |
|
micro avg 99.89 99.89 99.89 764927 |
|
macro avg 98.57 99.48 99.02 764927 |
|
weighted avg 99.89 99.89 99.89 764927 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
<summary>Chinese</summary> |
|
|
|
```text |
|
punct_post test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.47 97.46 98.45 414383 |
|
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0 |
|
. (label_id: 2) 0.00 0.00 0.00 0 |
|
, (label_id: 3) 0.00 0.00 0.00 0 |
|
? (label_id: 4) 0.00 0.00 0.00 0 |
|
? (label_id: 5) 81.41 85.80 83.55 1444 |
|
, (label_id: 6) 74.93 92.79 82.91 34094 |
|
。 (label_id: 7) 96.35 96.86 96.60 30668 |
|
、 (label_id: 8) 0.00 0.00 0.00 0 |
|
・ (label_id: 9) 0.00 0.00 0.00 0 |
|
। (label_id: 10) 0.00 0.00 0.00 0 |
|
؟ (label_id: 11) 0.00 0.00 0.00 0 |
|
، (label_id: 12) 0.00 0.00 0.00 0 |
|
; (label_id: 13) 0.00 0.00 0.00 0 |
|
። (label_id: 14) 0.00 0.00 0.00 0 |
|
፣ (label_id: 15) 0.00 0.00 0.00 0 |
|
፧ (label_id: 16) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 97.05 97.05 97.05 480589 |
|
macro avg 88.04 93.23 90.38 480589 |
|
weighted avg 97.47 97.05 97.19 480589 |
|
``` |
|
|
|
```text |
|
cap test report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 94.82 93.97 94.39 2786 |
|
UPPER (label_id: 1) 79.23 81.76 80.48 784 |
|
------------------- |
|
micro avg 91.29 91.29 91.29 3570 |
|
macro avg 87.03 87.87 87.44 3570 |
|
weighted avg 91.40 91.29 91.34 3570 |
|
``` |
|
|
|
```text |
|
seg test report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.99 99.98 99.98 450589 |
|
FULLSTOP (label_id: 1) 99.75 99.81 99.78 33000 |
|
------------------- |
|
micro avg 99.97 99.97 99.97 483589 |
|
macro avg 99.87 99.89 99.88 483589 |
|
weighted avg 99.97 99.97 99.97 483589 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
<summary>Japanese</summary> |
|
|
|
```text |
|
punct_post test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.32 95.84 97.55 387103 |
|
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0 |
|
. (label_id: 2) 0.00 0.00 0.00 0 |
|
, (label_id: 3) 0.00 0.00 0.00 0 |
|
? (label_id: 4) 0.00 0.00 0.00 0 |
|
? (label_id: 5) 75.12 68.14 71.46 1378 |
|
, (label_id: 6) 0.00 0.00 0.00 0 |
|
。 (label_id: 7) 93.30 97.44 95.33 31110 |
|
、 (label_id: 8) 53.91 87.52 66.72 17710 |
|
・ (label_id: 9) 29.33 64.60 40.35 1048 |
|
। (label_id: 10) 0.00 0.00 0.00 0 |
|
؟ (label_id: 11) 0.00 0.00 0.00 0 |
|
، (label_id: 12) 0.00 0.00 0.00 0 |
|
; (label_id: 13) 0.00 0.00 0.00 0 |
|
። (label_id: 14) 0.00 0.00 0.00 0 |
|
፣ (label_id: 15) 0.00 0.00 0.00 0 |
|
፧ (label_id: 16) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 95.46 95.46 95.46 438349 |
|
macro avg 70.20 82.71 74.28 438349 |
|
weighted avg 96.81 95.46 95.93 438349 |
|
``` |
|
|
|
```text |
|
cap test report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 92.64 92.67 92.65 4036 |
|
UPPER (label_id: 1) 80.75 80.70 80.73 1539 |
|
------------------- |
|
micro avg 89.36 89.36 89.36 5575 |
|
macro avg 86.70 86.68 86.69 5575 |
|
weighted avg 89.36 89.36 89.36 5575 |
|
``` |
|
|
|
```text |
|
seg test report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.98 99.95 99.97 408349 |
|
FULLSTOP (label_id: 1) 99.36 99.78 99.57 33000 |
|
------------------- |
|
micro avg 99.94 99.94 99.94 441349 |
|
macro avg 99.67 99.86 99.77 441349 |
|
weighted avg 99.94 99.94 99.94 441349 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
<summary>Hindi</summary> |
|
|
|
```text |
|
punct_post test report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.73 99.47 99.60 533761 |
|
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0 |
|
. (label_id: 2) 0.00 0.00 0.00 0 |
|
, (label_id: 3) 70.69 76.48 73.47 7713 |
|
? (label_id: 4) 65.41 74.75 69.77 301 |
|
? (label_id: 5) 0.00 0.00 0.00 0 |
|
, (label_id: 6) 0.00 0.00 0.00 0 |
|
。 (label_id: 7) 0.00 0.00 0.00 0 |
|
、 (label_id: 8) 0.00 0.00 0.00 0 |
|
・ (label_id: 9) 0.00 0.00 0.00 0 |
|
। (label_id: 10) 96.46 98.81 97.62 30641 |
|
؟ (label_id: 11) 0.00 0.00 0.00 0 |
|
، (label_id: 12) 0.00 0.00 0.00 0 |
|
; (label_id: 13) 0.00 0.00 0.00 0 |
|
። (label_id: 14) 0.00 0.00 0.00 0 |
|
፣ (label_id: 15) 0.00 0.00 0.00 0 |
|
፧ (label_id: 16) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 99.11 99.11 99.11 572416 |
|
macro avg 83.07 87.38 85.11 572416 |
|
weighted avg 99.15 99.11 99.13 572416 |
|
``` |
|
|
|
```text |
|
cap test report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 97.46 96.50 96.98 2346 |
|
UPPER (label_id: 1) 89.01 91.84 90.40 723 |
|
------------------- |
|
micro avg 95.41 95.41 95.41 3069 |
|
macro avg 93.23 94.17 93.69 3069 |
|
weighted avg 95.47 95.41 95.43 3069 |
|
``` |
|
|
|
```text |
|
seg test report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 100.00 100.00 100.00 542437 |
|
FULLSTOP (label_id: 1) 99.92 99.97 99.95 32979 |
|
------------------- |
|
micro avg 99.99 99.99 99.99 575416 |
|
macro avg 99.96 99.98 99.97 575416 |
|
weighted avg 99.99 99.99 99.99 575416 |
|
``` |
|
|
|
</details> |
|
|