|
--- |
|
license: apache-2.0 |
|
library_name: generic |
|
tags: |
|
- text2text-generation |
|
- punctuation |
|
- sentence-boundary-detection |
|
- truecasing |
|
- true-casing |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- bg |
|
- bn |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- fi |
|
- fr |
|
- gu |
|
- hi |
|
- hr |
|
- hu |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- kk |
|
- kn |
|
- ko |
|
- ky |
|
- lt |
|
- lv |
|
- mk |
|
- ml |
|
- mr |
|
- nl |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- rw |
|
- so |
|
- sr |
|
- sw |
|
- ta |
|
- te |
|
- tr |
|
- uk |
|
- zh |
|
--- |
|
|
|
# Model Overview |
|
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes), |
|
and detects sentence boundaries (full stops) in 47 languages. |
|
|
|
|
|
## Tokenizer |
|
|
|
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode |
|
the text. Per HF's comments, |
|
|
|
```python |
|
# Original fairseq vocab and spm vocab must be "aligned": |
|
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ---- |
|
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-' |
|
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a' |
|
``` |
|
|
|
The SP model was un-hacked with the following snippet |
|
(SentencePiece experts, let me know if there is a problem here): |
|
|
|
```python |
|
from sentencepiece import SentencePieceProcessor |
|
from sentencepiece.sentencepiece_model_pb2 import ModelProto |
|
|
|
m = ModelProto() |
|
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read()) |
|
|
|
pieces = list(m.pieces) |
|
pieces = ( |
|
[ |
|
ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN), |
|
] |
|
+ pieces[3:] |
|
+ [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)] |
|
) |
|
del m.pieces[:] |
|
m.pieces.extend(pieces) |
|
|
|
with open("/path/to/new/sp.model", "wb") as f: |
|
f.write(m.SerializeToString()) |
|
``` |
|
|
|
|
|
## Post-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens after each subtoken: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| \<ACRONYM\> | Every character in this subword is followed by a period | Primarily English, some European | |
|
| . | Latin full stop | Many | |
|
| , | Latin comma | Many | |
|
| ? | Latin question mark | Many | |
|
| ? | Full-width question mark | Chinese, Japanese | |
|
| , | Full-width comma | Chinese, Japanese | |
|
| 。 | Full-width full stop | Chinese, Japanese | |
|
| 、 | Ideographic comma | Chinese, Japanese | |
|
| ・ | Middle dot | Japanese | |
|
| । | Danda | Hindi, Bengali, Oriya | |
|
| ؟ | Arabic question mark | Arabic | |
|
| ; | Greek question mark | Greek | |
|
| ። | Ethiopic full stop | Amharic | |
|
| ፣ | Ethiopic comma | Amharic | |
|
| ፧ | Ethiopic question mark | Amharic | |
|
|
|
|
|
## Pre-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens before each subword: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| ¿ | Inverted question mark | Spanish | |
|
|
|
|
|
|
|
# Training Details |
|
This model was trained in the NeMo framework. |
|
|
|
## Training Data |
|
This model was trained with News Crawl data from WMT. |
|
|
|
1M lines of text for each language was used, except for a few low-resource languages which may have used less. |
|
|
|
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author. |
|
|
|
# Limitations |
|
This model was trained on news data, and may not perform well on conversational or informal data. |
|
|
|
Further, this model is unlikely to be of production quality. |
|
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data. |
|
|
|
|
|
|
|
# Evaluation |
|
In these metrics, keep in mind that |
|
1. The data is noisy |
|
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. |
|
When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages. |
|
4. Punctuation can be subjective. E.g., |
|
|
|
`Hola mundo, ¿cómo estás?` |
|
|
|
or |
|
|
|
`Hola mundo. ¿Cómo estás?` |
|
|
|
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics. |
|
|
|
## Test Data and Example Generation |
|
Each test example was generated using the following procedure: |
|
|
|
1. Concatenate 10 random sentences |
|
2. Lower-case the concatenated sentence |
|
3. Remove all punctuation |
|
|
|
The data is a held-out portion of News Crawl, which has been deduplicated. |
|
3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each. |
|
The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated. |
|
|
|
Examples longer than the model's maximum length were truncated. |
|
The number of affected sentences can be estimated from the "full stop" support: with 3,000 |
|
sentences and 10 sentences per example, we expect 30,000 full stop targets total. |
|
|
|
## Selected Language Evaluation Reports |
|
|
|
|
|
|