|
--- |
|
license: apache-2.0 |
|
library_name: generic |
|
tags: |
|
- text2text-generation |
|
- punctuation |
|
- sentence-boundary-detection |
|
- truecasing |
|
- true-casing |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- bg |
|
- bn |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- fi |
|
- fr |
|
- gu |
|
- hi |
|
- hr |
|
- hu |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- kk |
|
- kn |
|
- ko |
|
- ky |
|
- lt |
|
- lv |
|
- mk |
|
- ml |
|
- mr |
|
- nl |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- rw |
|
- so |
|
- sr |
|
- sw |
|
- ta |
|
- te |
|
- tr |
|
- uk |
|
- zh |
|
--- |
|
|
|
# Model Overview |
|
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes), |
|
and detects sentence boundaries (full stops) in 47 languages. |
|
|
|
# Model Architecture |
|
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction |
|
in every language without language-specific behavior: |
|
|
|
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png) |
|
|
|
We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph. |
|
|
|
Then we predict punctuation before and after every subtoken. |
|
Predicting before each token allows for Spanish inverted question marks. |
|
Predicting after every token allows for all other punctuation, including punctuation within continuous-script |
|
languages and acronyms. |
|
|
|
We use embeddings to represent the predicted punctuation tokens to inform the sentence boundary head of the |
|
punctuation that'll be inserted into the text. This allows proper full stop prediction, since certain punctuation |
|
tokens (periods, questions marks, etc.) are strongly correlated with sentence boundaries. |
|
|
|
We then shift full stop predictions to the right by one, to inform the true-casing head of where the beginning |
|
of each new sentence is. This is important since true-casing is strongly correlated to sentence boundaries. |
|
|
|
For true-casing, we predict `N` predictions per subtoken, where `N` is the number of characters in the subtoken. |
|
In practice, `N` is the maximum subtoken length and extra predictions are ignored. Essentially, true-casing is |
|
modeled as a multi-label problem. This allows for upper-casing arbitrary characters, e.g., "NATO", "MacDonald", "mRNA", etc. |
|
|
|
Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language. |
|
|
|
## Tokenizer |
|
|
|
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode |
|
the text. Per HF's comments, |
|
|
|
```python |
|
# Original fairseq vocab and spm vocab must be "aligned": |
|
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ---- |
|
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-' |
|
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a' |
|
``` |
|
|
|
The SP model was un-hacked with the following snippet |
|
(SentencePiece experts, let me know if there is a problem here): |
|
|
|
```python |
|
from sentencepiece import SentencePieceProcessor |
|
from sentencepiece.sentencepiece_model_pb2 import ModelProto |
|
|
|
m = ModelProto() |
|
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read()) |
|
|
|
pieces = list(m.pieces) |
|
pieces = ( |
|
[ |
|
ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL), |
|
ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN), |
|
] |
|
+ pieces[3:] |
|
+ [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)] |
|
) |
|
del m.pieces[:] |
|
m.pieces.extend(pieces) |
|
|
|
with open("/path/to/new/sp.model", "wb") as f: |
|
f.write(m.SerializeToString()) |
|
``` |
|
|
|
|
|
## Post-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens after each subtoken: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| \<ACRONYM\> | Every character in this subword is followed by a period | Primarily English, some European | |
|
| . | Latin full stop | Many | |
|
| , | Latin comma | Many | |
|
| ? | Latin question mark | Many | |
|
| ? | Full-width question mark | Chinese, Japanese | |
|
| , | Full-width comma | Chinese, Japanese | |
|
| 。 | Full-width full stop | Chinese, Japanese | |
|
| 、 | Ideographic comma | Chinese, Japanese | |
|
| ・ | Middle dot | Japanese | |
|
| । | Danda | Hindi, Bengali, Oriya | |
|
| ؟ | Arabic question mark | Arabic | |
|
| ; | Greek question mark | Greek | |
|
| ። | Ethiopic full stop | Amharic | |
|
| ፣ | Ethiopic comma | Amharic | |
|
| ፧ | Ethiopic question mark | Amharic | |
|
|
|
|
|
## Pre-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens before each subword: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| ¿ | Inverted question mark | Spanish | |
|
|
|
|
|
|
|
# Training Details |
|
This model was trained in the NeMo framework. |
|
|
|
## Training Data |
|
This model was trained with News Crawl data from WMT. |
|
|
|
1M lines of text for each language was used, except for a few low-resource languages which may have used less. |
|
|
|
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author. |
|
|
|
# Limitations |
|
This model was trained on news data, and may not perform well on conversational or informal data. |
|
|
|
Further, this model is unlikely to be of production quality. |
|
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data. |
|
|
|
This model over-predicts the inverted Spanish question mark, `¿`. Since `¿` is a rare token, especially in the |
|
context of a 47-language model, Spanish questions were over-sampled by selecting more of these sentences from |
|
additional training data that was not used. However, this seems to have "over-corrected" the problem and a lot |
|
of Spanish question marks are predicted. |
|
|
|
|
|
# Evaluation |
|
In these metrics, keep in mind that |
|
1. The data is noisy |
|
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. |
|
When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages. |
|
4. Punctuation can be subjective. E.g., |
|
|
|
`Hola mundo, ¿cómo estás?` |
|
|
|
or |
|
|
|
`Hola mundo. ¿Cómo estás?` |
|
|
|
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics. |
|
|
|
## Test Data and Example Generation |
|
Each test example was generated using the following procedure: |
|
|
|
1. Concatenate 11 random sentences (1 + 10 for each sentence in the test set) |
|
2. Lower-case the concatenated sentence |
|
3. Remove all punctuation |
|
|
|
The data is a held-out portion of News Crawl, which has been deduplicated. |
|
3,000 lines of data per language was used, generating 3,000 unique examples of 11 sentences each. |
|
We generate 3,000 examples, where example `i` begins with sentence `i` and is followed by 10 random |
|
sentences selected from the 3,000 sentence test set. |
|
|
|
## Selected Language Evaluation Reports |
|
For now, metrics for a few selected languages are shown below. |
|
Given the amount of work required to collect pretty metrics in 47 languages, I'll add more eventually. |
|
|
|
Expand any of the following tabs to see metrics for that language. |
|
|
|
|