File size: 5,623 Bytes

---
license: apache-2.0
library_name: generic
tags:
  - text2text-generation
  - punctuation
  - sentence-boundary-detection
  - truecasing
language:
  - af
  - am
  - ar
  - bg
  - bn
  - de
  - el
  - en
  - es
  - et
  - fa
  - fi
  - fr
  - gu
  - hi
  - hr
  - hu
  - id
  - is
  - it
  - ja
  - kk
  - kn
  - ko
  - ky
  - lt
  - lv
  - mk
  - ml
  - mr
  - nl
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - rw
  - so
  - sr
  - sw
  - ta
  - te
  - tr
  - uk
  - zh
---

# Model Overview
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes), 
and detects sentence boundaries (full stops) in 47 languages.


## Tokenizer

Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode
the text. Per HF's comments,

```python
# Original fairseq vocab and spm vocab must be "aligned":
# Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
# spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'
```

The SP model was un-hacked with the following snippet 
(SentencePiece experts, let me know if there is a problem here):

```python
from sentencepiece import SentencePieceProcessor
from sentencepiece.sentencepiece_model_pb2 import ModelProto

m = ModelProto()
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())

pieces = list(m.pieces)
pieces = (
    [
        ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
    ]
    + pieces[3:]
    + [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
)
del m.pieces[:]
m.pieces.extend(pieces)

with open("/path/to/new/sp.model", "wb") as f:
    f.write(m.SerializeToString())
```


## Post-Punctuation Tokens
This model predicts the following set of punctuation tokens after each subtoken:

| Token  | Description | Relevant Languages |
| ---: | :---------- | :----------- |
| \<NULL\>    | No punctuation | All |
| \<ACRONYM\>    | Every character in this subword is followed by a period | Primarily English, some European |
| .    | Latin full stop | Many |
| ,    | Latin comma | Many |
| ?    | Latin question mark | Many |
| ？    | Full-width question mark | Chinese, Japanese |
| ，    | Full-width comma | Chinese, Japanese |
| 。    | Full-width full stop | Chinese, Japanese |
| 、    | Ideographic comma | Chinese, Japanese |
| ・    | Middle dot | Japanese |
| ।    | Danda | Hindi, Bengali, Oriya |
| ؟    | Arabic question mark | Arabic |
| ;    | Greek question mark | Greek |
| ።    | Ethiopic full stop | Amharic |
| ፣    | Ethiopic comma | Amharic |
| ፧    | Ethiopic question mark | Amharic |


## Pre-Punctuation Tokens
This model predicts the following set of punctuation tokens before each subword:

| Token  | Description | Relevant Languages |
| ---: | :---------- | :----------- |
| \<NULL\>    | No punctuation | All |
| ¿    | Inverted question mark | Spanish |



# Training Details
This model was trained in the NeMo framework.

## Training Data
This model was trained with News Crawl data from WMT.

1M lines of text for each language was used, except for a few low-resource languages which may have used less.

Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.

# Limitations
This model was trained on news data, and may not perform well on conversational or informal data.

Further, this model is unlikely to be of production quality. 
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
This is also a base-sized model with many languages and many tasks, so capacity may be limited.


# Evaluation
In these metrics, keep in mind that
1. The data is noisy
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
   When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
4. Punctuation can be subjective. E.g.,
   
   `Hola mundo, ¿cómo estás?`
   
   or

   `Hola mundo. ¿Cómo estás?`

   When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.

## Test Data and Example Generation
Each test example was generated using the following procedure:

1. Concatenate 10 random sentences
2. Lower-case the concatenated sentence
3. Remove all punctuation

The data is a held-out portion of News Crawl, which has been deduplicated. 
3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each.
The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.

Examples longer than the model's maximum length were truncated. 
The number of affected sentences can be estimated from the "full stop" support: with 3,000 
sentences and 10 sentences per example, we expect 30,000 full stop targets total.

## Selected Language Evaluation Reports