1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 12, 2023

Commit

b629ea4

•

1 Parent(s): 5c8868f

Update README.md

Files changed (1) hide show

README.md +25 -0

README.md CHANGED Viewed

@@ -61,6 +61,31 @@ language:
 This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
 and detects sentence boundaries (full stops) in 47 languages.
 ## Tokenizer

 This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
 and detects sentence boundaries (full stops) in 47 languages.
+# Model Architecture
+This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
+in every language without language-specific behavior:
+![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
+We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
+Then we predict punctuation before and after every subtoken.
+Predicting before each token allows for Spanish inverted question marks.
+Predicting after every token allows for all other punctuation, including punctuation within continuous-script
+languages and acronyms.
+We use embeddings to represent the predicted punctuation tokens to inform the sentence boundary head of the
+punctuation that'll be inserted into the text. This allows proper full stop prediction, since certain punctuation
+tokens (periods, questions marks, etc.) are strongly correlated with sentence boundaries.
+We then shift full stop predictions to the right by one, to inform the true-casing head of where the beginning
+of each new sentence is. This is important since true-casing is strongly correlated to sentence boundaries.
+For true-casing, we predict `N` predictions per subtoken, where `N` is the number of characters in the subtoken.
+In practice, `N` is the maximum subtoken length and extra predictions are ignored. Essentially, true-casing is
+modeled as a multi-label problem. This allows for upper-casing arbitrary characters, e.g., "NATO", "MacDonald", "mRNA", etc.
+Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
 ## Tokenizer