1-800-BAD-CODE
commited on
Commit
•
b629ea4
1
Parent(s):
5c8868f
Update README.md
Browse files
README.md
CHANGED
@@ -61,6 +61,31 @@ language:
|
|
61 |
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
|
62 |
and detects sentence boundaries (full stops) in 47 languages.
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
## Tokenizer
|
66 |
|
|
|
61 |
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
|
62 |
and detects sentence boundaries (full stops) in 47 languages.
|
63 |
|
64 |
+
# Model Architecture
|
65 |
+
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
|
66 |
+
in every language without language-specific behavior:
|
67 |
+
|
68 |
+
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
|
69 |
+
|
70 |
+
We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
|
71 |
+
|
72 |
+
Then we predict punctuation before and after every subtoken.
|
73 |
+
Predicting before each token allows for Spanish inverted question marks.
|
74 |
+
Predicting after every token allows for all other punctuation, including punctuation within continuous-script
|
75 |
+
languages and acronyms.
|
76 |
+
|
77 |
+
We use embeddings to represent the predicted punctuation tokens to inform the sentence boundary head of the
|
78 |
+
punctuation that'll be inserted into the text. This allows proper full stop prediction, since certain punctuation
|
79 |
+
tokens (periods, questions marks, etc.) are strongly correlated with sentence boundaries.
|
80 |
+
|
81 |
+
We then shift full stop predictions to the right by one, to inform the true-casing head of where the beginning
|
82 |
+
of each new sentence is. This is important since true-casing is strongly correlated to sentence boundaries.
|
83 |
+
|
84 |
+
For true-casing, we predict `N` predictions per subtoken, where `N` is the number of characters in the subtoken.
|
85 |
+
In practice, `N` is the maximum subtoken length and extra predictions are ignored. Essentially, true-casing is
|
86 |
+
modeled as a multi-label problem. This allows for upper-casing arbitrary characters, e.g., "NATO", "MacDonald", "mRNA", etc.
|
87 |
+
|
88 |
+
Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
|
89 |
|
90 |
## Tokenizer
|
91 |
|