1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

@@ -69,8 +69,17 @@ and detect sentence boundaries (full stops) in 47 languages.
 # Usage
 ## Usage via `punctuators` package
 The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
 ```bash
@@ -180,6 +189,7 @@ Outputs:
 </details>
 ## Manual Usage
 If you want to use the ONNX and SP models without wrappers, see the following example.
@@ -305,11 +315,16 @@ Outputs:
 # Model Architecture
 This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
 in every language without language-specific behavior:
 ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
 We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
 Then we predict punctuation before and after every subtoken.
@@ -330,8 +345,14 @@ modeled as a multi-label problem. This allows for upper-casing arbitrary charact
 Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
 ## Tokenizer
 Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the `xlm-roberta` SentencePiece model was adjusted to correctly encode
 the text. Per HF's comments,
@@ -373,6 +394,7 @@ with open("/path/to/new/sp.model", "wb") as f:
 Now we can use just the SP model without a wrapper.
 ## Post-Punctuation Tokens
 This model predicts the following set of punctuation tokens after each subtoken:

 # Usage
+If you want to just play with the model, the widget on this page will suffice. To use the model offline,
+the following snippets show how to use the model both with a wrapper (that I wrote, available from `PyPI`)
+and manual usuage (using the ONNX and SentencePiece models in this repo).
 ## Usage via `punctuators` package
+<details>
+  <summary>Click to see usage with wrappers</summary>
 The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
 ```bash
 </details>
+</details>
 ## Manual Usage
 If you want to use the ONNX and SP models without wrappers, see the following example.
 # Model Architecture
 This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
 in every language without language-specific behavior:
 ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
+<details>
+  <summary>Click to see graph explanations</summary>
 We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
 Then we predict punctuation before and after every subtoken.
 Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
+</details>
 ## Tokenizer
+<details>
+  <summary>Click to see how the XLM-Roberta tokenizer was un-hacked</summary>
 Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the `xlm-roberta` SentencePiece model was adjusted to correctly encode
 the text. Per HF's comments,
 Now we can use just the SP model without a wrapper.
+</details>
 ## Post-Punctuation Tokens
 This model predicts the following set of punctuation tokens after each subtoken: