|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- punctuation |
|
- true casing |
|
- sentence boundary detection |
|
- token classification |
|
- nlp |
|
--- |
|
|
|
# Model Overview |
|
This model accepts as input lower-cased, unpunctuated English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation). |
|
|
|
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions. |
|
|
|
|
|
# Usage |
|
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators): |
|
|
|
```bash |
|
pip install punctuators |
|
``` |
|
|
|
|
|
Running the following script should load this model and run some texts: |
|
<details open> |
|
|
|
<summary>Example Usage</summary> |
|
|
|
``` |
|
from punctuators.models import PunctCapSegModelONNX |
|
|
|
# Instantiate this model |
|
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory. |
|
m = PunctCapSegModelONNX.from_pretrained("pcs_en") |
|
|
|
# Define some input texts to punctuate |
|
input_texts: List[str] = [ |
|
"hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in", |
|
"i live in the us where george hw bush was once president" |
|
] |
|
results: List[List[str]] = m.infer(input_texts) |
|
for input_text, output_texts in zip(input_texts, results): |
|
print(f"Input: {input_text}") |
|
print(f"Outputs:") |
|
for text in output_texts: |
|
print(f"\t{text}") |
|
print() |
|
|
|
``` |
|
|
|
</details> |
|
|
|
<details open> |
|
|
|
<summary>Expected Output</summary> |
|
|
|
```text |
|
|
|
``` |
|
|
|
Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts. |
|
|
|
</details> |
|
|
|
# Model Details |
|
|
|
This model implements the graph shown below, with brief descriptions for each step following. |
|
|
|
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png) |
|
|
|
|
|
1. **Encoding**: |
|
The model begins by tokenizing the text with a subword tokenizer. |
|
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 32k. |
|
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512. |
|
|
|
2. **Punctuation**: |
|
The encoded sequence is then fed into a feed-forward classification network to predict punctuation tokens. |
|
Punctation is predicted once per subword, to allow acronyms to be properly punctuated. |
|
An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese. |
|
|
|
5. **Sentence boundary detection** |
|
For sentence boundary detection, we condition the model on punctuation via embeddings. |
|
Each punctuation prediction is used to select an embedding for that token, which is concatenated to the encoded representation. |
|
The SBD head analyzes both the encoding of the un-punctuated sequence and the puncutation predictions, and predicts which tokens are sentence boundaries. |
|
|
|
7. **Shift and concat sentence boundaries** |
|
In English, the first character of each sentence should be upper-cased. |
|
Thus, we should feed the sentence boundary information to the true-case classification network. |
|
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence. |
|
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence. |
|
Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head. |
|
|
|
8. **True-case prediction** |
|
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing. |
|
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken. |
|
(In practice, `N` is the longest possible subword, and the extra predictions are ignored). |
|
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald". |
|
|
|
The model's maximum length is 256 subtokens, due to the limit of the trained embeddings. |
|
However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package |
|
as described above will transparently predict on overlapping subgsegments of long inputs and fuse the results before returning output, |
|
allowing inputs to be arbitrarily long. |
|
|
|
## Punctuation Tokens |
|
This model predicts the following set of punctuation tokens: |
|
|
|
| Token | Description | |
|
| ---: | :---------- | |
|
| NULL | Predict no punctuation | |
|
| ACRONYM | Every character in this subword ends with a period | |
|
| . | Latin full stop | |
|
| , | Latin comma | |
|
| ? | Latin question mark | |
|
|
|
# Training Details |
|
|
|
## Training Framework |
|
This model was trained on a forked branch of the [NeMo](https://github.com/NVIDIA/NeMo) framework. |
|
|
|
## Training Data |
|
This model was trained with News Crawl data from WMT. |
|
|
|
Approximately 10M lines were used from the years 2021 and 2012. |
|
The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, and 2021 is dominated by COVID discussions. |
|
|
|
# Limitations |
|
## Domain |
|
This model was trained on news data, and may not perform well on conversational or informal data. |
|
|
|
## Noisy Training Data |
|
The training data was noisy, and no manual cleaning was utilized. |
|
|
|
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data. |
|
|
|
| Token | Count | |
|
| -: | :- | |
|
| Mr | 115232 | |
|
| Mr. | 108212 | |
|
|
|
| Token | Count | |
|
| -: | :- | |
|
| U.S. | 85324 | |
|
| US | 37332 | |
|
| U.S | 354 | |
|
| U.s | 108 | |
|
| u.S. | 65 | |
|
|
|
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable. |
|
|
|
|
|
Further, an assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence. |
|
However, a non-negligible portion of the training data contains multiple sentences in one line. |
|
Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data. |
|
|
|
|
|
# Evaluation |
|
In these metrics, keep in mind that |
|
1. The data is noisy |
|
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. |
|
When conditioning on reference punctuation, true-casing and SBD metrics are much higher w.r.t. the reference targets. |
|
4. Punctuation can be subjective. E.g., |
|
|
|
`Hello Frank, how's it going?` |
|
|
|
or |
|
|
|
`Hello Frank. How's it going?` |
|
|
|
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics. |
|
|
|
## Test Data and Example Generation |
|
Each test example was generated using the following procedure: |
|
|
|
1. Concatenate 10 random sentences |
|
2. Lower-case the concatenated sentence |
|
3. Remove all punctuation |
|
|
|
The data is a held-out portion of News Crawl, which has been deduplicated. |
|
2,000 lines of data was used, generating 2,000 unique examples of 10 sentences each. |
|
|
|
Examples longer than the model's maximum length (256) were truncated. |
|
The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total. |
|
|
|
|