1-800-BAD-CODE
commited on
Commit
•
f852113
1
Parent(s):
c7bbf57
Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,8 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
|
|
22 |
pip install punctuators
|
23 |
```
|
24 |
|
25 |
-
Running the following script should load this model and run some texts:
|
|
|
26 |
<details open>
|
27 |
|
28 |
<summary>Example Usage</summary>
|
@@ -36,8 +37,10 @@ m = PunctCapSegModelONNX.from_pretrained("pcs_en")
|
|
36 |
|
37 |
# Define some input texts to punctuate
|
38 |
input_texts: List[str] = [
|
39 |
-
"
|
40 |
-
"i
|
|
|
|
|
41 |
]
|
42 |
results: List[List[str]] = m.infer(input_texts)
|
43 |
for input_text, output_texts in zip(input_texts, results):
|
@@ -49,6 +52,8 @@ for input_text, output_texts in zip(input_texts, results):
|
|
49 |
|
50 |
```
|
51 |
|
|
|
|
|
52 |
</details>
|
53 |
|
54 |
<details open>
|
@@ -56,7 +61,21 @@ for input_text, output_texts in zip(input_texts, results):
|
|
56 |
<summary>Expected Output</summary>
|
57 |
|
58 |
```text
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
```
|
61 |
|
62 |
Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
|
@@ -88,7 +107,7 @@ The SBD head analyzes both the encoding of the un-punctuated sequence and the pu
|
|
88 |
7. **Shift and concat sentence boundaries**
|
89 |
In English, the first character of each sentence should be upper-cased.
|
90 |
Thus, we should feed the sentence boundary information to the true-case classification network.
|
91 |
-
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
92 |
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
|
93 |
Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
94 |
|
@@ -132,6 +151,7 @@ This model was trained on news data, and may not perform well on conversational
|
|
132 |
## Noisy Training Data
|
133 |
The training data was noisy, and no manual cleaning was utilized.
|
134 |
|
|
|
135 |
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
|
136 |
|
137 |
| Token | Count |
|
@@ -149,9 +169,9 @@ Acronyms and abbreviations are especially noisy; the table below shows how many
|
|
149 |
|
150 |
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
|
151 |
|
152 |
-
|
153 |
-
|
154 |
-
However, a non-negligible portion of the training data contains multiple sentences
|
155 |
Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
|
156 |
|
157 |
|
@@ -196,10 +216,10 @@ We show here the cosine similarity between the embeddings of each token:
|
|
196 |
| | NULL | ACRONYM | . | , | ? |
|
197 |
| - | - | - | - | - | - |
|
198 |
| NULL | 1.00 | | | | |
|
199 |
-
| ACRONYM | -0.
|
200 |
-
| . | -1.00 | 0.
|
201 |
-
| , | 1.00 | -0.
|
202 |
-
| ? | -1.00 | 0.
|
203 |
|
204 |
Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
|
205 |
|
@@ -211,7 +231,7 @@ Next, we see that "`.`" and "`?`" are exactly the same, because w.r.t. SBD these
|
|
211 |
Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
|
212 |
This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
|
213 |
|
214 |
-
Lastly, we see that `ACRONYM` is
|
215 |
-
and
|
216 |
Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").
|
217 |
|
|
|
22 |
pip install punctuators
|
23 |
```
|
24 |
|
25 |
+
Running the following script should load this model and run some random texts I made up:
|
26 |
+
|
27 |
<details open>
|
28 |
|
29 |
<summary>Example Usage</summary>
|
|
|
37 |
|
38 |
# Define some input texts to punctuate
|
39 |
input_texts: List[str] = [
|
40 |
+
"george hw bush was the president of the us for 8 years",
|
41 |
+
"i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
|
42 |
+
"despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
|
43 |
+
"i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
|
44 |
]
|
45 |
results: List[List[str]] = m.infer(input_texts)
|
46 |
for input_text, output_texts in zip(input_texts, results):
|
|
|
52 |
|
53 |
```
|
54 |
|
55 |
+
Exact output may vary based on the model version; here is the current output:
|
56 |
+
|
57 |
</details>
|
58 |
|
59 |
<details open>
|
|
|
61 |
<summary>Expected Output</summary>
|
62 |
|
63 |
```text
|
64 |
+
In: george hw bush was the president of the us for 8 years
|
65 |
+
Out: George H.W. Bush was the president of the U.S. for 8 years.
|
66 |
+
|
67 |
+
In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
|
68 |
+
Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
|
69 |
+
Out: We like to take morning adventures on the weekends.
|
70 |
|
71 |
+
In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
|
72 |
+
Out: Despite being mid March, it snowed overnight and into the morning.
|
73 |
+
Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.
|
74 |
+
|
75 |
+
In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
|
76 |
+
Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
|
77 |
+
Out: I suggested he get one of those new battery operated ones.
|
78 |
+
Out: They're so much quieter.
|
79 |
```
|
80 |
|
81 |
Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
|
|
|
107 |
7. **Shift and concat sentence boundaries**
|
108 |
In English, the first character of each sentence should be upper-cased.
|
109 |
Thus, we should feed the sentence boundary information to the true-case classification network.
|
110 |
+
Since the true-case classification network is feed-forward and has no temporal context, each time step must embed whether it is the first word of a sentence.
|
111 |
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
|
112 |
Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
113 |
|
|
|
151 |
## Noisy Training Data
|
152 |
The training data was noisy, and no manual cleaning was utilized.
|
153 |
|
154 |
+
### Acronyms and Abbreviations
|
155 |
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
|
156 |
|
157 |
| Token | Count |
|
|
|
169 |
|
170 |
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
|
171 |
|
172 |
+
### Sentence Boundary Detection Targets
|
173 |
+
An assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
|
174 |
+
However, a non-negligible portion of the training data contains multiple sentences per line.
|
175 |
Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
|
176 |
|
177 |
|
|
|
216 |
| | NULL | ACRONYM | . | , | ? |
|
217 |
| - | - | - | - | - | - |
|
218 |
| NULL | 1.00 | | | | |
|
219 |
+
| ACRONYM | -0.49 | 1.00 | | ||
|
220 |
+
| . | -1.00 | 0.48 | 1.00 | | |
|
221 |
+
| , | 1.00 | -0.48 | -1.00 | 1.00 | |
|
222 |
+
| ? | -1.00 | 0.49 | 1.00 | -1.00 | 1.00 |
|
223 |
|
224 |
Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
|
225 |
|
|
|
231 |
Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
|
232 |
This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
|
233 |
|
234 |
+
Lastly, we see that `ACRONYM` is similar to, but not the same as, the full stops "`.`" and "`?`",
|
235 |
+
and far from, but not the opposite of, `NULL` and "`,`".
|
236 |
Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").
|
237 |
|