1-800-BAD-CODE
commited on
Commit
•
fc68459
1
Parent(s):
a7250c6
Update README.md
Browse files
README.md
CHANGED
@@ -177,13 +177,25 @@ This model was trained with News Crawl data from WMT.
|
|
177 |
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
178 |
|
179 |
# Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
This model was trained on news data, and may not perform well on conversational or informal data. Notably,
|
181 |
when presented with many short sentences, the model misses obvious sentence boundaries since the model was
|
182 |
trained on relatively-long sentences.
|
183 |
|
|
|
184 |
Further, this model is unlikely to be of production quality.
|
185 |
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
|
186 |
|
|
|
187 |
This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
|
188 |
Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
|
189 |
by selecting more of these sentences from additional training data that was not used. However, this seems to have
|
|
|
177 |
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
178 |
|
179 |
# Limitations
|
180 |
+
|
181 |
+
## Sentence Boundaries / Fullstops
|
182 |
+
Fullstop (sentence boundary) detection is near-perfect with news data, but misses obvious sentence boundaries
|
183 |
+
when several short sentences appear contiguously.
|
184 |
+
|
185 |
+
With News crawl, SBD F1 is > 99.5%. With OpenSubtitles, SBD F1 drops unacceptably to < 90%.
|
186 |
+
|
187 |
+
When I figure out why this is, I'll fine-tune the SBD head. It's likely due to pre-processing and domain mis-match.
|
188 |
+
|
189 |
+
## Domain
|
190 |
This model was trained on news data, and may not perform well on conversational or informal data. Notably,
|
191 |
when presented with many short sentences, the model misses obvious sentence boundaries since the model was
|
192 |
trained on relatively-long sentences.
|
193 |
|
194 |
+
## Quality
|
195 |
Further, this model is unlikely to be of production quality.
|
196 |
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
|
197 |
|
198 |
+
## Excessive Predictions
|
199 |
This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
|
200 |
Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
|
201 |
by selecting more of these sentences from additional training data that was not used. However, this seems to have
|