1-800-BAD-CODE commited on
Commit
fc68459
1 Parent(s): a7250c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -177,13 +177,25 @@ This model was trained with News Crawl data from WMT.
177
  Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
178
 
179
  # Limitations
 
 
 
 
 
 
 
 
 
 
180
  This model was trained on news data, and may not perform well on conversational or informal data. Notably,
181
  when presented with many short sentences, the model misses obvious sentence boundaries since the model was
182
  trained on relatively-long sentences.
183
 
 
184
  Further, this model is unlikely to be of production quality.
185
  It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
186
 
 
187
  This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
188
  Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
189
  by selecting more of these sentences from additional training data that was not used. However, this seems to have
 
177
  Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
178
 
179
  # Limitations
180
+
181
+ ## Sentence Boundaries / Fullstops
182
+ Fullstop (sentence boundary) detection is near-perfect with news data, but misses obvious sentence boundaries
183
+ when several short sentences appear contiguously.
184
+
185
+ With News crawl, SBD F1 is > 99.5%. With OpenSubtitles, SBD F1 drops unacceptably to < 90%.
186
+
187
+ When I figure out why this is, I'll fine-tune the SBD head. It's likely due to pre-processing and domain mis-match.
188
+
189
+ ## Domain
190
  This model was trained on news data, and may not perform well on conversational or informal data. Notably,
191
  when presented with many short sentences, the model misses obvious sentence boundaries since the model was
192
  trained on relatively-long sentences.
193
 
194
+ ## Quality
195
  Further, this model is unlikely to be of production quality.
196
  It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
197
 
198
+ ## Excessive Predictions
199
  This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
200
  Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
201
  by selecting more of these sentences from additional training data that was not used. However, this seems to have