1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 29, 2023

Commit

fc68459

•

1 Parent(s): a7250c6

Update README.md

Files changed (1) hide show

README.md +12 -0

README.md CHANGED Viewed

@@ -177,13 +177,25 @@ This model was trained with News Crawl data from WMT.
 Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
 # Limitations
 This model was trained on news data, and may not perform well on conversational or informal data. Notably,
 when presented with many short sentences, the model misses obvious sentence boundaries since the model was
 trained on relatively-long sentences.
 Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
 This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
 Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
 by selecting more of these sentences from additional training data that was not used. However, this seems to have

 Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
 # Limitations
+## Sentence Boundaries / Fullstops
+Fullstop (sentence boundary) detection is near-perfect with news data, but misses obvious sentence boundaries
+when several short sentences appear contiguously.
+With News crawl, SBD F1 is > 99.5%. With OpenSubtitles, SBD F1 drops unacceptably to < 90%.
+When I figure out why this is, I'll fine-tune the SBD head. It's likely due to pre-processing and domain mis-match.
+## Domain
 This model was trained on news data, and may not perform well on conversational or informal data. Notably,
 when presented with many short sentences, the model misses obvious sentence boundaries since the model was
 trained on relatively-long sentences.
+## Quality
 Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
+## Excessive Predictions
 This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
 Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
 by selecting more of these sentences from additional training data that was not used. However, this seems to have