1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jun 2, 2023

Commit

affeb69

•

1 Parent(s): d44eb1a

Update README.md

Files changed (1) hide show

README.md +2 -12

README.md CHANGED Viewed

@@ -289,21 +289,11 @@ Languages were chosen based on whether the News Crawl corpus contained enough re
 # Limitations
-## Sentence Boundaries / Fullstops
-Fullstop (sentence boundary) detection is near-perfect with news data, but misses obvious sentence boundaries
-when several short sentences appear contiguously.
-With News crawl, SBD F1 is > 99.5%. With OpenSubtitles, SBD F1 drops unacceptably to < 90%.
-When I figure out why this is, I'll fine-tune the SBD head. It's likely due to pre-processing and domain mis-match.
 ## Domain
-This model was trained on news data, and may not perform well on conversational or informal data. Notably,
-when presented with many short sentences, the model misses obvious sentence boundaries since the model was
-trained on relatively-long sentences.
 ## Quality
-Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
 ## Excessive Predictions

 # Limitations
 ## Domain
+This model was trained on news data, and may not perform well on conversational or informal data.
 ## Quality
+This model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
 ## Excessive Predictions