1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 12, 2023

Commit

3c7b25f

•

1 Parent(s): 0dc2ad3

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -178,7 +178,8 @@ This model was trained on news data, and may not perform well on conversational
 Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
-This model over-predicts the inverted Spanish question mark, `¿` (see metrics below). Since `¿` is a rare token, especially in the
 context of a 47-language model, Spanish questions were over-sampled by selecting more of these sentences from
 additional training data that was not used. However, this seems to have "over-corrected" the problem and a lot
 of Spanish question marks are predicted. This can be fixed by exposing prior probabilities, but I'll fine-tune

 Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
+This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
+Since `¿` is a rare token, especially in the
 context of a 47-language model, Spanish questions were over-sampled by selecting more of these sentences from
 additional training data that was not used. However, this seems to have "over-corrected" the problem and a lot
 of Spanish question marks are predicted. This can be fixed by exposing prior probabilities, but I'll fine-tune