1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 12, 2023

Commit

5548a75

•

1 Parent(s): cad4273

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -178,6 +178,10 @@ This model was trained on news data, and may not perform well on conversational
 Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
 # Evaluation

 Further, this model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
+This model over-predicts the inverted Spanish question mark, `¿`. Since `¿` is a rare token, especially in the
+context of a 47-language model, Spanish questions were over-sampled by selecting more of these sentences from
+additional training data that was not used. However, this seems to have "over-corrected" the problem and a lot
+of Spanish question marks are predicted.
 # Evaluation