1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jun 2, 2023

Commit

932cc97

•

1 Parent(s): e270816

Update README.md

Files changed (1) hide show

README.md +0 -5

README.md CHANGED Viewed

@@ -291,21 +291,16 @@ This model was trained on an A100 for approximately 9 hours.
 ## Training Data
 This model was trained with News Crawl data from WMT.
 1M lines of text for each language was used, except for a few low-resource languages which may have used less.
 Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
 # Limitations
-## Domain
 This model was trained on news data, and may not perform well on conversational or informal data.
-## Quality
 This model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
-## Excessive Predictions
 This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
 Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
 by selecting more of these sentences from additional training data that was not used. However, this seems to have

 ## Training Data
 This model was trained with News Crawl data from WMT.
 1M lines of text for each language was used, except for a few low-resource languages which may have used less.
 Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
 # Limitations
 This model was trained on news data, and may not perform well on conversational or informal data.
 This model is unlikely to be of production quality.
 It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
 This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
 Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
 by selecting more of these sentences from additional training data that was not used. However, this seems to have