1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 12, 2023

Commit

cad4273

•

1 Parent(s): b629ea4

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -198,18 +198,18 @@ In these metrics, keep in mind that
 ## Test Data and Example Generation
 Each test example was generated using the following procedure:
-1. Concatenate 10 random sentences
 2. Lower-case the concatenated sentence
 3. Remove all punctuation
 The data is a held-out portion of News Crawl, which has been deduplicated.
-3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each.
-The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.
-Examples longer than the model's maximum length were truncated.
-The number of affected sentences can be estimated from the "full stop" support: with 3,000
-sentences and 10 sentences per example, we expect 30,000 full stop targets total.
 ## Selected Language Evaluation Reports

 ## Test Data and Example Generation
 Each test example was generated using the following procedure:
+1. Concatenate 11 random sentences (1 + 10 for each sentence in the test set)
 2. Lower-case the concatenated sentence
 3. Remove all punctuation
 The data is a held-out portion of News Crawl, which has been deduplicated.
+3,000 lines of data per language was used, generating 3,000 unique examples of 11 sentences each.
+We generate 3,000 examples, where example `i` begins with sentence `i` and is followed by 10 random
+sentences selected from the 3,000 sentence test set.
 ## Selected Language Evaluation Reports
+For now, metrics for a few selected languages are shown below.
+Given the amount of work required to collect pretty metrics in 47 languages, I'll add more eventually.
+Expand any of the following tabs to see metrics for that language.