1-800-BAD-CODE
commited on
Commit
•
932cc97
1
Parent(s):
e270816
Update README.md
Browse files
README.md
CHANGED
@@ -291,21 +291,16 @@ This model was trained on an A100 for approximately 9 hours.
|
|
291 |
|
292 |
## Training Data
|
293 |
This model was trained with News Crawl data from WMT.
|
294 |
-
|
295 |
1M lines of text for each language was used, except for a few low-resource languages which may have used less.
|
296 |
-
|
297 |
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
298 |
|
299 |
# Limitations
|
300 |
|
301 |
-
## Domain
|
302 |
This model was trained on news data, and may not perform well on conversational or informal data.
|
303 |
|
304 |
-
## Quality
|
305 |
This model is unlikely to be of production quality.
|
306 |
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
|
307 |
|
308 |
-
## Excessive Predictions
|
309 |
This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
|
310 |
Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
|
311 |
by selecting more of these sentences from additional training data that was not used. However, this seems to have
|
|
|
291 |
|
292 |
## Training Data
|
293 |
This model was trained with News Crawl data from WMT.
|
|
|
294 |
1M lines of text for each language was used, except for a few low-resource languages which may have used less.
|
|
|
295 |
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
296 |
|
297 |
# Limitations
|
298 |
|
|
|
299 |
This model was trained on news data, and may not perform well on conversational or informal data.
|
300 |
|
|
|
301 |
This model is unlikely to be of production quality.
|
302 |
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
|
303 |
|
|
|
304 |
This model over-predicts Spanish question marks, especially the inverted question mark `¿` (see metrics below).
|
305 |
Since `¿` is a rare token, especially in the context of a 47-language model, Spanish questions were over-sampled
|
306 |
by selecting more of these sentences from additional training data that was not used. However, this seems to have
|