1-800-BAD-CODE
commited on
Commit
•
affeb69
1
Parent(s):
d44eb1a
Update README.md
Browse files
README.md
CHANGED
@@ -289,21 +289,11 @@ Languages were chosen based on whether the News Crawl corpus contained enough re
|
|
289 |
|
290 |
# Limitations
|
291 |
|
292 |
-
## Sentence Boundaries / Fullstops
|
293 |
-
Fullstop (sentence boundary) detection is near-perfect with news data, but misses obvious sentence boundaries
|
294 |
-
when several short sentences appear contiguously.
|
295 |
-
|
296 |
-
With News crawl, SBD F1 is > 99.5%. With OpenSubtitles, SBD F1 drops unacceptably to < 90%.
|
297 |
-
|
298 |
-
When I figure out why this is, I'll fine-tune the SBD head. It's likely due to pre-processing and domain mis-match.
|
299 |
-
|
300 |
## Domain
|
301 |
-
This model was trained on news data, and may not perform well on conversational or informal data.
|
302 |
-
when presented with many short sentences, the model misses obvious sentence boundaries since the model was
|
303 |
-
trained on relatively-long sentences.
|
304 |
|
305 |
## Quality
|
306 |
-
|
307 |
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
|
308 |
|
309 |
## Excessive Predictions
|
|
|
289 |
|
290 |
# Limitations
|
291 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
292 |
## Domain
|
293 |
+
This model was trained on news data, and may not perform well on conversational or informal data.
|
|
|
|
|
294 |
|
295 |
## Quality
|
296 |
+
This model is unlikely to be of production quality.
|
297 |
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
|
298 |
|
299 |
## Excessive Predictions
|