oliverguhr
commited on
Commit
•
16fd186
1
Parent(s):
8eeec9d
updated readme
Browse files
README.md
CHANGED
@@ -8,40 +8,27 @@ datasets: sonar
|
|
8 |
license: mit
|
9 |
widget:
|
10 |
- text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
|
11 |
-
example_title: "
|
12 |
metrics:
|
13 |
- f1
|
14 |
---
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
|
|
|
|
21 |
|
22 |
-
|
23 |
-
```
|
24 |
-
precision recall f1-score support
|
25 |
-
|
26 |
-
, 0.754384 0.687349 0.719308 3127454
|
27 |
-
- 0.848480 0.628337 0.722000 331849
|
28 |
-
. 0.856989 0.851786 0.854380 4941897
|
29 |
-
0 0.982454 0.989201 0.985816 73926815
|
30 |
-
: 0.738974 0.657906 0.696088 590946
|
31 |
-
? 0.730301 0.643325 0.684060 410416
|
32 |
|
33 |
-
|
34 |
-
macro avg 0.818597 0.742984 0.776942 83329377
|
35 |
-
weighted avg 0.962951 0.964233 0.963427 83329377
|
36 |
-
|
37 |
-
```
|
38 |
-
|
39 |
-
Usage:
|
40 |
|
41 |
```bash
|
42 |
pip install deepmultilingualpunctuation
|
43 |
```
|
44 |
-
|
45 |
```python
|
46 |
from deepmultilingualpunctuation import PunctuationModel
|
47 |
|
@@ -50,3 +37,45 @@ text = "hervatting van de zitting ik verklaar de zitting van het europees parlem
|
|
50 |
result = model.restore_punctuation(text)
|
51 |
print(result)
|
52 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
license: mit
|
9 |
widget:
|
10 |
- text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
|
11 |
+
example_title: "Dutch Sample"
|
12 |
metrics:
|
13 |
- f1
|
14 |
---
|
15 |
|
16 |
+
This model predicts the punctuation of Dutch texts. We developed it to restore the punctuation of transcribed spoken language.
|
17 |
|
18 |
+
This multilanguage model was trained on the [SoNaR Dataset](http://hdl.handle.net/10032/tm-a2-h5).
|
19 |
|
20 |
+
The model restores the following punctuation markers: **"." "," "?" "-" ":"**
|
21 |
+
## Sample Code
|
22 |
+
We provide a simple python package that allows you to process text of any length.
|
23 |
|
24 |
+
## Install
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
+
To get started install the package from [pypi](https://pypi.org/project/deepmultilingualpunctuation/):
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
```bash
|
29 |
pip install deepmultilingualpunctuation
|
30 |
```
|
31 |
+
### Restore Punctuation
|
32 |
```python
|
33 |
from deepmultilingualpunctuation import PunctuationModel
|
34 |
|
|
|
37 |
result = model.restore_punctuation(text)
|
38 |
print(result)
|
39 |
```
|
40 |
+
|
41 |
+
**output**
|
42 |
+
> hervatting van de zitting. ik verklaar de zitting van het europees parlement, die op vrijdag 17 december werd onderbroken, te zijn hervat.
|
43 |
+
|
44 |
+
|
45 |
+
### Predict Labels
|
46 |
+
```python
|
47 |
+
from deepmultilingualpunctuation import PunctuationModel
|
48 |
+
|
49 |
+
model = PunctuationModel()
|
50 |
+
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
|
51 |
+
clean_text = model.preprocess(text)
|
52 |
+
labled_words = model.predict(clean_text)
|
53 |
+
print(labled_words)
|
54 |
+
```
|
55 |
+
|
56 |
+
**output**
|
57 |
+
|
58 |
+
> [['hervatting', '0', 0.99998724], ['van', '0', 0.9999784], ['de', '0', 0.99991274], ['zitting', '.', 0.6771242], ['ik', '0', 0.9999466], ['verklaar', '0', 0.9998566], ['de', '0', 0.9999783], ['zitting', '0', 0.9999809], ['van', '0', 0.99996245], ['het', '0', 0.99997795], ['europees', '0', 0.9999783], ['parlement', ',', 0.9908242], ['die', '0', 0.999985], ['op', '0', 0.99998224], ['vrijdag', '0', 0.9999831], ['17', '0', 0.99997985], ['december', '0', 0.9999827], ['werd', '0', 0.999982], ['onderbroken', ',', 0.9951485], ['te', '0', 0.9999677], ['zijn', '0', 0.99997723], ['hervat', '.', 0.9957053]]
|
59 |
+
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
+
## Results
|
64 |
+
|
65 |
+
The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores:
|
66 |
+
|
67 |
+
| Label | F1 Score |
|
68 |
+
| ------------- | -------- |
|
69 |
+
| 0 | 0.985816 |
|
70 |
+
| . | 0.854380 |
|
71 |
+
| ? | 0.684060 |
|
72 |
+
| , | 0.719308 |
|
73 |
+
| : | 0.696088 |
|
74 |
+
| - | 0.722000 |
|
75 |
+
| macro average | 0.776942 |
|
76 |
+
| micro average | 0.963427 |
|
77 |
+
|
78 |
+
## References
|
79 |
+
|
80 |
+
TBD
|
81 |
+
|