oliverguhr
/

fullstop-dutch-sonar-punctuation-prediction

@@ -8,40 +8,27 @@ datasets: sonar
 license: mit
 widget:
 - text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
-  example_title: "EuroParl Sample"
 metrics:
 - f1
 ---
-## Model
-Trained on Sonar corpus
-## Performance
-Evaluated on dutch SoNaR data set
-```
-              precision    recall  f1-score   support
-           ,   0.754384  0.687349  0.719308   3127454
-           -   0.848480  0.628337  0.722000    331849
-           .   0.856989  0.851786  0.854380   4941897
-           0   0.982454  0.989201  0.985816  73926815
-           :   0.738974  0.657906  0.696088    590946
-           ?   0.730301  0.643325  0.684060    410416
-    accuracy                       0.964233  83329377
-   macro avg   0.818597  0.742984  0.776942  83329377
-weighted avg   0.962951  0.964233  0.963427  83329377
-```
-Usage:
 ```bash
 pip install deepmultilingualpunctuation
 ```
 ```python
 from deepmultilingualpunctuation import PunctuationModel
@@ -50,3 +37,45 @@ text = "hervatting van de zitting ik verklaar de zitting van het europees parlem
 result = model.restore_punctuation(text)
 print(result)
 ```

 license: mit
 widget:
 - text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
+  example_title: "Dutch Sample"
 metrics:
 - f1
 ---
+This model predicts the punctuation of Dutch texts. We developed it to restore the punctuation of transcribed spoken language.
+This multilanguage model was trained on the [SoNaR Dataset](http://hdl.handle.net/10032/tm-a2-h5).
+The model restores the following punctuation markers: **"." "," "?" "-" ":"**
+## Sample Code
+We provide a simple python package that allows you to process text of any length.
+## Install
+To get started install the package from [pypi](https://pypi.org/project/deepmultilingualpunctuation/):
 ```bash
 pip install deepmultilingualpunctuation
 ```
+### Restore Punctuation
 ```python
 from deepmultilingualpunctuation import PunctuationModel
 result = model.restore_punctuation(text)
 print(result)
 ```
+**output**
+> hervatting van de zitting. ik verklaar de zitting van het europees parlement, die op vrijdag 17 december werd onderbroken, te zijn hervat.
+### Predict Labels
+```python
+from deepmultilingualpunctuation import PunctuationModel
+model = PunctuationModel()
+text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
+clean_text = model.preprocess(text)
+labled_words = model.predict(clean_text)
+print(labled_words)
+```
+**output**
+> [['hervatting', '0', 0.99998724], ['van', '0', 0.9999784], ['de', '0', 0.99991274], ['zitting', '.', 0.6771242], ['ik', '0', 0.9999466], ['verklaar', '0', 0.9998566], ['de', '0', 0.9999783], ['zitting', '0', 0.9999809], ['van', '0', 0.99996245], ['het', '0', 0.99997795], ['europees', '0', 0.9999783], ['parlement', ',', 0.9908242], ['die', '0', 0.999985], ['op', '0', 0.99998224], ['vrijdag', '0', 0.9999831], ['17', '0', 0.99997985], ['december', '0', 0.9999827], ['werd', '0', 0.999982], ['onderbroken', ',', 0.9951485], ['te', '0', 0.9999677], ['zijn', '0', 0.99997723], ['hervat', '.', 0.9957053]]
+## Results
+The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores:
+| Label         | F1 Score |
+| ------------- | -------- |
+| 0             | 0.985816 |
+| .             | 0.854380 |
+| ?             | 0.684060 |
+| ,             | 0.719308 |
+| :             | 0.696088 |
+| -             | 0.722000 |
+| macro average | 0.776942 |
+| micro average | 0.963427 |
+## References
+TBD