vibhorag101
/

roberta-base-suicide-prediction-phr

Text Classification

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

vibhorag101 commited on Nov 25, 2023

Commit

37c5e21

•

1 Parent(s): 5b69f6b

Update README.md

Files changed (1) hide show

README.md +14 -8

README.md CHANGED Viewed

@@ -34,9 +34,6 @@ language:
 library_name: transformers
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # roberta-base-suicide-prediction-phr
 This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) on a Suicide Prediction dataset sourced from Reddit.
@@ -48,15 +45,24 @@ It achieves the following results on the evaluation/validation set:
 - F1: {'f1': 0.9651921995935487}
 ## Model description
-More information needed
 ## Training and evaluation data
-The dataset is sourced from Reddit and is available on [Kaggle](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch).
-The dataset contains text with binary labels for suicide or non-suicide.
-The evaluation set had ~23000 samples, while the training set had ~186k samples, i.e. 80:10:10 (train:test:val) split.
 ## Training procedure
 ### Training hyperparameters

 library_name: transformers
 ---
 # roberta-base-suicide-prediction-phr
 This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) on a Suicide Prediction dataset sourced from Reddit.
 - F1: {'f1': 0.9651921995935487}
 ## Model description
+This model is a finetune of roberta-base to detect suicidal tendencies in a given text.
 ## Training and evaluation data
+- The dataset is sourced from Reddit and is available on [Kaggle](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch).
+- The dataset was cleaned and following steps were applied
+  - Converted to lowercase
+  - Removed numbers and special characters.
+  - Removed URLs, Emojis and accented characters.
+  - Removed any word contractions.
+  - Remove any extra white spaces and any extra spaces after a single space.
+  - Removed any consecutive characters repeated more than 3 times.
+  - Tokenised the text, then lemmatized it and then removed the stopwords (excluding not).
+- The cleaned dataset can be found [here](https://huggingface.co/datasets/vibhorag101/suicide_prediction_dataset_phr)
+- The dataset contains text with binary labels for suicide or non-suicide.
+- The evaluation set had ~23000 samples, while the training set had ~186k samples, i.e. a 80:10:10 (train:test:val) split.
 ## Training procedure
+- The model was trained on an RTXA5000 GPU.
 ### Training hyperparameters