gbyuvd commited on
Commit
b562e57
1 Parent(s): a33d3d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -41,7 +41,7 @@ tags:
41
 
42
  # ChemFIE-DTP (DrugTargetPrediction - 221 Classes)
43
 
44
- This model is a BERT-like sequence classification model for 221 human protein drug targets, based on [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) fine-tuned on a dataset derived from ChemBL34 (Zdrazil et al. 2023). It predicts potential drug targets using chemical structures represented as SELFIES (Self-Referencing Embedded Strings). The model was trained on a selected and balanced dataset of around 154k examples covering 221 distinct human protein targets. Data selection criteria included specific activity types (IC50, Ki, EC50) with values ≤ 10 µM, assay confidence scores ≥ 7, and exact activity relations. Among all drug target classes found in ChemBL34, classes with at least 1000 examples are selected then capped at 1000 for those with more samples. Building upon the pre-trained base model's pre-existing knowledge of SELFIES, this model is originally intended to validate the capabilities of the light-weight base model to be fine-tuned for various tasks, and for this model case, it might be useful for tasks related to early-stage drug discovery and target prediction (e.g. compounds annotations) - though its performance and applicability should be carefully evaluated for specific use cases (see [Evaluation](#evaluation))
45
 
46
  - List of classes available in the "label_dict.json"
47
  - Its performance on each classes available in "test_result.txt"
@@ -220,7 +220,7 @@ Bioactive compounds from [ChemBL34](https://ftp.ebi.ac.uk/pub/databases/chembl/C
220
  Dataset Details:
221
  - Total training examples: 154,700
222
  - Number of classes: 221 distinct human protein drug targets
223
- - Organism: Homo sapiens
224
  - Number of train examples for each class: 700
225
  - Number of validation examples for each class: 100
226
  - Number of held out test examples for each class: 200
@@ -303,6 +303,8 @@ Both macro (unweighted mean of all classes) and weighted (weighted by class supp
303
  ### Results
304
 
305
  #### General
 
 
306
  - Accuracy: 0.6199
307
  - Macro F1: 0.6127
308
  - Weighted F1: 0.6127
 
41
 
42
  # ChemFIE-DTP (DrugTargetPrediction - 221 Classes)
43
 
44
+ This model is a BERT-like sequence classifier for 221 human protein drug targets, fine-tuned from [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) on a dataset derived ChemBL34 (Zdrazil et al. 2023). It predicts potential drug targets using chemical structures represented as SELFIES (Self-Referencing Embedded Strings). The model was trained on a selected and balanced dataset of around 154k examples covering 221 distinct human protein targets. Data selection criteria included specific activity types (IC50, Ki, EC50) with values ≤ 10 µM, assay confidence scores ≥ 7, and exact activity relations. Among all drug target classes found in ChemBL34, classes with at least 1000 examples are selected then capped at 1000 for those with more samples. Building upon the pre-trained base model's pre-existing knowledge of SELFIES, this model is originally intended to validate the capabilities of the light-weight base model to be fine-tuned for various tasks, and for this model case, it might be useful for tasks related to early-stage drug discovery and target prediction (e.g. compounds annotations) - though its performance and applicability should be carefully evaluated for specific use cases (see [Evaluation](#evaluation))
45
 
46
  - List of classes available in the "label_dict.json"
47
  - Its performance on each classes available in "test_result.txt"
 
220
  Dataset Details:
221
  - Total training examples: 154,700
222
  - Number of classes: 221 distinct human protein drug targets
223
+ - Organism: _Homo sapiens_
224
  - Number of train examples for each class: 700
225
  - Number of validation examples for each class: 100
226
  - Number of held out test examples for each class: 200
 
303
  ### Results
304
 
305
  #### General
306
+ - Baseline with random guess (1/221): 0.0045
307
+
308
  - Accuracy: 0.6199
309
  - Macro F1: 0.6127
310
  - Weighted F1: 0.6127