gaodrew's picture
Update README.md
f6f0fdb verified
metadata
library_name: transformers
license: cc-by-nc-nd-4.0
base_model: microsoft/mdeberta-v3-base
tags:
  - generated_from_trainer
  - pii
  - privacy
  - personaldata
  - redaction
  - piidetection
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: piiranha-1
    results: []
datasets:
  - ai4privacy/pii-masking-400k
language:
  - en
  - it
  - fr
  - de
  - nl
  - es
pipeline_tag: token-classification

Piiranha-v1: Protect your personal information!

Open In Colab

Piiranha (cc-by-nc-nd-4.0 license) is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%. Piiranha is especially accurate at detecting passwords, emails (100%), phone numbers, and usernames.

Performance on PII vs. Non PII classification task:

  • Precision: 98.48% (98.48% of tokens classified as PII are actually PII)
  • Recall: 98.27% (correctly identifies 98.27% of PII tokens)
  • Specificity: 99.84% (correctly identifies 99.84% of Non PII tokens)
Akash Network logo

Piiranha was trained on H100 GPUs generously sponsored by the Akash Network

Model Description

Piiranha is a fine-tuned version of microsoft/mdeberta-v3-base. The context length is 256 Deberta tokens. If your text is longer than that, just split it up.

Supported languages: English, Spanish, French, German, Italian, Dutch

Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.

It achieves the following results on a test set of ~73,000 sentences containing PII:

  • Accuracy: 99.44%
  • Loss: 0.0173
  • Precision: 93.16%
  • Recall: 93.08%
  • F1: 93.12%

Note that the above metrics factor in the eighteen possible categories (17 PII and 1 Non PII), so the metrics are lower than the metrics for just PII vs. Non PII (binary classification).

Performance by PII type

Reported performance metrics are lower than the overall accuracy of 99.44% due to class imbalance (most tokens are not PII). However, the model is more useful than the below results suggest, due to the intent behind PII detection. The model sometimes misclassifies one PII type for another, but at the end of the day, it still recognizes the token as PII. For instance, the model often confuses first names for last names, but that's fine because it still flags the name as PII.

Entity Precision Recall F1-Score Support
ACCOUNTNUM 0.84 0.87 0.85 3575
BUILDINGNUM 0.92 0.90 0.91 3252
CITY 0.95 0.97 0.96 7270
CREDITCARDNUMBER 0.94 0.96 0.95 2308
DATEOFBIRTH 0.93 0.85 0.89 3389
DRIVERLICENSENUM 0.96 0.96 0.96 2244
EMAIL 1.00 1.00 1.00 6892
GIVENNAME 0.87 0.93 0.90 12150
IDCARDNUM 0.89 0.94 0.91 3700
PASSWORD 0.98 0.98 0.98 2387
SOCIALNUM 0.93 0.94 0.93 2709
STREET 0.97 0.95 0.96 3331
SURNAME 0.89 0.78 0.83 8267
TAXNUM 0.97 0.89 0.93 2322
TELEPHONENUM 0.99 1.00 0.99 5039
USERNAME 0.98 0.98 0.98 7680
ZIPCODE 0.94 0.97 0.95 3191
micro avg 0.93 0.93 0.93 79706
macro avg 0.94 0.93 0.93 79706
weighted avg 0.93 0.93 0.93 79706

Intended uses & limitations

Piiranha can be used to assist with redacting PII from texts. Use at your own risk. We do not accept responsibility for any incorrect model predictions.

Training and evaluation data

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 128
  • eval_batch_size: 128
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.05
  • num_epochs: 5
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.2984 0.0983 250 0.1005 0.5446 0.6111 0.5759 0.9702
0.0568 0.1965 500 0.0464 0.7895 0.8459 0.8167 0.9849
0.0441 0.2948 750 0.0400 0.8346 0.8669 0.8504 0.9869
0.0368 0.3931 1000 0.0320 0.8531 0.8784 0.8656 0.9891
0.0323 0.4914 1250 0.0293 0.8779 0.8889 0.8834 0.9903
0.0287 0.5896 1500 0.0269 0.8919 0.8836 0.8877 0.9907
0.0282 0.6879 1750 0.0276 0.8724 0.9012 0.8866 0.9903
0.0268 0.7862 2000 0.0254 0.8890 0.9041 0.8965 0.9914
0.0264 0.8844 2250 0.0236 0.8886 0.9040 0.8962 0.9915
0.0243 0.9827 2500 0.0232 0.8998 0.9033 0.9015 0.9917
0.0213 1.0810 2750 0.0237 0.9115 0.9040 0.9077 0.9923
0.0213 1.1792 3000 0.0222 0.9123 0.9143 0.9133 0.9925
0.0217 1.2775 3250 0.0222 0.8999 0.9169 0.9083 0.9924
0.0209 1.3758 3500 0.0212 0.9111 0.9133 0.9122 0.9928
0.0204 1.4741 3750 0.0206 0.9054 0.9203 0.9128 0.9926
0.0183 1.5723 4000 0.0212 0.9126 0.9160 0.9143 0.9927
0.0191 1.6706 4250 0.0192 0.9122 0.9192 0.9157 0.9929
0.0185 1.7689 4500 0.0195 0.9200 0.9191 0.9196 0.9932
0.018 1.8671 4750 0.0188 0.9136 0.9215 0.9176 0.9933
0.0183 1.9654 5000 0.0191 0.9179 0.9212 0.9196 0.9934
0.0147 2.0637 5250 0.0188 0.9246 0.9242 0.9244 0.9937
0.0149 2.1619 5500 0.0184 0.9188 0.9254 0.9221 0.9937
0.0143 2.2602 5750 0.0193 0.9187 0.9224 0.9205 0.9932
0.014 2.3585 6000 0.0190 0.9246 0.9280 0.9263 0.9936
0.0146 2.4568 6250 0.0190 0.9225 0.9277 0.9251 0.9936
0.0148 2.5550 6500 0.0175 0.9297 0.9306 0.9301 0.9942
0.0136 2.6533 6750 0.0172 0.9191 0.9329 0.9259 0.9938
0.0137 2.7516 7000 0.0166 0.9299 0.9312 0.9306 0.9942
0.014 2.8498 7250 0.0167 0.9285 0.9313 0.9299 0.9942
0.0128 2.9481 7500 0.0166 0.9271 0.9326 0.9298 0.9943
0.0113 3.0464 7750 0.0171 0.9286 0.9347 0.9316 0.9946
0.0103 3.1447 8000 0.0172 0.9284 0.9383 0.9334 0.9945
0.0104 3.2429 8250 0.0169 0.9312 0.9406 0.9359 0.9947
0.0094 3.3412 8500 0.0166 0.9368 0.9359 0.9364 0.9948
0.01 3.4395 8750 0.0166 0.9289 0.9387 0.9337 0.9944
0.0099 3.5377 9000 0.0162 0.9335 0.9332 0.9334 0.9947
0.0099 3.6360 9250 0.0160 0.9321 0.9380 0.9350 0.9947
0.01 3.7343 9500 0.0168 0.9306 0.9389 0.9347 0.9947
0.0101 3.8325 9750 0.0159 0.9339 0.9350 0.9344 0.9947

Contact

william (at) integrinet [dot] org

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.0
  • Tokenizers 0.19.1