|
--- |
|
license: mit |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
tags: |
|
- LLM |
|
- token classification |
|
- nlp |
|
- safetensor |
|
base_model: microsoft/Phi-3-mini-4k-instruct |
|
library_name: transformers |
|
|
|
widget: |
|
- text: "My name is Sylvain and I live in Paris" |
|
example_title: "Parisian" |
|
- text: "My name is Sarah and I live in London" |
|
example_title: "Londoner" |
|
--- |
|
|
|
|
|
# PII Detection Model - Phi3 Mini Fine-Tuned |
|
|
|
This repository contains a fine-tuned version of the [Phi3 Mini](https://huggingface.co/ab-ai/PII-Model-Phi3-Mini) model for detecting personally identifiable information (PII). The model has been specifically trained to recognize various PII entities in text, making it a powerful tool for tasks such as data redaction, privacy protection, and compliance with data protection regulations. |
|
|
|
## Model Overview |
|
|
|
### Model Architecture |
|
|
|
- **Base Model**: Phi3 Mini |
|
- **Fine-Tuned For**: PII detection |
|
- **Framework**: [Hugging Face Transformers](https://huggingface.co/transformers/) |
|
|
|
### Detected PII Entities |
|
|
|
The model is capable of detecting the following PII entities: |
|
|
|
- **Personal Information**: |
|
- `firstname` |
|
- `middlename` |
|
- `lastname` |
|
- `sex` |
|
- `dob` (Date of Birth) |
|
- `age` |
|
- `gender` |
|
- `height` |
|
- `eyecolor` |
|
|
|
- **Contact Information**: |
|
- `email` |
|
- `phonenumber` |
|
- `url` |
|
- `username` |
|
- `useragent` |
|
|
|
- **Address Information**: |
|
- `street` |
|
- `city` |
|
- `state` |
|
- `county` |
|
- `zipcode` |
|
- `country` |
|
- `secondaryaddress` |
|
- `buildingnumber` |
|
- `ordinaldirection` |
|
|
|
- **Geographical Information**: |
|
- `nearbygpscoordinate` |
|
|
|
- **Organizational Information**: |
|
- `companyname` |
|
- `jobtitle` |
|
- `jobarea` |
|
- `jobtype` |
|
|
|
- **Financial Information**: |
|
- `accountname` |
|
- `accountnumber` |
|
- `creditcardnumber` |
|
- `creditcardcvv` |
|
- `creditcardissuer` |
|
- `iban` |
|
- `bic` |
|
- `currency` |
|
- `currencyname` |
|
- `currencysymbol` |
|
- `currencycode` |
|
- `amount` |
|
|
|
- **Unique Identifiers**: |
|
- `pin` |
|
- `ssn` |
|
- `imei` (Phone IMEI) |
|
- `mac` (MAC Address) |
|
- `vehiclevin` (Vehicle VIN) |
|
- `vehiclevrm` (Vehicle VRM) |
|
|
|
- **Cryptocurrency Information**: |
|
- `bitcoinaddress` |
|
- `litecoinaddress` |
|
- `ethereumaddress` |
|
|
|
- **Other Information**: |
|
- `ip` (IP Address) |
|
- `ipv4` |
|
- `ipv6` |
|
- `maskednumber` |
|
- `password` |
|
- `time` |
|
- `ordinaldirection` |
|
- `prefix` |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
To use this model, you'll need to have the `transformers` library installed: |
|
|
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
### Run Inference |
|
```bash |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini") |
|
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini") |
|
|
|
|
|
input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email [email protected]." |
|
|
|
model_prompt = f"""### Instruction: |
|
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format. |
|
|
|
### Input: |
|
{input_text} |
|
|
|
### Output: """ |
|
|
|
|
|
inputs = tokenizer(model_prompt, return_tensors="pt").to(device) |
|
# adjust max_new_tokens according to your need |
|
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120) |
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |