metadata

license: mit
language:
  - en
pipeline_tag: text-generation
tags:
  - LLM
  - token classification
  - nlp
  - safetensor
  - PyTorch
base_model: microsoft/Phi-3-mini-4k-instruct
library_name: transformers
widget:
  - text: My name is Sylvain and I live in Paris
    example_title: Parisian
  - text: My name is Sarah and I live in London
    example_title: Londoner

PII Detection Model - Phi3 Mini Fine-Tuned

This repository contains a fine-tuned version of the Phi3 Mini model for detecting personally identifiable information (PII). The model has been specifically trained to recognize various PII entities in text, making it a powerful tool for tasks such as data redaction, privacy protection, and compliance with data protection regulations.

Model Overview

Model Architecture

Base Model: Phi3 Mini
Fine-Tuned For: PII detection
Framework: Hugging Face Transformers

Detected PII Entities

The model is capable of detecting the following PII entities:

Personal Information:
- firstname
- middlename
- lastname
- sex
- dob (Date of Birth)
- age
- gender
- height
- eyecolor
Contact Information:
- email
- phonenumber
- url
- username
- useragent
Address Information:
- street
- city
- state
- county
- zipcode
- country
- secondaryaddress
- buildingnumber
- ordinaldirection
Geographical Information:
- nearbygpscoordinate
Organizational Information:
- companyname
- jobtitle
- jobarea
- jobtype
Financial Information:
- accountname
- accountnumber
- creditcardnumber
- creditcardcvv
- creditcardissuer
- iban
- bic
- currency
- currencyname
- currencysymbol
- currencycode
- amount
Unique Identifiers:
- pin
- ssn
- imei (Phone IMEI)
- mac (MAC Address)
- vehiclevin (Vehicle VIN)
- vehiclevrm (Vehicle VRM)
Cryptocurrency Information:
- bitcoinaddress
- litecoinaddress
- ethereumaddress
Other Information:
- ip (IP Address)
- ipv4
- ipv6
- maskednumber
- password
- time
- ordinaldirection
- prefix

Prompt Format

### Instruction:
  Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

### Input:
  Greetings, Mason! Let's celebrate another year of wellness on 14/01/1977. Don't miss the event at 176,Apt. 388.

### Output:

Usage

Installation

To use this model, you'll need to have the transformers library installed:

pip install transformers

Run Inference

from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")


input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email [email protected]."

model_prompt = f"""### Instruction:
    Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

    ### Input:
    {input_text}

    ### Output: """


inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
# adjust max_new_tokens according to your need
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response) #{'middlename': ['Abner'], 'dob': ['23/03/1926'], 'email': ['[email protected]']}