File size: 4,951 Bytes
ab632e6
 
 
 
 
 
 
 
40f7877
 
535f8a2
12ccbef
40f7877
d28552d
535f8a2
 
 
 
ab632e6
 
12ccbef
62e1681
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0f2a47
 
 
 
 
 
 
 
 
 
 
 
62e1681
 
 
 
 
 
 
2018391
a7ba35a
2018391
a7ba35a
 
2018391
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68fb4eb
 
40f7877
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- LLM
- token classification
- nlp
- safetensor
- PyTorch
base_model: microsoft/Phi-3-mini-4k-instruct
library_name: transformers
widget:
- text: My name is Sylvain and I live in Paris
  example_title: Parisian
- text: My name is Sarah and I live in London
  example_title: Londoner
---


# PII Detection Model - Phi3 Mini Fine-Tuned

This repository contains a fine-tuned version of the [Phi3 Mini](https://huggingface.co/ab-ai/PII-Model-Phi3-Mini) model for detecting personally identifiable information (PII). The model has been specifically trained to recognize various PII entities in text, making it a powerful tool for tasks such as data redaction, privacy protection, and compliance with data protection regulations.

## Model Overview

### Model Architecture

- **Base Model**: Phi3 Mini
- **Fine-Tuned For**: PII detection
- **Framework**: [Hugging Face Transformers](https://huggingface.co/transformers/)

### Detected PII Entities

The model is capable of detecting the following PII entities:

- **Personal Information**:
  - `firstname`
  - `middlename`
  - `lastname`
  - `sex`
  - `dob` (Date of Birth)
  - `age`
  - `gender`
  - `height`
  - `eyecolor`
  
- **Contact Information**:
  - `email`
  - `phonenumber`
  - `url`
  - `username`
  - `useragent`
  
- **Address Information**:
  - `street`
  - `city`
  - `state`
  - `county`
  - `zipcode`
  - `country`
  - `secondaryaddress`
  - `buildingnumber`
  - `ordinaldirection`
  
- **Geographical Information**:
  - `nearbygpscoordinate`
  
- **Organizational Information**:
  - `companyname`
  - `jobtitle`
  - `jobarea`
  - `jobtype`
  
- **Financial Information**:
  - `accountname`
  - `accountnumber`
  - `creditcardnumber`
  - `creditcardcvv`
  - `creditcardissuer`
  - `iban`
  - `bic`
  - `currency`
  - `currencyname`
  - `currencysymbol`
  - `currencycode`
  - `amount`
  
- **Unique Identifiers**:
  - `pin`
  - `ssn`
  - `imei` (Phone IMEI)
  - `mac` (MAC Address)
  - `vehiclevin` (Vehicle VIN)
  - `vehiclevrm` (Vehicle VRM)
  
- **Cryptocurrency Information**:
  - `bitcoinaddress`
  - `litecoinaddress`
  - `ethereumaddress`
  
- **Other Information**:
  - `ip` (IP Address)
  - `ipv4`
  - `ipv6`
  - `maskednumber`
  - `password`
  - `time`
  - `ordinaldirection`
  - `prefix`

## Prompt Format
```bash
### Instruction:
  Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

### Input:
  Greetings, Mason! Let's celebrate another year of wellness on 14/01/1977. Don't miss the event at 176,Apt. 388.

### Output:

```

## Usage

### Installation

To use this model, you'll need to have the `transformers` library installed:

```bash
pip install transformers
```

### Run Inference
```bash
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")


input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email [email protected]."

model_prompt = f"""### Instruction:
    Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

    ### Input:
    {input_text}

    ### Output: """


inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
# adjust max_new_tokens according to your need
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response) #{'middlename': ['Abner'], 'dob': ['23/03/1926'], 'email': ['[email protected]']}

```