Update README.md

535f8a2 verified 3 months ago

4.95 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- LLM
	- token classification
	- nlp
	- safetensor
	- PyTorch
	base_model: microsoft/Phi-3-mini-4k-instruct
	library_name: transformers
	widget:
	- text: My name is Sylvain and I live in Paris
	example_title: Parisian
	- text: My name is Sarah and I live in London
	example_title: Londoner
	---


	# PII Detection Model - Phi3 Mini Fine-Tuned

	This repository contains a fine-tuned version of the [Phi3 Mini](https://huggingface.co/ab-ai/PII-Model-Phi3-Mini) model for detecting personally identifiable information (PII). The model has been specifically trained to recognize various PII entities in text, making it a powerful tool for tasks such as data redaction, privacy protection, and compliance with data protection regulations.

	## Model Overview

	### Model Architecture

	- Base Model: Phi3 Mini
	- Fine-Tuned For: PII detection
	- Framework: [Hugging Face Transformers](https://huggingface.co/transformers/)

	### Detected PII Entities

	The model is capable of detecting the following PII entities:

	- Personal Information:
	- `firstname`
	- `middlename`
	- `lastname`
	- `sex`
	- `dob` (Date of Birth)
	- `age`
	- `gender`
	- `height`
	- `eyecolor`

	- Contact Information:
	- `email`
	- `phonenumber`
	- `url`
	- `username`
	- `useragent`

	- Address Information:
	- `street`
	- `city`
	- `state`
	- `county`
	- `zipcode`
	- `country`
	- `secondaryaddress`
	- `buildingnumber`
	- `ordinaldirection`

	- Geographical Information:
	- `nearbygpscoordinate`

	- Organizational Information:
	- `companyname`
	- `jobtitle`
	- `jobarea`
	- `jobtype`

	- Financial Information:
	- `accountname`
	- `accountnumber`
	- `creditcardnumber`
	- `creditcardcvv`
	- `creditcardissuer`
	- `iban`
	- `bic`
	- `currency`
	- `currencyname`
	- `currencysymbol`
	- `currencycode`
	- `amount`

	- Unique Identifiers:
	- `pin`
	- `ssn`
	- `imei` (Phone IMEI)
	- `mac` (MAC Address)
	- `vehiclevin` (Vehicle VIN)
	- `vehiclevrm` (Vehicle VRM)

	- Cryptocurrency Information:
	- `bitcoinaddress`
	- `litecoinaddress`
	- `ethereumaddress`

	- Other Information:
	- `ip` (IP Address)
	- `ipv4`
	- `ipv6`
	- `maskednumber`
	- `password`
	- `time`
	- `ordinaldirection`
	- `prefix`

	## Prompt Format
	```bash
	### Instruction:
	Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

	### Input:
	Greetings, Mason! Let's celebrate another year of wellness on 14/01/1977. Don't miss the event at 176,Apt. 388.

	### Output:

	```

	## Usage

	### Installation

	To use this model, you'll need to have the `transformers` library installed:

	```bash
	pip install transformers
	```

	### Run Inference
	```bash
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
	model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")


	input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email [email protected]."

	model_prompt = f"""### Instruction:
	Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

	### Input:
	{input_text}

	### Output: """


	inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
	# adjust max_new_tokens according to your need
	outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response) #{'middlename': ['Abner'], 'dob': ['23/03/1926'], 'email': ['[email protected]']}

	```