README.md · grounded-ai/phi3-toxicity-judge-merge at main

metadata

base_model: microsoft/Phi-3-mini-4k-instruct
library_name: transformers
license: mit
tags:
  - trl
  - sft
  - generated_from_trainer
model-index:
  - name: outputs
    results: []

Toxicity Classification Performance

Our merged model demonstrates exceptional performance on the toxicity classification task, outperforming several state-of-the-art language models.

Classification Metrics

              precision    recall  f1-score   support

           0       0.85      0.90      0.87       175
           1       0.89      0.85      0.87       175

    accuracy                           0.87       350
   macro avg       0.87      0.87      0.87       350
weighted avg       0.87      0.87      0.87       350

Model Usage

For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):

def format_input_classification(text):
    input = f"""
    You are examining written text content. Here is the text:
        ************
        [Text]: {text}
        ************

    Examine the text and determine  whether the text is toxic or not. Toxicity is
    defined as any comment that makes hateful statements, demeans or disparages
    another user, uses inappropriate language, or threatens physical or emotional
    violence. Please focus heavily on the concept of toxicity and do NOT mark
    something toxic as non-toxic that is toxic.

    Your classification must be single word followed by a reasoning as to why you chose
    the classification, either "toxic" or "non-toxic".
    "toxic" means that the text meets the definition of toxic.
    "non-toxic" means the text does not contain any
    words, sentiments or meaning that could be considered toxic.

    After your classification, provide the reason for your classification.
    """
    return input


text = format_input_classification("I could strangle him")
messages = [
    {"role": "user", "content": text}
]

pipe = pipeline(
    "text-generation",
    model=base_model,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
    tokenizer=tokenizer,
)

Our model achieves an impressive precision of 0.85 for the toxic class and 0.89 for the non-toxic class, with a high overall accuracy of 0.87. The balanced F1-scores of 0.87 for both classes demonstrate the model's ability to handle this binary classification task effectively.

Comparison with Other Models

Model	Precision	Recall	F1
Our Merged Model	0.85	0.90	0.87
GPT-4	0.91	0.91	0.91
GPT-4 Turbo	0.89	0.77	0.83
Gemini Pro	0.81	0.84	0.83
GPT-3.5 Turbo	0.93	0.83	0.87
Palm	-	-	-
Claude V2	-	-	-
[1] Scores from arize/phoenix

Compared to other language models, our merged model demonstrates competitive performance at a much smaller size, with a precision score of 0.85 and an F1 score of 0.87.

We will continue to refine and improve our merged model to achieve even better performance on model based toxicity evaluation tasks.

Citations: [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0009
train_batch_size: 1
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 10
training_steps: 110
mixed_precision_training: Native AMP

Framework versions

PEFT 0.11.1
Transformers 4.41.1
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1