|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# Model Card for Deita Complexity Scorer |
|
|
|
Deita is an open-sourced project designed to facilitate **Automatic Data Selection** for instruction tuning in Large Language Models (LLMs). |
|
|
|
Deita Complexity Scorer is a tool for automatically annotating the Instruction Complexity of SFT data. |
|
|
|
## Model description |
|
|
|
- **Model type:** Model fine tuned to automatically annotate the Instruction Complexity |
|
- **Language(s) (NLP):** Primarily English |
|
- **Finetuned from model:** Llama-1-13b-hf |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/hkust-nlp/deita |
|
- **Model Family:** Other models and the dataset are found in the [Deita collection](https://huggingface.co/collections/hkust-nlp/deita-6569c198c174808d94cf5bd4). |
|
|
|
## Usage |
|
|
|
Please use the following format to score the complexity of the Instruction: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import numpy as np |
|
from scipy.special import softmax |
|
model_name = "hkust-nlp/Deita-Complexity-Scorer" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
def infer_complexity(model, tokenizer, input_text): |
|
complexity_template = ("You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: ") |
|
user_input = complexity_template.format(instruction=input_text) |
|
input_ids = tokenizer.encode(user_input, return_tensors="pt") |
|
max_length = 512 |
|
outputs = model.generate(input_ids, max_length=512, num_return_sequences=1, return_dict_in_generate=True, output_scores=True) |
|
logprobs_list = outputs.scores[0][0] |
|
score_logits = [] |
|
id2score = { |
|
29896: "1", |
|
29906: "2", |
|
29941: "3", |
|
29946: "4", |
|
29945: "5", |
|
29953: "6" |
|
} |
|
score_template = np.array([1,2,3,4,5,6]) |
|
for k in id2score: |
|
score_logits.append(logprobs_list[k]) |
|
score_logits = np.array(score_logits) |
|
score_npy = softmax(score_logits, axis=0) |
|
score_npy = score_npy * score_template |
|
|
|
score_npy = np.sum(score_npy, axis=0) |
|
return score_npy |
|
|
|
# example input |
|
input_text = "write a performance review for a junior data scientist" |
|
complexity_score = infer_complexity(model, tokenizer, input_text) |
|
|
|
print(complexity_score) |
|
|
|
|
|
``` |