|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- en |
|
--- |
|
# Jellyfish-7B |
|
<!-- Provide a quick summary of what the model is/does. --> |
|
<!-- |
|
<img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/> |
|
--> |
|
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/> |
|
|
|
Jellyfish model with other sizes are available here: |
|
[Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B) |
|
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B) |
|
|
|
## Model Details |
|
Jellyfish-7B is a large language model equipped with 7 billion parameters. |
|
We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset. |
|
|
|
For interpretability, the winning rate of Jellyfish-7B against GPT-3.5-turbo (evaluated by GPT-4) is 56.36%. |
|
|
|
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678). |
|
|
|
- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada |
|
- **Contact: [email protected]** |
|
- **Funded by:** NEC Corporation, Osaka University |
|
- **Language(s) (NLP):** English |
|
- **License:** Non-Commercial Creative Commons license (CC BY-NC-4.0) |
|
- **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |
|
## Citation |
|
|
|
If you find our work useful, please give us credit by citing: |
|
|
|
``` |
|
@article{zhang2023jellyfish, |
|
title={Jellyfish: A Large Language Model for Data Preprocessing}, |
|
author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi}, |
|
journal={arXiv preprint arXiv:2312.01678}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
## Performance on seen tasks |
|
|
|
| Task | Type | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | GPT-4o | Table-GPT | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B | |
|
|-----------------|--------|-------------------|-----------------|--------|--------|--------|-----------|--------------|--------------|---------------| |
|
| Error Detection | Seen | Adult | *99.10* | 99.10 | 92.01 | 83.58 | -- | 77.40 | 73.74 | **99.33** | |
|
| Error Detection | Seen | Hospital | 94.40 | **97.80** | 90.74 | 44.76 | -- | 94.51 | 93.40 | *95.59* | |
|
| Error Detection | Unseen | Flights | 81.00 | -- | **83.48** | 66.01 | -- | 69.15 | 66.21 | *82.52* | |
|
| Error Detection | Unseen | Rayyan | 79.00 | -- | *81.95* | 68.53 | -- | 75.07 | 81.06 | **90.65** | |
|
| Data Imputation | Seen | Buy | 96.50 | 98.50 | **100** | **100** | -- | 98.46 | 98.46 | **100** | |
|
| Data Imputation | Seen | Restaurant | 77.20 | 88.40 | **97.67** | 90.70 | -- | 89.53 | 87.21 | 89.53 | |
|
| Data Imputation | Unseen | Flipkart | 68.00 | -- | **89.94** | 83.20 | -- | 87.14 | *87.48* | 81.68 | |
|
| Data Imputation | Unseen | Phone | 86.70 | -- | **90.79** | 86.78 | -- | 86.52 | 85.68 | *87.21* | |
|
| Schema Matching | Seen | MIMIC-III | 20.00 | -- | 40.00 | 29.41 | -- | **53.33** | *45.45* | 40.00 | |
|
| Schema Matching | Seen | Synthea | 38.50 | 45.20 | **66.67** | 6.56 | -- | 55.56 | 47.06 | 56.00 | |
|
| Schema Matching | Unseen | CMS | *50.00* | -- | 19.35 | 22.22 | -- | 42.86 | 38.10 | **59.29** | |
|
| Entity Matching | Seen | Amazon-Google | 75.58 | 63.50 | 74.21 | 70.91 | 70.10 | **81.69** | *81.42* | 81.34 | |
|
| Entity Matching | Seen | Beer | 94.37 | **100** | **100** | 90.32 | 96.30 | **100.00** | **100.00** | 96.77 | |
|
| Entity Matching | Seen | DBLP-ACM | **98.99** | 96.60 | 97.44 | 95.87 | 93.80 | 98.65 | 98.77 | *98.98* | |
|
| Entity Matching | Seen | DBLP-GoogleScholar| *95.70* | 83.80 | 91.87 | 90.45 | 92.40 | 94.88 | 95.03 | **98.51** | |
|
| Entity Matching | Seen | Fodors-Zagats | **100** | **100** | **100** | 93.62 | **100** | **100** | **100** | **100** | |
|
| Entity Matching | Seen | iTunes-Amazon | 97.06 | *98.20*| **100** | 98.18 | 94.30 | 96.30 | 96.30 | 98.11 | |
|
| Entity Matching | Unseen | Abt-Buy | 89.33 | -- | **92.77** | 78.73 | -- | 86.06 | 88.84 | *89.58* | |
|
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* | |
|
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** | |
|
|
|
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._ |
|
_Accuracy as the metric for data imputation and the F1 score for other tasks._ |
|
|
|
1. |
|
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets |
|
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets |
|
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation |
|
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching |
|
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching |
|
2. |
|
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361) |
|
|
|
## Performance on unseen tasks |
|
|
|
### Column Type Annotation |
|
|
|
| Dataset | RoBERTa (159 shots)<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B | |
|
|--------|-----------------|--------|--------|--------|--------------|--------------|---------------| |
|
| SOTAB | 79.20 | 89.47 | 91.55 | 65.05 | 83 | 76.33 | 82 | |
|
|
|
_Few-shot is disabled for Jellyfish models._ |
|
|
|
1. Results from [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) |
|
|
|
### Attribute Value Extraction |
|
|
|
| Dataset |Stable Beluga 2 70B<sup>1</sup> | SOLAR 70B<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 <sup>1</sup>| GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B | |
|
| ---- | ---- | ---- | ---- | ---- | ---- | ----| ----| ----| |
|
| AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 55.77 | 56.09 |59.55 | 58.12 | |
|
| OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 60.20 | 51.98 | 59.22 | 55.96 | |
|
|
|
_Few-shot is disabled for Jellyfish models._ |
|
|
|
1. Results from [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537) |
|
|
|
|
|
## Prompt Template |
|
``` |
|
{system message} |
|
|
|
[INST]: |
|
|
|
{prompt} (without the {}) |
|
|
|
[\INST]] |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Method |
|
|
|
We used LoRA to speed up the training process, targeting the q_proj, k_proj, v_proj, and o_proj modules. |
|
|
|
## Uses |
|
|
|
To accelerate the inference, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm). |
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Python Script |
|
We provide two simple Python code examples for inference using the Jellyfish model. |
|
|
|
#### Using Transformers and Torch Modules |
|
<div style="height: auto; max-height: 400px; overflow-y: scroll;"> |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig |
|
import torch |
|
|
|
if torch.cuda.is_available(): |
|
device = "cuda" |
|
else: |
|
device = "cpu" |
|
|
|
# Model will be automatically downloaded from HuggingFace model hub if not cached. |
|
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default. |
|
# You can also download the model manually and replace the model name with the path to the model files. |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"NECOUDBFM/Jellyfish", |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish") |
|
|
|
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can." |
|
|
|
# You need to define the user_message variable based on the task and the data you want to test on. |
|
user_message = "Hello, world." |
|
|
|
prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
input_ids = inputs["input_ids"].to(device) |
|
|
|
# You can modify the sampling parameters according to your needs. |
|
generation_config = GenerationConfig( |
|
do_samples=True, |
|
temperature=0.35, |
|
top_p=0.9, |
|
) |
|
|
|
with torch.no_grad(): |
|
generation_output = model.generate( |
|
input_ids=input_ids, |
|
generation_config=generation_config, |
|
return_dict_in_generate=True, |
|
output_scores=True, |
|
max_new_tokens=1024, |
|
pad_token_id=tokenizer.eos_token_id, |
|
repetition_penalty=1.15, |
|
) |
|
|
|
output = generation_output[0] |
|
response = tokenizer.decode( |
|
output[:, input_ids.shape[-1] :][0], skip_special_tokens=True |
|
).strip() |
|
|
|
print(response) |
|
|
|
``` |
|
</div> |
|
|
|
#### Using vLLM |
|
<div style="height: auto; max-height: 400px; overflow-y: scroll;"> |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually. |
|
# You should modify the path to the model according to your local environment. |
|
path_to_model = ( |
|
"/workspace/models/Jellyfish" |
|
) |
|
|
|
model = LLM(model=path_to_model) |
|
|
|
# You can modify the sampling parameters according to your needs. |
|
# Caution: The stop parameter should not be changed. |
|
sampling_params = SamplingParams( |
|
temperature=0.35, |
|
top_p=0.9, |
|
max_tokens=1024, |
|
stop=["[INST]"], |
|
) |
|
|
|
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can." |
|
|
|
# You need to define the user_message variable based on the task and the data you want to test on. |
|
user_message = "Hello, world." |
|
|
|
prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]" |
|
outputs = model.generate(prompt, sampling_params) |
|
response = outputs[0].outputs[0].text.strip() |
|
print(response) |
|
|
|
``` |
|
</div> |
|
|
|
## Prompts |
|
|
|
We provide the prompts used for both fine-tuning and inference. |
|
You can structure your data according to these prompts. |
|
|
|
### System Message |
|
``` |
|
You are an AI assistant that follows instruction extremely well. |
|
User will give you a question. Your task is to answer as faithfully as you can. |
|
``` |
|
|
|
### For Error Detection |
|
_There are two forms of the error detection task. |
|
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous. |
|
In the second form, only the value of a specific attribute is given, and the decision about its correctness is based solely on the attribute's name and value. |
|
The subsequent prompt examples pertain to these two forms, respectively._ |
|
``` |
|
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided. |
|
The attributes may include {attribute 1}, {attribute 2}, ... |
|
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record. |
|
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...] |
|
Attribute for Verification: [{attribute X}: {attribute X value}] |
|
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]. |
|
``` |
|
``` |
|
Your task is to determine if there is an error in the value of a specific attribute. |
|
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ... |
|
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute. |
|
Note: Missing values (N/A or \"nan\") are not considered errors. |
|
Attribute for Verification: [{attribute X}: {attribute X value}] |
|
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No]. |
|
``` |
|
|
|
### For Data Imputation |
|
``` |
|
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}. |
|
Your task is to deduce or infer the value of {attribute X} using the available information in the record. |
|
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference. |
|
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...] |
|
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}? |
|
Answer only the value of {attribute X}. |
|
``` |
|
|
|
### For Schema Matching |
|
``` |
|
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables. |
|
Each attribute will be provided by its name and a brief description. |
|
Your goal is to assess if they refer to the same information based on these names and descriptions provided. |
|
Attribute A is [name: {value of name}, description: {value of description}]. |
|
Attribute B is [name: {value of name}, description: {value of description}]. |
|
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No]. |
|
``` |
|
|
|
### For Entity Matching |
|
``` |
|
You are tasked with determining whether two records listed below are the same based on the information provided. |
|
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision. |
|
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision. |
|
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...] |
|
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...] |
|
Are record A and record B the same entity? Choose your answer from: [Yes, No]. |
|
``` |
|
|
|
### For Column Type Annotation |
|
|
|
We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step). |
|
|
|
### For Attribute Value Extraction |
|
|
|
We follow the prompt in [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537) (textual, w/o examples). |
|
|