Jellyfish-7B / README.md
yuyangdong's picture
Update README.md
ea8a21f verified
|
raw
history blame
4.43 kB
metadata
license: cc-by-nc-4.0
language:
  - en

Jellyfish-7B

PicToModel

Model Details

Jellyfish-7B is a large language model equipped with 7 billion parameters.
We fine-tuned the mistralai/Mistral-7B-Instruct-v0.2 model using the datasets pertinent to data preprocessing tasks. The training data include two parts:

  • Jellyfish-13B training data
  • GPT4 generated reasoning data for data preprocessing tasks.

More details about the model can be found in the Jellyfish paper.

  • Developed by: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
  • Contact: [email protected]
  • Funded by: NEC Corporation, Osaka University
  • Language(s) (NLP): English
  • License: Non-Commercial Creative Commons license (CC BY-NC-4.0)
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

Citation

If you find our work useful, please give us credit by citing:

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}

Performance on seen tasks

Task Type Dataset Non-LLM SoTA1 GPT-3.52 GPT-42 Jellyfish-13B Jellyfish-7B
Entity Matching Seen Fodors-Zagats 100 100 100 100 100
Entity Matching Seen Beer 94.37 96.30 100 96.77 96.55
Entity Matching Seen iTunes-Amazon 97.06 96.43 100 98.11 96.30
Entity Matching Seen DBLP-ACM 98.99 96.99 97.44 98.98 98.88
Entity Matching Seen DBLP-GoogleScholar 95.60 76.12 91.87 98.51 95.15
Entity Matching Seen Amazon-Google 75.58 66.53 74.21 81.34 80.83
Entity Matching Unseen Walmart-Amazon 86.76 86.17 90.27 89.42 85.64
Entity Matching Unseen Abt-Buy 89.33 -- 92.77 89.58 82.38
Data Imputation Seen Restaurant 77.20 94.19 97.67 94.19 88.37
Data Imputation Seen Buy 96.50 98.46 100 100 96.62
Data Imputation Unseen Filpkart 68.00 -- 89.94 81.68 79.44
Data Imputation Unseen Phone 86.70 -- 90.79 87.21 85.00
Error Detection Seen Hosptial 94.40 90.74 90.74 95.59 96.27
Error Detection Seen Adult 99.10 92.01 92.01 99.33 91.96
Error Detection Unseen Flights 81.00 -- 83.48 82.52 66.92
Error Detection Unseen Rayyan 79.00 -- 81.95 90.65 69.82
Schema Matching Seen Sythea 38.50 57.14 66.67 36.36 44.44
Schema Matching Seen MIMIC 20.00 -- 40.00 40.00 40.00
Schema Matching Unseen CMS 50.00 -- 19.35 59.29 13.79

For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish-13B and Jellyfish-Interpreter, the few-shot approach is disabled on seen datasets and enabled on unseen datasets.
Accuracy as the metric for data imputation and the F1 score for other tasks.

Performance on unseen tasks

Column Type Annotation

Dataset RoBERTa (159 shots)1 GPT-3.51 GPT-4 Jellfish-13B Jellyfish-7B
SOTAB 79.20 89.47 91.55 82.00 80.89

Few-shot is disabled for Jellyfish-13B.

  1. Results from Column Type Annotation using ChatGPT

Attribute Value Extraction

Dataset Stable Beluga 2 70B1 SOLAR 70B1 GPT-3.51 GPT-4 1 Jellfish-13B Jellyfish-7B
AE-110k 52.10 49.20 61.30 55.50 58.12 76.85
OA-Mine 50.80 55.20 62.70 68.90 55.96 76.04

Prompt Template

[INST]:

<prompt> (without the <>)

[\INST]]