metadata
license: cc-by-nc-4.0
language:
- en
Jellyfish-7B
Model Details
Jellyfish-7B is a large language model equipped with 7 billion parameters.
We fine-tuned the mistralai/Mistral-7B-Instruct-v0.2 model using the datasets pertinent to data preprocessing tasks.
The training data include two parts:
- Jellyfish-13B training data
- GPT4 generated reasoning data for data preprocessing tasks.
More details about the model can be found in the Jellyfish paper.
- Developed by: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
- Contact: [email protected]
- Funded by: NEC Corporation, Osaka University
- Language(s) (NLP): English
- License: Non-Commercial Creative Commons license (CC BY-NC-4.0)
- Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2
Citation
If you find our work useful, please give us credit by citing:
@article{zhang2023jellyfish,
title={Jellyfish: A Large Language Model for Data Preprocessing},
author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
journal={arXiv preprint arXiv:2312.01678},
year={2023}
}
Performance on seen tasks
Task | Type | Dataset | Non-LLM SoTA1 | GPT-3.52 | GPT-42 | Jellyfish-13B | Jellyfish-7B |
---|---|---|---|---|---|---|---|
Entity Matching | Seen | Fodors-Zagats | 100 | 100 | 100 | 100 | 100 |
Entity Matching | Seen | Beer | 94.37 | 96.30 | 100 | 96.77 | 96.55 |
Entity Matching | Seen | iTunes-Amazon | 97.06 | 96.43 | 100 | 98.11 | 96.30 |
Entity Matching | Seen | DBLP-ACM | 98.99 | 96.99 | 97.44 | 98.98 | 98.88 |
Entity Matching | Seen | DBLP-GoogleScholar | 95.60 | 76.12 | 91.87 | 98.51 | 95.15 |
Entity Matching | Seen | Amazon-Google | 75.58 | 66.53 | 74.21 | 81.34 | 80.83 |
Entity Matching | Unseen | Walmart-Amazon | 86.76 | 86.17 | 90.27 | 89.42 | 85.64 |
Entity Matching | Unseen | Abt-Buy | 89.33 | -- | 92.77 | 89.58 | 82.38 |
Data Imputation | Seen | Restaurant | 77.20 | 94.19 | 97.67 | 94.19 | 88.37 |
Data Imputation | Seen | Buy | 96.50 | 98.46 | 100 | 100 | 96.62 |
Data Imputation | Unseen | Filpkart | 68.00 | -- | 89.94 | 81.68 | 79.44 |
Data Imputation | Unseen | Phone | 86.70 | -- | 90.79 | 87.21 | 85.00 |
Error Detection | Seen | Hosptial | 94.40 | 90.74 | 90.74 | 95.59 | 96.27 |
Error Detection | Seen | Adult | 99.10 | 92.01 | 92.01 | 99.33 | 91.96 |
Error Detection | Unseen | Flights | 81.00 | -- | 83.48 | 82.52 | 66.92 |
Error Detection | Unseen | Rayyan | 79.00 | -- | 81.95 | 90.65 | 69.82 |
Schema Matching | Seen | Sythea | 38.50 | 57.14 | 66.67 | 36.36 | 44.44 |
Schema Matching | Seen | MIMIC | 20.00 | -- | 40.00 | 40.00 | 40.00 |
Schema Matching | Unseen | CMS | 50.00 | -- | 19.35 | 59.29 | 13.79 |
For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish-13B and Jellyfish-Interpreter, the few-shot approach is disabled on seen datasets and enabled on unseen datasets.
Accuracy as the metric for data imputation and the F1 score for other tasks.
Performance on unseen tasks
Column Type Annotation
Dataset | RoBERTa (159 shots)1 | GPT-3.51 | GPT-4 | Jellfish-13B | Jellyfish-7B |
---|---|---|---|---|---|
SOTAB | 79.20 | 89.47 | 91.55 | 82.00 | 80.89 |
Few-shot is disabled for Jellyfish-13B.
- Results from Column Type Annotation using ChatGPT
Attribute Value Extraction
Dataset | Stable Beluga 2 70B1 | SOLAR 70B1 | GPT-3.51 | GPT-4 1 | Jellfish-13B | Jellyfish-7B |
---|---|---|---|---|---|---|
AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 58.12 | 76.85 |
OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 55.96 | 76.04 |
Prompt Template
[INST]:
<prompt> (without the <>)
[\INST]]