yuyangdong
commited on
Commit
•
ea8a21f
1
Parent(s):
945f77f
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,96 @@
|
|
1 |
---
|
2 |
license: cc-by-nc-4.0
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
---
|
6 |
+
# Jellyfish-7B
|
7 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
8 |
+
<!--
|
9 |
+
<img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
|
10 |
+
-->
|
11 |
+
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
12 |
+
|
13 |
+
|
14 |
+
## Model Details
|
15 |
+
Jellyfish-7B is a large language model equipped with 7 billion parameters.
|
16 |
+
We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the datasets pertinent to data preprocessing tasks.
|
17 |
+
The training data include two parts:
|
18 |
+
* Jellyfish-13B training data
|
19 |
+
* GPT4 generated reasoning data for data preprocessing tasks.
|
20 |
+
|
21 |
+
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
|
22 |
+
|
23 |
+
- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
|
24 |
+
- **Contact: [email protected]**
|
25 |
+
- **Funded by:** NEC Corporation, Osaka University
|
26 |
+
- **Language(s) (NLP):** English
|
27 |
+
- **License:** Non-Commercial Creative Commons license (CC BY-NC-4.0)
|
28 |
+
- **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
|
29 |
+
## Citation
|
30 |
+
|
31 |
+
If you find our work useful, please give us credit by citing:
|
32 |
+
|
33 |
+
```
|
34 |
+
@article{zhang2023jellyfish,
|
35 |
+
title={Jellyfish: A Large Language Model for Data Preprocessing},
|
36 |
+
author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
|
37 |
+
journal={arXiv preprint arXiv:2312.01678},
|
38 |
+
year={2023}
|
39 |
+
}
|
40 |
+
```
|
41 |
+
|
42 |
+
## Performance on seen tasks
|
43 |
+
|
44 |
+
| Task | Type | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-7B |
|
45 |
+
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
46 |
+
| Entity Matching | Seen | Fodors-Zagats | 100 | 100 | 100 | 100 | 100 |
|
47 |
+
| Entity Matching | Seen | Beer | 94.37| 96.30 | 100 | 96.77 | 96.55|
|
48 |
+
| Entity Matching | Seen | iTunes-Amazon | 97.06| 96.43 | 100 | 98.11 | 96.30|
|
49 |
+
| Entity Matching | Seen | DBLP-ACM | 98.99| 96.99 | 97.44 | 98.98 | 98.88|
|
50 |
+
| Entity Matching | Seen | DBLP-GoogleScholar | 95.60| 76.12 | 91.87 | 98.51 | 95.15|
|
51 |
+
| Entity Matching | Seen | Amazon-Google | 75.58| 66.53 | 74.21 | 81.34 | 80.83 |
|
52 |
+
| Entity Matching | Unseen | Walmart-Amazon | 86.76| 86.17 | 90.27 | 89.42 | 85.64 |
|
53 |
+
| Entity Matching | Unseen | Abt-Buy | 89.33 | -- | 92.77 | 89.58 | 82.38 |
|
54 |
+
| Data Imputation | Seen | Restaurant | 77.20| 94.19 | 97.67 | 94.19 | 88.37 |
|
55 |
+
| Data Imputation | Seen | Buy | 96.50| 98.46 | 100 | 100 | 96.62 |
|
56 |
+
| Data Imputation | Unseen | Filpkart | 68.00 | -- | 89.94 | 81.68 | 79.44|
|
57 |
+
| Data Imputation | Unseen | Phone | 86.70| -- | 90.79 | 87.21 | 85.00|
|
58 |
+
| Error Detection | Seen | Hosptial | 94.40| 90.74 | 90.74 | 95.59 | 96.27 |
|
59 |
+
| Error Detection | Seen | Adult | 99.10| 92.01 | 92.01 | 99.33 | 91.96 |
|
60 |
+
| Error Detection | Unseen | Flights | 81.00 | -- | 83.48 | 82.52 | 66.92 |
|
61 |
+
| Error Detection | Unseen | Rayyan | 79.00| -- | 81.95 | 90.65 | 69.82 |
|
62 |
+
| Schema Matching | Seen | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 44.44 |
|
63 |
+
| Schema Matching | Seen | MIMIC | 20.00| -- | 40.00 | 40.00 | 40.00 |
|
64 |
+
| Schema Matching | Unseen | CMS | 50.00| -- | 19.35 | 59.29 | 13.79 |
|
65 |
+
|
66 |
+
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish-13B and Jellyfish-Interpreter, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
|
67 |
+
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
68 |
+
|
69 |
+
## Performance on unseen tasks
|
70 |
+
|
71 |
+
### Column Type Annotation
|
72 |
+
|
73 |
+
| Dataset | RoBERTa (159 shots)<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 | Jellfish-13B| Jellyfish-7B |
|
74 |
+
| ---- | ---- | ---- | ---- | ---- | ----|
|
75 |
+
| SOTAB | 79.20 | 89.47 | 91.55 | 82.00 | 80.89 |
|
76 |
+
|
77 |
+
_Few-shot is disabled for Jellyfish-13B._
|
78 |
+
|
79 |
+
1. Results from [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745)
|
80 |
+
|
81 |
+
### Attribute Value Extraction
|
82 |
+
|
83 |
+
| Dataset |Stable Beluga 2 70B<sup>1</sup> | SOLAR 70B<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 <sup>1</sup>| Jellfish-13B | Jellyfish-7B|
|
84 |
+
| ---- | ---- | ---- | ---- | ---- | ---- | ----|
|
85 |
+
| AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 58.12 | 76.85|
|
86 |
+
| OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 55.96 | 76.04|
|
87 |
+
|
88 |
+
|
89 |
+
## Prompt Template
|
90 |
+
```
|
91 |
+
[INST]:
|
92 |
+
|
93 |
+
<prompt> (without the <>)
|
94 |
+
|
95 |
+
[\INST]]
|
96 |
+
```
|