File size: 7,034 Bytes
945f77f
 
ea8a21f
 
945f77f
ea8a21f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4daa44
 
ea8a21f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a3cec8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ea8a21f
75eefc6
ea8a21f
 
6a3cec8
 
 
 
 
 
 
 
 
ea8a21f
 
 
 
6a3cec8
 
 
ea8a21f
75eefc6
ea8a21f
 
 
 
 
6a3cec8
 
 
 
 
75eefc6
6a3cec8
 
ea8a21f
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: cc-by-nc-4.0
language:
- en
---
# Jellyfish-7B
<!-- Provide a quick summary of what the model is/does. -->
<!--
<img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
-->
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>


## Model Details
Jellyfish-7B is a large language model equipped with 7 billion parameters.   
We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the datasets pertinent to data preprocessing tasks.
The training data include two parts:
* Jellyfish-13B training data
* GPT4 generated reasoning data for data preprocessing tasks.

Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%.

More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).

- **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada  
- **Contact: [email protected]**  
- **Funded by:** NEC Corporation, Osaka University  
- **Language(s) (NLP):** English  
- **License:** Non-Commercial Creative Commons license (CC BY-NC-4.0)  
- **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) 
## Citation

If you find our work useful, please give us credit by citing:

```
@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}
```

## Performance on seen tasks

| Task            | Type   | Dataset           | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup>  | GPT-4o | Table-GPT | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|-----------------|--------|-------------------|-----------------|--------|--------|--------|-----------|--------------|--------------|---------------|
| Error Detection | Seen   | Adult             | *99.10*         | 99.10  | 92.01  | 83.58  | --        | 77.40        | 73.74        | **99.33**     |
| Error Detection | Seen   | Hospital          | 94.40           | **97.80** | 90.74  | 44.76  | --        | 94.51        | 93.40        | *95.59*       |
| Error Detection | Unseen | Flights           | 81.00           | --     | **83.48** | 66.01  | --        | 69.15        | 66.21        | *82.52*       |
| Error Detection | Unseen | Rayyan            | 79.00           | --     | *81.95* | 68.53  | --        | 75.07        | 81.06        | **90.65**     |
| Data Imputation | Seen   | Buy               | 96.50           | 98.50  | **100** | **100** | --        | 98.46        | 98.46        | **100**       |
| Data Imputation | Seen   | Restaurant        | 77.20           | 88.40  | **97.67** | 90.70  | --        | 89.53        | 87.21        | 89.53         |
| Data Imputation | Unseen | Flipkart          | 68.00           | --     | **89.94** | 83.20  | --        | 87.14        | *87.48*      | 81.68         |
| Data Imputation | Unseen | Phone             | 86.70           | --     | **90.79** | 86.78  | --        | 86.52        | 85.68        | *87.21*       |
| Schema Matching | Seen   | MIMIC-III         | 20.00           | --     | 40.00   | 29.41  | --        | **53.33**    | *45.45*      | 40.00         |
| Schema Matching | Seen   | Synthea           | 38.50           | 45.20  | **66.67** | 6.56   | --        | 55.56        | 47.06        | 56.00         |
| Schema Matching | Unseen | CMS               | *50.00*         | --     | 19.35   | 22.22  | --        | 42.86        | 38.10        | **59.29**     |
| Entity Matching | Seen   | Amazon-Google     | 75.58           | 63.50  | 74.21  | 70.91  | 70.10     | **81.69**    | *81.42*      | 81.34         |
| Entity Matching | Seen   | Beer              | 94.37           | **100** | **100** | 90.32  | 96.30     | **100.00**   | **100.00**   | 96.77         |
| Entity Matching | Seen   | DBLP-ACM          | **98.99**       | 96.60  | 97.44  | 95.87  | 93.80     | 98.65        | 98.77        | *98.98*       |
| Entity Matching | Seen   | DBLP-GoogleScholar| *95.70*         | 83.80  | 91.87  | 90.45  | 92.40     | 94.88        | 95.03        | **98.51**     |
| Entity Matching | Seen   | Fodors-Zagats     | **100**         | **100** | **100** | 93.62  | **100**   | **100**      | **100**      | **100**       |
| Entity Matching | Seen   | iTunes-Amazon     | 97.06           | *98.20*| **100** | 98.18  | 94.30     | 96.30        | 96.30        | 98.11         |
| Entity Matching | Unseen | Abt-Buy           | 89.33           | --     | **92.77** | 78.73  | --        | 86.06        | 88.84        | *89.58*       |
| Entity Matching | Unseen | Walmart-Amazon    | 86.89           | 87.00  | **90.27** | 79.19  | 82.40     | 84.91        | 85.24        | *89.42*       |
| Avg             |        |                   | 80.44           | -      | *84.17* | 72.58  | -         | 82.74        | 81.55        | **86.02**     |

_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._   
_Accuracy as the metric for data imputation and the F1 score for other tasks._ 

1.  
  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching  
  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching  
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets  
  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets  
  [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation  
2.  
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)

## Performance on unseen tasks

### Column Type Annotation

| Dataset           | RoBERTa (159 shots)<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4  | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
|--------|-----------------|--------|--------|--------|--------------|--------------|---------------|
| SOTAB | 79.20 | 89.47 | 91.55 | 65.05 | 83 | 76.33 | 82 |

_Few-shot is disabled for Jellyfish models._   

1. Results from [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745)

### Attribute Value Extraction

| Dataset |Stable Beluga 2 70B<sup>1</sup> | SOLAR 70B<sup>1</sup> | GPT-3.5<sup>1</sup> | GPT-4 <sup>1</sup>|  GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
| ---- | ---- | ---- | ---- | ---- | ---- | ----| ----| ----|
| AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 55.77 | 56.09 |59.55 | 58.12 |
| OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 60.20 | 51.98 | 59.22 | 55.96 |

_Few-shot is disabled for Jellyfish models._   

1. Results from [Product Attribute Value Extraction using Large Language Models](https://arxiv.org/abs/2310.12537)


## Prompt Template
```
[INST]:

<prompt> (without the <>)

[\INST]]
```