chuanxiao1983
commited on
Commit
•
32e6cae
1
Parent(s):
6d4ddfd
Update README.md
Browse files
README.md
CHANGED
@@ -10,17 +10,15 @@ language:
|
|
10 |
-->
|
11 |
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
12 |
|
13 |
-
|
14 |
[Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B)
|
15 |
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
|
16 |
|
17 |
## Model Details
|
18 |
Jellyfish-7B is a large language model equipped with 7 billion parameters.
|
19 |
-
We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using
|
20 |
-
a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct)
|
21 |
|
22 |
-
|
23 |
-
Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%.
|
24 |
|
25 |
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
|
26 |
|
@@ -68,15 +66,15 @@ If you find our work useful, please give us credit by citing:
|
|
68 |
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
69 |
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
70 |
|
71 |
-
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets.
|
72 |
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
73 |
|
74 |
1.
|
75 |
-
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
76 |
-
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
77 |
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
|
78 |
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
|
79 |
-
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
|
|
|
|
|
80 |
2.
|
81 |
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
82 |
|
@@ -117,7 +115,7 @@ _Few-shot is disabled for Jellyfish models._
|
|
117 |
|
118 |
## Prompts
|
119 |
|
120 |
-
We provide the prompts used for both
|
121 |
You can structure your data according to these prompts.
|
122 |
|
123 |
### System Message
|
@@ -126,36 +124,6 @@ You are an AI assistant that follows instruction extremely well.
|
|
126 |
User will give you a question. Your task is to answer as faithfully as you can.
|
127 |
```
|
128 |
|
129 |
-
### For Entity Matching
|
130 |
-
```
|
131 |
-
You are tasked with determining whether two records listed below are the same based on the information provided.
|
132 |
-
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
133 |
-
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
134 |
-
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
135 |
-
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
136 |
-
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
137 |
-
```
|
138 |
-
|
139 |
-
### For Data Imputation
|
140 |
-
```
|
141 |
-
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
142 |
-
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
143 |
-
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
144 |
-
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
145 |
-
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
146 |
-
Answer only the value of {attribute X}.
|
147 |
-
```
|
148 |
-
|
149 |
-
### For Data Imputation
|
150 |
-
```
|
151 |
-
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
152 |
-
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
153 |
-
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
154 |
-
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
155 |
-
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
156 |
-
Answer only the value of {attribute X}.
|
157 |
-
```
|
158 |
-
|
159 |
### For Error Detection
|
160 |
_There are two forms of the error detection task.
|
161 |
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
@@ -178,6 +146,16 @@ Attribute for Verification: [{attribute X}: {attribute X value}]
|
|
178 |
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
179 |
```
|
180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
181 |
### For Schema Matching
|
182 |
```
|
183 |
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
@@ -188,6 +166,16 @@ Attribute B is [name: {value of name}, description: {value of description}].
|
|
188 |
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
189 |
```
|
190 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
191 |
### For Column Type Annotation
|
192 |
|
193 |
We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
|
|
|
10 |
-->
|
11 |
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
12 |
|
13 |
+
Jellyfish model with other sizes are available here:
|
14 |
[Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B)
|
15 |
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
|
16 |
|
17 |
## Model Details
|
18 |
Jellyfish-7B is a large language model equipped with 7 billion parameters.
|
19 |
+
We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.
|
|
|
20 |
|
21 |
+
For interpretability, the winning rate of Jellyfish-7B against GPT-3.5-turbo (evaluated by GPT-4) is 56.36%.
|
|
|
22 |
|
23 |
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
|
24 |
|
|
|
66 |
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
67 |
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
68 |
|
69 |
+
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
|
70 |
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
71 |
|
72 |
1.
|
|
|
|
|
73 |
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
|
74 |
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
|
75 |
+
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
|
76 |
+
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
77 |
+
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
78 |
2.
|
79 |
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
80 |
|
|
|
115 |
|
116 |
## Prompts
|
117 |
|
118 |
+
We provide the prompts used for both fine-tuning and inference.
|
119 |
You can structure your data according to these prompts.
|
120 |
|
121 |
### System Message
|
|
|
124 |
User will give you a question. Your task is to answer as faithfully as you can.
|
125 |
```
|
126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
### For Error Detection
|
128 |
_There are two forms of the error detection task.
|
129 |
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
|
|
146 |
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
147 |
```
|
148 |
|
149 |
+
### For Data Imputation
|
150 |
+
```
|
151 |
+
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
152 |
+
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
153 |
+
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
154 |
+
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
155 |
+
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
156 |
+
Answer only the value of {attribute X}.
|
157 |
+
```
|
158 |
+
|
159 |
### For Schema Matching
|
160 |
```
|
161 |
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
|
|
166 |
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
167 |
```
|
168 |
|
169 |
+
### For Entity Matching
|
170 |
+
```
|
171 |
+
You are tasked with determining whether two records listed below are the same based on the information provided.
|
172 |
+
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
173 |
+
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
174 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
175 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
176 |
+
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
177 |
+
```
|
178 |
+
|
179 |
### For Column Type Annotation
|
180 |
|
181 |
We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
|