chuanxiao1983 commited on
Commit
32e6cae
1 Parent(s): 6d4ddfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -40
README.md CHANGED
@@ -10,17 +10,15 @@ language:
10
  -->
11
  <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
12
 
13
- Other versions of Jellyfish:
14
  [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B)
15
  [Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
16
 
17
  ## Model Details
18
  Jellyfish-7B is a large language model equipped with 7 billion parameters.
19
- We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using
20
- a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct)
21
 
22
-
23
- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%.
24
 
25
  More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
26
 
@@ -68,15 +66,15 @@ If you find our work useful, please give us credit by citing:
68
  | Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
69
  | Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
70
 
71
- _For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
72
  _Accuracy as the metric for data imputation and the F1 score for other tasks._
73
 
74
  1.
75
- [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
76
- [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
77
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
78
  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
79
- [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
 
 
80
  2.
81
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
82
 
@@ -117,7 +115,7 @@ _Few-shot is disabled for Jellyfish models._
117
 
118
  ## Prompts
119
 
120
- We provide the prompts used for both the model's fine-tuning and inference.
121
  You can structure your data according to these prompts.
122
 
123
  ### System Message
@@ -126,36 +124,6 @@ You are an AI assistant that follows instruction extremely well.
126
  User will give you a question. Your task is to answer as faithfully as you can.
127
  ```
128
 
129
- ### For Entity Matching
130
- ```
131
- You are tasked with determining whether two records listed below are the same based on the information provided.
132
- Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
133
- Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
134
- Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
135
- Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
136
- Are record A and record B the same entity? Choose your answer from: [Yes, No].
137
- ```
138
-
139
- ### For Data Imputation
140
- ```
141
- You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
142
- Your task is to deduce or infer the value of {attribute X} using the available information in the record.
143
- You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
144
- Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
145
- Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
146
- Answer only the value of {attribute X}.
147
- ```
148
-
149
- ### For Data Imputation
150
- ```
151
- You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
152
- Your task is to deduce or infer the value of {attribute X} using the available information in the record.
153
- You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
154
- Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
155
- Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
156
- Answer only the value of {attribute X}.
157
- ```
158
-
159
  ### For Error Detection
160
  _There are two forms of the error detection task.
161
  In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
@@ -178,6 +146,16 @@ Attribute for Verification: [{attribute X}: {attribute X value}]
178
  Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
179
  ```
180
 
 
 
 
 
 
 
 
 
 
 
181
  ### For Schema Matching
182
  ```
183
  Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
@@ -188,6 +166,16 @@ Attribute B is [name: {value of name}, description: {value of description}].
188
  Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
189
  ```
190
 
 
 
 
 
 
 
 
 
 
 
191
  ### For Column Type Annotation
192
 
193
  We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
 
10
  -->
11
  <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
12
 
13
+ Jellyfish model with other sizes are available here:
14
  [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B)
15
  [Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
16
 
17
  ## Model Details
18
  Jellyfish-7B is a large language model equipped with 7 billion parameters.
19
+ We fine-tuned the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.
 
20
 
21
+ For interpretability, the winning rate of Jellyfish-7B against GPT-3.5-turbo (evaluated by GPT-4) is 56.36%.
 
22
 
23
  More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
24
 
 
66
  | Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
67
  | Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
68
 
69
+ _For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
70
  _Accuracy as the metric for data imputation and the F1 score for other tasks._
71
 
72
  1.
 
 
73
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
74
  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
75
+ [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
76
+ [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
77
+ [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
78
  2.
79
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
80
 
 
115
 
116
  ## Prompts
117
 
118
+ We provide the prompts used for both fine-tuning and inference.
119
  You can structure your data according to these prompts.
120
 
121
  ### System Message
 
124
  User will give you a question. Your task is to answer as faithfully as you can.
125
  ```
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ### For Error Detection
128
  _There are two forms of the error detection task.
129
  In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
 
146
  Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
147
  ```
148
 
149
+ ### For Data Imputation
150
+ ```
151
+ You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
152
+ Your task is to deduce or infer the value of {attribute X} using the available information in the record.
153
+ You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
154
+ Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
155
+ Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
156
+ Answer only the value of {attribute X}.
157
+ ```
158
+
159
  ### For Schema Matching
160
  ```
161
  Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
 
166
  Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
167
  ```
168
 
169
+ ### For Entity Matching
170
+ ```
171
+ You are tasked with determining whether two records listed below are the same based on the information provided.
172
+ Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
173
+ Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
174
+ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
175
+ Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
176
+ Are record A and record B the same entity? Choose your answer from: [Yes, No].
177
+ ```
178
+
179
  ### For Column Type Annotation
180
 
181
  We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).