chuanxiao1983
commited on
Commit
•
077a231
1
Parent(s):
78eff7e
Update README.md
Browse files
README.md
CHANGED
@@ -12,14 +12,14 @@ language:
|
|
12 |
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
13 |
|
14 |
|
15 |
-
We also build [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
|
16 |
-
They
|
17 |
|
18 |
-
😄 We strongly recommend users to use the 7B
|
19 |
|
20 |
|
21 |
## Model Details
|
22 |
-
Jellyfish-13B is a large language model
|
23 |
|
24 |
We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
|
25 |
Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
|
@@ -31,8 +31,8 @@ As the names suggest, Jellyfish-13B is tailored to deliver precise, straightforw
|
|
31 |
In contrast, Jellyfish-13B-Interpreter, is fine-tuned with data that includes reasoning and sequential thought processes for handling data preprocessing tasks, distilling knowledge from GPT-4.
|
32 |
|
33 |
The two versions are designed for different application scenarios.
|
34 |
-
Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into
|
35 |
-
On the other hand, Jellyfish-13B-Interpreter is more user-oriented, with responses that provide
|
36 |
|
37 |
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
|
38 |
|
@@ -80,15 +80,15 @@ If you find our work useful, please give us credit by citing:
|
|
80 |
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
81 |
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
82 |
|
83 |
-
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets.
|
84 |
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
85 |
|
86 |
1.
|
87 |
-
[
|
88 |
-
[
|
89 |
-
[
|
90 |
-
[
|
91 |
-
[
|
92 |
2.
|
93 |
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
94 |
|
@@ -129,7 +129,7 @@ _Few-shot is disabled for Jellyfish models._
|
|
129 |
### Training Data
|
130 |
We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
|
131 |
The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks), [RAHA](https://github.com/BigDaMa/raha), [SMAT](https://github.com/JZCS2018/SMAT), and [IPM](https://ieeexplore.ieee.org/document/9458712).
|
132 |
-
|
133 |
|
134 |
### Training Method
|
135 |
|
@@ -137,7 +137,7 @@ We used LoRA to speed up the training process, targeting the q_proj, k_proj, v_p
|
|
137 |
|
138 |
## Uses
|
139 |
|
140 |
-
|
141 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
142 |
|
143 |
### Python Script
|
@@ -240,29 +240,11 @@ print(response)
|
|
240 |
|
241 |
### Prompts
|
242 |
|
243 |
-
We provide the prompts used for both
|
244 |
You can structure your data according to these prompts.
|
245 |
-
|
246 |
|
247 |
### JellyFish-13B
|
248 |
-
#### For Entity Matching
|
249 |
-
```
|
250 |
-
You are tasked with determining whether two records listed below are the same based on the information provided.
|
251 |
-
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
252 |
-
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
253 |
-
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
254 |
-
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
255 |
-
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
256 |
-
```
|
257 |
-
#### For Data Imputation
|
258 |
-
```
|
259 |
-
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
260 |
-
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
261 |
-
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
262 |
-
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
263 |
-
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
264 |
-
Answer only the value of {attribute X}.
|
265 |
-
```
|
266 |
#### For Error Detection
|
267 |
_There are two forms of the error detection task.
|
268 |
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
@@ -284,6 +266,15 @@ Note: Missing values (N/A or \"nan\") are not considered errors.
|
|
284 |
Attribute for Verification: [{attribute X}: {attribute X value}]
|
285 |
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
286 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
287 |
#### For Schema Matching
|
288 |
```
|
289 |
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
@@ -293,6 +284,15 @@ Attribute A is [name: {value of name}, description: {value of description}].
|
|
293 |
Attribute B is [name: {value of name}, description: {value of description}].
|
294 |
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
295 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
296 |
|
297 |
### For Column Type Annotation
|
298 |
|
@@ -304,26 +304,6 @@ We follow the prompt in [Product Attribute Value Extraction using Large Language
|
|
304 |
|
305 |
|
306 |
### JellyFish-13B-Interpreter
|
307 |
-
#### For Entity Matching
|
308 |
-
```
|
309 |
-
You are tasked with determining whether two products listed below are the same based on the information provided.
|
310 |
-
Carefully examine all the attributes before making your decision.
|
311 |
-
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
312 |
-
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
313 |
-
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
314 |
-
Are record A and record B the same entity?
|
315 |
-
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
|
316 |
-
```
|
317 |
-
#### For Data Imputation
|
318 |
-
```
|
319 |
-
You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
|
320 |
-
Your task is to deduce or infer the manufacturer of the product using the available information in the record.
|
321 |
-
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
322 |
-
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
323 |
-
Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
|
324 |
-
After your reasoning, finish your response in a separate line with and ONLY with your final answer.
|
325 |
-
Your final answer should only consist of the value of {attribute X}.
|
326 |
-
```
|
327 |
#### For Error Detection
|
328 |
```
|
329 |
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
|
@@ -342,6 +322,16 @@ Attribute for Verification: [{attribute X}: {attribute X value}]
|
|
342 |
Question: Is there an error in the value of {attribute X}?
|
343 |
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
|
344 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
345 |
#### For Schema Matching
|
346 |
```
|
347 |
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
@@ -351,6 +341,16 @@ Attribute A is [name: {value of name}, description: {value of description}].
|
|
351 |
Attribute B is [name: {value of name}, description: {value of description}].
|
352 |
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
|
353 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
354 |
|
355 |
## Sample Responses from Jellyfish-13B-Interpreter
|
356 |
We provide a few sample responses from Jellyfish-13B-Interpreter to demonstrate its performance.
|
|
|
12 |
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
13 |
|
14 |
|
15 |
+
We also build [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B) and [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B), lighter versions of Jellyfish!\
|
16 |
+
They retain excellent data propcessing performance while delivering faster inference speed and better reasoning ability!
|
17 |
|
18 |
+
😄 We strongly recommend users to use the 7B or 8B model for their generalizability to unseen tasks and reasoning ability!
|
19 |
|
20 |
|
21 |
## Model Details
|
22 |
+
Jellyfish-13B is a large language model with 13 billion parameters. It is tailored specifically for data preprocessing tasks, including error detection, data imputation, schema matching, and entity matching.
|
23 |
|
24 |
We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
|
25 |
Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
|
|
|
31 |
In contrast, Jellyfish-13B-Interpreter, is fine-tuned with data that includes reasoning and sequential thought processes for handling data preprocessing tasks, distilling knowledge from GPT-4.
|
32 |
|
33 |
The two versions are designed for different application scenarios.
|
34 |
+
Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into codes in a data management/analysis pipeline.
|
35 |
+
On the other hand, Jellyfish-13B-Interpreter is more user-oriented, with responses that provide in-depth data insights without the necessity for advanced coding skills or an intricate grasp of statistics.
|
36 |
|
37 |
More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
|
38 |
|
|
|
80 |
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
81 |
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
82 |
|
83 |
+
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
|
84 |
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
85 |
|
86 |
1.
|
87 |
+
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
|
88 |
+
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
|
89 |
+
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
|
90 |
+
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
91 |
+
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
92 |
2.
|
93 |
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
94 |
|
|
|
129 |
### Training Data
|
130 |
We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
|
131 |
The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks), [RAHA](https://github.com/BigDaMa/raha), [SMAT](https://github.com/JZCS2018/SMAT), and [IPM](https://ieeexplore.ieee.org/document/9458712).
|
132 |
+
Based on these datasets, we constructed an instruction tuning dataset for fine-tuning LLMs, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
|
133 |
|
134 |
### Training Method
|
135 |
|
|
|
137 |
|
138 |
## Uses
|
139 |
|
140 |
+
To accelerate the inference, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
|
141 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
142 |
|
143 |
### Python Script
|
|
|
240 |
|
241 |
### Prompts
|
242 |
|
243 |
+
We provide the prompts used for both fine-tuning and inference.
|
244 |
You can structure your data according to these prompts.
|
245 |
+
Moreover, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
|
246 |
|
247 |
### JellyFish-13B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
248 |
#### For Error Detection
|
249 |
_There are two forms of the error detection task.
|
250 |
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
|
|
266 |
Attribute for Verification: [{attribute X}: {attribute X value}]
|
267 |
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
268 |
```
|
269 |
+
#### For Data Imputation
|
270 |
+
```
|
271 |
+
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
272 |
+
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
273 |
+
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
274 |
+
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
275 |
+
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
276 |
+
Answer only the value of {attribute X}.
|
277 |
+
```
|
278 |
#### For Schema Matching
|
279 |
```
|
280 |
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
|
|
284 |
Attribute B is [name: {value of name}, description: {value of description}].
|
285 |
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
286 |
```
|
287 |
+
#### For Entity Matching
|
288 |
+
```
|
289 |
+
You are tasked with determining whether two records listed below are the same based on the information provided.
|
290 |
+
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
291 |
+
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
292 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
293 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
294 |
+
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
295 |
+
```
|
296 |
|
297 |
### For Column Type Annotation
|
298 |
|
|
|
304 |
|
305 |
|
306 |
### JellyFish-13B-Interpreter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
307 |
#### For Error Detection
|
308 |
```
|
309 |
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
|
|
|
322 |
Question: Is there an error in the value of {attribute X}?
|
323 |
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
|
324 |
```
|
325 |
+
#### For Data Imputation
|
326 |
+
```
|
327 |
+
You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
|
328 |
+
Your task is to deduce or infer the manufacturer of the product using the available information in the record.
|
329 |
+
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
330 |
+
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
331 |
+
Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
|
332 |
+
After your reasoning, finish your response in a separate line with and ONLY with your final answer.
|
333 |
+
Your final answer should only consist of the value of {attribute X}.
|
334 |
+
```
|
335 |
#### For Schema Matching
|
336 |
```
|
337 |
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
|
|
|
341 |
Attribute B is [name: {value of name}, description: {value of description}].
|
342 |
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
|
343 |
```
|
344 |
+
#### For Entity Matching
|
345 |
+
```
|
346 |
+
You are tasked with determining whether two products listed below are the same based on the information provided.
|
347 |
+
Carefully examine all the attributes before making your decision.
|
348 |
+
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
349 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
350 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
351 |
+
Are record A and record B the same entity?
|
352 |
+
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
|
353 |
+
```
|
354 |
|
355 |
## Sample Responses from Jellyfish-13B-Interpreter
|
356 |
We provide a few sample responses from Jellyfish-13B-Interpreter to demonstrate its performance.
|