chuanxiao1983 commited on
Commit
077a231
•
1 Parent(s): 78eff7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -54
README.md CHANGED
@@ -12,14 +12,14 @@ language:
12
  <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
13
 
14
 
15
- We also build [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B), and [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B), lighter versions of Jellyfish!\
16
- They keep the powerful data propcessing performance with faster inference speed and better reasoning ability!
17
 
18
- 😄 We strongly recommend users to use the 7B and 8B models for their general ability!
19
 
20
 
21
  ## Model Details
22
- Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
23
 
24
  We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
25
  Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
@@ -31,8 +31,8 @@ As the names suggest, Jellyfish-13B is tailored to deliver precise, straightforw
31
  In contrast, Jellyfish-13B-Interpreter, is fine-tuned with data that includes reasoning and sequential thought processes for handling data preprocessing tasks, distilling knowledge from GPT-4.
32
 
33
  The two versions are designed for different application scenarios.
34
- Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into code.
35
- On the other hand, Jellyfish-13B-Interpreter is more user-oriented, with responses that provide them with in-depth data insights without the necessity for advanced coding skills or an intricate grasp of statistics.
36
 
37
  More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
38
 
@@ -80,15 +80,15 @@ If you find our work useful, please give us credit by citing:
80
  | Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
81
  | Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
82
 
83
- _For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
84
  _Accuracy as the metric for data imputation and the F1 score for other tasks._
85
 
86
  1.
87
- [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
88
- [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
89
- [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
90
- [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
91
- [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
92
  2.
93
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
94
 
@@ -129,7 +129,7 @@ _Few-shot is disabled for Jellyfish models._
129
  ### Training Data
130
  We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
131
  The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks), [RAHA](https://github.com/BigDaMa/raha), [SMAT](https://github.com/JZCS2018/SMAT), and [IPM](https://ieeexplore.ieee.org/document/9458712).
132
- We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
133
 
134
  ### Training Method
135
 
@@ -137,7 +137,7 @@ We used LoRA to speed up the training process, targeting the q_proj, k_proj, v_p
137
 
138
  ## Uses
139
 
140
- For improved practical inference speed, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
141
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
142
 
143
  ### Python Script
@@ -240,29 +240,11 @@ print(response)
240
 
241
  ### Prompts
242
 
243
- We provide the prompts used for both the model's fine-tuning and inference.
244
  You can structure your data according to these prompts.
245
- However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
246
 
247
  ### JellyFish-13B
248
- #### For Entity Matching
249
- ```
250
- You are tasked with determining whether two records listed below are the same based on the information provided.
251
- Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
252
- Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
253
- Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
254
- Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
255
- Are record A and record B the same entity? Choose your answer from: [Yes, No].
256
- ```
257
- #### For Data Imputation
258
- ```
259
- You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
260
- Your task is to deduce or infer the value of {attribute X} using the available information in the record.
261
- You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
262
- Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
263
- Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
264
- Answer only the value of {attribute X}.
265
- ```
266
  #### For Error Detection
267
  _There are two forms of the error detection task.
268
  In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
@@ -284,6 +266,15 @@ Note: Missing values (N/A or \"nan\") are not considered errors.
284
  Attribute for Verification: [{attribute X}: {attribute X value}]
285
  Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
286
  ```
 
 
 
 
 
 
 
 
 
287
  #### For Schema Matching
288
  ```
289
  Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
@@ -293,6 +284,15 @@ Attribute A is [name: {value of name}, description: {value of description}].
293
  Attribute B is [name: {value of name}, description: {value of description}].
294
  Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
295
  ```
 
 
 
 
 
 
 
 
 
296
 
297
  ### For Column Type Annotation
298
 
@@ -304,26 +304,6 @@ We follow the prompt in [Product Attribute Value Extraction using Large Language
304
 
305
 
306
  ### JellyFish-13B-Interpreter
307
- #### For Entity Matching
308
- ```
309
- You are tasked with determining whether two products listed below are the same based on the information provided.
310
- Carefully examine all the attributes before making your decision.
311
- Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
312
- Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
313
- Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
314
- Are record A and record B the same entity?
315
- After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
316
- ```
317
- #### For Data Imputation
318
- ```
319
- You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
320
- Your task is to deduce or infer the manufacturer of the product using the available information in the record.
321
- You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
322
- Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
323
- Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
324
- After your reasoning, finish your response in a separate line with and ONLY with your final answer.
325
- Your final answer should only consist of the value of {attribute X}.
326
- ```
327
  #### For Error Detection
328
  ```
329
  Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
@@ -342,6 +322,16 @@ Attribute for Verification: [{attribute X}: {attribute X value}]
342
  Question: Is there an error in the value of {attribute X}?
343
  After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
344
  ```
 
 
 
 
 
 
 
 
 
 
345
  #### For Schema Matching
346
  ```
347
  Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
@@ -351,6 +341,16 @@ Attribute A is [name: {value of name}, description: {value of description}].
351
  Attribute B is [name: {value of name}, description: {value of description}].
352
  After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
353
  ```
 
 
 
 
 
 
 
 
 
 
354
 
355
  ## Sample Responses from Jellyfish-13B-Interpreter
356
  We provide a few sample responses from Jellyfish-13B-Interpreter to demonstrate its performance.
 
12
  <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
13
 
14
 
15
+ We also build [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B) and [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B), lighter versions of Jellyfish!\
16
+ They retain excellent data propcessing performance while delivering faster inference speed and better reasoning ability!
17
 
18
+ 😄 We strongly recommend users to use the 7B or 8B model for their generalizability to unseen tasks and reasoning ability!
19
 
20
 
21
  ## Model Details
22
+ Jellyfish-13B is a large language model with 13 billion parameters. It is tailored specifically for data preprocessing tasks, including error detection, data imputation, schema matching, and entity matching.
23
 
24
  We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
25
  Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
 
31
  In contrast, Jellyfish-13B-Interpreter, is fine-tuned with data that includes reasoning and sequential thought processes for handling data preprocessing tasks, distilling knowledge from GPT-4.
32
 
33
  The two versions are designed for different application scenarios.
34
+ Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into codes in a data management/analysis pipeline.
35
+ On the other hand, Jellyfish-13B-Interpreter is more user-oriented, with responses that provide in-depth data insights without the necessity for advanced coding skills or an intricate grasp of statistics.
36
 
37
  More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
38
 
 
80
  | Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
81
  | Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
82
 
83
+ _For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
84
  _Accuracy as the metric for data imputation and the F1 score for other tasks._
85
 
86
  1.
87
+ [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
88
+ [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
89
+ [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
90
+ [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
91
+ [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
92
  2.
93
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
94
 
 
129
  ### Training Data
130
  We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
131
  The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks), [RAHA](https://github.com/BigDaMa/raha), [SMAT](https://github.com/JZCS2018/SMAT), and [IPM](https://ieeexplore.ieee.org/document/9458712).
132
+ Based on these datasets, we constructed an instruction tuning dataset for fine-tuning LLMs, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
133
 
134
  ### Training Method
135
 
 
137
 
138
  ## Uses
139
 
140
+ To accelerate the inference, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
141
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
142
 
143
  ### Python Script
 
240
 
241
  ### Prompts
242
 
243
+ We provide the prompts used for both fine-tuning and inference.
244
  You can structure your data according to these prompts.
245
+ Moreover, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
246
 
247
  ### JellyFish-13B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
  #### For Error Detection
249
  _There are two forms of the error detection task.
250
  In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
 
266
  Attribute for Verification: [{attribute X}: {attribute X value}]
267
  Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
268
  ```
269
+ #### For Data Imputation
270
+ ```
271
+ You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
272
+ Your task is to deduce or infer the value of {attribute X} using the available information in the record.
273
+ You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
274
+ Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
275
+ Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
276
+ Answer only the value of {attribute X}.
277
+ ```
278
  #### For Schema Matching
279
  ```
280
  Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
 
284
  Attribute B is [name: {value of name}, description: {value of description}].
285
  Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
286
  ```
287
+ #### For Entity Matching
288
+ ```
289
+ You are tasked with determining whether two records listed below are the same based on the information provided.
290
+ Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
291
+ Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
292
+ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
293
+ Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
294
+ Are record A and record B the same entity? Choose your answer from: [Yes, No].
295
+ ```
296
 
297
  ### For Column Type Annotation
298
 
 
304
 
305
 
306
  ### JellyFish-13B-Interpreter
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
307
  #### For Error Detection
308
  ```
309
  Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
 
322
  Question: Is there an error in the value of {attribute X}?
323
  After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
324
  ```
325
+ #### For Data Imputation
326
+ ```
327
+ You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
328
+ Your task is to deduce or infer the manufacturer of the product using the available information in the record.
329
+ You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
330
+ Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
331
+ Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
332
+ After your reasoning, finish your response in a separate line with and ONLY with your final answer.
333
+ Your final answer should only consist of the value of {attribute X}.
334
+ ```
335
  #### For Schema Matching
336
  ```
337
  Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
 
341
  Attribute B is [name: {value of name}, description: {value of description}].
342
  After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
343
  ```
344
+ #### For Entity Matching
345
+ ```
346
+ You are tasked with determining whether two products listed below are the same based on the information provided.
347
+ Carefully examine all the attributes before making your decision.
348
+ Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
349
+ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
350
+ Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
351
+ Are record A and record B the same entity?
352
+ After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
353
+ ```
354
 
355
  ## Sample Responses from Jellyfish-13B-Interpreter
356
  We provide a few sample responses from Jellyfish-13B-Interpreter to demonstrate its performance.