NECOUDBFM
/

Jellyfish-13B

@@ -12,14 +12,14 @@ language:
 <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
-We also build [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B), and [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B), lighter versions of Jellyfish!\
-They keep the powerful data propcessing performance with faster inference speed and better reasoning ability!
-😄 We strongly recommend users to use the 7B and 8B models for their general ability!
 ## Model Details
-Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
 We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
 Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
@@ -31,8 +31,8 @@ As the names suggest, Jellyfish-13B is tailored to deliver precise, straightforw
 In contrast, Jellyfish-13B-Interpreter, is fine-tuned with data that includes reasoning and sequential thought processes for handling data preprocessing tasks, distilling knowledge from GPT-4.
 The two versions are designed for different application scenarios.
-Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into code.
-On the other hand, Jellyfish-13B-Interpreter is more user-oriented, with responses that provide them with in-depth data insights without the necessity for advanced coding skills or an intricate grasp of statistics.
 More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
@@ -80,15 +80,15 @@ If you find our work useful, please give us credit by citing:
 | Entity Matching | Unseen | Walmart-Amazon    | 86.89           | 87.00  | **90.27** | 79.19  | 82.40     | 84.91        | 85.24        | *89.42*       |
 | Avg             |        |                   | 80.44           | -      | *84.17* | 72.58  | -         | 82.74        | 81.55        | **86.02**     |
-_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
 _Accuracy as the metric for data imputation and the F1 score for other tasks._
 1.
-  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
-  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
-  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
-  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
-  [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
 2.
   [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
@@ -129,7 +129,7 @@ _Few-shot is disabled for Jellyfish models._
 ### Training Data
 We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
 The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks), [RAHA](https://github.com/BigDaMa/raha), [SMAT](https://github.com/JZCS2018/SMAT), and [IPM](https://ieeexplore.ieee.org/document/9458712).
-We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 ### Training Method
@@ -137,7 +137,7 @@ We used LoRA to speed up the training process, targeting the q_proj, k_proj, v_p
 ## Uses
-For improved practical inference speed, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Python Script
@@ -240,29 +240,11 @@ print(response)
 ### Prompts
-We provide the prompts used for both the model's fine-tuning and inference.
 You can structure your data according to these prompts.
-However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
 ### JellyFish-13B
-#### For Entity Matching
-```
-You are tasked with determining whether two records listed below are the same based on the information provided.
-Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
-Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
-Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Are record A and record B the same entity? Choose your answer from: [Yes, No].
-```
-#### For Data Imputation
-```
-You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
-Your task is to deduce or infer the value of {attribute X} using the available information in the record.
-You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
-Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
-Answer only the value of {attribute X}.
-```
 #### For Error Detection
 _There are two forms of the error detection task.
 In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
@@ -284,6 +266,15 @@ Note: Missing values (N/A or \"nan\") are not considered errors.
 Attribute for Verification: [{attribute X}: {attribute X value}]
 Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
 ```
 #### For Schema Matching
 ```
 Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
@@ -293,6 +284,15 @@ Attribute A is [name: {value of name}, description: {value of description}].
 Attribute B is [name: {value of name}, description: {value of description}].
 Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
 ```
 ### For Column Type Annotation
@@ -304,26 +304,6 @@ We follow the prompt in [Product Attribute Value Extraction using Large Language
 ### JellyFish-13B-Interpreter
-#### For Entity Matching
-```
-You are tasked with determining whether two products listed below are the same based on the information provided.
-Carefully examine all the attributes before making your decision.
-Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
-Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Are record A and record B the same entity?
-After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
-```
-#### For Data Imputation
-```
-You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
-Your task is to deduce or infer the manufacturer of the product using the available information in the record.
-You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
-Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
-After your reasoning, finish your response in a separate line with and ONLY with your final answer.
-Your final answer should only consist of the value of {attribute X}.
-```
 #### For Error Detection
 ```
 Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
@@ -342,6 +322,16 @@ Attribute for Verification: [{attribute X}: {attribute X value}]
 Question: Is there an error in the value of {attribute X}?
 After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
 ```
 #### For Schema Matching
 ```
 Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
@@ -351,6 +341,16 @@ Attribute A is [name: {value of name}, description: {value of description}].
 Attribute B is [name: {value of name}, description: {value of description}].
 After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
 ```
 ## Sample Responses from Jellyfish-13B-Interpreter
 We provide a few sample responses from Jellyfish-13B-Interpreter to demonstrate its performance.

 <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
+We also build [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B) and [Jellyfish-8B](https://huggingface.co/NECOUDBFM/Jellyfish-8B), lighter versions of Jellyfish!\
+They retain excellent data propcessing performance while delivering faster inference speed and better reasoning ability!
+😄 We strongly recommend users to use the 7B or 8B model for their generalizability to unseen tasks and reasoning ability!
 ## Model Details
+Jellyfish-13B is a large language model with 13 billion parameters. It is tailored specifically for data preprocessing tasks, including error detection, data imputation, schema matching, and entity matching.
 We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
 Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
 In contrast, Jellyfish-13B-Interpreter, is fine-tuned with data that includes reasoning and sequential thought processes for handling data preprocessing tasks, distilling knowledge from GPT-4.
 The two versions are designed for different application scenarios.
+Jellyfish-13B is suitable for integration into larger data management systems due to its simple and clear responses that can be easily transformed into codes in a data management/analysis pipeline.
+On the other hand, Jellyfish-13B-Interpreter is more user-oriented, with responses that provide in-depth data insights without the necessity for advanced coding skills or an intricate grasp of statistics.
 More details about the model can be found in the [Jellyfish paper](https://arxiv.org/abs/2312.01678).
 | Entity Matching | Unseen | Walmart-Amazon    | 86.89           | 87.00  | **90.27** | 79.19  | 82.40     | 84.91        | 85.24        | *89.42*       |
 | Avg             |        |                   | 80.44           | -      | *84.17* | 72.58  | -         | 82.74        | 81.55        | **86.02**     |
+_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
 _Accuracy as the metric for data imputation and the F1 score for other tasks._
 1.
+  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
+  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
+  [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
+  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
+  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
 2.
   [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
 ### Training Data
 We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
 The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks), [RAHA](https://github.com/BigDaMa/raha), [SMAT](https://github.com/JZCS2018/SMAT), and [IPM](https://ieeexplore.ieee.org/document/9458712).
+Based on these datasets, we constructed an instruction tuning dataset for fine-tuning LLMs, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 ### Training Method
 ## Uses
+To accelerate the inference, we strongly recommend running Jellyfish using [vLLM](https://github.com/vllm-project/vllm).
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Python Script
 ### Prompts
+We provide the prompts used for both fine-tuning and inference.
 You can structure your data according to these prompts.
+Moreover, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
 ### JellyFish-13B
 #### For Error Detection
 _There are two forms of the error detection task.
 In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
 Attribute for Verification: [{attribute X}: {attribute X value}]
 Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
 ```
+#### For Data Imputation
+```
+You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
+Your task is to deduce or infer the value of {attribute X} using the available information in the record.
+You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
+Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
+Answer only the value of {attribute X}.
+```
 #### For Schema Matching
 ```
 Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
 Attribute B is [name: {value of name}, description: {value of description}].
 Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
 ```
+#### For Entity Matching
+```
+You are tasked with determining whether two records listed below are the same based on the information provided.
+Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
+Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
+Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Are record A and record B the same entity? Choose your answer from: [Yes, No].
+```
 ### For Column Type Annotation
 ### JellyFish-13B-Interpreter
 #### For Error Detection
 ```
 Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
 Question: Is there an error in the value of {attribute X}?
 After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",
 ```
+#### For Data Imputation
+```
+You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
+Your task is to deduce or infer the manufacturer of the product using the available information in the record.
+You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
+Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
+After your reasoning, finish your response in a separate line with and ONLY with your final answer.
+Your final answer should only consist of the value of {attribute X}.
+```
 #### For Schema Matching
 ```
 Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
 Attribute B is [name: {value of name}, description: {value of description}].
 After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
 ```
+#### For Entity Matching
+```
+You are tasked with determining whether two products listed below are the same based on the information provided.
+Carefully examine all the attributes before making your decision.
+Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.
+Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Are record A and record B the same entity?
+After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].
+```
 ## Sample Responses from Jellyfish-13B-Interpreter
 We provide a few sample responses from Jellyfish-13B-Interpreter to demonstrate its performance.