NECOUDBFM
/

Jellyfish-13B

@@ -9,17 +9,33 @@ language:
 <img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
 ## Model Details
-Jellyfish-13B is a 13B large language model designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.
-We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using data preprocessing tasks datasets.
-Its performance is competitive, standing up well against prior state-of-the-art algorithms and OpenAI GPT 3.5 and GPT 4, which is only 13B model and run locally for low cost and data security.
 We have released two versions of Jellyfish: the Jellyfish-13B and Jellyfish-13B-Reasoning.
 As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
-In contrast, Jellyfish-13B-Reasoning is distilled by GPT4, which is fine-tuned with data of the reasons and chain-of-thought responses for solving data preprocessing tasks
-generated by OpenAI GPT-4.
-Jellyfish paper will coming soon!
 - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
 - **Contact: [email protected]**
@@ -40,11 +56,9 @@ Jellyfish paper will coming soon!
 ## Training Details
 ### Training Data
-We utilized the training and validation sets of datasets for 4 data processing tasks from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune both versions of Jellyfish.
-The original datasets can be accessed from their GitHub project page [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks)
-Through meticulous prompt-engineering, we constructed our datasets suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 ### Training Method
@@ -55,7 +69,7 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
-### For JellyFish-main
 ```
 You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
@@ -66,7 +80,7 @@ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value
 Are record A and record B the same entity? Choose your answer from: [Yes, No]
 ```
-### For JellyFish-reasoning
 ```
 You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.
@@ -82,14 +96,16 @@ After your reasoning, finish your response in a separate line with and ONLY with
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
 As of now, we've tested Jellyfish exclusively with the test sets from the benchmark datasets mentioned earlier.
 We're in the process of assessing its performance on additional datasets.
 ## Citation
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 ```bibtex
 @article{
@@ -98,6 +114,7 @@ We're in the process of assessing its performance on additional datasets.
   booktitle = {arXiv:2205.09911},
   year = {2022}
 }
 @software{hunterlee2023orcaplaty1
   title = {OpenOrcaPlatypus: Llama2-13B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset and Merged with divergent STEM and Logic Dataset Model},
   author = {Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz and Bleys Goodson and Wing Lian and Guan Wang and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
@@ -157,7 +174,8 @@ We're in the process of assessing its performance on additional datasets.
   journal={CoRR},
   year={2021}
 }
-```
 <!--**BibTeX:**

 <img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
 ## Model Details
+Jellyfish-13B is a large language model with 13 billion parameters, designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.
+We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets related to data preprocessing tasks.
+Its performance is competitive, standing up well against prior state-of-the-art algorithms and LLMs such as OpenAI GPT 3.5 and GPT 4 (evaluated by our previous work, https://arxiv.org/abs/2205.09911).
+Note that Jellyfish is only a 13B model and can be run locally for low cost and data security.
+|  Task  | Dataset | Non-LLM SoTA | GPT-3.5 | GPT-4 | Jellyfish-13B| Jellyfish-13B-Resoning |
+| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| Entity Matching  | Fodors-Zagats  | 100  | 100 | 100 | 100 | 100 |
+| Entity Matching  | Beer           | 94.37| 96.30 | 100 | 93.33 | 100 |
+| Entity Matching  | iTunes-Amazon  | 97.06| 96.43 | 100 | 96.30 | 96.15 |
+| Entity Matching  | Walmart-Amazon | 86.76| 86.17 | 90.27 | 80.71 | 85.16 |
+| Entity Matching  | DBLP-ACM       | 98.99| 96.99 | 97.44 | 97.35 | 95.74 |
+| Entity Matching  | DBLP-GoogleScholar | 95.60| 76.12 | 91.87 | 92.83 | 89.45 |
+| Entity Matching  | Amazon-Google  | 75.58| 66.53 | 74.21 | 72.69 | 56.64 |
+| Imputation       |  Restaurant    | 77.20| 94.19 | 97.67 | 94.19 | 93.02 |
+| Imputation       |  Buy           | 96.50| 98.46 | 100 | 100 | 100 |
+| Error Detection  |  Hosptial      | 99.10| 90.74 | 90.74 | 92.21 | 65.66 |
+| Error Detection  |  Adult         | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
+| Schema Matching  |  Sythea        | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
 We have released two versions of Jellyfish: the Jellyfish-13B and Jellyfish-13B-Reasoning.
 As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
+In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
+generated by GPT-4.
+**Jellyfish paper will coming soon!**
 - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
 - **Contact: [email protected]**
 ## Training Details
 ### Training Data
+We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish
+The original datasets is [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
+We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 ### Training Method
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
+### For JellyFish-13B
 ```
 You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
 Are record A and record B the same entity? Choose your answer from: [Yes, No]
 ```
+### For JellyFish-13B-reasoning
 ```
 You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.
 ## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations.
 As of now, we've tested Jellyfish exclusively with the test sets from the benchmark datasets mentioned earlier.
 We're in the process of assessing its performance on additional datasets.
+-->
 ## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section.
 ```bibtex
 @article{
   booktitle = {arXiv:2205.09911},
   year = {2022}
 }
 @software{hunterlee2023orcaplaty1
   title = {OpenOrcaPlatypus: Llama2-13B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset and Merged with divergent STEM and Logic Dataset Model},
   author = {Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz and Bleys Goodson and Wing Lian and Guan Wang and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
   journal={CoRR},
   year={2021}
 }
+-->
 <!--**BibTeX:**