yuyangdong commited on
Commit
18e92a9
1 Parent(s): 6fdce9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -17
README.md CHANGED
@@ -9,17 +9,33 @@ language:
9
  <img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
10
 
11
  ## Model Details
12
- Jellyfish-13B is a 13B large language model designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.
13
-
14
- We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using data preprocessing tasks datasets.
15
- Its performance is competitive, standing up well against prior state-of-the-art algorithms and OpenAI GPT 3.5 and GPT 4, which is only 13B model and run locally for low cost and data security.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  We have released two versions of Jellyfish: the Jellyfish-13B and Jellyfish-13B-Reasoning.
18
  As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
19
- In contrast, Jellyfish-13B-Reasoning is distilled by GPT4, which is fine-tuned with data of the reasons and chain-of-thought responses for solving data preprocessing tasks
20
- generated by OpenAI GPT-4.
21
 
22
- Jellyfish paper will coming soon!
23
 
24
  - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
25
  - **Contact: [email protected]**
@@ -40,11 +56,9 @@ Jellyfish paper will coming soon!
40
  ## Training Details
41
 
42
  ### Training Data
43
- We utilized the training and validation sets of datasets for 4 data processing tasks from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune both versions of Jellyfish.
44
-
45
- The original datasets can be accessed from their GitHub project page [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks)
46
-
47
- Through meticulous prompt-engineering, we constructed our datasets suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
48
 
49
  ### Training Method
50
 
@@ -55,7 +69,7 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
55
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
56
  Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
57
 
58
- ### For JellyFish-main
59
  ```
60
  You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
61
 
@@ -66,7 +80,7 @@ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value
66
  Are record A and record B the same entity? Choose your answer from: [Yes, No]
67
  ```
68
 
69
- ### For JellyFish-reasoning
70
  ```
71
  You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.
72
 
@@ -82,14 +96,16 @@ After your reasoning, finish your response in a separate line with and ONLY with
82
 
83
  ## Bias, Risks, and Limitations
84
 
85
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
86
  As of now, we've tested Jellyfish exclusively with the test sets from the benchmark datasets mentioned earlier.
87
 
88
  We're in the process of assessing its performance on additional datasets.
 
89
 
90
  ## Citation
91
 
92
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
93
 
94
  ```bibtex
95
  @article{
@@ -98,6 +114,7 @@ We're in the process of assessing its performance on additional datasets.
98
  booktitle = {arXiv:2205.09911},
99
  year = {2022}
100
  }
 
101
  @software{hunterlee2023orcaplaty1
102
  title = {OpenOrcaPlatypus: Llama2-13B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset and Merged with divergent STEM and Logic Dataset Model},
103
  author = {Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz and Bleys Goodson and Wing Lian and Guan Wang and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
@@ -157,7 +174,8 @@ We're in the process of assessing its performance on additional datasets.
157
  journal={CoRR},
158
  year={2021}
159
  }
160
- ```
 
161
 
162
  <!--**BibTeX:**
163
 
 
9
  <img src="https://i.imgur.com/d8Bl04i.png" alt="PicToModel" width="330"/>
10
 
11
  ## Model Details
12
+ Jellyfish-13B is a large language model with 13 billion parameters, designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.
13
+
14
+ We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets related to data preprocessing tasks.
15
+ Its performance is competitive, standing up well against prior state-of-the-art algorithms and LLMs such as OpenAI GPT 3.5 and GPT 4 (evaluated by our previous work, https://arxiv.org/abs/2205.09911).
16
+ Note that Jellyfish is only a 13B model and can be run locally for low cost and data security.
17
+
18
+ | Task | Dataset | Non-LLM SoTA | GPT-3.5 | GPT-4 | Jellyfish-13B| Jellyfish-13B-Resoning |
19
+ | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
20
+ | Entity Matching | Fodors-Zagats | 100 | 100 | 100 | 100 | 100 |
21
+ | Entity Matching | Beer | 94.37| 96.30 | 100 | 93.33 | 100 |
22
+ | Entity Matching | iTunes-Amazon | 97.06| 96.43 | 100 | 96.30 | 96.15 |
23
+ | Entity Matching | Walmart-Amazon | 86.76| 86.17 | 90.27 | 80.71 | 85.16 |
24
+ | Entity Matching | DBLP-ACM | 98.99| 96.99 | 97.44 | 97.35 | 95.74 |
25
+ | Entity Matching | DBLP-GoogleScholar | 95.60| 76.12 | 91.87 | 92.83 | 89.45 |
26
+ | Entity Matching | Amazon-Google | 75.58| 66.53 | 74.21 | 72.69 | 56.64 |
27
+ | Imputation | Restaurant | 77.20| 94.19 | 97.67 | 94.19 | 93.02 |
28
+ | Imputation | Buy | 96.50| 98.46 | 100 | 100 | 100 |
29
+ | Error Detection | Hosptial | 99.10| 90.74 | 90.74 | 92.21 | 65.66 |
30
+ | Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
31
+ | Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
32
 
33
  We have released two versions of Jellyfish: the Jellyfish-13B and Jellyfish-13B-Reasoning.
34
  As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
35
+ In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks
36
+ generated by GPT-4.
37
 
38
+ **Jellyfish paper will coming soon!**
39
 
40
  - **Developed by:** Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
41
  - **Contact: [email protected]**
 
56
  ## Training Details
57
 
58
  ### Training Data
59
+ We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish
60
+ The original datasets is [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
61
+ We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 
 
62
 
63
  ### Training Method
64
 
 
69
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
70
  Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.
71
 
72
+ ### For JellyFish-13B
73
  ```
74
  You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
75
 
 
80
  Are record A and record B the same entity? Choose your answer from: [Yes, No]
81
  ```
82
 
83
+ ### For JellyFish-13B-reasoning
84
  ```
85
  You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.
86
 
 
96
 
97
  ## Bias, Risks, and Limitations
98
 
99
+ <!-- This section is meant to convey both technical and sociotechnical limitations.
100
  As of now, we've tested Jellyfish exclusively with the test sets from the benchmark datasets mentioned earlier.
101
 
102
  We're in the process of assessing its performance on additional datasets.
103
+ -->
104
 
105
  ## Citation
106
 
107
+
108
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section.
109
 
110
  ```bibtex
111
  @article{
 
114
  booktitle = {arXiv:2205.09911},
115
  year = {2022}
116
  }
117
+
118
  @software{hunterlee2023orcaplaty1
119
  title = {OpenOrcaPlatypus: Llama2-13B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset and Merged with divergent STEM and Logic Dataset Model},
120
  author = {Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz and Bleys Goodson and Wing Lian and Guan Wang and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
 
174
  journal={CoRR},
175
  year={2021}
176
  }
177
+ -->
178
+
179
 
180
  <!--**BibTeX:**
181