HCZhang commited on
Commit
249253a
1 Parent(s): 744c6f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -11,9 +11,9 @@ language:
11
  ## Model Details
12
  Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
13
 
14
- We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets pertinent to data preprocessing tasks.
15
- Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4, ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361))
16
- It is notable that as a 13B model, Jellyfish allows for cost-effective local execution without compromising data security.
17
 
18
  | Task | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
19
  | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
@@ -30,7 +30,7 @@ It is notable that as a 13B model, Jellyfish allows for cost-effective local exe
30
  | Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
31
  | Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
32
 
33
- _Accuracy as the metric for data imputation, and the F1 score for other tasks._
34
  _For GPT-3.5, GPT-4 we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
35
  1.
36
  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
@@ -71,8 +71,8 @@ On the other hand, Jellyfish-13B-Reasoning is more user-oriented, with responses
71
 
72
  ### Training Data
73
  We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
74
- The original datasets is [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
75
- We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
76
 
77
  ### Training Method
78
 
@@ -82,7 +82,7 @@ We used LoRA to speed up the training process, targeting the q_proj and v_proj m
82
 
83
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
84
  We provide the prompts used for both the model's fine-tuning and inference.
85
- You can structure your data accordingly to these prompts.
86
  However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
87
 
88
  ### JellyFish-13B
 
11
  ## Model Details
12
  Jellyfish-13B is a large language model equipped with 13 billion parameters. It's tailored specifically for data preprocessing tasks, including entity matching, data imputation, error detection, and schema matching.
13
 
14
+ We fine-tuned the [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) model using the datasets pertinent to data preprocessing tasks.
15
+ Its performance is competitive, rivaling previous state-of-the-art algorithms and LLMs such as OpenAI's GPT 3.5 and GPT 4 ([as demonstrated in our earlier studies](https://arxiv.org/abs/2308.16361)).
16
+ It is notable that, as a 13B model, Jellyfish allows for cost-effective local execution without compromising data security.
17
 
18
  | Task | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
19
  | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 
30
  | Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
31
  | Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
32
 
33
+ _Accuracy as the metric for data imputation and the F1 score for other tasks._
34
  _For GPT-3.5, GPT-4 we used the few-shot approach, while for Jellyfish and Jellyfish-Reasoning, the zero-shot approach was employed._
35
  1.
36
  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
 
71
 
72
  ### Training Data
73
  We utilized the training and validation sets from the paper [Can Foundation Models Wrangle Your Data?](https://arxiv.org/abs/2205.09911) to fine-tune Jellyfish.
74
+ The original datasets are from [HazyResearch/fm_data_tasks](https://github.com/HazyResearch/fm_data_tasks).
75
+ We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca).
76
 
77
  ### Training Method
78
 
 
82
 
83
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
84
  We provide the prompts used for both the model's fine-tuning and inference.
85
+ You can structure your data according to these prompts.
86
  However, we encourage experimenting with different prompts to potentially achieve optimal generation quality.
87
 
88
  ### JellyFish-13B