Update README.md
Browse files
README.md
CHANGED
@@ -12,10 +12,10 @@ language:
|
|
12 |
Jellyfish-13B is a large language model with 13 billion parameters, designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.
|
13 |
|
14 |
We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets related to data preprocessing tasks.
|
15 |
-
Its performance is competitive, standing up well against prior state-of-the-art algorithms and LLMs such as OpenAI GPT 3.5 and GPT 4 (evaluated by our previous work
|
16 |
Note that Jellyfish is only a 13B model and can be run locally for low cost and data security.
|
17 |
|
18 |
-
| Task | Dataset | Non-LLM SoTA | GPT-3.5 | GPT-4 | Jellyfish-13B| Jellyfish-13B-Resoning |
|
19 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
20 |
| Entity Matching | Fodors-Zagats | 100 | 100 | 100 | 100 | 100 |
|
21 |
| Entity Matching | Beer | 94.37| 96.30 | 100 | 93.33 | 100 |
|
@@ -24,12 +24,19 @@ Note that Jellyfish is only a 13B model and can be run locally for low cost and
|
|
24 |
| Entity Matching | DBLP-ACM | 98.99| 96.99 | 97.44 | 97.35 | 95.74 |
|
25 |
| Entity Matching | DBLP-GoogleScholar | 95.60| 76.12 | 91.87 | 92.83 | 89.45 |
|
26 |
| Entity Matching | Amazon-Google | 75.58| 66.53 | 74.21 | 72.69 | 56.64 |
|
27 |
-
| Imputation
|
28 |
-
| Imputation
|
29 |
| Error Detection | Hosptial | 99.10| 90.74 | 90.74 | 92.21 | 65.66 |
|
30 |
| Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
|
31 |
-
| Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
|
35 |
As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
|
|
|
12 |
Jellyfish-13B is a large language model with 13 billion parameters, designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.
|
13 |
|
14 |
We fine-tuned [Open-Orca/OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) using the datasets related to data preprocessing tasks.
|
15 |
+
Its performance is competitive, standing up well against prior state-of-the-art algorithms and LLMs such as OpenAI GPT 3.5 and GPT 4 ([evaluated by our previous work](https://arxiv.org/abs/2308.16361)).
|
16 |
Note that Jellyfish is only a 13B model and can be run locally for low cost and data security.
|
17 |
|
18 |
+
| Task | Dataset | Non-LLM SoTA<sup>1</sup> | GPT-3.5<sup>2</sup> | GPT-4<sup>2</sup> | Jellyfish-13B| Jellyfish-13B-Resoning |
|
19 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
20 |
| Entity Matching | Fodors-Zagats | 100 | 100 | 100 | 100 | 100 |
|
21 |
| Entity Matching | Beer | 94.37| 96.30 | 100 | 93.33 | 100 |
|
|
|
24 |
| Entity Matching | DBLP-ACM | 98.99| 96.99 | 97.44 | 97.35 | 95.74 |
|
25 |
| Entity Matching | DBLP-GoogleScholar | 95.60| 76.12 | 91.87 | 92.83 | 89.45 |
|
26 |
| Entity Matching | Amazon-Google | 75.58| 66.53 | 74.21 | 72.69 | 56.64 |
|
27 |
+
| Data Imputation | Restaurant | 77.20| 94.19 | 97.67 | 94.19 | 93.02 |
|
28 |
+
| Data Imputation | Buy | 96.50| 98.46 | 100 | 100 | 100 |
|
29 |
| Error Detection | Hosptial | 99.10| 90.74 | 90.74 | 92.21 | 65.66 |
|
30 |
| Error Detection | Adult | 94.40| 92.01 | 92.01 | 96.62 | 90.13 |
|
31 |
+
| Schema Matching | Sythea | 38.50| 57.14 | 66.67 | 36.36 | 30.77 |
|
32 |
+
|
33 |
+
_Accuracy as the metric for data imputation, and the f1 score for other tasks._
|
34 |
+
1.
|
35 |
+
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
36 |
+
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
37 |
+
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection
|
38 |
+
[HoloClean](https://arxiv.org/abs/1702.00820) for Data Imputation
|
39 |
+
2. [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
40 |
|
41 |
We release two versions of Jellyfish: the Jellyfish-13B (the main branch) and Jellyfish-13B-Reasoning.
|
42 |
As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers.
|