Jellyfish-13B / README.md
yuyangdong's picture
Update README.md
18e92a9
|
raw
history blame
9.92 kB
metadata
license: cc-by-nc-4.0
language:
  - en

Jellyfish-13B

PicToModel

Model Details

Jellyfish-13B is a large language model with 13 billion parameters, designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.

We fine-tuned Open-Orca/OpenOrca-Platypus2-13B using the datasets related to data preprocessing tasks. Its performance is competitive, standing up well against prior state-of-the-art algorithms and LLMs such as OpenAI GPT 3.5 and GPT 4 (evaluated by our previous work, https://arxiv.org/abs/2205.09911). Note that Jellyfish is only a 13B model and can be run locally for low cost and data security.

Task Dataset Non-LLM SoTA GPT-3.5 GPT-4 Jellyfish-13B Jellyfish-13B-Resoning
Entity Matching Fodors-Zagats 100 100 100 100 100
Entity Matching Beer 94.37 96.30 100 93.33 100
Entity Matching iTunes-Amazon 97.06 96.43 100 96.30 96.15
Entity Matching Walmart-Amazon 86.76 86.17 90.27 80.71 85.16
Entity Matching DBLP-ACM 98.99 96.99 97.44 97.35 95.74
Entity Matching DBLP-GoogleScholar 95.60 76.12 91.87 92.83 89.45
Entity Matching Amazon-Google 75.58 66.53 74.21 72.69 56.64
Imputation Restaurant 77.20 94.19 97.67 94.19 93.02
Imputation Buy 96.50 98.46 100 100 100
Error Detection Hosptial 99.10 90.74 90.74 92.21 65.66
Error Detection Adult 94.40 92.01 92.01 96.62 90.13
Schema Matching Sythea 38.50 57.14 66.67 36.36 30.77

We have released two versions of Jellyfish: the Jellyfish-13B and Jellyfish-13B-Reasoning. As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers. In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks generated by GPT-4.

Jellyfish paper will coming soon!

  • Developed by: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
  • Contact: [email protected]
  • Funded by: NEC Corporation, Osaka University
  • Language(s) (NLP): English
  • License: Non-Commercial Creative Commons license (CC BY-NC-4.0)
  • Finetuned from model: Open-Orca/OpenOrca-Platypus2-13B

Prompt Template

### Instruction:

<prompt> (without the <>)

### Response:

Training Details

Training Data

We utilized the training and validation sets from the paper Can Foundation Models Wrangle Your Data? to fine-tune Jellyfish The original datasets is HazyResearch/fm_data_tasks. We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of OpenOrca.

Training Method

We used LoRA to speed up the training process, targeting the q_proj and v_proj modules.

Uses

Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.

For JellyFish-13B

You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.

Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.

Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]

Are record A and record B the same entity? Choose your answer from: [Yes, No]

For JellyFish-13B-reasoning

You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.

Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.

Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]

Are record A and record B the same entity?

After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",

Bias, Risks, and Limitations

Citation