metadata

license: cc-by-nc-4.0
language:
  - en

Jellyfish-13B

Model Details

Jellyfish-13B is a large language model with 13 billion parameters, designed specifically for data managment and preprocessing tasks, such as entity matching, data imputation, error detection, and schema matching.

We fine-tuned Open-Orca/OpenOrca-Platypus2-13B using the datasets related to data preprocessing tasks. Its performance is competitive, standing up well against prior state-of-the-art algorithms and LLMs such as OpenAI GPT 3.5 and GPT 4 (evaluated by our previous work, https://arxiv.org/abs/2205.09911). Note that Jellyfish is only a 13B model and can be run locally for low cost and data security.

Task	Dataset	Non-LLM SoTA	GPT-3.5	GPT-4	Jellyfish-13B	Jellyfish-13B-Resoning
Entity Matching	Fodors-Zagats	100	100	100	100	100
Entity Matching	Beer	94.37	96.30	100	93.33	100
Entity Matching	iTunes-Amazon	97.06	96.43	100	96.30	96.15
Entity Matching	Walmart-Amazon	86.76	86.17	90.27	80.71	85.16
Entity Matching	DBLP-ACM	98.99	96.99	97.44	97.35	95.74
Entity Matching	DBLP-GoogleScholar	95.60	76.12	91.87	92.83	89.45
Entity Matching	Amazon-Google	75.58	66.53	74.21	72.69	56.64
Imputation	Restaurant	77.20	94.19	97.67	94.19	93.02
Imputation	Buy	96.50	98.46	100	100	100
Error Detection	Hosptial	99.10	90.74	90.74	92.21	65.66
Error Detection	Adult	94.40	92.01	92.01	96.62	90.13
Schema Matching	Sythea	38.50	57.14	66.67	36.36	30.77

We have released two versions of Jellyfish: the Jellyfish-13B and Jellyfish-13B-Reasoning. As the names suggest, Jellyfish-13B focuses on providing accurate, direct answers. In contrast, Jellyfish-13B-Reasoning distills knowledge from GPT-4. It fine-tuned with data containing reasons and chain-of-thought responses for solving data preprocessing tasks generated by GPT-4.

Jellyfish paper will coming soon!

Developed by: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
Contact: [email protected]
Funded by: NEC Corporation, Osaka University
Language(s) (NLP): English
License: Non-Commercial Creative Commons license (CC BY-NC-4.0)
Finetuned from model: Open-Orca/OpenOrca-Platypus2-13B

Prompt Template

### Instruction:

<prompt> (without the <>)

### Response:

Training Details

Training Data

We utilized the training and validation sets from the paper Can Foundation Models Wrangle Your Data? to fine-tune Jellyfish The original datasets is HazyResearch/fm_data_tasks. We revised this data and constructed an instruction tuning dataset suitable for fine-tuning LLM, mirroring the style of OpenOrca.

Training Method

We used LoRA to speed up the training process, targeting the q_proj and v_proj modules.

Uses

Here are the prompts we used for both fine-tuning the model and for inference. Feel free to explore different prompts on your own to achieve the best generation quality.

For JellyFish-13B

You are tasked with determining whether two records listed below are the same based on the information provided. Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.

Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.

Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]

Are record A and record B the same entity? Choose your answer from: [Yes, No]

For JellyFish-13B-reasoning

You are tasked with determining whether two products listed below are the same based on the information provided. Carefully examine all the attributes before making your decision.

Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.

Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]\nProduct B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}...]

Are record A and record B the same entity?

After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",

NECOUDBFM
/

Jellyfish-13B