Edit model card

The Data Science Coder

Data Science coder is a group of fine tuned models designed to help with coding for data science applications. It comes in 2 variants: 1.3b and 6.7b. Models are fine tuned from DeepSeek Coder instruct versions. Fine tuning was performed on the ed001/ds-coder-instruct-v1 dataset which is constructed by filtering publicly available datasets on HuggingFace.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

def build_instruction_prompt(instruction):
    return '''
    You are the Data Science Coder, a helpful AI assistant created by a man named Ed.
    You help people with data science coding and you answer questions about data science in a helpful manner.
    ### Instruction:
    {}
    ### Response:
    '''.format(instruction.strip()).lstrip()

tokenizer = AutoTokenizer.from_pretrained("ed001/datascience-coder-6.7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("ed001/datascience-coder-6.7b", trust_remote_code=True).cuda()
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1024, top_p=0.95)
result = pipe(build_instruction_prompt("Perform EDA on the Iris dataset"))
print(result[0]['generated_text'])

Training Details

lora_r: 16
lora_alpha: 8
lora_dropout: 0.05
target_modules: q, k, v, o, gate_proj, down_proj, up_proj, lm_head
weight_decay: 0
optmizer: paged_adamw_32bit
lr: 1e-4
lr_scheduler: cosine
max_seq_len: 4096
batch_size: 4
max_grad_norm: 0.5
warmup_ratio: 0.05
num_epochs: 1

The model was trained on the python susbet of the ds-coder-instruct dataset.

Samples

Contact

GitHub: Ea0011

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 41.99
AI2 Reasoning Challenge (25-Shot) 34.64
HellaSwag (10-Shot) 53.83
MMLU (5-Shot) 37.96
TruthfulQA (0-shot) 44.82
Winogrande (5-shot) 55.72
GSM8k (5-shot) 24.94
Downloads last month
1,226
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ed001/datascience-coder-6.7b

Collection including ed001/datascience-coder-6.7b

Evaluation results