File size: 5,866 Bytes

3b54ba5
 
 
 
 
 
7ff1bba
3b54ba5
342632f
 
3b54ba5
 
 
f9284f2
 
3b54ba5
 
7ff1bba
3b54ba5
7ff1bba
15c52cf
5fae20c
3b54ba5
7ff1bba
 
15c52cf
 
 
 
4532178
 
 
 
 
 
 
 
 
 
 
 
 
 
15c52cf
 
 
7ff1bba
15c52cf
 
 
4532178
15c52cf
7ff1bba
15c52cf
7ff1bba
15c52cf
 
3b54ba5
4532178
3b54ba5
 
 
15c52cf
3b54ba5
 
4532178
3b54ba5
 
 
 
 
 
 
4532178
3b54ba5
 
 
 
 
 
 
 
 
 
 
 
 
 
4532178
3b54ba5
6e38502
3b54ba5
 
20ece27
 
 
 
 
 
5f1cb13
 
20ece27
 
 
3b54ba5
 
 
4532178
3b54ba5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4532178
3b54ba5
4532178
3b54ba5
 
f98fee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b54ba5
4532178
3b54ba5
f98fee6
3b54ba5
f98fee6
 
 
 
 
3b54ba5
f98fee6
3b54ba5
f98fee6
 
 
 
 
 
 
3b54ba5
 
f98fee6
3b54ba5
f98fee6
 
 
 
3b54ba5

---
base_model: deepseek-ai/deepseek-coder-6.7b-instruct
tags:
- instruct
- finetune
model-index:
- name: NaturalSQL-6.7B-v0.1
  results: []
license: other
license_name: deepseek
language:
- en
datasets:
- cfahlgren1/wiki-sql-codellama-expanded
- cfahlgren1/natural-sql
---

# **NaturalSQL-6.7B-v0.1**

### NaturalSQL is a series of models with state-of-the-art performance on Text to SQL instructions.

<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/hafdsfrFCqrVbATIzV_EN.png" width="600">

**NaturalSQL** is a LLM that can translate natural language queries to SQL based on your schema. It is finetuned on 8k text to PostgreSQL Natural Language <> SQL pairs.
NaturalSQL matches the state of the art models in text to sql for it's size and produces the best in the field for complex questions. 

Here is a write up, small test done [here](https://chatdb.ai/post/naturalsql-vs-sqlcoder-for-text-to-sql).

# Table of Contents
1. [Benchmarks](#benchmarks)
   - [SQL-Eval on novel datasets not seen in training](#sql-eval-on-novel-datasets-not-seen-in-training)
   - [SQL-Eval by SQL Category](#sql-eval-by-sql-category)
2. [Future Improvements](#future-improvements)
3. [Usage](#usage)
   - [Loading the Model](#loading-the-model)
   - [Generating Text](#generating-text)
4. [SQL Generation Template](#sql-generation-template)
5. [Example SQL Output](#example-sql-output)
   - [Example Schemas](#example-schemas)
   - [Example SQL Outputs](#example-sql-outputs)
  
     
## Benchmarks

## SQL-Eval on novel datasets not seen in training

<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/IZpVHJJzmapQ0gKV0Fb3q.png" width="800">

<em>Big thanks to the [defog](https://huggingface.co/defog) team for open sourcing [sql-eval](https://github.com/defog-ai/sql-eval)</em>👏

## SQL-Eval by SQL Category

**NaturalSQL-6.7B-v0 matches or outperforms other industry leading models in multiple categories!**

<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/waXAf7crQQSqITp07BwEs.png" width="800">

_The **date** category will be a strong focus in the next iteration of `v1`._

## Future Improvements

- Much larger training set
- More complex schemas, questions, and queries
- Strong focus on Date Queries
- Reward modeling via DPO

# Usage

Make sure you have the correct version of the transformers library installed:

```sh
pip install transformers==4.35.2
```

### Loading the Model

Use the following Python code to load the model:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cfahlgren1/NaturalSQL-6.7B-v0")
model = AutoModelForCausalLM.from_pretrained(
    "cfahlgren1/NaturalSQL-6.7B-v0",
    device_map="auto",
    torch_dtype=torch.float16,
)
```

### Generating Text

To generate text, use the following Python code. [Here](https://gist.github.com/cfahlgren1/ba17f01cf688c4229686dc3dfb4d4549) is a full notebook with the SQL table prompt format to use.

```python
messages=[
    { 'role': 'user', 'content': prompt}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

# 32023 is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32023)

print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

```


# SQL Generation Template

```
### Task 

Generate a SQL query to answer the following question: `{natural language question}` 

### Database Schema 

The query will run on a database with the following schema: 

'''
<SQL Table DDL Statements>
'''

### Answer 
Here is the SQL query that answers the question: `{natural language question}` 
'''sql
```

# Example SQL Output

### Example Schemas

```sql
CREATE TABLE users (
        user_id SERIAL PRIMARY KEY,
        username VARCHAR(50) NOT NULL,
        email VARCHAR(100) NOT NULL,
        password_hash TEXT NOT NULL,
        created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
    );

CREATE TABLE projects (
    project_id SERIAL PRIMARY KEY,
    project_name VARCHAR(100) NOT NULL,
    description TEXT,
    start_date DATE,
    end_date DATE,
    owner_id INTEGER REFERENCES users(user_id)
);

CREATE TABLE tasks (
    task_id SERIAL PRIMARY KEY,
    task_name VARCHAR(100) NOT NULL,
    description TEXT,
    due_date DATE,
    status VARCHAR(50),
    project_id INTEGER REFERENCES projects(project_id)
);

CREATE TABLE taskassignments (
    assignment_id SERIAL PRIMARY KEY,
    task_id INTEGER REFERENCES tasks(task_id),
    user_id INTEGER REFERENCES users(user_id),
    assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE comments (
    comment_id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    task_id INTEGER REFERENCES tasks(task_id),
    user_id INTEGER REFERENCES users(user_id)
);
```
### Example SQL Outputs

**Question**: **Show me the day with the most users joining**
```sql
SELECT created_at::DATE AS day, COUNT(*) AS user_count
FROM users
GROUP BY day
ORDER BY user_count DESC
LIMIT 1;
```
**Question**: **Show me the project that has a task with the most comments**
```sql
SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
FROM projects p
JOIN tasks t ON p.project_id = t.project_id
JOIN comments c ON t.task_id = c.task_id
GROUP BY p.project_name, t.task_name
ORDER BY comment_count DESC
LIMIT 1;
```

**Question**: **What is the ratio of users with gmail addresses vs without?**
```sql
SELECT 
    SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
FROM 
    users;
```