File size: 4,160 Bytes
3b54ba5 f9284f2 3b54ba5 6e38502 3b54ba5 20ece27 5f1cb13 20ece27 3b54ba5 f98fee6 3b54ba5 f98fee6 3b54ba5 f98fee6 3b54ba5 f98fee6 3b54ba5 f98fee6 3b54ba5 f98fee6 3b54ba5 f98fee6 3b54ba5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
base_model: deepseek-ai/deepseek-coder-6.7b-instruct
tags:
- SOLAR
- instruct
- finetune
model-index:
- name: NaturalQuery-Solar-6.7B-v0.1
results: []
license: apache-2.0
language:
- en
datasets:
- cfahlgren1/wiki-sql-codellama-expanded
- cfahlgren1/natural-sql
---
# **NaturalQuery-Solar-6.7B-v0.1**
**NaturalQuery** is a LLM that can translate natural language queries to SQL based on your schema.
NaturalQuery-v0.1 is finetuned on 8k text to PostgreSQL Natural Language <> SQL pairs.
**Future Improvements**:
- Much larger training set
- More complex schemas, questions, and queries
- Reward modeling via DPO
- Benchmarking
# **Usage**
Make sure you have the correct version of the transformers library installed:
```sh
pip install transformers==4.35.2
```
### **Loading the Model**
Use the following Python code to load the model:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cfahlgren1/NaturalSQL-6.7B-v0")
model = AutoModelForCausalLM.from_pretrained(
"cfahlgren1/NaturalSQL-6.7B-v0",
device_map="auto",
torch_dtype=torch.float16,
)
```
### **Generating Text**
To generate text, use the following Python code. [Here](https://gist.github.com/cfahlgren1/ba17f01cf688c4229686dc3dfb4d4549) is a full notebook with the SQL table prompt format to use.
```python
messages=[
{ 'role': 'user', 'content': prompt}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
# 32023 is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32023)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
```
# **SQL Generation Template**
```
### Task
Generate a SQL query to answer the following question: `{natural language question}`
### Database Schema
The query will run on a database with the following schema:
'''
<SQL Table DDL Statements>
'''
### Answer
Here is the SQL query that answers the question: `{natural language question}`
'''sql
```
# **Example SQL Output**
### **Example Schemas**
```sql
CREATE TABLE users (
user_id SERIAL PRIMARY KEY,
username VARCHAR(50) NOT NULL,
email VARCHAR(100) NOT NULL,
password_hash TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE projects (
project_id SERIAL PRIMARY KEY,
project_name VARCHAR(100) NOT NULL,
description TEXT,
start_date DATE,
end_date DATE,
owner_id INTEGER REFERENCES users(user_id)
);
CREATE TABLE tasks (
task_id SERIAL PRIMARY KEY,
task_name VARCHAR(100) NOT NULL,
description TEXT,
due_date DATE,
status VARCHAR(50),
project_id INTEGER REFERENCES projects(project_id)
);
CREATE TABLE taskassignments (
assignment_id SERIAL PRIMARY KEY,
task_id INTEGER REFERENCES tasks(task_id),
user_id INTEGER REFERENCES users(user_id),
assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE comments (
comment_id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
task_id INTEGER REFERENCES tasks(task_id),
user_id INTEGER REFERENCES users(user_id)
);
```
**Question**: **Show me the day with the most users joining**
```sql
SELECT created_at::DATE AS day, COUNT(*) AS user_count
FROM users
GROUP BY day
ORDER BY user_count DESC
LIMIT 1;
```
**Question**: **Show me the project that has a task with the most comments**
```sql
SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
FROM projects p
JOIN tasks t ON p.project_id = t.project_id
JOIN comments c ON t.task_id = c.task_id
GROUP BY p.project_name, t.task_name
ORDER BY comment_count DESC
LIMIT 1;
```
**Question**: **What is the ratio of users with gmail addresses vs without?**
```sql
SELECT
SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
FROM
users;
``` |