File size: 2,936 Bytes
3b54ba5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20ece27
 
 
 
 
 
 
 
 
 
 
3b54ba5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
base_model: deepseek-ai/deepseek-coder-6.7b-instruct
tags:
- SOLAR
- instruct
- finetune
model-index:
- name: NaturalQuery-Solar-6.7B-v0.1
  results: []
license: apache-2.0
language:
- en
datasets:
- wikisql
---

# **NaturalQuery-Solar-6.7B-v0.1**

**NaturalQuery** is a LLM that can translate natural language queries to SQL based on your schema.

NaturalQuery-v0.1 is finetuned on 8k text to PostgreSQL Natural Language <> SQL pairs.

**Future Improvements**:

- Much larger training set
- More complex schemas, questions, and queries
- Reward modeling via DPO
- Benchmarking

# **Usage**

Make sure you have the correct version of the transformers library installed:

```sh
pip install transformers==4.35.2
```

### **Loading the Model**

Use the following Python code to load the model:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cfahlgren1/NaturalSQL-6.7B-v0")
model = AutoModelForCausalLM.from_pretrained(
    "cfahlgren1/NaturalSQL-6.7B-v0",
    device_map="auto",
    torch_dtype=torch.float16,
)
```

### **Generating Text**

To generate text, use the following Python code:

```python
messages=[
    { 'role': 'user', 'content': prompt}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

# 32021 is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32021)

print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

```


# **SQL Generation Template**

```
### Task 

Generate a SQL query to answer the following question: `{natural language question}` 

### Database Schema 

The query will run on a database with the following schema: 

'''
<SQL Table DDL Statements>
'''

### Answer 
Here is the SQL query that answers the question: `{natural language question}` 
'''sql
```

# **Example SQL Output**

### **Example Schemas**

```sql
 CREATE TABLE
      table_1_11545282_6 (
        "No." numeric,
        Nationality text,
        "Years for Jazz" text
      );
    
    CREATE TABLE
      table_2_17383560_1 (
        Pick numeric,
        Round numeric,
        Player text,
        "School/Club Team" text,
        Position text
      );
    
    CREATE TABLE
      table_1_10581768_2 (
        Institution text,
        Enrollment numeric,
        Nickname text,
        Founded numeric
      );
```

**Question**: **What is the round of pick 63?**
```sql
SELECT "Round" FROM table_2_17383560_1 WHERE Pick=63;
```
**Question**: **What is the most popular position among players?**
```sql
SELECT COUNT("Position") FROM "table_2_17383560_1" GROUP BY "Position" ORDER BY COUNT("Position") DESC LIMIT 1;
```

**Question**: **What is the most recent year an institution was founded?**
```sql
SELECT MAX("Founded") FROM table_1_10581768_2;
```