NaturalSQL-6.7B-v0 / README.md

cfahlgren1 HF staff

Update README.md

15c52cf verified 9 months ago

preview code

raw

history blame

No virus

5.89 kB

	---
	base_model: deepseek-ai/deepseek-coder-6.7b-instruct
	tags:
	- instruct
	- finetune
	model-index:
	- name: NaturalQuery-6.7B-v0.1
	results: []
	license: other
	license_name: deepseek
	language:
	- en
	datasets:
	- cfahlgren1/wiki-sql-codellama-expanded
	- cfahlgren1/natural-sql
	---

	# NaturalQuery-6.7B-v0.1

	### NaturalQuery is a series of models with state-of-the-art performance on Text to SQL instructions.

	<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/hafdsfrFCqrVbATIzV_EN.png" width="600">

	NaturalQuery is a LLM that can translate natural language queries to SQL based on your schema. It is finetuned on 8k text to PostgreSQL Natural Language <> SQL pairs.
	NaturalQuery matches the state of the art models in text to sql for it's size and produces the best in the field for complex questions.

	Here is a write up, small test done [here](https://chatdb.ai/post/naturalsql-vs-sqlcoder-for-text-to-sql).

	# Table of Contents
	1. [Benchmarks](#Benchmarks)
	- [SQL-Eval on Novel Datasets](#SQL-Eval-on-Novel-Datasets)
	- [SQL-Eval by Category](#SQL-Eval-by-Category)
	2. [Future Improvements](#Future-Improvements)
	3. [Usage](#Usage)
	- [Installation](#Installation)
	- [Loading the Model](#Loading-the-Model)
	- [Generating Text](#Generating-Text)
	4. [SQL Generation Template](#SQL-Generation-Template)
	5. [Example SQL Output](#Example-SQL-Output)
	- [Example Schemas](#Example-Schemas)
	- [Example Queries](#Example-Queries)

	## Benchmarks

	## SQL-Eval on novel datasets not seen in training

	<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/omYzbu9ZO8hx7cnUtOZKg.png" width="600">

	<em>Big thanks to the [defog](https://huggingface.co/defog) team for open sourcing [sql-eval](https://github.com/defog-ai/sql-eval)</em>👏

	## SQL-Eval by SQL Category

	NaturalQuery-6.7B-v0 matches or outperforms other industry leading models in multiple categories!

	<img src="https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/G6lGyTbBUmEkKdmJ9r4jp.png" width="600">

	_The date category will be a strong focus in the next iteration of `v1`._

	Future Improvements in the next iteration:

	- Much larger training set
	- More complex schemas, questions, and queries
	- Strong focus on Date Queries
	- Reward modeling via DPO

	# Usage

	Make sure you have the correct version of the transformers library installed:

	```sh
	pip install transformers==4.35.2
	```

	### Loading the Model

	Use the following Python code to load the model:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("cfahlgren1/NaturalSQL-6.7B-v0")
	model = AutoModelForCausalLM.from_pretrained(
	"cfahlgren1/NaturalSQL-6.7B-v0",
	device_map="auto",
	torch_dtype=torch.float16,
	)
	```

	### Generating Text

	To generate text, use the following Python code. [Here](https://gist.github.com/cfahlgren1/ba17f01cf688c4229686dc3dfb4d4549) is a full notebook with the SQL table prompt format to use.

	```python
	messages=[
	{ 'role': 'user', 'content': prompt}
	]

	inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

	# 32023 is the id of <\|EOT\|> token
	outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32023)

	print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

	```


	# SQL Generation Template

	```
	### Task

	Generate a SQL query to answer the following question: `{natural language question}`

	### Database Schema

	The query will run on a database with the following schema:

	'''
	<SQL Table DDL Statements>
	'''

	### Answer
	Here is the SQL query that answers the question: `{natural language question}`
	'''sql
	```

	# Example SQL Output

	### Example Schemas

	```sql
	CREATE TABLE users (
	user_id SERIAL PRIMARY KEY,
	username VARCHAR(50) NOT NULL,
	email VARCHAR(100) NOT NULL,
	password_hash TEXT NOT NULL,
	created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
	);

	CREATE TABLE projects (
	project_id SERIAL PRIMARY KEY,
	project_name VARCHAR(100) NOT NULL,
	description TEXT,
	start_date DATE,
	end_date DATE,
	owner_id INTEGER REFERENCES users(user_id)
	);

	CREATE TABLE tasks (
	task_id SERIAL PRIMARY KEY,
	task_name VARCHAR(100) NOT NULL,
	description TEXT,
	due_date DATE,
	status VARCHAR(50),
	project_id INTEGER REFERENCES projects(project_id)
	);

	CREATE TABLE taskassignments (
	assignment_id SERIAL PRIMARY KEY,
	task_id INTEGER REFERENCES tasks(task_id),
	user_id INTEGER REFERENCES users(user_id),
	assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
	);

	CREATE TABLE comments (
	comment_id SERIAL PRIMARY KEY,
	content TEXT NOT NULL,
	created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
	task_id INTEGER REFERENCES tasks(task_id),
	user_id INTEGER REFERENCES users(user_id)
	);
	```

	Question: Show me the day with the most users joining
	```sql
	SELECT created_at::DATE AS day, COUNT(*) AS user_count
	FROM users
	GROUP BY day
	ORDER BY user_count DESC
	LIMIT 1;
	```
	Question: Show me the project that has a task with the most comments
	```sql
	SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
	FROM projects p
	JOIN tasks t ON p.project_id = t.project_id
	JOIN comments c ON t.task_id = c.task_id
	GROUP BY p.project_name, t.task_name
	ORDER BY comment_count DESC
	LIMIT 1;
	```

	Question: What is the ratio of users with gmail addresses vs without?
	```sql
	SELECT
	SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
	FROM
	users;
	```