Uploaded model
- Developed by: netsol
- License: apache-2.0
- Finetuned from model : unsloth/gemma-2-2b-it-bnb-4bit
Inference
vLLM - offline
The model can be used directly with vLLM using the following code snippet
# Load the model
from vllm import LLM, SamplingParams
import json
llm = LLM(model="netsol/otoz-language-search-gemma-2-2b-it")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
# Make inference
messages = [
{"role": "user",
"content": "Need a pink hyundai elantra that should be at least from 2023 should be no more than $2000,000"},
]
outputs = llm.chat(
messages,
sampling_params=sampling_params,
use_tqdm=False,
)
# See the results
json.loads(outputs[0].outputs[0].text)
To manually use the tokenizer to apply chat prompt template
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 8192, 1
model_name = "netsol/otoz-language-search-gemma-2-2b-it"
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=4096)
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
message_list = [
[{"role": "user",
"content": "Need a pink hyundai elantra that should be at least from 2023 should be no more than $2000,000"},]
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in message_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
vLLM - server
To enable the vLLM server for fastAPI capabilities
python -m vllm.entrypoints.openai.api_server --model "netsol/otoz-language-search-gemma-2-2b-it" --dtype float16
Then launch the server using
query = "Honda city black 2022 1.2Vs variant"
url = "http://localhost:8000/v1/chat/completions"
headers = {
"Content-Type": "application/json"
}
data = {
"model": "netsol/otoz-language-search-gemma-2-2b-it",
"messages": [{"role": "user", "content": "Need a hyundai elantra that should be at least from 2023 should be no more than $2000,000 "}]
}
response = requests.post(url, headers=headers, json=data)
# Extracting the output text from the response
if response.status_code == 200:
response_json = response.json()
output_text = response_json.get('choices', [{}])[0].get('message', {}).get('content', '')
else:
print(f"Error: {response.status_code}, {response.text}")
json.loads(output_text)
This gemma2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 29
Inference API (serverless) is not available, repository is disabled.
Model tree for netsol/otoz-language-search-gemma-2-2b-it
Base model
unsloth/gemma-2-2b-it-bnb-4bit
Finetuned
this model