Uploaded model

Developed by: netsol
License: apache-2.0
Finetuned from model : unsloth/gemma-2b-it-bnb-4bit

Inference

vLLM - offline

The model can be used directly with vLLM using the following code snippet

# Load the model
from vllm import LLM, SamplingParams
import json

llm = LLM(model="netsol/otoz-language-search-gemma-2b-it")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)

# Make inference
messages = [
    {"role": "user", 
     "content": "Need a pink hyundai elantra that should be at least from 2023 should be no more than $2000,000"},
]
outputs = llm.chat(
    messages,
    sampling_params=sampling_params,
    use_tqdm=False,
)

# See the results
json.loads(outputs[0].outputs[0].text)

To manually use the tokenizer to apply chat prompt template

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 1
model_name = "netsol/otoz-language-search-gemma-2b-it"

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=4096)

tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)

message_list = [
    [{"role": "user", 
     "content": "Need a pink hyundai elantra that should be at least from 2023 should be no more than $2000,000"},]
     ]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in message_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

vLLM - server

To enable the vLLM server for fastAPI capabilities

python -m vllm.entrypoints.openai.api_server --model "netsol/otoz-language-search-gemma-2b-it" --dtype float16

Then launch the server using

query = "Honda city black 2022 1.2Vs variant"

url = "http://localhost:8000/v1/chat/completions"
headers = {
    "Content-Type": "application/json"
}
data = {
    "model": "netsol/otoz-language-search-gemma-2b-it",
    "messages": [{"role": "user", "content": "Need a hyundai elantra that should be at least from 2023 should be no more than $2000,000 "}]
}

response = requests.post(url, headers=headers, json=data)

# Extracting the output text from the response
if response.status_code == 200:
    response_json = response.json()
    output_text = response_json.get('choices', [{}])[0].get('message', {}).get('content', '')
else:
    print(f"Error: {response.status_code}, {response.text}")

json.loads(output_text)

This gemma model was trained 2x faster with Unsloth and Huggingface's TRL library.

netsol
/

otoz-language-search-gemma-2b-it

Uploaded model

Inference

vLLM - offline

vLLM - server

Model tree for netsol/otoz-language-search-gemma-2b-it