Getting runtime error when loading with llama-cpp in a HF space with Nvidia A10G Large
#20
by
Isaid-Silver
- opened
I don't know if I'm doing something wrong but I'm trying to deploy a gradio App using Mixtral-8x7B gguf and llama cpp. My space already has set up the environment variables:
CMAKE_ARGS="-DLLAMA_CUBLAS=on"
FORCE_CMAKE="1"
this is my requirements.txt
:
--extra-index-url https://download.pytorch.org/whl/cu113
torch
llama-cpp-python
and my app.py
goes as follows:
import gradio as gr
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
import os
import torch
print(f"Is CUDA available: {torch.cuda.is_available()}")
# True
print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
print(f"CMAKE_ARGS={os.environ['CMAKE_ARGS']}")
print(f"FORCE_CMAKE={os.environ['FORCE_CMAKE']}")
print(f'Llama={Llama.__name__}')
os.makedirs('models/')
downloaded_model_path = hf_hub_download(repo_id="TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF",
filename="mixtral-8x7b-instruct-v0.1.Q2_K.gguf",local_dir = 'models/')
print(f'Downloaded path: {downloaded_model_path}')
print('Initializing model...')
llm = Llama(
model_path=downloaded_model_path,
n_ctx=2048,
n_threads=10,
n_gpu_layers=25,
temp=0.1,
n_batch = 512,
n_predict = -1,
n_keep = 0
)
print('Model loaded.')
def mix_query(query, history):
output = llm(
f"[INST] {query} [/INST]",
max_tokens=512,
stop=["</s>"],
echo=False
)
print(['choices'][0]['text'])
return ['choices'][0]['text']
demo = gr.ChatInterface(fn=mix_query,
examples=["Explain the Fermi paradox"], title="TARS",
theme="soft")
demo.launch()
As you can see I added a lot of prints to check where does the execution fails, and it's during the definition of llm=Llama(...
, however when I run this on my local machine it executes flawless. The issue is that I get no logs when it fails, it just does:
Has anyone run into something like this?
Isaid-Silver
changed discussion status to
closed
Isaid-Silver
changed discussion status to
open