Bllossom
/

llama-3-Korean-Bllossom-70B-gguf-Q4_K_M

Inference Endpoints

Model card Files Files and versions Community

limhyeonseok commited on May 10

Commit

f1a2785

•

1 Parent(s): da6eee8

Update README.md

Files changed (1) hide show

README.md +40 -26

README.md CHANGED Viewed

@@ -85,34 +85,48 @@ Refer to the [original model card](https://huggingface.co/Bllossom/llama-3-Korea
 ## Example code
-## Use with llama.cpp
-Install llama.cpp through brew.
-```bash
-brew install ggerganov/ggerganov/llama.cpp
-```
-Invoke the llama.cpp server or the CLI.
-CLI:
-```bash
-llama-cli --hf-repo Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M --model bllossom_llama3_70b.Q4_K_M.gguf -p "서울과학기술대학교 임경태 교수는 어떤연구를하니?"
-```
-Server:
-```bash
-llama-server --hf-repo Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M --model bllossom_llama3_70b.Q4_K_M.gguf -c 2048
-```
-Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
 ```
-git clone https://github.com/ggerganov/llama.cpp &&             cd llama.cpp &&             make &&             ./main -m bllossom_llama3_70b.Q4_K_M.gguf -n 128
-```

 ## Example code
+```python
+!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
+!huggingface-cli download Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M --local-dir='YOUR-LOCAL-FOLDER-PATH'
+from llama_cpp import Llama
+from transformers import AutoTokenizer
+model_id = 'Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M'
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = Llama(
+    model_path='YOUR-LOCAL-FOLDER-PATH/llama-3-Korean-Bllossom-70B-RAG-gguf-Q4_K_M.gguf',
+    n_ctx=512,
+    n_gpu_layers=-1        # Number of model layers to offload to GPU
+)
+PROMPT = \
+'''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.
+You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
+instruction = 'Your Instruction'
+messages = [
+    {"role": "system", "content": f"{PROMPT}"},
+    {"role": "user", "content": f"{instruction}"}
+    ]
+prompt = tokenizer.apply_chat_template(
+    messages,
+    tokenize = False,
+    add_generation_prompt=True
+)
+generation_kwargs = {
+    "max_tokens":512,
+    "stop":["<|eot_id|>"],
+    "echo":True, # Echo the prompt in the output
+    "top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
+}
+resonse_msg = model(prompt, **generation_kwargs)
+print(resonse_msg['choices'][0]['text'][len(prompt):])
 ```