unsubscribe commited on
Commit
dcfaa25
1 Parent(s): aaf74ae

add serving section in readme

Browse files
Files changed (1) hide show
  1. README.md +29 -2
README.md CHANGED
@@ -55,7 +55,34 @@ huggingface-cli download internlm/internlm2_5-7b-chat-gguf internlm2_5-7b-chat-f
55
 
56
  You can use `llama-cli` for conducting inference. For a detailed explanation of `llama-cli`, please refer to [this guide](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
57
  ```shell
58
- build/bin/llama-cli -m internlm2_5-7b-chat-fp16.gguf
59
  ```
60
 
61
- ## Serving
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  You can use `llama-cli` for conducting inference. For a detailed explanation of `llama-cli`, please refer to [this guide](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
57
  ```shell
58
+ build/bin/llama-cli -m internlm2_5-7b-chat-fp16.gguf -ngl 32
59
  ```
60
 
61
+ ## Serving
62
+
63
+ `llama.cpp` provides an OpenAI API compatible server - `llama-server`. You can deploy `internlm2_5-7b-chat-fp16.gguf` into a service like this:
64
+
65
+ ```shell
66
+ ./build/bin/llama-server -m ./internlm2_5-7b-chat-fp16.gguf -ngl 32
67
+ ```
68
+
69
+ At the client side, you can access the service through OpenAI API:
70
+
71
+ ```python
72
+ from openai import OpenAI
73
+ client = OpenAI(
74
+ api_key='YOUR_API_KEY',
75
+ base_url='http://localhost:8080/v1'
76
+ )
77
+ model_name = client.models.list().data[0].id
78
+ response = client.chat.completions.create(
79
+ model=model_name,
80
+ messages=[
81
+ {"role": "system", "content": "You are a helpful assistant."},
82
+ {"role": "user", "content": " provide three suggestions about time management"},
83
+ ],
84
+ temperature=0.8,
85
+ top_p=0.8
86
+ )
87
+ print(response)
88
+ ```