Arun Kumar Tiwary
commited on
Commit
•
fa413cc
1
Parent(s):
cca0560
Update README.md
Browse files
README.md
CHANGED
@@ -80,3 +80,87 @@ options:
|
|
80 |
phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
|
81 |
(default: None)
|
82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
80 |
phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
|
81 |
(default: None)
|
82 |
|
83 |
+
|
84 |
+
--------------------------------------------------------------------------------------------------------------------------------
|
85 |
+
$mlc_llm serve --help
|
86 |
+
usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}]
|
87 |
+
[--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE]
|
88 |
+
[--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
|
89 |
+
[--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
|
90 |
+
[--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH]
|
91 |
+
[--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials]
|
92 |
+
[--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS]
|
93 |
+
model
|
94 |
+
|
95 |
+
positional arguments:
|
96 |
+
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
|
97 |
+
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
|
98 |
+
|
99 |
+
options:
|
100 |
+
-h, --help show this help message and exit
|
101 |
+
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
|
102 |
+
GPUs if not specified. (default: "auto")
|
103 |
+
--model-lib MODEL_LIB
|
104 |
+
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
|
105 |
+
the provided ``model`` to search over possible paths. It the model lib is not found, it will be
|
106 |
+
compiled in a JIT manner. (default: "None")
|
107 |
+
--mode {local,interactive,server}
|
108 |
+
The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The
|
109 |
+
default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total-
|
110 |
+
seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local"
|
111 |
+
refers to the local server deployment which has low request concurrency. So the max batch size will
|
112 |
+
be set to 4, and max total sequence length and prefill chunk size are set to the context window size
|
113 |
+
(or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of
|
114 |
+
server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max
|
115 |
+
total sequence length and prefill chunk size are set to the context window size (or sliding window
|
116 |
+
size) of the model. 3. Mode "server" refers to the large server use case which may handle many
|
117 |
+
concurrent request and want to use GPU memory as much as possible. In this mode, we will
|
118 |
+
automatically infer the largest possible max batch size and max total sequence length. You can
|
119 |
+
manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size"
|
120 |
+
to override the automatic inferred values. (default: "local")
|
121 |
+
--additional-models [ADDITIONAL_MODELS ...]
|
122 |
+
The model paths and (optional) model library paths of additional models (other than the main model).
|
123 |
+
When engine is enabled with speculative decoding, additional models are needed. The way of
|
124 |
+
specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "--
|
125 |
+
additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not
|
126 |
+
given, JIT model compilation will be activated to compile the model automatically.
|
127 |
+
--max-batch-size MAX_BATCH_SIZE
|
128 |
+
The maximum allowed batch size set for the KV cache to concurrently support.
|
129 |
+
--max-total-seq-length MAX_TOTAL_SEQ_LENGTH
|
130 |
+
The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache
|
131 |
+
support. This decides the GPU memory size that the KV cache consumes. If not specified, system will
|
132 |
+
automatically estimate the maximum capacity based on the vRAM size on GPU.
|
133 |
+
--prefill-chunk-size PREFILL_CHUNK_SIZE
|
134 |
+
The maximum number of tokens the model passes for prefill each time. It should not exceed the
|
135 |
+
prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in
|
136 |
+
model config.
|
137 |
+
--max-history-size MAX_HISTORY_SIZE
|
138 |
+
The maximum history length for rolling back the RNN state. If unspecified, the default value is 1.
|
139 |
+
KV cache does not need this.
|
140 |
+
--gpu-memory-utilization GPU_MEMORY_UTILIZATION
|
141 |
+
A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to
|
142 |
+
infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode
|
143 |
+
"local" or "interactive", the actual memory usage may be significantly smaller than this number.
|
144 |
+
Under mode "server", the actual memory usage may be slightly larger than this number.
|
145 |
+
--speculative-mode {disable,small_draft,eagle,medusa}
|
146 |
+
The speculative decoding mode. Right now three options are supported: - "disable", where speculative
|
147 |
+
decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft)
|
148 |
+
style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable".
|
149 |
+
(default: "disable")
|
150 |
+
--spec-draft-length SPEC_DRAFT_LENGTH
|
151 |
+
The number of draft tokens to generate in speculative proposal. The default values is 4.
|
152 |
+
--enable-tracing Enable Chrome Tracing for the server. After enabling, you can send POST request to the
|
153 |
+
"debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST
|
154 |
+
http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model":
|
155 |
+
"dist/llama"}'"
|
156 |
+
--host HOST host name (default: "127.0.0.1")
|
157 |
+
--port PORT port (default: "8000")
|
158 |
+
--allow-credentials allow credentials
|
159 |
+
--allow-origins ALLOW_ORIGINS
|
160 |
+
allowed origins (default: "['*']")
|
161 |
+
--allow-methods ALLOW_METHODS
|
162 |
+
allowed methods (default: "['*']")
|
163 |
+
--allow-headers ALLOW_HEADERS
|
164 |
+
allowed headers (default: "['*']")
|
165 |
+
|
166 |
+
|