Arun Kumar Tiwary commited on
Commit
fa413cc
1 Parent(s): cca0560

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md CHANGED
@@ -80,3 +80,87 @@ options:
80
  phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
81
  (default: None)
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
81
  (default: None)
82
 
83
+
84
+ --------------------------------------------------------------------------------------------------------------------------------
85
+ $mlc_llm serve --help
86
+ usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}]
87
+ [--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE]
88
+ [--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
89
+ [--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
90
+ [--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH]
91
+ [--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials]
92
+ [--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS]
93
+ model
94
+
95
+ positional arguments:
96
+ model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
97
+ It can also be a link to a HF repository pointing to an MLC compiled model. (required)
98
+
99
+ options:
100
+ -h, --help show this help message and exit
101
+ --device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
102
+ GPUs if not specified. (default: "auto")
103
+ --model-lib MODEL_LIB
104
+ The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
105
+ the provided ``model`` to search over possible paths. It the model lib is not found, it will be
106
+ compiled in a JIT manner. (default: "None")
107
+ --mode {local,interactive,server}
108
+ The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The
109
+ default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total-
110
+ seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local"
111
+ refers to the local server deployment which has low request concurrency. So the max batch size will
112
+ be set to 4, and max total sequence length and prefill chunk size are set to the context window size
113
+ (or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of
114
+ server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max
115
+ total sequence length and prefill chunk size are set to the context window size (or sliding window
116
+ size) of the model. 3. Mode "server" refers to the large server use case which may handle many
117
+ concurrent request and want to use GPU memory as much as possible. In this mode, we will
118
+ automatically infer the largest possible max batch size and max total sequence length. You can
119
+ manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size"
120
+ to override the automatic inferred values. (default: "local")
121
+ --additional-models [ADDITIONAL_MODELS ...]
122
+ The model paths and (optional) model library paths of additional models (other than the main model).
123
+ When engine is enabled with speculative decoding, additional models are needed. The way of
124
+ specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "--
125
+ additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not
126
+ given, JIT model compilation will be activated to compile the model automatically.
127
+ --max-batch-size MAX_BATCH_SIZE
128
+ The maximum allowed batch size set for the KV cache to concurrently support.
129
+ --max-total-seq-length MAX_TOTAL_SEQ_LENGTH
130
+ The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache
131
+ support. This decides the GPU memory size that the KV cache consumes. If not specified, system will
132
+ automatically estimate the maximum capacity based on the vRAM size on GPU.
133
+ --prefill-chunk-size PREFILL_CHUNK_SIZE
134
+ The maximum number of tokens the model passes for prefill each time. It should not exceed the
135
+ prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in
136
+ model config.
137
+ --max-history-size MAX_HISTORY_SIZE
138
+ The maximum history length for rolling back the RNN state. If unspecified, the default value is 1.
139
+ KV cache does not need this.
140
+ --gpu-memory-utilization GPU_MEMORY_UTILIZATION
141
+ A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to
142
+ infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode
143
+ "local" or "interactive", the actual memory usage may be significantly smaller than this number.
144
+ Under mode "server", the actual memory usage may be slightly larger than this number.
145
+ --speculative-mode {disable,small_draft,eagle,medusa}
146
+ The speculative decoding mode. Right now three options are supported: - "disable", where speculative
147
+ decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft)
148
+ style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable".
149
+ (default: "disable")
150
+ --spec-draft-length SPEC_DRAFT_LENGTH
151
+ The number of draft tokens to generate in speculative proposal. The default values is 4.
152
+ --enable-tracing Enable Chrome Tracing for the server. After enabling, you can send POST request to the
153
+ "debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST
154
+ http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model":
155
+ "dist/llama"}'"
156
+ --host HOST host name (default: "127.0.0.1")
157
+ --port PORT port (default: "8000")
158
+ --allow-credentials allow credentials
159
+ --allow-origins ALLOW_ORIGINS
160
+ allowed origins (default: "['*']")
161
+ --allow-methods ALLOW_METHODS
162
+ allowed methods (default: "['*']")
163
+ --allow-headers ALLOW_HEADERS
164
+ allowed headers (default: "['*']")
165
+
166
+