ArunKr
/

MLC-LLM

Model card Files Files and versions Community

Arun Kumar Tiwary commited on May 16

Commit

fa413cc

•

1 Parent(s): cca0560

Update README.md

Browse files

Files changed (1) hide show

README.md +84 -0

README.md CHANGED Viewed

@@ -80,3 +80,87 @@ options:
                         phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
                         (default: None)

                         phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
                         (default: None)
+--------------------------------------------------------------------------------------------------------------------------------
+$mlc_llm serve --help
+usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}]
+                         [--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE]
+                         [--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
+                         [--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
+                         [--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH]
+                         [--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials]
+                         [--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS]
+                         model
+positional arguments:
+  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
+                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)
+options:
+  -h, --help            show this help message and exit
+  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
+                        GPUs if not specified. (default: "auto")
+  --model-lib MODEL_LIB
+                        The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
+                        the provided ``model`` to search over possible paths. It the model lib is not found, it will be
+                        compiled in a JIT manner. (default: "None")
+  --mode {local,interactive,server}
+                        The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The
+                        default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total-
+                        seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local"
+                        refers to the local server deployment which has low request concurrency. So the max batch size will
+                        be set to 4, and max total sequence length and prefill chunk size are set to the context window size
+                        (or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of
+                        server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max
+                        total sequence length and prefill chunk size are set to the context window size (or sliding window
+                        size) of the model. 3. Mode "server" refers to the large server use case which may handle many
+                        concurrent request and want to use GPU memory as much as possible. In this mode, we will
+                        automatically infer the largest possible max batch size and max total sequence length. You can
+                        manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size"
+                        to override the automatic inferred values. (default: "local")
+  --additional-models [ADDITIONAL_MODELS ...]
+                        The model paths and (optional) model library paths of additional models (other than the main model).
+                        When engine is enabled with speculative decoding, additional models are needed. The way of
+                        specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "--
+                        additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not
+                        given, JIT model compilation will be activated to compile the model automatically.
+  --max-batch-size MAX_BATCH_SIZE
+                        The maximum allowed batch size set for the KV cache to concurrently support.
+  --max-total-seq-length MAX_TOTAL_SEQ_LENGTH
+                        The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache
+                        support. This decides the GPU memory size that the KV cache consumes. If not specified, system will
+                        automatically estimate the maximum capacity based on the vRAM size on GPU.
+  --prefill-chunk-size PREFILL_CHUNK_SIZE
+                        The maximum number of tokens the model passes for prefill each time. It should not exceed the
+                        prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in
+                        model config.
+  --max-history-size MAX_HISTORY_SIZE
+                        The maximum history length for rolling back the RNN state. If unspecified, the default value is 1.
+                        KV cache does not need this.
+  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
+                        A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to
+                        infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode
+                        "local" or "interactive", the actual memory usage may be significantly smaller than this number.
+                        Under mode "server", the actual memory usage may be slightly larger than this number.
+  --speculative-mode {disable,small_draft,eagle,medusa}
+                        The speculative decoding mode. Right now three options are supported: - "disable", where speculative
+                        decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft)
+                        style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable".
+                        (default: "disable")
+  --spec-draft-length SPEC_DRAFT_LENGTH
+                        The number of draft tokens to generate in speculative proposal. The default values is 4.
+  --enable-tracing      Enable Chrome Tracing for the server. After enabling, you can send POST request to the
+                        "debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST
+                        http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model":
+                        "dist/llama"}'"
+  --host HOST           host name (default: "127.0.0.1")
+  --port PORT           port (default: "8000")
+  --allow-credentials   allow credentials
+  --allow-origins ALLOW_ORIGINS
+                        allowed origins (default: "['*']")
+  --allow-methods ALLOW_METHODS
+                        allowed methods (default: "['*']")
+  --allow-headers ALLOW_HEADERS
+                        allowed headers (default: "['*']")