|
# Setup MLC-LLM on CPU on UBUNTU 22.04 LTS |
|
|
|
```sh |
|
sudo apt update |
|
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools |
|
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly |
|
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC |
|
|
|
___________________________________________________________________________________________________________________________________________ |
|
$ mlc_llm --help |
|
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package} |
|
|
|
positional arguments: |
|
{compile,convert_weight,gen_config,chat,serve,bench,package} |
|
Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
|
|
|
|
____________________________________________________________________________________________________________________________________________ |
|
$ mlc_llm chat --help |
|
usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model |
|
|
|
positional arguments: |
|
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. |
|
It can also be a link to a HF repository pointing to an MLC compiled model. (required) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, |
|
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme |
|
optimization that could potentially break the system. Meanwhile, optimization flags could be |
|
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2") |
|
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available |
|
GPUs if not specified. (default: "auto") |
|
--overrides OVERRIDES |
|
Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`, |
|
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`, |
|
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified |
|
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "") |
|
--model-lib MODEL_LIB |
|
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use |
|
the provided ``model`` to search over possible paths. It the model lib is not found, it will be |
|
compiled in a JIT manner. (default: "None") |
|
|
|
|
|
------------------------------------------------------------------------------------------------------------------------------------------ |
|
$ mlc_llm compile --help |
|
usage: mlc_llm compile [-h] |
|
[--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}] |
|
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}] |
|
[--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT |
|
[--overrides OVERRIDES] [--debug-dump DEBUG_DUMP] |
|
model |
|
|
|
positional arguments: |
|
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. |
|
It can also be a link to a HF repository pointing to an MLC compiled model. (required) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} |
|
The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up |
|
mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, |
|
q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16) |
|
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa} |
|
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`. |
|
(default: "auto") |
|
--device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally. |
|
(default: "auto") |
|
--host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS. |
|
Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux- |
|
android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM |
|
macOS: arm64-apple-darwin. (default: "auto") |
|
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, |
|
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme |
|
optimization that could potentially break the system. Meanwhile, optimization flags could be |
|
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2") |
|
--system-lib-prefix SYSTEM_LIB_PREFIX |
|
Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when |
|
compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy, |
|
this takes no effect for shared library. (default: "auto") |
|
--output OUTPUT, -o OUTPUT |
|
The path to the output file. The suffix determines if the output file is a shared library or |
|
objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar |
|
(objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm |
|
(web assembly). (required) |
|
--overrides OVERRIDES |
|
Model configuration override. Configurations to override `mlc-chat-config.json`. Supports |
|
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`, |
|
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified |
|
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "") |
|
--debug-dump DEBUG_DUMP |
|
Specifies the directory where the compiler will store its IRs for debugging purposes during various |
|
phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled. |
|
(default: None) |
|
|
|
____________________________________________________________________________________________________________________________________________ |
|
$ mlc_llm convert_weight --help |
|
usage: MLC AutoLLM Quantization Framework [-h] --quantization |
|
{q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} |
|
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}] |
|
[--device DEVICE] [--source SOURCE] |
|
[--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output |
|
OUTPUT |
|
config |
|
|
|
positional arguments: |
|
config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json` |
|
in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in |
|
HuggingFace format defines the model architecture, including the vocabulary size, the number of |
|
layers, the hidden size, number of attention heads, etc. Example: |
|
https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory |
|
often contains a `config.json` which defines the model architecture, the non-quantized model weights |
|
in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional |
|
`generation_config.json` provides additional default configuration for text generation. Example: |
|
https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} |
|
The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices: |
|
q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft, |
|
e5m2_e5m2_f16, e4m3_e4m3_f16) |
|
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa} |
|
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`. |
|
(default: "auto") |
|
--device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs |
|
if not specified. (default: "auto") |
|
--source SOURCE The path to original model weight, infer from `config` if missing. (default: "auto") |
|
--source-format {auto,huggingface-torch,huggingface-safetensor,awq} |
|
The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto, |
|
huggingface-torch, huggingface-safetensor, awq") |
|
--output OUTPUT, -o OUTPUT |
|
The output directory to save the quantized model weight. Will create `params_shard_*.bin` and |
|
`ndarray-cache.json` in this directory. (required) |
|
|
|
-------------------------------------------------------------------------------------------------------------------------------- |
|
$mlc_llm serve --help |
|
usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}] |
|
[--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE] |
|
[--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE] |
|
[--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] |
|
[--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH] |
|
[--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials] |
|
[--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS] |
|
model |
|
|
|
positional arguments: |
|
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. |
|
It can also be a link to a HF repository pointing to an MLC compiled model. (required) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available |
|
GPUs if not specified. (default: "auto") |
|
--model-lib MODEL_LIB |
|
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use |
|
the provided ``model`` to search over possible paths. It the model lib is not found, it will be |
|
compiled in a JIT manner. (default: "None") |
|
--mode {local,interactive,server} |
|
The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The |
|
default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total- |
|
seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local" |
|
refers to the local server deployment which has low request concurrency. So the max batch size will |
|
be set to 4, and max total sequence length and prefill chunk size are set to the context window size |
|
(or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of |
|
server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max |
|
total sequence length and prefill chunk size are set to the context window size (or sliding window |
|
size) of the model. 3. Mode "server" refers to the large server use case which may handle many |
|
concurrent request and want to use GPU memory as much as possible. In this mode, we will |
|
automatically infer the largest possible max batch size and max total sequence length. You can |
|
manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size" |
|
to override the automatic inferred values. (default: "local") |
|
--additional-models [ADDITIONAL_MODELS ...] |
|
The model paths and (optional) model library paths of additional models (other than the main model). |
|
When engine is enabled with speculative decoding, additional models are needed. The way of |
|
specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "-- |
|
additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not |
|
given, JIT model compilation will be activated to compile the model automatically. |
|
--max-batch-size MAX_BATCH_SIZE |
|
The maximum allowed batch size set for the KV cache to concurrently support. |
|
--max-total-seq-length MAX_TOTAL_SEQ_LENGTH |
|
The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache |
|
support. This decides the GPU memory size that the KV cache consumes. If not specified, system will |
|
automatically estimate the maximum capacity based on the vRAM size on GPU. |
|
--prefill-chunk-size PREFILL_CHUNK_SIZE |
|
The maximum number of tokens the model passes for prefill each time. It should not exceed the |
|
prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in |
|
model config. |
|
--max-history-size MAX_HISTORY_SIZE |
|
The maximum history length for rolling back the RNN state. If unspecified, the default value is 1. |
|
KV cache does not need this. |
|
--gpu-memory-utilization GPU_MEMORY_UTILIZATION |
|
A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to |
|
infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode |
|
"local" or "interactive", the actual memory usage may be significantly smaller than this number. |
|
Under mode "server", the actual memory usage may be slightly larger than this number. |
|
--speculative-mode {disable,small_draft,eagle,medusa} |
|
The speculative decoding mode. Right now three options are supported: - "disable", where speculative |
|
decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft) |
|
style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable". |
|
(default: "disable") |
|
--spec-draft-length SPEC_DRAFT_LENGTH |
|
The number of draft tokens to generate in speculative proposal. The default values is 4. |
|
--enable-tracing Enable Chrome Tracing for the server. After enabling, you can send POST request to the |
|
"debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST |
|
http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model": |
|
"dist/llama"}'" |
|
--host HOST host name (default: "127.0.0.1") |
|
--port PORT port (default: "8000") |
|
--allow-credentials allow credentials |
|
--allow-origins ALLOW_ORIGINS |
|
allowed origins (default: "['*']") |
|
--allow-methods ALLOW_METHODS |
|
allowed methods (default: "['*']") |
|
--allow-headers ALLOW_HEADERS |
|
allowed headers (default: "['*']") |
|
|
|
_________________________________________________________________________________________________________________________________________ |
|
$ mlc_llm gen_config --help |
|
usage: MLC LLM Configuration Generator [-h] --quantization |
|
{q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} |
|
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}] |
|
--conv-template |
|
{llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly} |
|
[--context-window-size CONTEXT_WINDOW_SIZE] |
|
[--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE] |
|
[--attention-sink-size ATTENTION_SINK_SIZE] |
|
[--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE] |
|
--output OUTPUT |
|
config |
|
|
|
positional arguments: |
|
config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json` |
|
in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in |
|
HuggingFace format defines the model architecture, including the vocabulary size, the number of |
|
layers, the hidden size, number of attention heads, etc. Example: |
|
https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory |
|
often contains a `config.json` which defines the model architecture, the non-quantized model weights |
|
in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional |
|
`generation_config.json` provides additional default configuration for text generation. Example: |
|
https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} |
|
The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices: |
|
q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft, |
|
e5m2_e5m2_f16, e4m3_e4m3_f16) |
|
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa} |
|
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`. |
|
(default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi- |
|
msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm, |
|
eagle, bert, medusa) |
|
--conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly} |
|
Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model |
|
(required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla- |
|
openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b, |
|
guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2, |
|
mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction, |
|
neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math, |
|
chatml, orion, glm, dolly) |
|
--context-window-size CONTEXT_WINDOW_SIZE |
|
Option to provide the maximum sequence length supported by the model. This is usually explicitly |
|
shown as context length or context window in the model card. If this option is not set explicitly, |
|
by default, it will be determined by `context_window_size` or `max_position_embeddings` in |
|
`config.json`, and the latter is usually inaccurate for some models. (default: "None") |
|
--sliding-window-size SLIDING_WINDOW_SIZE |
|
(Experimental) The sliding window size in sliding window attention (SWA). This optional field |
|
overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only |
|
useful when compiling Mistral. This flag subjects to future refactoring. (default: "None") |
|
--prefill-chunk-size PREFILL_CHUNK_SIZE |
|
(Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding |
|
window or max sequence length. This flag subjects to future refactoring. (default: "None") |
|
--attention-sink-size ATTENTION_SINK_SIZE |
|
(Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of |
|
sinks is 4. This flag subjects to future refactoring. (default: "None") |
|
--tensor-parallel-shards TENSOR_PARALLEL_SHARDS |
|
Number of shards to split the model into in tensor parallelism multi-gpu inference. (default: |
|
"None") |
|
--max-batch-size MAX_BATCH_SIZE |
|
The maximum allowed batch size set for the KV cache to concurrently support. (default: "80") |
|
--output OUTPUT, -o OUTPUT |
|
The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer |
|
configuration. (required) |
|
________________________________________________________________________________________________________________________________________ |
|
$ mlc_llm bench --help |
|
usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] |
|
[--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB] |
|
model |
|
|
|
positional arguments: |
|
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. |
|
It can also be a link to a HF repository pointing to an MLC compiled model. (required) |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--prompt PROMPT The prompt of the text generation. (default: "What is the meaning of life?") |
|
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, |
|
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme |
|
optimization that could potentially break the system. Meanwhile, optimization flags could be |
|
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2") |
|
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available |
|
GPUs if not specified. (default: "auto") |
|
--overrides OVERRIDES |
|
Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`, |
|
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`, |
|
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified |
|
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "") |
|
--generate-length GENERATE_LENGTH |
|
The target length of the text generation. (default: "256") |
|
--model-lib MODEL_LIB |
|
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use |
|
the provided ``model`` to search over possible paths. It the model lib is not found, it will be |
|
compiled in a JIT manner. (default: "None") |
|
|
|
__________________________________________________________________________________________________________________________________________ |
|
|
|
$ mlc_llm package --help |
|
usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT] |
|
|
|
options: |
|
-h, --help show this help message and exit |
|
--package-config PACKAGE_CONFIG |
|
The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc- |
|
ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package- |
|
config.json") |
|
--mlc-llm-home MLC_LLM_HOME |
|
The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable) |
|
--output OUTPUT, -o OUTPUT |
|
The path of output directory for the package build outputs. (default: "dist") |
|
|
|
|