Arun Kumar Tiwary commited on
Commit
681be52
1 Parent(s): fa413cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -1
README.md CHANGED
@@ -6,6 +6,19 @@ sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
6
  python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
7
  mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  $ mlc_llm chat --help
10
  usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
11
 
@@ -30,7 +43,10 @@ options:
30
  The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
31
  the provided ``model`` to search over possible paths. It the model lib is not found, it will be
32
  compiled in a JIT manner. (default: "None")
33
- (env) amd@volcano-9b20-os:~/workspace/Arun/data_dir/llamaCpp/mlc_LLM$ mlc_llm compile --help
 
 
 
34
  usage: mlc_llm compile [-h]
35
  [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
36
  [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
@@ -80,6 +96,45 @@ options:
80
  phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
81
  (default: None)
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  --------------------------------------------------------------------------------------------------------------------------------
85
  $mlc_llm serve --help
@@ -163,4 +218,117 @@ options:
163
  --allow-headers ALLOW_HEADERS
164
  allowed headers (default: "['*']")
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
 
6
  python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
7
  mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
8
 
9
+ ___________________________________________________________________________________________________________________________________________
10
+ $ mlc_llm --help
11
+ usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package}
12
+
13
+ positional arguments:
14
+ {compile,convert_weight,gen_config,chat,serve,bench,package}
15
+ Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package)
16
+
17
+ options:
18
+ -h, --help show this help message and exit
19
+
20
+
21
+ ____________________________________________________________________________________________________________________________________________
22
  $ mlc_llm chat --help
23
  usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
24
 
 
43
  The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
44
  the provided ``model`` to search over possible paths. It the model lib is not found, it will be
45
  compiled in a JIT manner. (default: "None")
46
+
47
+
48
+ ------------------------------------------------------------------------------------------------------------------------------------------
49
+ $ mlc_llm compile --help
50
  usage: mlc_llm compile [-h]
51
  [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
52
  [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
 
96
  phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
97
  (default: None)
98
 
99
+ ____________________________________________________________________________________________________________________________________________
100
+ $ mlc_llm convert_weight --help
101
+ usage: MLC AutoLLM Quantization Framework [-h] --quantization
102
+ {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
103
+ [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
104
+ [--device DEVICE] [--source SOURCE]
105
+ [--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output
106
+ OUTPUT
107
+ config
108
+
109
+ positional arguments:
110
+ config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
111
+ in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
112
+ HuggingFace format defines the model architecture, including the vocabulary size, the number of
113
+ layers, the hidden size, number of attention heads, etc. Example:
114
+ https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
115
+ often contains a `config.json` which defines the model architecture, the non-quantized model weights
116
+ in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
117
+ `generation_config.json` provides additional default configuration for text generation. Example:
118
+ https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required)
119
+
120
+ options:
121
+ -h, --help show this help message and exit
122
+ --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
123
+ The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
124
+ q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
125
+ e5m2_e5m2_f16, e4m3_e4m3_f16)
126
+ --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
127
+ Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
128
+ (default: "auto")
129
+ --device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs
130
+ if not specified. (default: "auto")
131
+ --source SOURCE The path to original model weight, infer from `config` if missing. (default: "auto")
132
+ --source-format {auto,huggingface-torch,huggingface-safetensor,awq}
133
+ The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto,
134
+ huggingface-torch, huggingface-safetensor, awq")
135
+ --output OUTPUT, -o OUTPUT
136
+ The output directory to save the quantized model weight. Will create `params_shard_*.bin` and
137
+ `ndarray-cache.json` in this directory. (required)
138
 
139
  --------------------------------------------------------------------------------------------------------------------------------
140
  $mlc_llm serve --help
 
218
  --allow-headers ALLOW_HEADERS
219
  allowed headers (default: "['*']")
220
 
221
+ _________________________________________________________________________________________________________________________________________
222
+ $ mlc_llm gen_config --help
223
+ usage: MLC LLM Configuration Generator [-h] --quantization
224
+ {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
225
+ [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
226
+ --conv-template
227
+ {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
228
+ [--context-window-size CONTEXT_WINDOW_SIZE]
229
+ [--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
230
+ [--attention-sink-size ATTENTION_SINK_SIZE]
231
+ [--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE]
232
+ --output OUTPUT
233
+ config
234
+
235
+ positional arguments:
236
+ config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
237
+ in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
238
+ HuggingFace format defines the model architecture, including the vocabulary size, the number of
239
+ layers, the hidden size, number of attention heads, etc. Example:
240
+ https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
241
+ often contains a `config.json` which defines the model architecture, the non-quantized model weights
242
+ in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
243
+ `generation_config.json` provides additional default configuration for text generation. Example:
244
+ https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required)
245
+
246
+ options:
247
+ -h, --help show this help message and exit
248
+ --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
249
+ The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
250
+ q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
251
+ e5m2_e5m2_f16, e4m3_e4m3_f16)
252
+ --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
253
+ Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
254
+ (default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi-
255
+ msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm,
256
+ eagle, bert, medusa)
257
+ --conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
258
+ Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model
259
+ (required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla-
260
+ openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b,
261
+ guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2,
262
+ mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction,
263
+ neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math,
264
+ chatml, orion, glm, dolly)
265
+ --context-window-size CONTEXT_WINDOW_SIZE
266
+ Option to provide the maximum sequence length supported by the model. This is usually explicitly
267
+ shown as context length or context window in the model card. If this option is not set explicitly,
268
+ by default, it will be determined by `context_window_size` or `max_position_embeddings` in
269
+ `config.json`, and the latter is usually inaccurate for some models. (default: "None")
270
+ --sliding-window-size SLIDING_WINDOW_SIZE
271
+ (Experimental) The sliding window size in sliding window attention (SWA). This optional field
272
+ overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only
273
+ useful when compiling Mistral. This flag subjects to future refactoring. (default: "None")
274
+ --prefill-chunk-size PREFILL_CHUNK_SIZE
275
+ (Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding
276
+ window or max sequence length. This flag subjects to future refactoring. (default: "None")
277
+ --attention-sink-size ATTENTION_SINK_SIZE
278
+ (Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of
279
+ sinks is 4. This flag subjects to future refactoring. (default: "None")
280
+ --tensor-parallel-shards TENSOR_PARALLEL_SHARDS
281
+ Number of shards to split the model into in tensor parallelism multi-gpu inference. (default:
282
+ "None")
283
+ --max-batch-size MAX_BATCH_SIZE
284
+ The maximum allowed batch size set for the KV cache to concurrently support. (default: "80")
285
+ --output OUTPUT, -o OUTPUT
286
+ The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer
287
+ configuration. (required)
288
+ ________________________________________________________________________________________________________________________________________
289
+ $ mlc_llm bench --help
290
+ usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES]
291
+ [--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB]
292
+ model
293
+
294
+ positional arguments:
295
+ model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
296
+ It can also be a link to a HF repository pointing to an MLC compiled model. (required)
297
+
298
+ options:
299
+ -h, --help show this help message and exit
300
+ --prompt PROMPT The prompt of the text generation. (default: "What is the meaning of life?")
301
+ --opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
302
+ O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
303
+ optimization that could potentially break the system. Meanwhile, optimization flags could be
304
+ explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
305
+ --device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
306
+ GPUs if not specified. (default: "auto")
307
+ --overrides OVERRIDES
308
+ Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
309
+ `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
310
+ `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
311
+ via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
312
+ --generate-length GENERATE_LENGTH
313
+ The target length of the text generation. (default: "256")
314
+ --model-lib MODEL_LIB
315
+ The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
316
+ the provided ``model`` to search over possible paths. It the model lib is not found, it will be
317
+ compiled in a JIT manner. (default: "None")
318
+
319
+ __________________________________________________________________________________________________________________________________________
320
+
321
+ $ mlc_llm package --help
322
+ usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT]
323
+
324
+ options:
325
+ -h, --help show this help message and exit
326
+ --package-config PACKAGE_CONFIG
327
+ The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc-
328
+ ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package-
329
+ config.json")
330
+ --mlc-llm-home MLC_LLM_HOME
331
+ The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable)
332
+ --output OUTPUT, -o OUTPUT
333
+ The path of output directory for the package build outputs. (default: "dist")
334