ArunKr
/

MLC-LLM

Model card Files Files and versions Community

Arun Kumar Tiwary commited on May 16

Commit

681be52

•

1 Parent(s): fa413cc

Update README.md

Browse files

Files changed (1) hide show

README.md +169 -1

README.md CHANGED Viewed

@@ -6,6 +6,19 @@ sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
 python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
 mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
 $ mlc_llm chat --help
 usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
@@ -30,7 +43,10 @@ options:
                         The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
                         the provided ``model`` to search over possible paths. It the model lib is not found, it will be
                         compiled in a JIT manner. (default: "None")
-(env) amd@volcano-9b20-os:~/workspace/Arun/data_dir/llamaCpp/mlc_LLM$ mlc_llm compile --help
 usage: mlc_llm compile [-h]
                        [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
                        [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
@@ -80,6 +96,45 @@ options:
                         phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
                         (default: None)
 --------------------------------------------------------------------------------------------------------------------------------
 $mlc_llm serve --help
@@ -163,4 +218,117 @@ options:
   --allow-headers ALLOW_HEADERS
                         allowed headers (default: "['*']")

 python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
 mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
+___________________________________________________________________________________________________________________________________________
+$ mlc_llm  --help
+usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package}
+positional arguments:
+  {compile,convert_weight,gen_config,chat,serve,bench,package}
+                        Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package)
+options:
+  -h, --help            show this help message and exit
+____________________________________________________________________________________________________________________________________________
 $ mlc_llm chat --help
 usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
                         The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
                         the provided ``model`` to search over possible paths. It the model lib is not found, it will be
                         compiled in a JIT manner. (default: "None")
+------------------------------------------------------------------------------------------------------------------------------------------
+$ mlc_llm compile --help
 usage: mlc_llm compile [-h]
                        [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
                        [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
                         phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
                         (default: None)
+____________________________________________________________________________________________________________________________________________
+$ mlc_llm convert_weight --help
+usage: MLC AutoLLM Quantization Framework [-h] --quantization
+                                          {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
+                                          [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
+                                          [--device DEVICE] [--source SOURCE]
+                                          [--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output
+                                          OUTPUT
+                                          config
+positional arguments:
+  config                1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
+                        in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
+                        HuggingFace format defines the model architecture, including the vocabulary size, the number of
+                        layers, the hidden size, number of attention heads, etc. Example:
+                        https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
+                        often contains a `config.json` which defines the model architecture, the non-quantized model weights
+                        in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
+                        `generation_config.json` provides additional default configuration for text generation. Example:
+                        https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required)
+options:
+  -h, --help            show this help message and exit
+  --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
+                        The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
+                        q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
+                        e5m2_e5m2_f16, e4m3_e4m3_f16)
+  --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
+                        Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
+                        (default: "auto")
+  --device DEVICE       The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs
+                        if not specified. (default: "auto")
+  --source SOURCE       The path to original model weight, infer from `config` if missing. (default: "auto")
+  --source-format {auto,huggingface-torch,huggingface-safetensor,awq}
+                        The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto,
+                        huggingface-torch, huggingface-safetensor, awq")
+  --output OUTPUT, -o OUTPUT
+                        The output directory to save the quantized model weight. Will create `params_shard_*.bin` and
+                        `ndarray-cache.json` in this directory. (required)
 --------------------------------------------------------------------------------------------------------------------------------
 $mlc_llm serve --help
   --allow-headers ALLOW_HEADERS
                         allowed headers (default: "['*']")
+_________________________________________________________________________________________________________________________________________
+$ mlc_llm  gen_config --help
+usage: MLC LLM Configuration Generator [-h] --quantization
+                                       {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
+                                       [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
+                                       --conv-template
+                                       {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
+                                       [--context-window-size CONTEXT_WINDOW_SIZE]
+                                       [--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
+                                       [--attention-sink-size ATTENTION_SINK_SIZE]
+                                       [--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE]
+                                       --output OUTPUT
+                                       config
+positional arguments:
+  config                1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
+                        in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
+                        HuggingFace format defines the model architecture, including the vocabulary size, the number of
+                        layers, the hidden size, number of attention heads, etc. Example:
+                        https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
+                        often contains a `config.json` which defines the model architecture, the non-quantized model weights
+                        in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
+                        `generation_config.json` provides additional default configuration for text generation. Example:
+                        https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main. (required)
+options:
+  -h, --help            show this help message and exit
+  --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
+                        The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
+                        q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
+                        e5m2_e5m2_f16, e4m3_e4m3_f16)
+  --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
+                        Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
+                        (default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi-
+                        msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm,
+                        eagle, bert, medusa)
+  --conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
+                        Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model
+                        (required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla-
+                        openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b,
+                        guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2,
+                        mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction,
+                        neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math,
+                        chatml, orion, glm, dolly)
+  --context-window-size CONTEXT_WINDOW_SIZE
+                        Option to provide the maximum sequence length supported by the model. This is usually explicitly
+                        shown as context length or context window in the model card. If this option is not set explicitly,
+                        by default, it will be determined by `context_window_size` or `max_position_embeddings` in
+                        `config.json`, and the latter is usually inaccurate for some models. (default: "None")
+  --sliding-window-size SLIDING_WINDOW_SIZE
+                        (Experimental) The sliding window size in sliding window attention (SWA). This optional field
+                        overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only
+                        useful when compiling Mistral. This flag subjects to future refactoring. (default: "None")
+  --prefill-chunk-size PREFILL_CHUNK_SIZE
+                        (Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding
+                        window or max sequence length. This flag subjects to future refactoring. (default: "None")
+  --attention-sink-size ATTENTION_SINK_SIZE
+                        (Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of
+                        sinks is 4. This flag subjects to future refactoring. (default: "None")
+  --tensor-parallel-shards TENSOR_PARALLEL_SHARDS
+                        Number of shards to split the model into in tensor parallelism multi-gpu inference. (default:
+                        "None")
+  --max-batch-size MAX_BATCH_SIZE
+                        The maximum allowed batch size set for the KV cache to concurrently support. (default: "80")
+  --output OUTPUT, -o OUTPUT
+                        The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer
+                        configuration. (required)
+________________________________________________________________________________________________________________________________________
+$ mlc_llm  bench --help
+usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES]
+                        [--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB]
+                        model
+positional arguments:
+  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
+                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)
+options:
+  -h, --help            show this help message and exit
+  --prompt PROMPT       The prompt of the text generation. (default: "What is the meaning of life?")
+  --opt OPT             Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
+                        O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
+                        optimization that could potentially break the system. Meanwhile, optimization flags could be
+                        explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
+  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
+                        GPUs if not specified. (default: "auto")
+  --overrides OVERRIDES
+                        Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
+                        `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
+                        `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
+                        via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
+  --generate-length GENERATE_LENGTH
+                        The target length of the text generation. (default: "256")
+  --model-lib MODEL_LIB
+                        The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
+                        the provided ``model`` to search over possible paths. It the model lib is not found, it will be
+                        compiled in a JIT manner. (default: "None")
+__________________________________________________________________________________________________________________________________________
+$ mlc_llm  package --help
+usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT]
+options:
+  -h, --help            show this help message and exit
+  --package-config PACKAGE_CONFIG
+                        The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc-
+                        ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package-
+                        config.json")
+  --mlc-llm-home MLC_LLM_HOME
+                        The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable)
+  --output OUTPUT, -o OUTPUT
+                        The path of output directory for the package build outputs. (default: "dist")