neuralmagic
/

Phi-3-medium-128k-instruct-quantized.w4a16

@@ -13,19 +13,19 @@ license: llama2
   - **Output:** Text
 - **Model Optimizations:**
   - **Weight quantization:** INT4
-- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct), this models is intended for assistant-like chat.
 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 7/11/2024
 - **Version:** 1.0
 - **License(s)**: [MIT](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/main/LICENSE)
 - **Model Developers:** Neural Magic
-Quantized version of [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct).
 It achieves an average score of 72.38 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 74.46.
 ### Model Optimizations
-This model was obtained by quantizing the weights of [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) to INT4 data type.
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
 Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
@@ -120,7 +120,7 @@ from llmcompressor.modifiers.quantization import GPTQModifier
 from datasets import load_dataset
 import random
-model_id = "microsoft/Phi-3-medium-4k-instruct"
 num_samples = 512
 max_seq_len = 4096
@@ -184,7 +184,7 @@ lm_eval \
   <tr>
    <td><strong>Benchmark</strong>
    </td>
-   <td><strong>Phi-3-medium-4k-instruct </strong>
    </td>
    <td><strong>Phi-3-medium-128k-instruct-quantized.w4a16(this model)</strong>
    </td>

   - **Output:** Text
 - **Model Optimizations:**
   - **Weight quantization:** INT4
+- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct), this models is intended for assistant-like chat.
 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 7/11/2024
 - **Version:** 1.0
 - **License(s)**: [MIT](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/main/LICENSE)
 - **Model Developers:** Neural Magic
+Quantized version of [Phi-3-medium-128-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct).
 It achieves an average score of 72.38 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 74.46.
 ### Model Optimizations
+This model was obtained by quantizing the weights of [Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) to INT4 data type.
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
 Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
 from datasets import load_dataset
 import random
+model_id = "microsoft/Phi-3-medium-128k-instruct"
 num_samples = 512
 max_seq_len = 4096
   <tr>
    <td><strong>Benchmark</strong>
    </td>
+   <td><strong>Phi-3-medium-128k-instruct </strong>
    </td>
    <td><strong>Phi-3-medium-128k-instruct-quantized.w4a16(this model)</strong>
    </td>