license: apache-2.0
inference: false
MegaBeam-Mistral-7B-300k-AWQ Model
MegaBeam-Mistral-7B-300k-AWQ is a version of the MegaBeam-Mistral-7B-300k model that was quantized using the AWQ method developed by Lin et al. (2023). The MegaBeam-Mistral-7B-300k-AWQ models are approximately 70% smaller than those of MegaBeam-Mistral-7B-300k whilst maintaining comparable performance.
Please refer to the original MegaBeam-Mistral-7B-300k model card for details about the model preparation and training processes.
MegaBeam-Mistral-7B-300k Variants
Branch | Approx. Model Size | q_group_size |
w_bit |
version |
---|---|---|---|---|
main | 3.9 GB | 128 | 4 | GEMM |
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM | 4.0 GB | 64 | 4 | GEMM |
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM | 4.3 GB | 32 | 4 | GEMM |
Dependencies
autoawq==0.2.5
– AutoAWQ was used to quantize the MegaBeam-Mistral-7B-300k model.vllm==0.4.2
– vLLM was used to host models for benchmarking.
Evaluations
InfiniteBench
This benchmark was developed by Zhang et al. (2024), available from https://github.com/OpenBMB/InfiniteBench.
See the original MegaBeam-Mistral-7B-300k model card for more details.
Task Name | MegaBeam-Mistral-7B-300k-AWQ | MegaBeam-Mistral-7B-300k | Mistral-7B-Instruct-v0.2 | Llama-3-8B-Instruct-262k | Llama3-70B-1M | GPT-4-1106-preview | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Retrieve.PassKey | 100% | 100% | 75.76% | 98.30% | 81.35% | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% |
Retrieve.Number | 92.7% | 96.10% | 25.25% | 97.79% | 97.62% | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% |
Retrieve.KV | 0% | 0% | 0% | 3.40% | 3% | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% |
En.Sum | 29.05% | 29.39% | 22.13% | 16.40% | 20.72% | 14.73% | 9.09% | 17.93% | 14.45% | < 5% | < 5% | < 5% |
En.QA | 15.69% | 14.93% | 4.93% | 13.20% | 16.52% | 22.22% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% | < 5% |
En.MC | 48.91% | 51.52% | 7.80% | 50.65% | 62% | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% | 38.43% | 10.48% |
En.Dia | 11.50% | 9.50% | 3.50% | 1% | 12.50% | 8.50% | 7.50% | 11.50% | 46.50% | < 5% | < 5% | < 5% |
Zh.QA | 10.53% | 10.71% | 3.43% | 19.02% | 26% | 25.96% | 14.43% | 17.93% | 9.64% | 15.07% | 13.61% | < 5% |
Code.Debug | 21.83% | 27.41% | 11.60% | 22.08% | 23.85% | 39.59% | < 5% | 18.02% | < 5% | < 5% | < 5% | < 5% |
Code.Run | 1.25% | 1.75% | 0.25% | 0% | 0% | 23.25% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
Math.Calc | 0% | 0% | 0% | 0% | 0% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
Math.Find | 20.57% | 24.28% | 26.28% | 15.40% | 30% | 60.00% | 17.14% | 12.57% | 32.29% | < 5% | 25.71% | 7.71% |
Average | 29.34% | 30.70% | 15.08% | 28.10% | 31.13% | 46.08% | 20.41% | 34.93% | 37.21% | 22.78% | 25.41% | 17.59% |
Long Context
The following benchmark results are shown as accuracy (%) values, unless stated otherwise.
Topic Retrieval
See https://lmsys.org/blog/2023-06-29-longchat/
Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 |
---|---|---|---|---|---|
n_tokens (approx.) = | 3048 | 5966 | 8903 | 11832 | 14757 |
MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 |
MegaBeam-Mistral-7B-300k-AWQ | 100 | 100 | 100 | 100 | 100 |
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM | 100 | 100 | 100 | 100 | 98 |
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM | 100 | 100 | 100 | 100 | 98 |
Line Retrieval
See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results
Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 |
---|---|---|---|---|---|---|
n_tokens (approx.) = | 4317 | 6415 | 8510 | 10610 | 12698 | 14373 |
MegaBeam-Mistral-7B-300k | 98 | 98 | 92 | 98 | 90 | 90 |
MegaBeam-Mistral-7B-300k-AWQ | 96 | 94 | 88 | 80 | 70 | 62 |
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM | 100 | 98 | 96 | 96 | 90 | 94 |
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM | 98 | 98 | 82 | 96 | 92 | 90 |
Pass Key Retrieval
See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101
Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 |
---|---|---|---|---|---|---|
n_tokens (approx.) = | 3272 | 5405 | 8338 | 10205 | 12071 | 16072 |
MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 | 100 |
MegaBeam-Mistral-7B-300k-AWQ | 100 | 100 | 100 | 100 | 100 | 100 |
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM | 100 | 100 | 100 | 100 | 100 | 100 |
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM | 100 | 100 | 100 | 100 | 100 | 100 |
QuALITY (Question Answering with Long Input Texts, Yes!)
See https://nyu-mll.github.io/quality/
Model Name | Test set Accuracy | Hard subset Accuracy |
---|---|---|
MegaBeam-Mistral-7B-300k | 53.2 | 72 |
MegaBeam-Mistral-7B-300k-AWQ | 51.3 | 71.3 |
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM | 52.4 | 72.1 |
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM | 53.1 | 71.3 |
Usage
Inference via vLLM HTTP Host
Launch Host
python -m vllm.entrypoints.openai.api_server \
--model aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ \
--quantization awq
Query Host
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{ "model": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ",
"prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
"temperature": 0,
"echo": false
}'
Inference via vLLM Offline Inference
from vllm import LLM, SamplingParams
prompts = [
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)
llm = LLM(model="aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
License
Apache 2.0
Limitations
Before using the MegaBeam-Mistral-7B-300k-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.