aws-prototyping
Initial commit of the model files.
7eba212
metadata
license: apache-2.0
inference: false

MegaBeam-Mistral-7B-300k-AWQ Model

MegaBeam-Mistral-7B-300k-AWQ is a version of the MegaBeam-Mistral-7B-300k model that was quantized using the AWQ method developed by Lin et al. (2023). The MegaBeam-Mistral-7B-300k-AWQ models are approximately 70% smaller than those of MegaBeam-Mistral-7B-300k whilst maintaining comparable performance.

Please refer to the original MegaBeam-Mistral-7B-300k model card for details about the model preparation and training processes.

MegaBeam-Mistral-7B-300k Variants

Branch Approx. Model Size q_group_size w_bit version
main 3.9 GB 128 4 GEMM
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM 4.0 GB 64 4 GEMM
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM 4.3 GB 32 4 GEMM

Dependencies

Evaluations

InfiniteBench

This benchmark was developed by Zhang et al. (2024), available from https://github.com/OpenBMB/InfiniteBench.

See the original MegaBeam-Mistral-7B-300k model card for more details.

Task Name MegaBeam-Mistral-7B-300k-AWQ MegaBeam-Mistral-7B-300k Mistral-7B-Instruct-v0.2 Llama-3-8B-Instruct-262k Llama3-70B-1M GPT-4-1106-preview YaRN-Mistral-7B Kimi-Chat Claude 2 Yi-6B-200K Yi-34B-200K Chatglm3-6B-128K
Retrieve.PassKey 100% 100% 75.76% 98.30% 81.35% 100% 92.71% 98.14% 97.80% 100.00% 100.00% 92.20%
Retrieve.Number 92.7% 96.10% 25.25% 97.79% 97.62% 100% 56.61% 95.42% 98.14% 94.92% 100.00% 80.68%
Retrieve.KV 0% 0% 0% 3.40% 3% 89.00% < 5% 53.60% 65.40% < 5% < 5% < 5%
En.Sum 29.05% 29.39% 22.13% 16.40% 20.72% 14.73% 9.09% 17.93% 14.45% < 5% < 5% < 5%
En.QA 15.69% 14.93% 4.93% 13.20% 16.52% 22.22% 9.55% 16.52% 11.97% 9.20% 12.17% < 5%
En.MC 48.91% 51.52% 7.80% 50.65% 62% 67.25% 27.95% 72.49% 62.88% 36.68% 38.43% 10.48%
En.Dia 11.50% 9.50% 3.50% 1% 12.50% 8.50% 7.50% 11.50% 46.50% < 5% < 5% < 5%
Zh.QA 10.53% 10.71% 3.43% 19.02% 26% 25.96% 14.43% 17.93% 9.64% 15.07% 13.61% < 5%
Code.Debug 21.83% 27.41% 11.60% 22.08% 23.85% 39.59% < 5% 18.02% < 5% < 5% < 5% < 5%
Code.Run 1.25% 1.75% 0.25% 0% 0% 23.25% < 5% < 5% < 5% < 5% < 5% < 5%
Math.Calc 0% 0% 0% 0% 0% < 5% < 5% < 5% < 5% < 5% < 5% < 5%
Math.Find 20.57% 24.28% 26.28% 15.40% 30% 60.00% 17.14% 12.57% 32.29% < 5% 25.71% 7.71%
Average 29.34% 30.70% 15.08% 28.10% 31.13% 46.08% 20.41% 34.93% 37.21% 22.78% 25.41% 17.59%

Long Context

The following benchmark results are shown as accuracy (%) values, unless stated otherwise.

Topic Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/

Model Name n_topics=05 n_topics=10 n_topics=15 n_topics=20 n_topics=25
n_tokens (approx.) = 3048 5966 8903 11832 14757
MegaBeam-Mistral-7B-300k 100 100 100 100 100
MegaBeam-Mistral-7B-300k-AWQ 100 100 100 100 100
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM 100 100 100 100 98
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM 100 100 100 100 98

Line Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results

Model Name n_lines=200 n_lines=300 n_lines=400 n_lines=500 n_lines=600 n_lines=680
n_tokens (approx.) = 4317 6415 8510 10610 12698 14373
MegaBeam-Mistral-7B-300k 98 98 92 98 90 90
MegaBeam-Mistral-7B-300k-AWQ 96 94 88 80 70 62
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM 100 98 96 96 90 94
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM 98 98 82 96 92 90

Pass Key Retrieval

See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101

Model Name n_garbage=12000 n_garbage=20000 n_garbage=31000 n_garbage=38000 n_garbage=45000 n_garbage=60000
n_tokens (approx.) = 3272 5405 8338 10205 12071 16072
MegaBeam-Mistral-7B-300k 100 100 100 100 100 100
MegaBeam-Mistral-7B-300k-AWQ 100 100 100 100 100 100
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM 100 100 100 100 100 100
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM 100 100 100 100 100 100

QuALITY (Question Answering with Long Input Texts, Yes!)

See https://nyu-mll.github.io/quality/

Model Name Test set Accuracy Hard subset Accuracy
MegaBeam-Mistral-7B-300k 53.2 72
MegaBeam-Mistral-7B-300k-AWQ 51.3 71.3
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM 52.4 72.1
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM 53.1 71.3

Usage

Inference via vLLM HTTP Host

Launch Host

python -m vllm.entrypoints.openai.api_server \
    --model aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ \
    --quantization awq

Query Host

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ",
          "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
          "temperature": 0,
          "echo": false
    }'

Inference via vLLM Offline Inference

from vllm import LLM, SamplingParams

prompts = [
   "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model="aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

License

Apache 2.0

Limitations

Before using the MegaBeam-Mistral-7B-300k-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.