nm-testing/llama2.c-stories15M-pruned_50.2of4-uncompressed

llama2.c-stories15M-pruned50

This repo contains model files for llama2.c 15M tinystories optimized for NM-vLLM, a high-throughput serving engine for compressed LLMs.

This model was pruned with SparseGPT, using llm-compressor.

Sparsification

Install llm-compressor:

pip install llmcompressor

from llmcompressor.transformers import oneshot
from llmcompressor.transformers import SparseAutoModelForCausalLM

hf_model_stub = "Xenova/llama2.c-stories15M"
calibration_dataset = "open_platypus"
output_directory = f"{hf_model_stub.split('/')[-1]}-pruned_50.2of4-uncompressed"

model = SparseAutoModelForCausalLM.from_pretrained(hf_model_stub, torch_dtype="auto", device_map="auto")

recipe = """
test_stage:
  obcq_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      sequential_update: true
      mask_structure: "2:4"
      targets: ['re:model.layers.\d*$']
"""


oneshot(
    model=model,
    dataset=calibration_dataset,
    recipe=recipe,
    output_dir=output_directory,
)

model.save_pretrained(output_directory, save_compressed=False)

Slack

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community