metadata

datasets:
  - HuggingFaceH4/ultrachat_200k
language:
  - en
pipeline_tag: text-generation

SparseLlama-2-7b-ultrachat_200k-pruned_50.2of4

Model Overview

Model Architecture: Llama-2
- Input: Text
- Output: Text
Model Optimizations:
- Pruned: 50% 2:4
Release Date: 6/28/2024
Version: 1.0
Model Developers: Neural Magic

Compressed version of Llama-2-7b specialized for text-generation. This model was obtained by fine-tuning the Sparse Foundational model Sparse-Llama-2-7b-pruned_50.2of4 on the ultrachat_200k dataset, using [SquareHead distillation] (https://arxiv.org/abs/2310.06927) and Llama-2-7b-ultrachat200k as teacher. It achieves a win rate of 64.9% on the AlpacaEval benchmark (version 1.0) when using Llama-2-70b-chat as evaluator, whereas the dense Llama-2-7b-ultrachat200k model achieves 57.6% win rate.

This model was produced as part if Neural Magic's Sparse Foundational Models initiative, and demostrates the capability of Sparse Foundational Models to transfer to the text-generation domain.

Note: This model uses the chat template from zephyr-7b-beta.

Model Optimizations

This model is derived from the Sparse Foundational model Sparse-Llama-2-7b-pruned_50.2of4, which was obtained by applying the SparseGPT algorithm to prune Llama-2-7b to 50% sparsity with a 2:4 mask. This optimization reduces the number of parameters by 50%, reducing the disk size and FLOPs by the same level.

Evaluation

This model was evaluated in the AlpacaEval benchmark using Llama-2-70b-chat as evaluator.

Accuracy

Model	Win rate	Recovery
Llama-2-7b	3.7%	--
Llama-2-7b-ultrachat200k	57.6%	--
SparseLlama-2-7b-ultrachat_200k-pruned_50.2of4	64.9%	113%