|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- exllamav2 |
|
- exl2 |
|
- Text Generation |
|
- not-for-all-audiences |
|
- nsfw |
|
- Transformers |
|
- llama |
|
- text-generation-inference |
|
--- |
|
|
|
# Amethyst 13B Mistral - EXL2 - 8bpw, hb8 |
|
- Model creator: [Undi](https://huggingface.co/Undi95) |
|
- Original model: [Amethyst 13B Mistral](https://huggingface.co/Undi95/Amethyst-13B-Mistral) |
|
|
|
## Description |
|
- 8 bits per weight. |
|
- 8 bits "for the lm_head (output) layer of the model," instead of the typical 6. |
|
- Works fine with 24 GB VRAM and no flash attention v2 under Windows. |
|
- For me runs at about 64% of the 4-bit GPTQ speed. |
|
|
|
I converted the model using the convert.py script from the exllamav2 repo: |
|
https://github.com/turboderp/exllamav2 |
|
Its documentation: |
|
https://github.com/turboderp/exllamav2/blob/master/doc/convert.md |
|
|
|
Measuring the model took 51 minutes, converting it 18 minutes. |
|
|
|
I used the WikiText-2-v1 dataset for calibration: |
|
https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet |