mo137
/

Amethyst-13B-Mistral-8bpw-hb8-exl2

Text Generation

Text Generation

Not-For-All-Audiences

nsfw

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Amethyst-13B-Mistral-8bpw-hb8-exl2 / README.md

mo137's picture

Update README.md

761df1a 11 months ago

|

No virus

1.01 kB

	---
	license: cc-by-nc-4.0
	tags:
	- exllamav2
	- exl2
	- Text Generation
	- not-for-all-audiences
	- nsfw
	- Transformers
	- llama
	- text-generation-inference
	---

	# Amethyst 13B Mistral - EXL2 - 8bpw, hb8
	- Model creator: [Undi](https://huggingface.co/Undi95)
	- Original model: [Amethyst 13B Mistral](https://huggingface.co/Undi95/Amethyst-13B-Mistral)

	## Description
	- 8 bits per weight.
	- 8 bits "for the lm_head (output) layer of the model," instead of the typical 6.
	- Works fine with 24 GB VRAM and no flash attention v2 under Windows.
	- For me runs at about 64% of the 4-bit GPTQ speed.

	I converted the model using the convert.py script from the exllamav2 repo:
	https://github.com/turboderp/exllamav2
	Its documentation:
	https://github.com/turboderp/exllamav2/blob/master/doc/convert.md

	Measuring the model took 51 minutes, converting it 18 minutes.

	I used the WikiText-2-v1 dataset for calibration:
	https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet