OOM on RTX 3090 with vLLM
#1
by
willowill5
- opened
Weird bug with this model or vLLM. But the original hf4 model loads fine on 24 GB but OOMs with AWQ / vLLM.
python -m vllm.entrypoints.api_server --model TheBloke/zephyr-7B-alpha-AWQ --quantization awq --dtype float16