adamo1139
/

Hermes-3-Llama-3.1-8B_W8A8

8-bit precision

compressed-tensors

Model card Files Files and versions Community

adamo1139 commited on 23 days ago

Commit

f485ece

•

1 Parent(s): 20f6d69

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -5,6 +5,8 @@ Amazingly quick to inference on Ada GPUs like 3090 Ti. in INT8. In VLLM I left i
 Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
 Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
 Creation script:
 ```python

 Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
 Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
+Quantized on H100. On 3090 Ti I was OOMing.
 Creation script:
 ```python