adamo1139
/

Hermes-3-Llama-3.1-8B_W8A8

8-bit precision

compressed-tensors

Model card Files Files and versions Community

adamo1139 commited on 23 days ago

Commit

64bbe57

•

1 Parent(s): 2056caa

Update README.md

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -1,3 +1,7 @@
----
-license: llama3.1
----

+---
+license: llama3.1
+---
+Amazingly quick to inference on Ada GPUs like 3090 Ti. in INT8. In VLLM I left it on a task 10 minutes with prompt caching, average fixed input around 2000, variable input around 200 and output around 200.
+Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
+Averaged over an hour that's 81M input tokens and 5.5M output tokens.