Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
license: llama3.1
|
3 |
---
|
4 |
-
Amazingly quick to inference on
|
5 |
Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
|
6 |
Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
|
7 |
|
|
|
1 |
---
|
2 |
license: llama3.1
|
3 |
---
|
4 |
+
Amazingly quick to inference on Ampere GPUs like 3090 Ti. in INT8. In VLLM I left it on a task for 10 minutes with prompt caching, average fixed input around 2000, variable input around 200 and output around 200.
|
5 |
Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
|
6 |
Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
|
7 |
|