adamo1139 commited on
Commit
957dd91
1 Parent(s): f485ece

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: llama3.1
3
  ---
4
- Amazingly quick to inference on Ada GPUs like 3090 Ti. in INT8. In VLLM I left it on a task for 10 minutes with prompt caching, average fixed input around 2000, variable input around 200 and output around 200.
5
  Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
6
  Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
7
 
 
1
  ---
2
  license: llama3.1
3
  ---
4
+ Amazingly quick to inference on Ampere GPUs like 3090 Ti. in INT8. In VLLM I left it on a task for 10 minutes with prompt caching, average fixed input around 2000, variable input around 200 and output around 200.
5
  Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
6
  Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
7