model aya-expanse-8b inference is very slow
#16
by
blueqq1
- opened
I attempted to run the model with HF, HF+FA2, and vLLM configurations, but the speed remained consistent across each. The input used was: "The future of AI is", with parameters set to temperature=0.3 and max_token=512. The processing time was approximately 10 seconds on both GPU A40 and A100.
and dtype is bfloat16