model aya-expanse-8b inference is very slow

#16
by blueqq1 - opened

I attempted to run the model with HF, HF+FA2, and vLLM configurations, but the speed remained consistent across each. The input used was: "The future of AI is", with parameters set to temperature=0.3 and max_token=512. The processing time was approximately 10 seconds on both GPU A40 and A100.

and dtype is bfloat16

Sign up or log in to comment