Text Generation
Transformers
Safetensors
English
llama
nvidia
llama3.1
conversational
text-generation-inference

the model is not optimize in term of inference

#7
by Imran1 - opened

i load the model for inference using 2H100 gpu. but the model is very slow with Flash Attention.

Can you share your code?

Sign up or log in to comment