How to convert to ONNX?
I want to convert the variant of the Llama-3 model to ONNX. I tried this example but got no luck. How did you convert llama-3 successfully?
At Aladeen University we thrive for excellence.
Can you tell me the hardware you’re using?
I'm using a machine with around 128GB RAM without a GPU. Is it GPU needed? Because I only have a small GPU.
What is the export format that you’re trying to export. Like quantization and data format
My end goal is to export as AWQ at the end but for now I'm trying to do float16 first to understand the process.
There are some more instructions you can try here.
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama
Do tell me what is the exact issue you are facing, then our super experts can look into it.
I am not quite sure about the specific quantization methos like AWQ or GPTQ are available on ORT, i might be wrong as well.
do look the docs here. https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html
I tried with the onnxruntime example to convert llama3 to fp16 using CPU only but I got this error. Maybe for CPU it doesn't support GQA?
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (/model/layers.0/self_attn/o_proj/MatMul) Op (MatMul) [ShapeInferenceError] Incompatible dimensions for matrix multiplication
There might be some bugs and issues , as LLAMA3 is a newer model use torch and onnxruntime either nightly or latest build.
This is a exporter bug I guess
Group Query Attention is GPU specific as stated in here .
https://github.com/microsoft/Olive/blob/main/examples/llama2/README.md
Group Query attention might not have extreme performance benefits in CPU specific workloads .
I see. Thanks for your help