How to convert to ONNX?

by HoangHa - opened May 2

May 2

I want to convert the variant of the Llama-3 model to ONNX. I tried this example but got no luck. How did you convert llama-3 successfully?

https://github.com/microsoft/onnxruntime-inference-examples/blob/8fcc97e1e035d57ffdfd19b76732e3fc79d8c2a6/python/models/llama/LLaMA-2%20E2E%20Notebook.ipynb

aless2212

Owner May 3

At Aladeen University we thrive for excellence.
Can you tell me the hardware you’re using?

HoangHa

May 3

I'm using a machine with around 128GB RAM without a GPU. Is it GPU needed? Because I only have a small GPU.

aless2212

Owner May 3

What is the export format that you’re trying to export. Like quantization and data format

HoangHa

May 3

My end goal is to export as AWQ at the end but for now I'm trying to do float16 first to understand the process.

aless2212

Owner May 3

There are some more instructions you can try here.
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama

Do tell me what is the exact issue you are facing, then our super experts can look into it.

aless2212

Owner May 3

I am not quite sure about the specific quantization methos like AWQ or GPTQ are available on ORT, i might be wrong as well.
do look the docs here. https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html

HoangHa

May 3

I tried with the onnxruntime example to convert llama3 to fp16 using CPU only but I got this error. Maybe for CPU it doesn't support GQA?

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (/model/layers.0/self_attn/o_proj/MatMul) Op (MatMul) [ShapeInferenceError] Incompatible dimensions for matrix multiplication

aless2212

Owner May 3

There might be some bugs and issues , as LLAMA3 is a newer model use torch and onnxruntime either nightly or latest build.
This is a exporter bug I guess

Group Query Attention is GPU specific as stated in here .

https://github.com/microsoft/Olive/blob/main/examples/llama2/README.md

aless2212

Owner May 3

Group Query attention might not have extreme performance benefits in CPU specific workloads .

HoangHa

May 3

I see. Thanks for your help

HoangHa changed discussion status to closed May 3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment