would be interesting to see how this performs on bigger models

#10
by snapo - opened

Did you maybe do tests on the new released models like:
Llama 3.1 405B Instruct (https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct)
Llama 3.1 70B Instruct (https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
Mistral Large 123B Instruct (https://huggingface.co/mistralai/Mistral-Large-Instruct-2407)

You where able to get a nearly a 50% reduction, with such a reduction for example the Mistral 123B model would only be 60B param and with q4_0 quantization afterwards it would be possible to run on two 24GB gpus.

Did you try other models / different sizes on the maximum ammount of layers that can be removed?

Sign up or log in to comment