would be interesting to see how this performs on bigger models
#10
by
snapo
- opened
Did you maybe do tests on the new released models like:
Llama 3.1 405B Instruct (https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct)
Llama 3.1 70B Instruct (https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
Mistral Large 123B Instruct (https://huggingface.co/mistralai/Mistral-Large-Instruct-2407)
You where able to get a nearly a 50% reduction, with such a reduction for example the Mistral 123B model would only be 60B param and with q4_0 quantization afterwards it would be possible to run on two 24GB gpus.
Did you try other models / different sizes on the maximum ammount of layers that can be removed?