Any plans to use MQA (multi-query attention) or GQA (grouped-query attention) in the future?

#9
by graefics - opened

This model uses MHA (multi-head attention, i.e. num_attention_heads == num_key_value_heads). This is unlike Llama, which uses GQA.

The problem with MHA is that the KV-caches are very big (because KV-cache size is proportional to num_key_value_heads).

Therefore, do you have any plans to use MQA or GQA for future model releases? Thanks!

Sign up or log in to comment