This makes tensor parallelism impossible, as it needs a symmetrical number of attention heads. This design choice doesn't make any sense, 71 is a prime number.
Would be interested to know as well.
· Sign up or log in to comment