Someone can tell me, when Predict should get the second token in the output? As far as I know, Transformer will give Output the same size as input (Block_size). But with Qwen2-Audio, I think half of the token ahead represents Audio? We hope to receive the help of everyone. Thank you very much