When torch.nn.functional.scaled_dot_product_attention calls _scaled_dot_product_attention_math, the model reports an error

by Quasimodo0808 - opened Jul 19

Jul 19

If the sdpa in visual.py::attention_fn_default() uses the math kernel, then its output is contiguous. The output is transposed(), and then view() is executed. https://huggingface.co/THUDM/cogvlm2-video-llama3-chat/blob/main/visual.py#L78 veiw() will report an error

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 19

try using with reshape

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 19

You can try change output = self.dense(out.view(B, L, -1)) to output = self.dense(out.reshape(B, L, -1))

Jul 19

This comment has been hidden

Jul 19

You can try change output = self.dense(out.view(B, L, -1)) to output = self.dense(out.reshape(B, L, -1))

The reason for using view() is because you are using SDPA's flash-attention?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment