Post
I have just published my first blog post.
While FlashAttention has been readily integrated into HuggingFace transformers, there are much higher gains to be had (at least theoretically) for finetuning models with examples of variable sequence lengths in a batch.
For a deeper dive, please go through my blog at https://huggingface.co/blog/mayank-mishra/padding-free-transformer.
While FlashAttention has been readily integrated into HuggingFace transformers, there are much higher gains to be had (at least theoretically) for finetuning models with examples of variable sequence lengths in a batch.
For a deeper dive, please go through my blog at https://huggingface.co/blog/mayank-mishra/padding-free-transformer.