Post
"The Case for Co-Designing Model Architectures with Hardware"
This is a long overdue paper that we have started discussing back when training BLOOM-176.
Basically this paper tells you how to design your model's dimensions for an optimal training throughput.
Fantastic!
Yours truly contributed the SwiGLU section ;)
https://twitter.com/QuentinAnthon15/status/1752393989813375119
https://arxiv.org/abs/2401.14489
This is a long overdue paper that we have started discussing back when training BLOOM-176.
Basically this paper tells you how to design your model's dimensions for an optimal training throughput.
Fantastic!
Yours truly contributed the SwiGLU section ;)
https://twitter.com/QuentinAnthon15/status/1752393989813375119
https://arxiv.org/abs/2401.14489