Papers
arxiv:2406.16793

Adam-mini: Use Fewer Learning Rates To Gain More

Published on Jun 24
· Submitted by yushun0410 on Jun 27
#1 Paper of the day
Authors:
,
,
,
,
,

Abstract

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., 1/v). We find that geq 90% of these learning rates in v could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2times A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Community

Paper author Paper submitter

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieves 49.5% higher throughput than AdamW on Llama2-7B pre-training. The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers. Code available at: https://github.com/zyushun/Adam-mini

figure1.png

Paper author Paper submitter

So, if I understand correctly, Adam-mini is essentially just Adam, but with a block-wise mean thrown in inbetween the "squaring the gradients to get the second moments" step and the "computing the EMA of the second moments" step? So you're only storing the EMA of the blockwise means of the second moments?

If so, congratulations on discovering such an effective yet elegant/simple improvement on an already elegant algorithm! Remarkable work.

·
Paper author
edited Jun 28

Yes, you are right! Thanks for your kind comments!
For completeness, we remark that the partition of blocks cannot be done arbitrarily: a bad partition will oversimplify the problem and cause training instability (Figure 6 and 7). We then proposed a partition principle related to the Hessian structure. This principle works well on various tasks including 7B model training.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.16793 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.16793 in a Space README.md to link it from this page.

Collections including this paper 7