Inconsistency in effective batch size reporting

#1
by bjoernp - opened

In the model card you state that it was trained with a world size of 1024 and a micro batch size of 1 but in the training hyperparameters section you write effective batch size 4M tokens (2048x2048) instead of (1024x2048). Maybe there was a data entry error somewhere.

LumiOpen org

The effective batch size we're referring to is just the product of the global batch size and the sequence length, or in this case 2048*2048=4194304. We’re running sequence and tensor parallel with gradient accumulation.

Oh I understand, must have missed the mention of gradient accumulation. Thanks for clarifying! Perhaps it might be helpful to include this in the table (gradient accumulation steps = 2).

bjoernp changed discussion status to closed
LumiOpen org

I added a note to the training about using GAS=16. Thanks for the feedback!

Sign up or log in to comment