GradCache implementation?
Hi @MrLight ,
Your implementation in Tevatron library only uses gradient accumulation and GradCache isn't supported. Is gradient accumulation good enough to enable large batch size instead of GradCache? Thanks.
Hi,
GradCache is not used in the original implementation, as current gradcache do not support deepspeed yet.
Gradient accumulation would be good enough here,
Xueguang
Thank you
@MrLight
. Btw, in the forward
function of RankLlama model, target
is set to be zero:
Shouldn't this method accept labels
parameters which will then be used to calculate the loss function? As far as I can see, there isn't a way to signal "positive" vs "negative" pairs to the model. Am I missing something?
Hi @serialcoder ,
ranker_logits.view(self.train_batch_size, -1)
reranker logits is reshaped so that the first score in each group belongs to the positive pairs.
so the target is set to the 0 index for each group.