Pre-training script

#1
by philschmid HF staff - opened

Hey @benjamin ,

Awesome work on this! Super nice to see investments in languages other than English. I checked out the paper and repository.
I couldn't find more information about the pre-training you did after initializing the embeddings with WECHSEL.
Could you share what training dataset you used and the scripts?

Hi, thanks for your interest!

Sure, the training data was a subset of the first 4GB of the OSCAR corpus of each respective language. There are some more details here: https://github.com/CPJKU/wechsel/issues/4. If I were to repeat training now, I'd recommend to use a larger subset and mC4 instead.

Training script are just HF's run_clm.py and run_mlm.py with hyperparameters as specified in the paper. There are also some training logs here: https://wandb.ai/llms-transfer-learning/main/runs/3300nggh. Hope that helps!

Thank you for sharing @benjamin ! Super helpful!

If I were to repeat training now, I'd recommend to use a larger subset and mC4 instead.

Would you only use a bigger dataset with the same number of steps or also more steps?
Have you run any experiments to see if performance improves or decrease with a bigger dataset and "less" iteration over that dataset?

Would you only use a bigger dataset with the same number of steps or also more steps?

For this model size, I believe the number of steps we used is sufficient. Although models trained from scratch at this size usually train longer (Table 1 in the paper). Unfortunately we didn't have a chance to check scaling laws because we didn't have the compute. If you go larger, you should probably increase the number of steps, although maybe less than e.g. Chinchilla would suggest since there is some transfer from the source model.

Have you run any experiments to see if performance improves or decrease with a bigger dataset and "less" iteration over that dataset?

For encoder models like the German RoBERTa, it will probably not make a huge difference since the same training sample is never seen twice anyway due to the stochastic masking, except maybe for knowledge-intensive tasks like NER. The CamemBERT paper has some interesting insight on that in section 6.2.

For decoder models, I think it makes more of a difference since there you do see the exact same sample more than once if you have > 1 epoch.

I don't have any experiments to back this up though.

I remembered that we actually do have somewhat related experiments in appendix F in the paper. That was mainly to remove confounding from different training dataset sizes on the performance improvement from WECHSEL, it does also give some insight in how dataset size affects performance though, at least for decoder models.

Thank you for the comments! Super helpful! 🤗

You're welcome! Just out of curiosity, are you planning to train anything specific with WECHSEL?

Sign up or log in to comment