Can you explain the purpose of merged_all.json?

by nlpguy - opened Apr 9

Apr 9

•

To me the axolotl config already looks like it includes all relevant data sources. After looking at previous Einstein models I suspect that the merged_all.json still contains data from those, in addition to being merged with all other datasets. But Is it still relevant? Wouldn't it be more efficient to exclude it from the training process?

Weyaxi

Owner Apr 9

•

edited Apr 9

merged_all.json is merged data of many alpaca format datasets. The other datasets in the data folder is mainly in sharegpt format. So merged_all.json doesn't contain any of the other data that's in the data folder.

nlpguy

Apr 9

Oh ok. Thanks for the info. Does it simply contain all the other datasets mentioned in the README datasets list but not the axolotl config?

Weyaxi

Owner Apr 9

Yes, you got it right!

Note that I filtered some of them :)

nlpguy

Apr 9

Cool, Thanks for the info and thank you for this new version of Einstein :)

nlpguy changed discussion status to closed Apr 9

Weyaxi

Owner Apr 9

@nlpguy , if you are more interested in the datasets I use, you can have a look at this link:

https://huggingface.co/datasets/Weyaxi/sci-datasets/tree/main

It may be slightly outdated for 1-2 datasets, but that's the main repository I use.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment