ctheodoris/Geneformer · MTL Classifier - Discreptancy between train and validation mappings

Good afternoon,

I would like to contribute my thought as to how the task label mappings are currently created during training. As far as I am able to understand from the code, the mappings for train and validation are created independently of each other in preload_and_process_data (mtl/data.py#L97). But in load_and_preprocess_data (mtl/data.py#L45), they are saved to the same file. If the mappings differ (some classes missing in validation), this causes a lot of issues later in training and validation --- the reported validation loss in hyperparameter tuning is wrong, and load_and_evaluate_test_model (mtl/eval_utils.py#L54) fails if test dataset has different amount of classes too.

I think that only printing the mappings is not enough to show this issue. I think one of the following should be implemented for clarity:

raise an Error if the mappings do not agree and do not proceed with training,
create the mappings based on joint training, validation and test dataset, save it and then load the same mapping for all three datasets

Best,
Milos