ctheodoris/Geneformer · Consistency of perturbation analysis results

Hi,

I used Geneformer fine-tuned on my data (2200 cells) for perturbation analysis (deletion of 1 gene). Originally, I used the 12L-2048 version before the 4096 versions were introduced. As a formal validation, I performed automated literature mining to see:
whether the predicted genes indeed tend to be more frequently known to be involved in the transcriptomic phenotype of the given cell type;
whether the predicted genes are indeed more often found to be involved in the given disease.
The results were quite promising. I found that genes predicted by Geneformer have a ~4 times higher frequencies as compared to random genes expressed in the same cell type. The clear limitation was the context size, which I thought indirectly limited the number of predicted genes. So, when the new models with a 4096 token context size were introduced, I was very excited.

With the new 12L-4096 model, I noticed that the clustering of embeddings was already somewhat meaningful without fine-tuning for the same dataset (whereas embeddings from 2048 originally looked like one blob). Finally, the f1 score and validation loss of fine-tuning looked even better with the new model (f1 score > 0.99).

However, when I ran the same literature-based validation, I found that perturbation analysis results obtained with the new model were as good as random. Furthermore, they actually looked random, as each time I ran it, I got totally different genes predicted (e.g., out of ~150 genes, only 1-2 overlap with another perturbation analysis run). This seems very counterintuitive. Based on the training metrics and the embeddings, it looks like the 12L-4096 model should have a better understanding of my dataset. Even if that is not the case, it seems strange that predictions are so different each time.

Do you have any idea what could go wrong? Could there be any issue with the calculations related to the perturbation analysis itself, not the model?

Thank you for your discussion!

The fact that you get different results with each run seems that there is an error because the in silico perturbation is inference only and the results should be the same each time. If you subset to a small number of cells and rerun it with all cells are the results the same? This will help confirm if any subsampling is causing this.

For the genes appearing random, please confirm you are using the correct token dictionary for each model.

If neither of these are related, please send the exact settings you are using to run the code so we can reproduce this problem.

When we compare results of in silico perturbation between the two models, the cosine shift values are correlated. The magnitude of the values may change though - for example since the 4096 model has more genes, deleting 1 out of 4096 usually leads to less cosine shift than deleting 1 out of 2048.

Also keep in mind some genes are not present in both model dictionaries so this can cause them to appear in the results of one and not the other. Furthermore, certain genes that are beyond the 2048 cutoff will only appear in cells in the 4096 model, so this can also cause different results.