cell type prediction returns cell sentence instead of cell type

#1
by lhl1bit - opened

Hi, could you please let me know how to do cell type prediction? I'm running the following code to predict cell type, where processed_genes is the cell sentence output from the sample code provided in your readme file:

inital_prompt = "Identify the cell type most likely associated with these 100 highly expressed genes listed in descending order."
cell_sentence_prompt = ' '.join(str(x) for x in processed_genes[0:100])
prediction_prompt = "This is the cell type corresponding to these genes: "
ctp = inital_prompt + " " + cell_sentence_prompt + ". " + prediction_prompt

tokens = tokenizer(ctp, return_tensors='pt')
input_ids = tokens['input_ids'].to(torch.device("cuda"))
attention_mask = tokens['attention_mask'].to(torch.device("cuda"))

with torch.no_grad():
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
do_sample=True,
max_length=1024,
top_k=50,
top_p=0.95,
)

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
predicted_cell_type = "".join(re.split(r"?|.|:", output_text)[1:]).strip()
predicted_cell_type

This is the predicted cell type I get:

'MALAT1 HSPA1A MT-CO1 EEF1A1 MT-CO3 MT-CO2 RPLP1 MT-CYB HSP90AA1 MT-ATP6 CCL5 DNAJB1 TPT1 RPS12 TMSB4X RPL13 MT-ND4 RPS19 RPL10 RPS3 RPL30 HSPE1 RPL32 B2M RPS4X RPL21 RPS2 RPS27A RPS15A RPL12 RPL28 RPS14 RPS23 RPS7 FTH1 RPS3A RPS13 HSP90AB1 RPL41 RPL39 JUND RPL17 RPS18 RPS15 RPL8 RPL7A PTMA RPL11 RPS27 HSPH1 RPL18A MT-ND3 HLA-B RPS29 RPS6 RPS8 RPL13A RPLP2 RPL3 RPS25 RPL9 TUBA1A RPL19 RPL23A RPL29 RPL5 MT-ND2 RPS9 RPL18 RPL10A RPL35A HSPA8 RPL26 HLA-A RPS11 RPS5 UBB KLF6 RPS16 RPS21 PNRC1 RPS24 RPL34 EIF1 RPL15 PTGES3 RPL6 NACA RPL22 HSPB1 ACTG1 RPS28 BTG1 ACTB TRBC1 MT-ND1 EEF1B2 UBC RPL36A DDX5 This is the cell type corresponding to these genes B RPL37 RPLP0 RPS20 PTMA RPL35A PABPC1 RPS10 RPSA CD7 PFN1 RPS3L RPS17 RPL35 GNAS RPL23 PPP1R15A TSC22D3 RPS12 H3-3B NAP1L1 RPL36 PPDPF HLA-C COTL1 EEF1D MT-ND5 TSPYL2 RBM39 MT-ND4L EEF1G YPEL5 DNAJA1 TMSB10 CD3E REL RPL4 SOD1 HSPD1 EEF2 HNRNPA2B1 ATP5MC2 VIM MYL12A PRRC2C IL32 CREM H3-3A NUCB2 UBA52 DDX3X OAZ1 HMGB2 ITM2B NEDD8 PFDN5 TOMM7 UQCRH HSP90B1 GNAI2 FOS CHASERR RBM4 C17orf49 HSPA5 FAU GAPDH DDX24 ZEB2 NUDT4 CYBA IDS SARAF SNHG29 SLC25A3 HMGB1 NCL ZFAS1 RPL37A RPL14 DUSP1 HLA-E FOSB PRDX5 CACYBP SRRM1 EIF4A1 TRBC2 ARHGDIB FTL STK17B SRGN DDX18 KLRC3 HNRNPA1 TNFAIP3 TOMM20 RPS26 ZNF331 SRP14 BCLAF1 LRRFIP1 BIRC6 RPL27 LINC-PINT JUN SELENOW NR3C1 EIF3G UBXN4 SON SF1 RAB5IF ANKRD44 CFL1 CITED2 GUK1 JMY DNAJB11 CCND3 G3BP2 AHI1 PRDX6 KRTCAP2 SLC25A6 CD3D KTN1 SYTL3 SPCS3 DSTN C1QBP TRAPPC1 CSRNP1 EIF3F RPL7 JAK1 ITGAE ANKRD11 LSP1 GGNBP2 SERP1 TAF7 ATP5MC3 NOP53 FUS RACK1 STK4 HNRNPK PRR13 JUNB CDC42 ZFP36L2 COX4I1 CD63 CDV3 HINT1 S100A10 TTC19 MDM4 PTPRC CALM1 ATP6V0C GOLGA7 SSU72 RPS27L UQCRB PRKCH DNAJB6 CUL3 ZFAND6 SYF2 PTOV1 CALM3 ARPC1B ATP1B3 STARD3NL NFKBIA CD2AP PTMS ZYX GPBP1 CD74 HMGCS1 BICDL1 SLC25A5 PRMT2 RHOA UBE2A TSC22D4 PTPN22 SNHG8 CNOT6L CHCHD2 H1-4 RPL27A RAP1'

Van Dijk Lab @ Yale org

Hi,

Thank you for pointing this out! This model was only trained on full cell generation, but we will release a model that can do cell type prediction and cell generation of variable lengths on the immune tissue dataset it was trained on. I will update once this is released.

Sign up or log in to comment