Question about inputs in molmo

#13
by 2U1 - opened

I'm writing a code for fine-tuining molmo and I have some question about the input of molmo.

1.It looks like you are using <|endoftext|> for bos, cuz qwen dosen't use bos. Then Are you using <|endoftext|> for eos token too? When looking at the example code, it seems like the sequence should be ended with <|endoftext|>.

  1. Aren't there any seperator for the multi-turn conversations? When I preprocess the example input It looks like '<|endoftext|><im_start><im_patch><im_patch>...<im_col><im_end> User: Describe this image. Assistant:

  2. What is the purpose of the image_mask? From the modelling_molmo.py it looks like for telling the model it is a padding in the image tensor. If it's right than, does this used same in the training phase?

  1. Yes, we use the same token for BOS and EOS. If giving the model and example generation the model should produce, then that example should end with EOS.
  2. The separators are just "User :" for user input and "Assistant:" for model output, for multi-turn conversations those prefixes should appear before each message.
  3. That is correct, and the image_mask was used during training.

Sign up or log in to comment