allenai
/

Molmo-7B-D-0924

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

Question about inputs in molmo

#13

by 2U1 - opened 2 days ago

2U1

2 days ago

•

edited 1 day ago

I'm writing a code for fine-tuining molmo and I have some question about the input of molmo.

1.It looks like you are using <|endoftext|> for bos, cuz qwen dosen't use bos. Then Are you using <|endoftext|> for eos token too? When looking at the example code, it seems like the sequence should be ended with <|endoftext|>.

Aren't there any seperator for the multi-turn conversations? When I preprocess the example input It looks like '<|endoftext|><im_start><im_patch><im_patch>...<im_col><im_end> User: Describe this image. Assistant:
What is the purpose of the image_mask? From the modelling_molmo.py it looks like for telling the model it is a padding in the image tensor. If it's right than, does this used same in the training phase?

Ai2 org about 13 hours ago

•

edited about 13 hours ago

Yes, we use the same token for BOS and EOS. If giving the model and example generation the model should produce, then that example should end with EOS.
The separators are just "User :" for user input and "Assistant:" for model output, for multi-turn conversations those prefixes should appear before each message.
That is correct, and the image_mask was used during training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment