What do you think of "List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs"

#2
by Shure-Dev - opened

https://arxiv.org/pdf/2404.16375

I want to know why you do not concat multiple images to make one image and solve with only prompt engineering.

TIGER-Lab org

That's the baseline results we compared against across all the benchmarks. Also, concatenating images make co-reference almost impossible. We don't think that's the way to go.

Sign up or log in to comment