What do you think of "List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs"
#2
by
Shure-Dev
- opened
https://arxiv.org/pdf/2404.16375
I want to know why you do not concat multiple images to make one image and solve with only prompt engineering.
That's the baseline results we compared against across all the benchmarks. Also, concatenating images make co-reference almost impossible. We don't think that's the way to go.