Abstract
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
Community
Nice paper! tiny nit: It sounds like there is supposed to be a comparison to Llava-1.5, but it is missing from the image-to-text results table.
purpose is completely different from llava.
Maybe eventually, it seems to just be a paper right now it seems?
Read it, good training strategies. Thanks
There's a plain-english rewrite of the paper up here: https://www.aimodels.fyi/papers/arxiv/chameleon-mixed-modal-early-fusion-foundation-models
Thanks
Great work! I like the discussion around training stability!
I had a few questions:
a/
We narrowed down the cause of the divergence to the softmax operation being
problematic when training with multiple modalities of significantly varying entropy due to the translation
invariant property of softmax (i.e., sof tmax(z) = sof tmax(z + c)). Because we share all weights of the model
across modalities, each modality will try to “compete” with the other by increasing its norms slightly
Can you expand on this explanation?
b/ In figure 6b, does "Training loss curve with image generation disabled does not suffer from
instability issues." mean that the data is only pure text or does it mean that you do not compute the loss (and thus gradients) on the image tokens
c/ one of the long-lasting question for these types of multimodal models is whether they are more sample efficient (transfer between modalities) or learn something they were not able to learn just from observing pure text. do you have any insights into that question with the chamaleon models?
I have featured this paper in Ajith's AI Pulse https://ajithp.com/2024/05/26/chameleon-early-fusion-multimodal-ai-model-for-visual-and-textual-interaction/
The Future of AI: Chameleon’s Breakthrough in Multimodal Models
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
After reading the paper I can not find the way about how to decode from your codebook embedding into a image. Is there any decoder trained jointly with transformer architecture?
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper