arxiv:2405.09818

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Published on May 16

· Submitted by

akhaliq on May 17

#1 Paper of the day

Upvote

126

Authors:

Abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

View arXiv page View PDF Add to collection

Community

tyler-romero

May 17

•

edited May 17

Nice paper! tiny nit: It sounds like there is supposed to be a comparison to Llava-1.5, but it is missing from the image-to-text results table.

notlober

Jun 2

purpose is completely different from llava.

Chat-Error

May 17

Will there be model/code release?

dillfrescott

May 23

Maybe eventually, it seems to just be a paper right now it seems?

oguzhanercan

May 17

Read it, good training strategies. Thanks

mikelabs

May 17

There's a plain-english rewrite of the paper up here: https://www.aimodels.fyi/papers/arxiv/chameleon-mixed-modal-early-fusion-foundation-models

heysourin

May 18

Thanks

VictorSanh

May 20

•

edited May 20

Great work! I like the discussion around training stability!

I had a few questions:

We narrowed down the cause of the divergence to the softmax operation being
problematic when training with multiple modalities of significantly varying entropy due to the translation
invariant property of softmax (i.e., sof tmax(z) = sof tmax(z + c)). Because we share all weights of the model
across modalities, each modality will try to “compete” with the other by increasing its norms slightly

Can you expand on this explanation?

b/ In figure 6b, does "Training loss curve with image generation disabled does not suffer from
instability issues." mean that the data is only pure text or does it mean that you do not compute the loss (and thus gradients) on the image tokens

c/ one of the long-lasting question for these types of multimodal models is whether they are more sample efficient (transfer between modalities) or learn something they were not able to learn just from observing pure text. do you have any insights into that question with the chamaleon models?

ajithprabhakar

May 26

I have featured this paper in Ajith's AI Pulse https://ajithp.com/2024/05/26/chameleon-early-fusion-multimodal-ai-model-for-visual-and-textual-interaction/

blanchon

Jun 9

The Future of AI: Chameleon’s Breakthrough in Multimodal Models

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

hair-is-long

Aug 30

After reading the paper I can not find the way about how to decode from your codebook embedding into a image. Is there any decoder trained jointly with transformer architecture?