Update model weight
Hello i have load the weight of official pixtral model mistralai/Pixtral-12B-2409 and compare weight with your model and I figure some layer the weight is different between two model. Can you please update the model weight. i will leave the unmatch layer and match layer so you can check on your own
match pair:
[('tok_embeddings.weight', 'language_model.model.embed_tokens.weight'),
('vision_encoder.patch_conv.weight', 'vision_tower.patch_conv.weight'),
('vision_encoder.ln_pre.weight', 'vision_tower.ln_pre.weight'),
('vision_encoder.transformer.layers.0.attention.wv.weight',
'vision_tower.transformer.layers.0.attention.v_proj.weight'),
('vision_encoder.transformer.layers.0.attention.wo.weight',
'vision_tower.transformer.layers.0.attention.o_proj.weight'),
('vision_encoder.transformer.layers.0.attention_norm.weight',
'vision_tower.transformer.layers.0.attention_norm.weight'),
('vision_encoder.transformer.layers.0.ffn_norm.weight',
'vision_tower.transformer.layers.0.ffn_norm.weight'),
('vision_encoder.transformer.layers.0.feed_forward.w1.weight',
'vision_tower.transformer.layers.0.feed_forward.gate_proj.weight'),
('vision_encoder.transformer.layers.0.feed_forward.w2.weight',
'vision_tower.transformer.layers.0.feed_forward.down_proj.weight'),
('vision_encoder.transformer.layers.0.feed_forward.w3.weight',
'vision_tower.transformer.layers.0.feed_forward.up_proj.weight'),
('vision_encoder.transformer.layers.1.attention.wv.weight',
'vision_tower.transformer.layers.1.attention.v_proj.weight'),
('vision_encoder.transformer.layers.1.attention.wo.weight',
'vision_tower.transformer.layers.1.attention.o_proj.weight'),
('vision_encoder.transformer.layers.1.attention_norm.weight',
'vision_tower.transformer.layers.1.attention_norm.weight'),
('vision_encoder.transformer.layers.1.ffn_norm.weight',
'vision_tower.transformer.layers.1.ffn_norm.weight'),
('vision_encoder.transformer.layers.1.feed_forward.w1.weight',
'vision_tower.transformer.layers.1.feed_forward.gate_proj.weight'),
('vision_encoder.transformer.layers.1.feed_forward.w2.weight',
'vision_tower.transformer.layers.1.feed_forward.down_proj.weight'),
('vision_encoder.transformer.layers.1.feed_forward.w3.weight',
'vision_tower.transformer.layers.1.feed_forward.up_proj.weight'),
('vision_encoder.transformer.layers.2.attention.wv.weight',
'vision_tower.transformer.layers.2.attention.v_proj.weight'),
('vision_encoder.transformer.layers.2.attention.wo.weight',
'vision_tower.transformer.layers.2.attention.o_proj.weight'),
('vision_encoder.transformer.layers.2.attention_norm.weight',
'vision_tower.transformer.layers.2.attention_norm.weight'),
('vision_encoder.transformer.layers.2.ffn_norm.weight',
'vision_tower.transformer.layers.2.ffn_norm.weight'),
('vision_encoder.transformer.layers.2.feed_forward.w1.weight',
'vision_tower.transformer.layers.2.feed_forward.gate_proj.weight'),
('vision_encoder.transformer.layers.2.feed_forward.w2.weight',
'vision_tower.transformer.layers.2.feed_forward.down_proj.weight'),
('vision_encoder.transformer.layers.2.feed_forward.w3.weight',
'vision_tower.transformer.layers.2.feed_forward.up_proj.weight'),
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
('vision_encoder.transformer.layers.23.attention.wv.weight',
'vision_tower.transformer.layers.23.attention.v_proj.weight'),
('vision_encoder.transformer.layers.23.attention.wo.weight',
'vision_tower.transformer.layers.23.attention.o_proj.weight'),
('vision_encoder.transformer.layers.23.attention_norm.weight',
'vision_tower.transformer.layers.23.attention_norm.weight'),
('vision_encoder.transformer.layers.23.ffn_norm.weight',
'vision_tower.transformer.layers.23.ffn_norm.weight'),
('vision_encoder.transformer.layers.23.feed_forward.w1.weight',
'vision_tower.transformer.layers.23.feed_forward.gate_proj.weight'),
('vision_encoder.transformer.layers.23.feed_forward.w2.weight',
'vision_tower.transformer.layers.23.feed_forward.down_proj.weight'),
('vision_encoder.transformer.layers.23.feed_forward.w3.weight',
'vision_tower.transformer.layers.23.feed_forward.up_proj.weight'),
('vision_language_adapter.w_in.weight',
'multi_modal_projector.linear_1.weight'),
('vision_language_adapter.w_in.bias', 'multi_modal_projector.linear_1.bias'),
('vision_language_adapter.w_out.weight',
'multi_modal_projector.linear_2.weight'),
('vision_language_adapter.w_out.bias', 'multi_modal_projector.linear_2.bias'),
('norm.weight', 'language_model.model.norm.weight'),
('output.weight', 'language_model.lm_head.weight'),
('layers.0.attention.wv.weight',
'language_model.model.layers.0.self_attn.v_proj.weight'),
('layers.0.attention.wo.weight',
'language_model.model.layers.0.self_attn.o_proj.weight'),
('layers.0.attention_norm.weight',
'language_model.model.layers.0.input_layernorm.weight'),
('layers.0.ffn_norm.weight',
'language_model.model.layers.0.post_attention_layernorm.weight'),
('layers.0.feed_forward.w1.weight',
'language_model.model.layers.0.mlp.gate_proj.weight'),
('layers.0.feed_forward.w2.weight',
'language_model.model.layers.0.mlp.down_proj.weight'),
('layers.0.feed_forward.w3.weight',
'language_model.model.layers.0.mlp.up_proj.weight'),
('layers.1.attention.wv.weight',
'language_model.model.layers.1.self_attn.v_proj.weight'),
('layers.1.attention.wo.weight',
'language_model.model.layers.1.self_attn.o_proj.weight'),
('layers.1.attention_norm.weight',
'language_model.model.layers.1.input_layernorm.weight'),
('layers.1.ffn_norm.weight',
'language_model.model.layers.1.post_attention_layernorm.weight'),
('layers.1.feed_forward.w1.weight',
'language_model.model.layers.1.mlp.gate_proj.weight'),
('layers.1.feed_forward.w2.weight',
'language_model.model.layers.1.mlp.down_proj.weight'),
('layers.1.feed_forward.w3.weight',
'language_model.model.layers.1.mlp.up_proj.weight'),
('layers.2.attention.wv.weight',
'language_model.model.layers.2.self_attn.v_proj.weight'),
('layers.2.attention.wo.weight',
'language_model.model.layers.2.self_attn.o_proj.weight'),
('layers.2.attention_norm.weight',
'language_model.model.layers.2.input_layernorm.weight'),
('layers.2.ffn_norm.weight',
'language_model.model.layers.2.post_attention_layernorm.weight'),
('layers.2.feed_forward.w1.weight',
'language_model.model.layers.2.mlp.gate_proj.weight'),
('layers.2.feed_forward.w2.weight',
'language_model.model.layers.2.mlp.down_proj.weight'),
('layers.2.feed_forward.w3.weight',
'language_model.model.layers.2.mlp.up_proj.weight'),
('layers.3.attention.wv.weight',
...........................................................................................................
...........................................................................................................
...........................................................................................................
...........................................................................................................
'language_model.model.layers.38.self_attn.v_proj.weight'),
('layers.38.attention.wo.weight',
'language_model.model.layers.38.self_attn.o_proj.weight'),
('layers.38.attention_norm.weight',
'language_model.model.layers.38.input_layernorm.weight'),
('layers.38.ffn_norm.weight',
'language_model.model.layers.38.post_attention_layernorm.weight'),
('layers.38.feed_forward.w1.weight',
'language_model.model.layers.38.mlp.gate_proj.weight'),
('layers.38.feed_forward.w2.weight',
'language_model.model.layers.38.mlp.down_proj.weight'),
('layers.38.feed_forward.w3.weight',
'language_model.model.layers.38.mlp.up_proj.weight'),
('layers.39.attention.wv.weight',
'language_model.model.layers.39.self_attn.v_proj.weight'),
('layers.39.attention.wo.weight',
'language_model.model.layers.39.self_attn.o_proj.weight'),
('layers.39.attention_norm.weight',
'language_model.model.layers.39.input_layernorm.weight'),
('layers.39.ffn_norm.weight',
'language_model.model.layers.39.post_attention_layernorm.weight'),
('layers.39.feed_forward.w1.weight',
'language_model.model.layers.39.mlp.gate_proj.weight'),
('layers.39.feed_forward.w2.weight',
'language_model.model.layers.39.mlp.down_proj.weight'),
('layers.39.feed_forward.w3.weight',
'language_model.model.layers.39.mlp.up_proj.weight')]
Hey! This is expected because of ROPE!
Thank you. It seems like the model in local is not as good as the one on the website because of the RoPE and 128 layers of different model weights. I tried copying the weight of the original model to your model, but it turned out terrible, hahaha. Hopefully, your team could make the same version as the original one because the Mistral team has a public model but doesn't use Transformers eco, so it is not friendly for users.
Sorry I think there is confusion here 😅 this models works pretty well AFAIK ! We tested it!
Do you have an example prompt that did not work?
my sample is to detect the sarcasm in the meme image.
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
# attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).eval()
processor = AutoProcessor.from_pretrained(model_id)
img = "../data/warn_up/warmup-images/bc24654fb4fba69b41b6b4dce15295fc4acc8ebce9b9bff452ef6a8890e04e72.jpg"
img = Image.open(img)
chat = [
{
"role": "user", "content": [
{"type": "image"},
{"type": "text", "content": "based on the text in this image explain why this image contain sarcasm meaning ?"},
]
}
]
prompt = processor.apply_chat_template(chat)
inputs = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
with torch.no_grad():
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
image is:
the text inside image is Vietnamese text and it mean: "2 dead bodies found in cemetery". i expect the model to explain that the dead body in cemetery is common so it is sarcasm image. but when I try in local I got output and it can not detect the joke in here:
The image contains text in Vietnamese that reads "Phát hiện hai xác chết trong nghĩa trang," which translates to "Discovering two corpses in the cemetery." The context of the image shows a group of people standing near a small, old building surrounded by overgrown vegetation in what appears to be a cemetery. The juxtaposition of the seemingly macabre discovery with the casual presence of the group of people might suggest a tone of dark humor or sarcasm, as it could be interpreted as a lighthearted or exaggerated reaction to a situation that is typically serious.
but when I try the web version I can detect the joke: https://chat.mistral.ai/chat/c96f9b4e-4181-4f8b-bcbf-c337de9afa1a
Did you try with the latest silu
activation instead? 🤗
@nguyen-brat
@ArthurZ
This maybe because I think chat_template is not working properly. I was experimenting with it and i found that if when chat_template is applied no text is added.
For below code
text = "Give me comprehensive analysis of the intended meaning and interpretation of the political meme image provided"
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": text}
]}
]
input_text = processor.apply_chat_template(messages)
print(input_text)
I got response as:
<s>[INST][IMG][/INST]
Actual response should have been:
<s>[INST][IMG] text [/INST]