metadata

datasets:
  - MMInstruction/VLFeedback

Model Card for Silkie

Silkie is a visual language model trained using preference distillation on GPT-4V annotated AI feedback. It is a fine-tuned version of Qwen/Qwen-VL-Chat and was trained on our MMInstruction/VLFeedback dataset with direct preference optimization (DPO). Silkie is a visual language model trained by preference distillation on GPT-4V annotated AI feedback. It is a fine-tuned version of Qwen/Qwen-VL-Chat that is trained on our MMInstruction/VLFeedback dataset with direct preference optimization (DPO). Compared with the original model, Silkile achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Besides, Silkie sets a new state-of-the-art score of 3.02 on MMHal-Bench regarding hallucination evaluation. Please refer to our project page for more details.

Model Sources

Project page: https://vlf-silkie.github.io/
Dataset: https://huggingface.co/datasets/MMInstruction/VLFeedback
Paper: Coming soon.
Repository: Coming soon.

Uses

Silkie is intended for research purposes, particularly for alignment research in multimodal models.

How to Get Started

Below is a simple Python code snippet to get started with the model.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "MMInstruction/Silkie", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "MMInstruction/Silkie", device_map="cuda", trust_remote_code=True
).eval()
query = tokenizer.from_list_format(
    [
        {"image": "https://farm8.staticflickr.com/137/383965780_db4815011c_o.jpg"},
        {"text": "Which wooden stool has a vase with red flower on it?"},
    ]
)
response, history = model.chat(tokenizer, query=query, history=None)

Citation

Coming soon.