File size: 1,483 Bytes
c4e4674 de38fb7 dd33ebb c4e4674 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# LLaVa3-Med
We apply 3-stages to train our model.
1. Pretraining: We utilize a dataset comprising 600k image-text pairs from PMC and 60k medical references based on Mayo Clinic guidelines for the pretraining phase.
2. Instruction Fine-tuning: We employ a dataset consisting of 60k LLaVA_Med instruction fine-tuning examples and PMC-VQA datasets to perform instruction learning.
3. Fine-tuning: Our model undergoes fine-tuning on various VQA datasets.
# Inference
```python
CUDA_VISIBLE_DEVICES=0 python -m evaluation \
--model-path model_path \
--question-file data_path \
--image-folder image_path \
--answers-file result.jsonl \
--temperature 0.7 \
--conv-mode llama3
```
# Results
Because GPT-4 has not been fine-tuned on these VQA tasks, the answers it generates for open questions differ significantly in style from the reference answers. Therefore, we employed a few-shot approach and modified GPT-4's answers to match the style of the reference answers.
| Dataset | Metric | Med-Gemini | Med-PaLM-540B | GPT-4V | LLaVa3-Med|
|-----------------------|----------|------------|---------------|--------|-----------|
| Slake-VQA | Token F1 | 87.5 | 89.3 | 76.8 | 89.8† |
| Path-VQA | Token F1 | 64.7 | 62.7 | 57.7 | 64.9† |
Table 1 | Multimodal evaluation. Performance comparison of LLaVa3-Med versus state-of-the-art (SoTA) methods.
|