File size: 2,113 Bytes
575a5e7
 
0957dd5
 
 
 
 
 
575a5e7
0957dd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
license: mit
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
language:
- en
pipeline_tag: visual-question-answering
---

# Model Card for Model ID

This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi).

## Model Details
1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2)
2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
6. Finetuned Model: [RaviNaik/Llava-Phi2](https://huggingface.co/RaviNaik/Llava-Phi2)


### Model Sources

<!-- Provide the basic links for the model. -->

- **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi)
- **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330)
- **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2)


## How to Get Started with the Model

Use the code below to get started with the model.
1. Clone this repository and navigate to llava-phi folder
```bash
git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi
```
2. Install Package
```bash
conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```
3. Run the Model
```bash
python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \
    --image-file="https://huggingface.co/RaviNaik/Llava-Phi2/resolve/main/people.jpg?download=true" \
    --query="How many people are there in the image?"
```

### Acknowledgement
This implementation is based on wonderful work done by: \
[LlaVA-Phi](https://github.com/zhuyiche/llava-phi) \
[Llava](https://github.com/haotian-liu/LLaVA) \
[Phi2](https://huggingface.co/microsoft/phi-2)