LLaVA-JP Model Card
This is a pretrained checkpoint, you can use it to instruct tune your multimodal models.
Check out the instructions here
Model details
Model type:
LLaVA-JP is a vision-language model that can converse about input images.
This model is an LVLM model trained using google/siglip-so400m-patch14-384 as the image encoder and llm-jp/llm-jp-1.3b-v1.0 as the text decoder. supports the input of 768 x 768 high resolution images by scaling_on_scales method.
Training dataset
Acknowledgement
License
Apache-2.0
- Downloads last month
- 11
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.