File size: 4,258 Bytes
adb9ac2 a493f1d adb9ac2 1de6fd5 adb9ac2 8babac7 adb9ac2 1ca9d23 8babac7 adb9ac2 8babac7 adb9ac2 8babac7 adb9ac2 8babac7 adb9ac2 1a0f133 adb9ac2 8babac7 adb9ac2 8babac7 adb9ac2 8babac7 adb9ac2 8babac7 adb9ac2 1a0f133 adb9ac2 7d73888 a493f1d 7d73888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
language:
- ja
tags:
- heron
- vision
- image-captioning
- VQA
pipeline_tag: image-to-text
license:
- cc-by-nc-4.0
inference: false
---
# Heron BLIP Japanese StableLM Base 7B
![heron](./heron_image.png)
## DEMO
You can play the demo of this model [here](https://huggingface.co/spaces/turing-motors/heron_chat_blip).
## Model Details
Heron BLIP Japanese StableLM Base 7B is a vision-language model that can converse about input images.<br>
This model was trained using [the heron library](https://github.com/turingmotors/heron). Please refer to the code for details.
## Usage
Follow [the installation guide](https://github.com/turingmotors/heron/tree/dev-0.0.1#1-clone-this-repository).
```python
import torch
from heron.models.video_blip import VideoBlipForConditionalGeneration, VideoBlipProcessor
from transformers import LlamaTokenizer
device_id = 0
device = f"cuda:{device_id}"
max_length = 512
MODEL_NAME = "turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0"
model = VideoBlipForConditionalGeneration.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16, ignore_mismatched_sizes=True
)
model = model.half()
model.eval()
model.to(device)
# prepare a processor
processor = VideoBlipProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
processor.tokenizer = tokenizer
import requests
from PIL import Image
# prepare inputs
url = "https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = f"##human: この画像の面白い点は何ですか?\n##gpt: "
# do preprocessing
inputs = processor(
text=text,
images=image,
return_tensors="pt",
truncation=True,
)
inputs = {k: v.to(device) for k, v in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(device, torch.float16)
# set eos token
eos_token_id_list = [
processor.tokenizer.pad_token_id,
processor.tokenizer.eos_token_id,
int(tokenizer.convert_tokens_to_ids("##"))
]
# do inference
with torch.no_grad():
out = model.generate(**inputs, max_length=256, do_sample=False, temperature=0., eos_token_id=eos_token_id_list, no_repeat_ngram_size=2)
# print result
print(processor.tokenizer.batch_decode(out))
```
## Model Details
* **Developed by**: [Turing Inc.](https://www.turing-motors.com/)
* **Adaptor type**: [BLIP2](https://arxiv.org/abs/2301.12597)
* **Lamguage Model**: [Japanese StableLM Base Alpha](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b)
* **Language(s)**: Japanese
* **License**: This model is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
### Training
This model was initially trained with the Adaptor using STAIR Captions. In the second phase, it was fine-tuned with [LLaVA-Instruct-150K-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Instruct-150K-JA) and Japanese Visual Genome using LoRA.
### Training Dataset
- [LLaVA-Instruct-150K-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Instruct-150K-JA)
- [Japanese STAIR Captions](http://captions.stair.center/)
- [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa)
## Use and Limitations
### Intended Use
This model is intended for use in chat-like applications and for research purposes.
### Limitations
The model may produce inaccurate or false information, and its accuracy is not guaranteed. It is still in the research and development stage.
## How to cite
```bibtex
@misc{BlipJapaneseStableLM,
url = {[https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0)},
title = {Heron BLIP Japanese StableLM Base 7B},
author = {Kotaro Tanahashi, Yuichi Inoue, and Yu Yamaguchi}
}
```
## Citations
```bibtex
@misc{JapaneseInstructBLIPAlpha,
url = {[https://huggingface.co/stabilityai/japanese-instructblip-alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha)},
title = {Japanese InstructBLIP Alpha},
author = {Shing, Makoto and Akiba, Takuya}
}
```
---
license: cc-by-nc-4.0
---
|