File size: 4,258 Bytes
adb9ac2
 
 
 
 
 
 
 
 
 
a493f1d
adb9ac2
 
 
 
 
 
1de6fd5
 
 
 
adb9ac2
 
 
 
 
 
 
 
 
 
 
8babac7
 
adb9ac2
1ca9d23
8babac7
 
 
 
 
 
 
 
adb9ac2
8babac7
adb9ac2
8babac7
adb9ac2
 
8babac7
 
 
 
 
 
adb9ac2
 
 
 
 
1a0f133
adb9ac2
 
 
8babac7
 
adb9ac2
 
 
8babac7
 
 
adb9ac2
 
 
 
 
8babac7
adb9ac2
 
 
 
8babac7
adb9ac2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a0f133
adb9ac2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d73888
a493f1d
7d73888
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
language:
- ja
tags:
- heron
- vision
- image-captioning
- VQA
pipeline_tag: image-to-text
license:
- cc-by-nc-4.0
inference: false
---
# Heron BLIP Japanese StableLM Base 7B

![heron](./heron_image.png)

## DEMO

You can play the demo of this model [here](https://huggingface.co/spaces/turing-motors/heron_chat_blip).

## Model Details
Heron BLIP Japanese StableLM Base 7B is a vision-language model that can converse about input images.<br>
This model was trained using [the heron library](https://github.com/turingmotors/heron). Please refer to the code for details.


## Usage

Follow [the installation guide](https://github.com/turingmotors/heron/tree/dev-0.0.1#1-clone-this-repository). 

```python
import torch
from heron.models.video_blip import VideoBlipForConditionalGeneration, VideoBlipProcessor
from transformers import LlamaTokenizer

device_id = 0
device = f"cuda:{device_id}"

max_length = 512
MODEL_NAME = "turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0"
    
model = VideoBlipForConditionalGeneration.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, ignore_mismatched_sizes=True
)

model = model.half()
model.eval()
model.to(device)

# prepare a processor
processor = VideoBlipProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
processor.tokenizer = tokenizer

import requests
from PIL import Image

# prepare inputs
url = "https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw)

text = f"##human: この画像の面白い点は何ですか?\n##gpt: "

# do preprocessing
inputs = processor(
    text=text,
    images=image,
    return_tensors="pt",
    truncation=True,
)

inputs = {k: v.to(device) for k, v in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(device, torch.float16)

# set eos token
eos_token_id_list = [
    processor.tokenizer.pad_token_id,
    processor.tokenizer.eos_token_id,
    int(tokenizer.convert_tokens_to_ids("##"))
]

# do inference
with torch.no_grad():
    out = model.generate(**inputs, max_length=256, do_sample=False, temperature=0., eos_token_id=eos_token_id_list, no_repeat_ngram_size=2)

# print result
print(processor.tokenizer.batch_decode(out))
```


## Model Details
* **Developed by**: [Turing Inc.](https://www.turing-motors.com/)
* **Adaptor type**: [BLIP2](https://arxiv.org/abs/2301.12597)
* **Lamguage Model**: [Japanese StableLM Base Alpha](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b)
* **Language(s)**: Japanese
* **License**: This model is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

### Training
This model was initially trained with the Adaptor using STAIR Captions. In the second phase, it was fine-tuned with [LLaVA-Instruct-150K-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Instruct-150K-JA) and Japanese Visual Genome using LoRA.

### Training Dataset

- [LLaVA-Instruct-150K-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Instruct-150K-JA)
- [Japanese STAIR Captions](http://captions.stair.center/)
- [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa)

## Use and Limitations

### Intended Use

This model is intended for use in chat-like applications and for research purposes.

### Limitations

The model may produce inaccurate or false information, and its accuracy is not guaranteed. It is still in the research and development stage.

## How to cite
```bibtex
@misc{BlipJapaneseStableLM, 
    url    = {[https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0)}, 
    title  = {Heron BLIP Japanese StableLM Base 7B}, 
    author = {Kotaro Tanahashi, Yuichi Inoue, and Yu Yamaguchi}
}
```

## Citations

```bibtex
@misc{JapaneseInstructBLIPAlpha, 
    url    = {[https://huggingface.co/stabilityai/japanese-instructblip-alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha)}, 
    title  = {Japanese InstructBLIP Alpha}, 
    author = {Shing, Makoto and Akiba, Takuya}
}
```

---
license: cc-by-nc-4.0
---