File size: 3,697 Bytes
00e2113
 
 
 
 
f8ff713
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
library_name: transformers
tags: []
---

# Malaysian TinyLlama + siglip-large-patch16-384

WanDB https://wandb.ai/huseinzol05/vision-tinyllama?workspace=user-huseinzol05

## how-to

```python
from modeling_vision import MM_LLMs, MM_LLMs_Config
from transformers import AutoTokenizer, AutoProcessor
from PIL import Image
import requests

model = MM_LLMs.from_pretrained(
    'mesolitica/malaysian-tinyllama-1.1b-siglip-large-384-vision',
    flash_attention = True,
    dtype = torch.bfloat16,
    torch_dtype = torch.bfloat16
)
_ = model.cuda()

image_processor = AutoProcessor.from_pretrained('google/siglip-large-patch16-384')
tokenizer = AutoTokenizer.from_pretrained('mesolitica/malaysian-tinyllama-1.1b-siglip-large-384-vision')

def prepare_dataset(messages, images: List[str] = None):
    if images is not None:
        images = [Image.open(f).convert('RGB') for f in images]
        image_output = image_processor(images=images, return_tensors='pt')['pixel_values']
    else:
        image_output = None
    
    prompt = tokenizer.apply_chat_template(messages, tokenize = False)
    outputs = tokenizer(
                    prompt,
                    return_tensors='pt',
                    return_overflowing_tokens=False,
                    return_length=False)

    outputs['images'] = image_output
    outputs['image_index'] = torch.tensor([0] * len(outputs['images']))
    outputs['image_starts'] = torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * len(outputs['images']))
    return outputs

with open('Persian-cat-breed.jpg', 'wb') as fopen:
    fopen.write(requests.get('https://cdn.beautifulnara.net/wp-content/uploads/2017/12/10201620/Persian-cat-breed.jpg').content)

with open('nasi-goreng-1-23.jpg', 'wb') as fopen:
    fopen.write(requests.get('https://www.jocooks.com/wp-content/uploads/2023/09/nasi-goreng-1-23.jpg').content)

messages = [
    {'role': 'user', 'content': '<image> </image> ini gambar apa'},
]
outputs = prepare_dataset(messages, images = ['Persian-cat-breed.jpg'])
outputs['images'] = outputs['images'].type(model.dtype)
for k in outputs.keys():
    if outputs[k] is not None:
        outputs[k] = outputs[k].cuda()

with torch.no_grad():
    model_inputs = model.prepare_inputs_for_generation(**outputs)
r = model_inputs.pop('input_ids', None)

generate_kwargs = dict(
    model_inputs,
    max_new_tokens=300,
    top_p=0.95,
    top_k=50,
    temperature=0.1,
    do_sample=True,
    num_beams=1,
)

r = model.llm.generate(**generate_kwargs)
print(tokenizer.decode(r[0]))
```

```
<s>Imej itu menunjukkan seekor kucing putih yang comel duduk di atas sofa hitam.</s>
```

```python
messages = [
    {'role': 'user', 'content': '<image> </image> <image> </image> apa kaitan 2 gambar ni'},
]
outputs = prepare_dataset(messages, images = ['Persian-cat-breed.jpg', 'nasi-goreng-1-23.jpg'])
outputs['images'] = outputs['images'].type(model.dtype)
for k in outputs.keys():
    if outputs[k] is not None:
        outputs[k] = outputs[k].cuda()

with torch.no_grad():
    model_inputs = model.prepare_inputs_for_generation(**outputs)
r = model_inputs.pop('input_ids', None)

generate_kwargs = dict(
    model_inputs,
    max_new_tokens=300,
    top_p=0.95,
    top_k=50,
    temperature=0.1,
    do_sample=True,
    num_beams=1,
)

r = model.llm.generate(**generate_kwargs)
print(tokenizer.decode(r[0]))
```

```
<s>Tiada hubungan yang jelas antara gambar 1 (anak kucing putih duduk di atas sofa) dan gambar 2 (foto penutup mangkuk mi telur dengan nasi dan cili). Gambar pertama ialah imej haiwan, manakala gambar kedua ialah imej makanan. Mereka tergolong dalam kategori yang berbeza dan tidak mempunyai hubungan antara satu sama lain.</s>
```