File size: 3,992 Bytes
ed203eb
 
ed3059c
 
 
ed203eb
 
0d59fe7
ed203eb
 
 
 
 
 
 
 
 
 
 
 
 
0d59fe7
ed203eb
0d59fe7
ed203eb
 
 
 
 
 
 
 
 
 
 
 
 
3fa67c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed203eb
 
 
 
 
0d59fe7
 
 
9c38204
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed203eb
 
ed3059c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed203eb
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
library_name: transformers
license: apache-2.0
language:
- en
---

# LeroyDyer/Mixtral_AI_Cyber_Q_Vision

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** [LeroyDyer]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** [English]





### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

```python
from transformers import AutoProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")

# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# training
model.config.decoder_start_token_id = processor.tokenizer.eos_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size

pixel_values = processor(image, return_tensors="pt").pixel_values
text = "hello world"
labels = processor.tokenizer(text, return_tensors="pt").input_ids
outputs = model(pixel_values=pixel_values, labels=labels)
loss = outputs.loss

# inference (generation)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

[More Information Needed]

## Training Details

```python


from transformers import ViTImageProcessor, AutoTokenizer, VisionEncoderDecoderModel
from datasets import load_dataset

image_processor = ViTImageProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "LeroyDyer/Mixtral_AI_Cyber_Q_Vision", "LeroyDyer/Mixtral_AI_Cyber_Q_Vision"
)

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
pixel_values = image_processor(image, return_tensors="pt").pixel_values

labels = tokenizer(
    "an image of two cats chilling on a couch",
    return_tensors="pt",
).input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(pixel_values=pixel_values, labels=labels).loss



```


### Model Architecture and Objective

``` python

from transformers import MistralConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel

# Initializing a ViT & Mistral style configuration
config_encoder = ViTConfig()
config_decoder = MistralConfig()

config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

# Initializing a ViTBert model (with random weights) from a ViT & Mistral style configurations
model = VisionEncoderDecoderModel(config=config)

# Accessing the model configuration
config_encoder = model.config.encoder
config_decoder = model.config.decoder
# set decoder config to causal lm
config_decoder.is_decoder = True
config_decoder.add_cross_attention = True

# Saving the model, including its configuration
model.save_pretrained("my-model")

# loading model and config from pretrained folder
encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model")
model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)


```