File size: 2,897 Bytes
21b968e
 
 
 
 
 
 
 
 
0fadc68
c668b32
0fadc68
9537146
21b968e
c668b32
213baf5
 
7a12a13
213baf5
 
 
4c4e756
21b968e
b71dee9
21b968e
7a12a13
 
 
 
ee527dc
21b968e
7a12a13
21b968e
13f34ba
21b968e
82bc2d2
21b968e
82bc2d2
852065d
21b968e
 
 
213baf5
7a12a13
213baf5
80cb7f2
213baf5
db5dfd1
92613ad
 
 
 
ad95e5b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
tags:
- RUDOLPH
- text-image
- image-text
- decoder
datasets:
- sberquad
---

# RUDOLPH-350M (Small)

RUDOLPH: One Hyper-Tasking Transformer Сan be Сreative as DALL-E and GPT-3 and Smart as CLIP

<img src="https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/RUDOLPH.png" width=60% border="2"/>


Model was trained by [Sber AI](https://github.com/ai-forever) team.  

# Model Description

**RU**ssian **D**ecoder **O**n **L**anguage **P**icture **H**yper-tasking (**RUDOLPH**) **350M** is a fast and light text-image-text transformer designed for a quick and easy fine-tuning for a range of tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-tasking Transformers.

*Hyper-tasking model is a generalized multi-tasking model, i.e., the model that can solve almost all tasks within supported modalities, mandatory including mutual pairwise translations between modalities (two modalities in case of RUDOLPH: images and Russian texts).*

* Tasks: ` text2image generation, self reranking, text ranking, image ranking, image2text generation, zero-shot image classification, text2text generation, and so on`
* Language: ` Russian`
* Type: ` decoder`
* Num Parameters: ` 350M`
* Training Data Volume: `141 million text-image pairs, 7.6 million text paragraphs`

# Details of architecture

<img src=https://raw.githubusercontent.com/ai-forever/ru-dolph/master/pics/scheme-rudolph_350m.jpg height="20" border="2"/>

The maximum sequence length that this model may be used with depends on the modality and stands for 64 - 256 - 64 for the left text tokens, image tokens, and right text tokens, respectively.

RUDOLPH 350M is a Transformer-based decoder model with the following parameters:

* num\_layers (24) — Number of hidden layers in the Transformer decoder.
* hidden\_size (1024) — Dimensionality of the hidden layers.
* num\_attention\_heads (16) — Number of attention heads for each attention layer.

# Sparse Attention Masks

The primary proposed method is to modify the sparse transformer's attention mask to better control modalities. It allows us to calculate the transitions of modalities in both directions, unlike another similar work DALL-E Transformer, which used only one direction, "text to image". The proposed "image to right text" direction is achieved by extension sparse attention mask to the right for auto-repressively text generation with both image and left text condition.

<img src="https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/attention_masks_350m.png" height="40" border="2"/>

# Authors

+ Alex Shonenkov: [Github](https://github.com/shonenkov), [Kaggle GM](https://www.kaggle.com/shonenkov)
+ Michael Konstantinov: [Mishin Learning](https://t.me/mishin_learning), [Transformer Community](https://transformer.community/)