File size: 8,568 Bytes
1316154
 
 
2e0ed45
09208cc
1316154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2285fa
1316154
 
 
 
 
f2285fa
1316154
 
 
 
 
 
642d8f0
1316154
f2285fa
642d8f0
1316154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2285fa
1316154
 
 
 
 
f2285fa
1316154
 
 
 
 
 
 
 
f2285fa
1316154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2285fa
1316154
 
 
 
 
f2285fa
1316154
 
 
 
 
 
 
 
f2285fa
1316154
 
 
 
 
 
 
 
 
 
 
 
 
f2285fa
1316154
 
 
 
 
f2285fa
1316154
 
 
 
 
f2285fa
1316154
 
 
 
 
 
 
 
 
 
 
 
 
d7e6ab3
 
 
4c887bd
 
d7e6ab3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1316154
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
[**中文说明**](README_CN.md) | [**English**](README.md)
# 项目介绍
本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。
本项目于QQ-ARC Joint Lab, Tencent PCG完成。
更详细的信息可以参考[QA-CLIP项目的主页面](https://huggingface.co/TencentARC/QA-CLIP)。我们也在github上开源了模型,[QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP),welcome to star!
<br><br>

## 实验结果
针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。
针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表:


**Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
    </tr>
    <tr align="center">
        <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
    </tr> 
	<tr align="center">
        <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td><b>94.8</b></td><td>84.8</td><td>97.7</td><td>99.1</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td>94.7</td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
    </tr>
</table>
<br>

**MUGE Zero-shot Retrieval (Official Validation Set)**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
    </tr>
    <tr align="center">
        <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
    </tr> 
	<tr align="center">
        <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
    </tr>
</table>
<br>

**COCO-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
    </tr>
    <tr align="center">
        <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
    </tr> 
	<tr align="center">
        <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
    </tr>
</table>
<br>

**Zero-shot Image Classification on ImageNet**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="1">ImageNet</th>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
    </tr>
</table>
<br>

<br><br>


# 使用教程
## 推理代码
推理代码示例:
```python
from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
```
<br><br>

# 致谢
项目代码基于<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>实现,非常感谢他们优秀的开源工作。
<br><br>