Kunyi commited on
Commit
1316154
1 Parent(s): a88b3eb

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +282 -0
  2. README_CN.md +280 -0
  3. config.json +120 -0
  4. preprocessor_config.json +21 -0
  5. pytorch_model.bin +3 -0
  6. vocab.txt +0 -0
README.md CHANGED
@@ -1,3 +1,285 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ widget:
4
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
5
+ candidate_labels: 音乐表演, 体育运动
6
+ example_title: 猫和狗
7
  ---
8
+ [**中文说明**](README_CN.md) | [**English**](README.md)
9
+ # Introduction
10
+ This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
11
+ This project is produced by QQ-ARC Joint Lab, Tencent PCG.
12
+ <br><br>
13
+
14
+ # Models and Results
15
+ <span id="model_card"></span>
16
+ ## Model Card
17
+ QA-CLIP currently has three different open-source models of different sizes, and their model information and download links are shown in the table below:
18
+ <table border="1" width="100%">
19
+ <tr align="center">
20
+ <th>Model</th><th>Ckp</th><th>Params</th><th>Vision</th><th>Params of Vision</th><th>Text</th><th>Params of Text</th><th>Resolution</th>
21
+ </tr>
22
+ <tr align="center">
23
+ <td>QA-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-RN50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td>
24
+ </tr>
25
+ <tr align="center">
26
+ <td>QA-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-base.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
27
+ </tr>
28
+ <tr align="center">
29
+ <td>QA-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-large.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
30
+ </tr>
31
+ </table>
32
+ <br>
33
+
34
+ ## Results
35
+ We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
36
+
37
+ **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
38
+ <table border="1" width="120%">
39
+ <tr align="center">
40
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
41
+ </tr>
42
+ <tr align="center">
43
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
44
+ </tr>
45
+ <tr align="center">
46
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
47
+ </tr>
48
+ <tr align="center">
49
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
50
+ </tr>
51
+ <tr align="center">
52
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
53
+ </tr>
54
+ <tr align="center">
55
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
56
+ </tr>
57
+ <tr align="center">
58
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
59
+ </tr>
60
+ <tr align="center">
61
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td>94.8</td><td>84.8</td><td>97.7</td><td>99.1</td>
62
+ </tr>
63
+ <tr align="center">
64
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td><b>94.7</b></td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
65
+ </tr>
66
+ </table>
67
+ <br>
68
+
69
+ **MUGE Zero-shot Retrieval (Official Validation Set)**:
70
+ <table border="1" width="120%">
71
+ <tr align="center">
72
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
73
+ </tr>
74
+ <tr align="center">
75
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
76
+ </tr>
77
+ <tr align="center">
78
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
79
+ </tr>
80
+ <tr align="center">
81
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
82
+ </tr>
83
+ <tr align="center">
84
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
85
+ </tr>
86
+ <tr align="center">
87
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
88
+ </tr>
89
+ <tr align="center">
90
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
91
+ </tr>
92
+ <tr align="center">
93
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
94
+ </tr>
95
+ <tr align="center">
96
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
97
+ </tr>
98
+ </table>
99
+ <br>
100
+
101
+ **COCO-CN Zero-shot Retrieval (Official Test Set)**:
102
+ <table border="1" width="120%">
103
+ <tr align="center">
104
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
105
+ </tr>
106
+ <tr align="center">
107
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
108
+ </tr>
109
+ <tr align="center">
110
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
111
+ </tr>
112
+ <tr align="center">
113
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
114
+ </tr>
115
+ <tr align="center">
116
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
117
+ </tr>
118
+ <tr align="center">
119
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
120
+ </tr>
121
+ <tr align="center">
122
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
123
+ </tr>
124
+ <tr align="center">
125
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
126
+ </tr>
127
+ <tr align="center">
128
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
129
+ </tr>
130
+ </table>
131
+ <br>
132
+
133
+ **Zero-shot Image Classification on ImageNet**:
134
+ <table border="1" width="120%">
135
+ <tr align="center">
136
+ <th>Task</th><th colspan="1">ImageNet</th>
137
+ </tr>
138
+ <tr align="center">
139
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
140
+ </tr>
141
+ <tr align="center">
142
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
143
+ </tr>
144
+ <tr align="center">
145
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
146
+ </tr>
147
+ <tr align="center">
148
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
149
+ </tr>
150
+ <tr align="center">
151
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
152
+ </tr>
153
+ <tr align="center">
154
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
155
+ </tr>
156
+ </table>
157
+ <br>
158
+
159
+ <br><br>
160
+
161
+
162
+ # Getting Started
163
+ ## Installation Requirements
164
+ Environment configuration requirements:
165
+
166
+ * python >= 3.6.4
167
+ * pytorch >= 1.8.0 (with torchvision >= 0.9.0)
168
+ * CUDA Version >= 10.2
169
+
170
+ Install required packages:
171
+ ```bash
172
+ cd /yourpath/QA-CLIP-main
173
+ pip install -r requirements.txt
174
+ ```
175
+
176
+ ## Inference Code
177
+ ```bash
178
+ export PYTHONPATH=/yourpath/QA-CLIP-main
179
+ ```
180
+ Inference code example:
181
+ ```python
182
+ import torch
183
+ from PIL import Image
184
+
185
+ import clip as clip
186
+ from clip import load_from_name, available_models
187
+ print("Available models:", available_models())
188
+ # Available models: ['ViT-B-16', 'ViT-L-14', 'RN50']
189
+
190
+ device = "cuda" if torch.cuda.is_available() else "cpu"
191
+ model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
192
+ model.eval()
193
+ image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
194
+ text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
195
+
196
+ with torch.no_grad():
197
+ image_features = model.encode_image(image)
198
+ text_features = model.encode_text(text)
199
+ # Normalize the features. Please use the normalized features for downstream tasks.
200
+ image_features /= image_features.norm(dim=-1, keepdim=True)
201
+ text_features /= text_features.norm(dim=-1, keepdim=True)
202
+
203
+ logits_per_image, logits_per_text = model.get_similarity(image, text)
204
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
205
+
206
+ print("Label probs:", probs)
207
+ ```
208
+ <br><br>
209
+
210
+ ## Prediction and Evaluation
211
+
212
+ ### Download Image-text Retrieval Test Dataset
213
+ In Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, the test set has already been preprocessed. Here is the download link they provided:
214
+
215
+ MUGE dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)
216
+
217
+ Flickr30K-CN dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)
218
+
219
+ Additionally, obtaining the [COCO-CN](https://github.com/li-xirong/coco-cn) dataset requires applying to the original author.
220
+
221
+ ### Download ImageNet Dataset
222
+ Please download the raw data yourself,[Chinese Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt) and [English Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt) are provided by Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>
223
+ ### Image-text Retrieval Evaluation
224
+ The image-text retrieval evaluation code can be referred to as follows:
225
+ ```bash
226
+ split=test # Designate the computation of features for the valid or test set
227
+ resume=your_ckp_path
228
+ DATAPATH=your_DATAPATH
229
+ dataset_name=Flickr30k-CN
230
+ # dataset_name=MUGE
231
+
232
+ python -u eval/extract_features.py \
233
+ --extract-image-feats \
234
+ --extract-text-feats \
235
+ --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
236
+ --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
237
+ --img-batch-size=32 \
238
+ --text-batch-size=32 \
239
+ --context-length=52 \
240
+ --resume=${resume} \
241
+ --vision-model=ViT-B-16 \
242
+ --text-model=RoBERTa-wwm-ext-base-chinese
243
+
244
+ python -u eval/make_topk_predictions.py \
245
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
246
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
247
+ --top-k=10 \
248
+ --eval-batch-size=32768 \
249
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
250
+
251
+ python -u eval/make_topk_predictions_tr.py \
252
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
253
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
254
+ --top-k=10 \
255
+ --eval-batch-size=32768 \
256
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
257
+
258
+ python eval/evaluation.py \
259
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
260
+ ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
261
+ ${DATAPATH}/datasets/${dataset_name}/output1.json
262
+ cat ${DATAPATH}/datasets/${dataset_name}/output1.json
263
+
264
+ python eval/transform_ir_annotation_to_tr.py \
265
+ --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
266
+
267
+ python eval/evaluation_tr.py \
268
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
269
+ ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
270
+ ${DATAPATH}/datasets/${dataset_name}/output2.json
271
+ cat ${DATAPATH}/datasets/${dataset_name}/output2.json
272
+ ```
273
+
274
+ ### ImageNet Zero-shot Classification
275
+ The ImageNet zero-shot classification code can be referred to as follows
276
+ ```bash
277
+ bash scripts/zeroshot_eval.sh 0 \
278
+ ${DATAPATH} imagenet \
279
+ ViT-B-16 RoBERTa-wwm-ext-base-chinese \
280
+ ./pretrained_weights/QA-CLIP-base.pt
281
+ ```
282
+ <br><br>
283
+ # Acknowledgments
284
+ The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
285
+ <br><br>
README_CN.md ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [**中文说明**](README_CN.md) | [**English**](README.md)
2
+ # 项目介绍
3
+ 本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。
4
+ 本项目于QQ-ARC Joint Lab, Tencent PCG完成
5
+ <br><br>
6
+
7
+ # 模型及实验
8
+ <span id="model_card"></span>
9
+ ## 模型规模 & 下载链接
10
+ QA-CLIP目前开源3个不同规模,其模型信息和下载方式见下表:
11
+
12
+ <table border="1" width="100%">
13
+ <tr align="center">
14
+ <th>模型规模</th><th>下载链接</th><th>参数量</th><th>视觉侧骨架</th><th>视觉侧参数量</th><th>文本侧骨架</th><th>文本侧参数量</th><th>分辨率</th>
15
+ </tr>
16
+ <tr align="center">
17
+ <td>QA-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-RN50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td>
18
+ </tr>
19
+ <tr align="center">
20
+ <td>QA-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-base.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
21
+ </tr>
22
+ <tr align="center">
23
+ <td>QA-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-large.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
24
+ </tr>
25
+ </table>
26
+ <br>
27
+
28
+ ## 实验结果
29
+ 针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。
30
+ 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表:
31
+
32
+
33
+ **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
34
+ <table border="1" width="120%">
35
+ <tr align="center">
36
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
37
+ </tr>
38
+ <tr align="center">
39
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
40
+ </tr>
41
+ <tr align="center">
42
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
43
+ </tr>
44
+ <tr align="center">
45
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
46
+ </tr>
47
+ <tr align="center">
48
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
49
+ </tr>
50
+ <tr align="center">
51
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
52
+ </tr>
53
+ <tr align="center">
54
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
55
+ </tr>
56
+ <tr align="center">
57
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td>94.8</td><td>84.8</td><td>97.7</td><td>99.1</td>
58
+ </tr>
59
+ <tr align="center">
60
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td><b>94.7</b></td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
61
+ </tr>
62
+ </table>
63
+ <br>
64
+
65
+ **MUGE Zero-shot Retrieval (Official Validation Set)**:
66
+ <table border="1" width="120%">
67
+ <tr align="center">
68
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
69
+ </tr>
70
+ <tr align="center">
71
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
72
+ </tr>
73
+ <tr align="center">
74
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
75
+ </tr>
76
+ <tr align="center">
77
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
78
+ </tr>
79
+ <tr align="center">
80
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
81
+ </tr>
82
+ <tr align="center">
83
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
84
+ </tr>
85
+ <tr align="center">
86
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
87
+ </tr>
88
+ <tr align="center">
89
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
90
+ </tr>
91
+ <tr align="center">
92
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
93
+ </tr>
94
+ </table>
95
+ <br>
96
+
97
+ **COCO-CN Zero-shot Retrieval (Official Test Set)**:
98
+ <table border="1" width="120%">
99
+ <tr align="center">
100
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
101
+ </tr>
102
+ <tr align="center">
103
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
104
+ </tr>
105
+ <tr align="center">
106
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
107
+ </tr>
108
+ <tr align="center">
109
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
110
+ </tr>
111
+ <tr align="center">
112
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
113
+ </tr>
114
+ <tr align="center">
115
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
116
+ </tr>
117
+ <tr align="center">
118
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
119
+ </tr>
120
+ <tr align="center">
121
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
122
+ </tr>
123
+ <tr align="center">
124
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
125
+ </tr>
126
+ </table>
127
+ <br>
128
+
129
+ **Zero-shot Image Classification on ImageNet**:
130
+ <table border="1" width="120%">
131
+ <tr align="center">
132
+ <th>Task</th><th colspan="1">ImageNet</th>
133
+ </tr>
134
+ <tr align="center">
135
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
136
+ </tr>
137
+ <tr align="center">
138
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
139
+ </tr>
140
+ <tr align="center">
141
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
142
+ </tr>
143
+ <tr align="center">
144
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
145
+ </tr>
146
+ <tr align="center">
147
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
148
+ </tr>
149
+ <tr align="center">
150
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
151
+ </tr>
152
+ </table>
153
+ <br>
154
+
155
+ <br><br>
156
+
157
+
158
+ # 使用教程
159
+ ## 安装要求
160
+ 环境配置要求:
161
+
162
+ * python >= 3.6.4
163
+ * pytorch >= 1.8.0 (with torchvision >= 0.9.0)
164
+ * CUDA Version >= 10.2
165
+
166
+ 安装本项目所需库
167
+ ```bash
168
+ cd /yourpath/QA-CLIP-main
169
+ pip install -r requirements.txt
170
+ ```
171
+
172
+ ## 推理代码
173
+ ```bash
174
+ export PYTHONPATH=/yourpath/QA-CLIP-main
175
+ ```
176
+ 推理代码示例:
177
+ ```python
178
+ import torch
179
+ from PIL import Image
180
+
181
+ import clip as clip
182
+ from clip import load_from_name, available_models
183
+ print("Available models:", available_models())
184
+ # Available models: ['ViT-B-16', 'ViT-L-14', 'RN50']
185
+
186
+ device = "cuda" if torch.cuda.is_available() else "cpu"
187
+ model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
188
+ model.eval()
189
+ image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
190
+ text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
191
+
192
+ with torch.no_grad():
193
+ image_features = model.encode_image(image)
194
+ text_features = model.encode_text(text)
195
+ # 对特征进行归一化,请使用归一化后的图文特征用于下游任务
196
+ image_features /= image_features.norm(dim=-1, keepdim=True)
197
+ text_features /= text_features.norm(dim=-1, keepdim=True)
198
+
199
+ logits_per_image, logits_per_text = model.get_similarity(image, text)
200
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
201
+
202
+ print("Label probs:", probs)
203
+ ```
204
+ <br><br>
205
+
206
+ ## 预测及评估
207
+
208
+ ### 图文检索测试数据集下载
209
+ <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>项目中已经预处理好测试集,这是他们提供的下载链接:
210
+
211
+ MUGE数据:[下载链接](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)
212
+
213
+ Flickr30K-CN数据:[下载链接](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)
214
+
215
+ 另外[COCO-CN](https://github.com/li-xirong/coco-cn)数据的获取需要向原作者进行申请
216
+ ### ImageNet数据集下载
217
+ 原始数据请自行下载,[中文标签](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt)和[英文标签](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt)同样由<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>项目提供
218
+ ### 图文检索评估
219
+ 图文检索评估代码可以参考如下:
220
+ ```bash
221
+ split=test # 指定计算valid或test集特征
222
+ resume=your_ckp_path
223
+ DATAPATH=your_DATAPATH
224
+ dataset_name=Flickr30k-CN
225
+ # dataset_name=MUGE
226
+
227
+ python -u eval/extract_features.py \
228
+ --extract-image-feats \
229
+ --extract-text-feats \
230
+ --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
231
+ --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
232
+ --img-batch-size=32 \
233
+ --text-batch-size=32 \
234
+ --context-length=52 \
235
+ --resume=${resume} \
236
+ --vision-model=ViT-B-16 \
237
+ --text-model=RoBERTa-wwm-ext-base-chinese
238
+
239
+ python -u eval/make_topk_predictions.py \
240
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
241
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
242
+ --top-k=10 \
243
+ --eval-batch-size=32768 \
244
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
245
+
246
+ python -u eval/make_topk_predictions_tr.py \
247
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
248
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
249
+ --top-k=10 \
250
+ --eval-batch-size=32768 \
251
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
252
+
253
+ python eval/evaluation.py \
254
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
255
+ ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
256
+ ${DATAPATH}/datasets/${dataset_name}/output1.json
257
+ cat ${DATAPATH}/datasets/${dataset_name}/output1.json
258
+
259
+ python eval/transform_ir_annotation_to_tr.py \
260
+ --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
261
+
262
+ python eval/evaluation_tr.py \
263
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
264
+ ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
265
+ ${DATAPATH}/datasets/${dataset_name}/output2.json
266
+ cat ${DATAPATH}/datasets/${dataset_name}/output2.json
267
+ ```
268
+
269
+ ### ImageNet零样本分类
270
+ ImageNet零样本分类的代码参考如下
271
+ ```bash
272
+ bash scripts/zeroshot_eval.sh 0 \
273
+ ${DATAPATH} imagenet \
274
+ ViT-B-16 RoBERTa-wwm-ext-base-chinese \
275
+ ./pretrained_weights/QA-CLIP-base.pt
276
+ ```
277
+ <br><br>
278
+ # 致谢
279
+ 项目代码基于<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>实现,非常感谢他们优秀的开源工作。
280
+ <br><br>
config.json ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ChineseCLIPModel"
4
+ ],
5
+ "initializer_factor": 1.0,
6
+ "logit_scale_init_value": 2.6592,
7
+ "model_type": "chinese_clip",
8
+ "projection_dim": 768,
9
+ "text_config": {
10
+ "architectures": [
11
+ "ChineseCLIPTextModel"
12
+ ],
13
+ "attention_probs_dropout_prob": 0.1,
14
+ "bos_token_id": 0,
15
+ "directionality": "bidi",
16
+ "eos_token_id": 2,
17
+ "hidden_act": "gelu",
18
+ "hidden_dropout_prob": 0.1,
19
+ "hidden_size": 768,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "layer_norm_eps": 1e-12,
23
+ "max_position_embeddings": 512,
24
+ "model_type": "chinese_clip_text_model",
25
+ "num_attention_heads": 12,
26
+ "num_hidden_layers": 12,
27
+ "output_past": true,
28
+ "pad_token_id": 0,
29
+ "pooler_fc_size": 768,
30
+ "pooler_num_attention_heads": 12,
31
+ "pooler_num_fc_layers": 3,
32
+ "pooler_size_per_head": 128,
33
+ "pooler_type": "first_token_transform",
34
+ "type_vocab_size": 2,
35
+ "vocab_size": 21128
36
+ },
37
+ "text_config_dict": null,
38
+ "torch_dtype": "float32",
39
+ "transformers_version": null,
40
+ "vision_config": {
41
+ "_name_or_path": "",
42
+ "add_cross_attention": false,
43
+ "architectures": null,
44
+ "attention_dropout": 0.0,
45
+ "bad_words_ids": null,
46
+ "bos_token_id": null,
47
+ "chunk_size_feed_forward": 0,
48
+ "cross_attention_hidden_size": null,
49
+ "decoder_start_token_id": null,
50
+ "diversity_penalty": 0.0,
51
+ "do_sample": false,
52
+ "dropout": 0.0,
53
+ "early_stopping": false,
54
+ "encoder_no_repeat_ngram_size": 0,
55
+ "eos_token_id": null,
56
+ "finetuning_task": null,
57
+ "forced_bos_token_id": null,
58
+ "forced_eos_token_id": null,
59
+ "hidden_act": "quick_gelu",
60
+ "hidden_size": 1024,
61
+ "id2label": {
62
+ "0": "LABEL_0",
63
+ "1": "LABEL_1"
64
+ },
65
+ "image_size": 224,
66
+ "initializer_factor": 1.0,
67
+ "initializer_range": 0.02,
68
+ "intermediate_size": 4096,
69
+ "is_decoder": false,
70
+ "is_encoder_decoder": false,
71
+ "label2id": {
72
+ "LABEL_0": 0,
73
+ "LABEL_1": 1
74
+ },
75
+ "layer_norm_eps": 1e-05,
76
+ "length_penalty": 1.0,
77
+ "max_length": 20,
78
+ "min_length": 0,
79
+ "model_type": "clip_vision_model",
80
+ "no_repeat_ngram_size": 0,
81
+ "num_attention_heads": 16,
82
+ "num_beam_groups": 1,
83
+ "num_beams": 1,
84
+ "num_hidden_layers": 24,
85
+ "num_return_sequences": 1,
86
+ "output_attentions": false,
87
+ "output_hidden_states": false,
88
+ "output_scores": false,
89
+ "pad_token_id": null,
90
+ "patch_size": 14,
91
+ "prefix": null,
92
+ "problem_type": null,
93
+ "projection_dim" : 768,
94
+ "pruned_heads": {},
95
+ "remove_invalid_values": false,
96
+ "repetition_penalty": 1.0,
97
+ "return_dict": true,
98
+ "return_dict_in_generate": false,
99
+ "sep_token_id": null,
100
+ "task_specific_params": null,
101
+ "temperature": 1.0,
102
+ "tie_encoder_decoder": false,
103
+ "tie_word_embeddings": true,
104
+ "tokenizer_class": null,
105
+ "top_k": 50,
106
+ "top_p": 1.0,
107
+ "torch_dtype": null,
108
+ "torchscript": false,
109
+ "transformers_version": "4.16.0.dev0",
110
+ "use_bfloat16": false
111
+ },
112
+ "vision_config_dict": {
113
+ "hidden_size": 1024,
114
+ "intermediate_size": 4096,
115
+ "num_attention_heads": 16,
116
+ "num_hidden_layers": 24,
117
+ "patch_size": 14,
118
+ "projection_dim": 768
119
+ }
120
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_center_crop": false,
3
+ "do_normalize": true,
4
+ "do_resize": true,
5
+ "feature_extractor_type": "ChineseCLIPFeatureExtractor",
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_std": [
12
+ 0.26862954,
13
+ 0.26130258,
14
+ 0.27577711
15
+ ],
16
+ "resample": 3,
17
+ "size": {
18
+ "height": 224,
19
+ "width": 224
20
+ }
21
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7dff999a0942c7de062d608d5786d7aa7c75796031e8a122478cae7472ee3fd
3
+ size 1625135894
vocab.txt ADDED
The diff for this file is too large to render. See raw diff