Kunyi commited on
Commit
301941a
1 Parent(s): 4c887bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -108
README.md CHANGED
@@ -8,29 +8,9 @@ widget:
8
  [**中文说明**](README_CN.md) | [**English**](README.md)
9
  # Introduction
10
  This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
11
- This project is produced by QQ-ARC Joint Lab, Tencent PCG.
12
  <br><br>
13
 
14
- # Models and Results
15
- <span id="model_card"></span>
16
- ## Model Card
17
- QA-CLIP currently has three different open-source models of different sizes, and their model information and download links are shown in the table below:
18
- <table border="1" width="100%">
19
- <tr align="center">
20
- <th>Model</th><th>Ckp</th><th>Params</th><th>Vision</th><th>Params of Vision</th><th>Text</th><th>Params of Text</th><th>Resolution</th>
21
- </tr>
22
- <tr align="center">
23
- <td>QA-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-RN50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td>
24
- </tr>
25
- <tr align="center">
26
- <td>QA-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-base.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
27
- </tr>
28
- <tr align="center">
29
- <td>QA-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-large.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
30
- </tr>
31
- </table>
32
- <br>
33
-
34
  ## Results
35
  We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
36
 
@@ -160,18 +140,6 @@ We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge
160
 
161
 
162
  # Getting Started
163
- ## Installation Requirements
164
- Environment configuration requirements:
165
-
166
- * python >= 3.6.4
167
- * pytorch >= 1.8.0 (with torchvision >= 0.9.0)
168
- * CUDA Version >= 10.2
169
-
170
- Install required packages:
171
- ```bash
172
- cd /yourpath/QA-CLIP-main
173
- pip install -r requirements.txt
174
- ```
175
 
176
  ## Inference Code
177
  Inference code example:
@@ -180,8 +148,8 @@ from PIL import Image
180
  import requests
181
  from transformers import ChineseCLIPProcessor, ChineseCLIPModel
182
 
183
- model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-B-16")
184
- processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-B-16")
185
 
186
  url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
187
  image = Image.open(requests.get(url, stream=True).raw)
@@ -206,79 +174,6 @@ probs = logits_per_image.softmax(dim=1)
206
  ```
207
  <br><br>
208
 
209
- ## Prediction and Evaluation
210
-
211
- ### Download Image-text Retrieval Test Dataset
212
- In Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, the test set has already been preprocessed. Here is the download link they provided:
213
-
214
- MUGE dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)
215
-
216
- Flickr30K-CN dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)
217
-
218
- Additionally, obtaining the [COCO-CN](https://github.com/li-xirong/coco-cn) dataset requires applying to the original author.
219
-
220
- ### Download ImageNet Dataset
221
- Please download the raw data yourself,[Chinese Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt) and [English Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt) are provided by Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>
222
- ### Image-text Retrieval Evaluation
223
- The image-text retrieval evaluation code can be referred to as follows:
224
- ```bash
225
- split=test # Designate the computation of features for the valid or test set
226
- resume=your_ckp_path
227
- DATAPATH=your_DATAPATH
228
- dataset_name=Flickr30k-CN
229
- # dataset_name=MUGE
230
-
231
- python -u eval/extract_features.py \
232
- --extract-image-feats \
233
- --extract-text-feats \
234
- --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
235
- --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
236
- --img-batch-size=32 \
237
- --text-batch-size=32 \
238
- --context-length=52 \
239
- --resume=${resume} \
240
- --vision-model=ViT-B-16 \
241
- --text-model=RoBERTa-wwm-ext-base-chinese
242
-
243
- python -u eval/make_topk_predictions.py \
244
- --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
245
- --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
246
- --top-k=10 \
247
- --eval-batch-size=32768 \
248
- --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
249
-
250
- python -u eval/make_topk_predictions_tr.py \
251
- --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
252
- --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
253
- --top-k=10 \
254
- --eval-batch-size=32768 \
255
- --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
256
-
257
- python eval/evaluation.py \
258
- ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
259
- ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
260
- ${DATAPATH}/datasets/${dataset_name}/output1.json
261
- cat ${DATAPATH}/datasets/${dataset_name}/output1.json
262
-
263
- python eval/transform_ir_annotation_to_tr.py \
264
- --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
265
-
266
- python eval/evaluation_tr.py \
267
- ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
268
- ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
269
- ${DATAPATH}/datasets/${dataset_name}/output2.json
270
- cat ${DATAPATH}/datasets/${dataset_name}/output2.json
271
- ```
272
-
273
- ### ImageNet Zero-shot Classification
274
- The ImageNet zero-shot classification code can be referred to as follows
275
- ```bash
276
- bash scripts/zeroshot_eval.sh 0 \
277
- ${DATAPATH} imagenet \
278
- ViT-B-16 RoBERTa-wwm-ext-base-chinese \
279
- ./pretrained_weights/QA-CLIP-base.pt
280
- ```
281
- <br><br>
282
  # Acknowledgments
283
  The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
284
  <br><br>
 
8
  [**中文说明**](README_CN.md) | [**English**](README.md)
9
  # Introduction
10
  This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
11
+ This project is produced by QQ-ARC Joint Lab, Tencent PCG. For more detailed information, please refer to the [main page of the QA-CLIP project](https://huggingface.co/TencentARC/QA-CLIP).
12
  <br><br>
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## Results
15
  We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
16
 
 
140
 
141
 
142
  # Getting Started
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  ## Inference Code
145
  Inference code example:
 
148
  import requests
149
  from transformers import ChineseCLIPProcessor, ChineseCLIPModel
150
 
151
+ model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
152
+ processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
153
 
154
  url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
155
  image = Image.open(requests.get(url, stream=True).raw)
 
174
  ```
175
  <br><br>
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  # Acknowledgments
178
  The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
179
  <br><br>