Update README.md
Browse files
README.md
CHANGED
@@ -8,29 +8,9 @@ widget:
|
|
8 |
[**中文说明**](README_CN.md) | [**English**](README.md)
|
9 |
# Introduction
|
10 |
This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
|
11 |
-
This project is produced by QQ-ARC Joint Lab, Tencent PCG.
|
12 |
<br><br>
|
13 |
|
14 |
-
# Models and Results
|
15 |
-
<span id="model_card"></span>
|
16 |
-
## Model Card
|
17 |
-
QA-CLIP currently has three different open-source models of different sizes, and their model information and download links are shown in the table below:
|
18 |
-
<table border="1" width="100%">
|
19 |
-
<tr align="center">
|
20 |
-
<th>Model</th><th>Ckp</th><th>Params</th><th>Vision</th><th>Params of Vision</th><th>Text</th><th>Params of Text</th><th>Resolution</th>
|
21 |
-
</tr>
|
22 |
-
<tr align="center">
|
23 |
-
<td>QA-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-RN50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td>
|
24 |
-
</tr>
|
25 |
-
<tr align="center">
|
26 |
-
<td>QA-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-base.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
|
27 |
-
</tr>
|
28 |
-
<tr align="center">
|
29 |
-
<td>QA-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-large.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
|
30 |
-
</tr>
|
31 |
-
</table>
|
32 |
-
<br>
|
33 |
-
|
34 |
## Results
|
35 |
We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
|
36 |
|
@@ -160,18 +140,6 @@ We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge
|
|
160 |
|
161 |
|
162 |
# Getting Started
|
163 |
-
## Installation Requirements
|
164 |
-
Environment configuration requirements:
|
165 |
-
|
166 |
-
* python >= 3.6.4
|
167 |
-
* pytorch >= 1.8.0 (with torchvision >= 0.9.0)
|
168 |
-
* CUDA Version >= 10.2
|
169 |
-
|
170 |
-
Install required packages:
|
171 |
-
```bash
|
172 |
-
cd /yourpath/QA-CLIP-main
|
173 |
-
pip install -r requirements.txt
|
174 |
-
```
|
175 |
|
176 |
## Inference Code
|
177 |
Inference code example:
|
@@ -180,8 +148,8 @@ from PIL import Image
|
|
180 |
import requests
|
181 |
from transformers import ChineseCLIPProcessor, ChineseCLIPModel
|
182 |
|
183 |
-
model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-
|
184 |
-
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-
|
185 |
|
186 |
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
|
187 |
image = Image.open(requests.get(url, stream=True).raw)
|
@@ -206,79 +174,6 @@ probs = logits_per_image.softmax(dim=1)
|
|
206 |
```
|
207 |
<br><br>
|
208 |
|
209 |
-
## Prediction and Evaluation
|
210 |
-
|
211 |
-
### Download Image-text Retrieval Test Dataset
|
212 |
-
In Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, the test set has already been preprocessed. Here is the download link they provided:
|
213 |
-
|
214 |
-
MUGE dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)
|
215 |
-
|
216 |
-
Flickr30K-CN dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)
|
217 |
-
|
218 |
-
Additionally, obtaining the [COCO-CN](https://github.com/li-xirong/coco-cn) dataset requires applying to the original author.
|
219 |
-
|
220 |
-
### Download ImageNet Dataset
|
221 |
-
Please download the raw data yourself,[Chinese Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt) and [English Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt) are provided by Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>
|
222 |
-
### Image-text Retrieval Evaluation
|
223 |
-
The image-text retrieval evaluation code can be referred to as follows:
|
224 |
-
```bash
|
225 |
-
split=test # Designate the computation of features for the valid or test set
|
226 |
-
resume=your_ckp_path
|
227 |
-
DATAPATH=your_DATAPATH
|
228 |
-
dataset_name=Flickr30k-CN
|
229 |
-
# dataset_name=MUGE
|
230 |
-
|
231 |
-
python -u eval/extract_features.py \
|
232 |
-
--extract-image-feats \
|
233 |
-
--extract-text-feats \
|
234 |
-
--image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
|
235 |
-
--text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
|
236 |
-
--img-batch-size=32 \
|
237 |
-
--text-batch-size=32 \
|
238 |
-
--context-length=52 \
|
239 |
-
--resume=${resume} \
|
240 |
-
--vision-model=ViT-B-16 \
|
241 |
-
--text-model=RoBERTa-wwm-ext-base-chinese
|
242 |
-
|
243 |
-
python -u eval/make_topk_predictions.py \
|
244 |
-
--image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
|
245 |
-
--text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
|
246 |
-
--top-k=10 \
|
247 |
-
--eval-batch-size=32768 \
|
248 |
-
--output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
|
249 |
-
|
250 |
-
python -u eval/make_topk_predictions_tr.py \
|
251 |
-
--image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
|
252 |
-
--text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
|
253 |
-
--top-k=10 \
|
254 |
-
--eval-batch-size=32768 \
|
255 |
-
--output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
|
256 |
-
|
257 |
-
python eval/evaluation.py \
|
258 |
-
${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
|
259 |
-
${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
|
260 |
-
${DATAPATH}/datasets/${dataset_name}/output1.json
|
261 |
-
cat ${DATAPATH}/datasets/${dataset_name}/output1.json
|
262 |
-
|
263 |
-
python eval/transform_ir_annotation_to_tr.py \
|
264 |
-
--input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
|
265 |
-
|
266 |
-
python eval/evaluation_tr.py \
|
267 |
-
${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
|
268 |
-
${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
|
269 |
-
${DATAPATH}/datasets/${dataset_name}/output2.json
|
270 |
-
cat ${DATAPATH}/datasets/${dataset_name}/output2.json
|
271 |
-
```
|
272 |
-
|
273 |
-
### ImageNet Zero-shot Classification
|
274 |
-
The ImageNet zero-shot classification code can be referred to as follows
|
275 |
-
```bash
|
276 |
-
bash scripts/zeroshot_eval.sh 0 \
|
277 |
-
${DATAPATH} imagenet \
|
278 |
-
ViT-B-16 RoBERTa-wwm-ext-base-chinese \
|
279 |
-
./pretrained_weights/QA-CLIP-base.pt
|
280 |
-
```
|
281 |
-
<br><br>
|
282 |
# Acknowledgments
|
283 |
The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
|
284 |
<br><br>
|
|
|
8 |
[**中文说明**](README_CN.md) | [**English**](README.md)
|
9 |
# Introduction
|
10 |
This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
|
11 |
+
This project is produced by QQ-ARC Joint Lab, Tencent PCG. For more detailed information, please refer to the [main page of the QA-CLIP project](https://huggingface.co/TencentARC/QA-CLIP).
|
12 |
<br><br>
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
## Results
|
15 |
We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
|
16 |
|
|
|
140 |
|
141 |
|
142 |
# Getting Started
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
|
144 |
## Inference Code
|
145 |
Inference code example:
|
|
|
148 |
import requests
|
149 |
from transformers import ChineseCLIPProcessor, ChineseCLIPModel
|
150 |
|
151 |
+
model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
|
152 |
+
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
|
153 |
|
154 |
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
|
155 |
image = Image.open(requests.get(url, stream=True).raw)
|
|
|
174 |
```
|
175 |
<br><br>
|
176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
177 |
# Acknowledgments
|
178 |
The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
|
179 |
<br><br>
|