File size: 8,946 Bytes
a88b3eb 1316154 a88b3eb 1316154 cb58e3c 1316154 4406b41 1316154 4406b41 1316154 17df7e2 1316154 4406b41 17df7e2 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 4406b41 1316154 6fbc593 301941a 6fbc593 1316154 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
license: apache-2.0
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
candidate_labels: 音乐表演, 体育运动
example_title: 猫和狗
---
[**中文说明**](README_CN.md) | [**English**](README.md)
# Introduction
This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
This project is produced by QQ-ARC Joint Lab, Tencent PCG. For more detailed information, please refer to the [main page of the QA-CLIP project](https://huggingface.co/TencentARC/QA-CLIP). We have also open-sourced our code on GitHub, [QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP), and welcome to star!
<br><br>
## Results
We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
**Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
</tr>
<tr align="center">
<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
</tr>
<tr align="center">
<td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td><b>94.8</b></td><td>84.8</td><td>97.7</td><td>99.1</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td>94.7</td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
</tr>
</table>
<br>
**MUGE Zero-shot Retrieval (Official Validation Set)**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
</tr>
<tr align="center">
<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
</tr>
<tr align="center">
<td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
</tr>
</table>
<br>
**COCO-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
</tr>
<tr align="center">
<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
</tr>
<tr align="center">
<td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
</tr>
</table>
<br>
**Zero-shot Image Classification on ImageNet**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="1">ImageNet</th>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
</tr>
</table>
<br>
<br><br>
# Getting Started
## Inference Code
Inference code example:
```python
from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel
model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize
# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize
# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
```
<br><br>
# Acknowledgments
The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
<br><br> |