File size: 8,568 Bytes
1316154 2e0ed45 09208cc 1316154 f2285fa 1316154 f2285fa 1316154 642d8f0 1316154 f2285fa 642d8f0 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 f2285fa 1316154 d7e6ab3 4c887bd d7e6ab3 1316154 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
[**中文说明**](README_CN.md) | [**English**](README.md)
# 项目介绍
本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。
本项目于QQ-ARC Joint Lab, Tencent PCG完成。
更详细的信息可以参考[QA-CLIP项目的主页面](https://huggingface.co/TencentARC/QA-CLIP)。我们也在github上开源了模型,[QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP),welcome to star!
<br><br>
## 实验结果
针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。
针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表:
**Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
</tr>
<tr align="center">
<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
</tr>
<tr align="center">
<td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td><b>94.8</b></td><td>84.8</td><td>97.7</td><td>99.1</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td>94.7</td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
</tr>
</table>
<br>
**MUGE Zero-shot Retrieval (Official Validation Set)**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
</tr>
<tr align="center">
<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
</tr>
<tr align="center">
<td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
</tr>
</table>
<br>
**COCO-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
</tr>
<tr align="center">
<td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
</tr>
<tr align="center">
<td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
</tr>
</table>
<br>
**Zero-shot Image Classification on ImageNet**:
<table border="1" width="120%">
<tr align="center">
<th>Task</th><th colspan="1">ImageNet</th>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
</tr>
<tr align="center">
<td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
</tr>
<tr align="center", style="background-color: Honeydew;">
<td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
</tr>
</table>
<br>
<br><br>
# 使用教程
## 推理代码
推理代码示例:
```python
from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel
model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize
# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize
# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
```
<br><br>
# 致谢
项目代码基于<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>实现,非常感谢他们优秀的开源工作。
<br><br> |