--- language: ja license: apache-2.0 tags: - clip - japanese-clip pipeline_tag: feature-extraction --- # clip-japanese-base This is a Japanese [CLIP (Contrastive Language-Image Pre-training)](https://arxiv.org/abs/2103.00020) model developed by [LY Corporation](https://www.lycorp.co.jp/en/). This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval. ## How to use 1. Install packages ``` pip install pillow requests sentencepiece transformers torch timm ``` 2. Run ```python import io import requests from PIL import Image import torch from transformers import AutoImageProcessor, AutoModel, AutoTokenizer HF_MODEL_PATH = 'line-corporation/clip-japanese-base' device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True) processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True) model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device) image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content)) image = processor(image, return_tensors="pt").to(device) text = tokenizer(["犬", "猫", "象"]).to(device) with torch.no_grad(): image_features = model.get_image_features(**image) text_features = model.get_text_features(**text) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) # [[1., 0., 0.]] ``` ## Model architecture The model uses an [Eva02-B](https://huggingface.co/timm/eva02_base_patch16_clip_224.merged2b_s8b_b131k) Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16). ## Evaluation ### Dataset - [STAIR Captions](http://captions.stair.center/) (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval. - [Recruit Datasets](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset) for image classification. - [ImageNet-1K](https://www.image-net.org/download.php) for image classification. We translated all classnames into Japanese. The classnames and templates can be found in [ja-imagenet-1k-classnames.txt](https://huggingface.co/line-corporation/clip-japanese-base/blob/main/ja-imagenet-1k-classnames.txt) and [ja-imagenet-1k-templates.txt](https://huggingface.co/line-corporation/clip-japanese-base/blob/main/ja-imagenet-1k-templates.txt). ### Result | Model | Image Encoder Params | Text Encoder params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1) | |-------------------|----------------------|---------------------|----------------|------------------|-----------------| | [Ours](https://huggingface.co/line-corporation/clip-japanese-base) | 86M(Eva02-B) | 100M(BERT) | **0.30** | **0.89** | 0.58 | | [Stable-ja-clip](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) | 307M(ViT-L) | 100M(BERT) | 0.24 | 0.77 | **0.68** | | [Rinna-ja-clip](https://huggingface.co/rinna/japanese-clip-vit-b-16) | 86M(ViT-B) | 100M(BERT) | 0.13 | 0.54 | 0.56 | | [Laion-clip](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) | 632M(ViT-H) | 561M(XLM-RoBERTa) | **0.30** | 0.83 | 0.58 | | [Hakuhodo-ja-clip](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-wider) | 632M(ViT-H) | 100M(BERT) | 0.21 | 0.82 | 0.46 | ## Licenses [The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Citation ``` @misc{clip-japanese-base, title = {CLIP Japanese Base}, author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama} url = {https://huggingface.co/line-corporation/clip-japanese-base}, } ```