--- license: cc-by-nc-sa-4.0 language: - ja tags: - clip - ja - japanese - japanese-clip pipeline_tag: feature-extraction --- # Japanese CLIP ViT-H/14 (Base) ## Table of Contents 1. [Overview](#overview) 1. [Usage](#usage) 1. [Model Details](#model-details) 1. [Evaluation](#evaluation) 1. [Limitations and Biases](#limitations-and-biases) 1. [Citation](#citation) 1. [See Also](#see-also) 1. [Contact Information](#contact-information) ## Overview * **Developed by**: [HAKUHODO Technologies Inc.](https://www.hakuhodo-technologies.co.jp/) * **Model type**: Contrastive Language-Image Pre-trained Model * **Language(s)**: Japanese * **LICENSE**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) Presented here is a Japanese [CLIP (Contrastive Language-Image Pre-training)](https://arxiv.org/abs/2103.00020) model, mapping Japanese texts and images to a unified embedding space. Capable of multimodal tasks including zero-shot image classification, text-to-image retrieval, and image-to-text retrieval, this model extends its utility when integrated with other components, contributing to generative models like image-to-text and text-to-image generation. ## Usage ### Dependencies ```bash python3 -m pip install pillow sentencepiece torch torchvision transformers ``` ### Inference The usage is similar to [`CLIPModel`](https://huggingface.co/docs/transformers/model_doc/clip) and [`VisionTextDualEncoderModel`](https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder). ```python import requests import torch from PIL import Image from transformers import AutoModel, AutoProcessor, BatchEncoding # Download model_name = "hakuhodo-tech/japanese-clip-vit-h-14-bert-base" device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device) processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) # Prepare raw inputs url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) # Process inputs inputs = processor( text=["犬", "猫", "象"], images=image, return_tensors="pt", padding=True, ) # Infer and output outputs = model(**BatchEncoding(inputs).to(device)) probs = outputs.logits_per_image.softmax(dim=1) print([f"{x:.2f}" for x in probs.flatten().tolist()]) # ['0.00', '1.00', '0.00'] ``` ## Model Details ### Components The model consists of a frozen ViT-H image encoder from [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and a 12-layer 12-head BERT text encoder initialized from [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16). ### Training Model training is done by Zhi Wang with 8 A100 (80 GB) GPUs. [Locked-image Tuning (LiT)](https://arxiv.org/abs/2111.07991) is adopted. See more details in [the paper](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B6-5.pdf). ### Dataset The Japanese subset of the [laion2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) dataset containing ~120M image-text pairs. ## Evaluation ### Testing Data The 5K evaluation set (val2017) of [MS-COCO](https://cocodataset.org/) with [STAIR Captions](http://captions.stair.center/). ### Metrics Zero-shot image-to-text and text-to-image recall@1, 5, 10. ### Results | | | | | | | | | :---------------------------------------------------------------------------------------------------------------------- | :------: | :------: | :------: | :------: | :------: | :------: | |