metadata
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
language: ja
license: apache-2.0
datasets: reazon-research/reazonspeech
pipeline_tag: feature-extraction
inference: false
tags:
- wav2vec2
- speech
rinna/japanese-wav2vec2-base
Overview
This is a Japanese wav2vec 2.0 Base model trained by rinna Co., Ltd.
Model summary
The model architecture is the same as the original wav2vec 2.0 Base model, which contains 12 transformer layers with 12 attention heads. The model was trained using code from the official repository, and the detailed training configuration can be found in the same repository and the original paper.
Training
The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1.
Contributors
How to use the model
import soundfile as sf
from transformers import AutoFeatureExtractor, AutoModel
model_name = "rinna/japanese-wav2vec2-base"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
raw_speech_16kHz, sr = sf.read(audio_file)
inputs = feature_extractor(
raw_speech_16kHz,
return_tensors="pt",
sampling_rate=sr,
)
outputs = model(**inputs)
print(f"Input: {inputs.input_values.size()}") # [1, #samples]
print(f"Output: {outputs.last_hidden_state.size()}") # [1, #frames, 768]
A fairseq checkpoint file can also be available here.
How to cite
@misc{rinna-japanese-wav2vec2-base,
title={rinna/japanese-wav2vec2-base},
author={Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei},
url={https://huggingface.co/rinna/japanese-wav2vec2-base}
}
Citations
@inproceedings{baevski2020wav2vec,
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
booktitle={Advances in Neural Information Processing Systems},
volume={33},
pages={12449--12460},
year={2020},
url={https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html}
}