|
--- |
|
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png |
|
language: ja |
|
license: apache-2.0 |
|
datasets: reazon-research/reazonspeech |
|
pipeline_tag: feature-extraction |
|
inference: false |
|
tags: |
|
- wav2vec2 |
|
- speech |
|
--- |
|
|
|
# `rinna/japanese-wav2vec2-base` |
|
|
|
![rinna-icon](./rinna.png) |
|
|
|
# Overview |
|
|
|
This is a Japanese wav2vec 2.0 Base model trained by [rinna Co., Ltd.](https://rinna.co.jp/) |
|
|
|
* **Model summary** |
|
|
|
The model architecture is the same as the [original wav2vec 2.0 Base model](https://huggingface.co/facebook/wav2vec2-base), which contains 12 transformer layers with 12 attention heads. |
|
The model was trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec), and the detailed training configuration can be found in the same repository and the [original paper](https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html). |
|
|
|
|
|
* **Training** |
|
|
|
The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1. |
|
- [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) |
|
|
|
* **Contributors** |
|
|
|
- [Yukiya Hono](https://huggingface.co/yky-h) |
|
- [Kentaro Mitsui](https://huggingface.co/Kentaro321) |
|
- [Kei Sawada](https://huggingface.co/keisawada) |
|
|
|
--- |
|
|
|
# How to use the model |
|
|
|
```python |
|
import soundfile as sf |
|
from transformers import AutoFeatureExtractor, AutoModel |
|
|
|
model_name = "rinna/japanese-wav2vec2-base" |
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
model = AutoModel.from_pretrained(model_name) |
|
model.eval() |
|
|
|
raw_speech_16kHz, sr = sf.read(audio_file) |
|
inputs = feature_extractor( |
|
raw_speech_16kHz, |
|
return_tensors="pt", |
|
sampling_rate=sr, |
|
) |
|
outputs = model(**inputs) |
|
|
|
print(f"Input: {inputs.input_values.size()}") # [1, #samples] |
|
print(f"Output: {outputs.last_hidden_state.size()}") # [1, #frames, 768] |
|
``` |
|
|
|
A fairseq checkpoint file can also be available [here](https://huggingface.co/rinna/japanese-wav2vec2-base/tree/main/fairseq). |
|
|
|
--- |
|
|
|
# How to cite |
|
```bibtex |
|
@misc{rinna-japanese-wav2vec2-base, |
|
title={rinna/japanese-wav2vec2-base}, |
|
author={Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei}, |
|
url={https://huggingface.co/rinna/japanese-wav2vec2-base} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
# Citations |
|
```bibtex |
|
@inproceedings{baevski2020wav2vec, |
|
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations}, |
|
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael}, |
|
booktitle={Advances in Neural Information Processing Systems}, |
|
volume={33}, |
|
pages={12449--12460}, |
|
year={2020}, |
|
url={https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html} |
|
} |
|
``` |
|
--- |
|
|
|
# License |
|
[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) |
|
|