metadata

license: apache-2.0
tags:
  - generated_from_trainer
metrics:
  - wer
  - cer
model-index:
  - name: hubert-large-japanese-asr
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Reazonspeech
          type: custom
          args: ja
        metrics:
          - name: Test WER
            type: wer
            value: 40.5197
          - name: Test CER
            type: cer
            value: 23.220979
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: common_voice_11_0
          type: common_voice
          args: ja
        metrics:
          - name: Test WER
            type: wer
            value: 22.705487
          - name: Test CER
            type: cer
            value: 9.39939
datasets:
  - reazon-research/reazonspeech
  - mozilla-foundation/common_voice_11_0
language:
  - ja

hubert-large-asr

This model is a fine-tuned version of rinna/japanese-hubert-large ASR. Initially fine-tuned on the Reazonspeech(small) dataset, it was subsequently further fine-tuned on the common_voice_11_0 dataset for ASR tasks.

Acknowledgments

This model's fine-tuning approach was inspired by and references the training methodology used in vumichien/wav2vec2-large-xlsr-japanese-hiragana.

Training procedure

The model was fine-tuned in two main stages, first on the Reazonspeech dataset, followed by the common_voice_11_0 dataset. Details of the training steps and results are as follows:

Training on Reazonspeech

The initial fine-tuning on the Reazonspeech(small) dataset was carried out with the following performance metrics:

Step	Training Loss	Validation Loss	WER
1000	12.29880	3.610288	1.00000
2000	3.601800	3.505306	1.00000
3000	2.80300	1.948012	0.722361
4000	1.961500	1.545842	0.558738
5000	1.712000	1.420027	0.509049
6000	1.565500	1.235171	0.466279
7000	1.504900	1.160565	0.461829
8000	1.409800	1.088012	0.427435
9000	1.358800	1.097211	0.409861
10000	1.318600	1.062294	0.403694
11000	1.258500	1.026783	0.385464
12000	1.245100	1.024860	0.379845
13000	1.217700	0.985201	0.375634
14000	1.187900	0.977686	0.367163
15000	1.168100	0.978529	0.363656
16000	1.135800	0.965668	0.363942
17000	1.140600	0.953237	0.360912

Training on common_voice_11_0

After fine-tuning on Reazonspeech, further fine-tuning was performed on the common_voice_11_0 dataset, leading to the following results:

Step	Training Loss	Validation Loss	WER
1000	1.08950	0.49275	0.302035
2000	0.86100	0.45113	0.266950
3000	0.76240	0.442281	0.244981
4000	0.70170	0.411666	0.234287
5000	0.66400	0.411769	0.227942
6000	0.63810	0.413067	0.225690

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-4
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 10
lr_scheduler_type: linear

Test results

The final model was evaluated as follows:

On Reazonspeech:

WER: 40.519700%
CER: 23.220979%

On common_voice_11_0:

WER: 22.705487%
CER: 9.399390%

Framework versions

Transformers 4.39.1
Pytorch 2.2.1+cu118
Datasets 2.17.1