Update README.md

5aabcd5 verified 7 months ago

4.16 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	metrics:
	- wer
	- cer
	model-index:
	- name: hubert-large-japanese-asr
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Reazonspeech
	type: custom
	args: ja
	metrics:
	- name: Test WER
	type: wer
	value: 40.5197
	- name: Test CER
	type: cer
	value: 23.220979
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common_voice_11_0
	type: common_voice
	args: ja
	metrics:
	- name: Test WER
	type: wer
	value: 22.705487
	- name: Test CER
	type: cer
	value: 9.39939
	datasets:
	- reazon-research/reazonspeech
	- mozilla-foundation/common_voice_11_0
	language:
	- ja
	---


	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# hubert-large-asr

	This model is a fine-tuned version of [rinna/japanese-hubert-large](https://huggingface.co/rinna/japanese-hubert-large) ASR. Initially fine-tuned on the [Reazonspeech(small) dataset](https://huggingface.co/datasets/reazon-research/reazonspeech), it was subsequently further fine-tuned on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.

	## Acknowledgments

	This model's fine-tuning approach was inspired by and references the training methodology used in [vumichien/wav2vec2-large-xlsr-japanese-hiragana](https://huggingface.co/vumichien/wav2vec2-large-xlsr-japanese-hiragana).


	## Training procedure

	The model was fine-tuned in two main stages, first on the Reazonspeech dataset, followed by the common_voice_11_0 dataset. Details of the training steps and results are as follows:

	### Training on Reazonspeech
	The initial fine-tuning on the Reazonspeech(small) dataset was carried out with the following performance metrics:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|-------\|---------------\|-----------------\|--------\|
	\| 1000 \| 12.29880 \| 3.610288 \| 1.00000\|
	\| 2000 \| 3.601800 \| 3.505306 \| 1.00000\|
	\| 3000 \| 2.80300 \| 1.948012 \| 0.722361\|
	\| 4000 \| 1.961500 \| 1.545842 \| 0.558738\|
	\| 5000 \| 1.712000 \| 1.420027 \| 0.509049\|
	\| 6000 \| 1.565500 \| 1.235171 \| 0.466279\|
	\| 7000 \| 1.504900 \| 1.160565 \| 0.461829\|
	\| 8000 \| 1.409800 \| 1.088012 \| 0.427435\|
	\| 9000 \| 1.358800 \| 1.097211 \| 0.409861\|
	\| 10000 \| 1.318600 \| 1.062294 \| 0.403694\|
	\| 11000 \| 1.258500 \| 1.026783 \| 0.385464\|
	\| 12000 \| 1.245100 \| 1.024860 \| 0.379845\|
	\| 13000 \| 1.217700 \| 0.985201 \| 0.375634\|
	\| 14000 \| 1.187900 \| 0.977686 \| 0.367163\|
	\| 15000 \| 1.168100 \| 0.978529 \| 0.363656\|
	\| 16000 \| 1.135800 \| 0.965668 \| 0.363942\|
	\| 17000 \| 1.140600 \| 0.953237 \| 0.360912\|


	### Training on common_voice_11_0
	After fine-tuning on Reazonspeech, further fine-tuning was performed on the common_voice_11_0 dataset, leading to the following results:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|------\|---------------\|-----------------\|--------\|
	\| 1000 \| 1.08950 \| 0.49275 \| 0.302035\|
	\| 2000 \| 0.86100 \| 0.45113 \| 0.266950\|
	\| 3000 \| 0.76240 \| 0.442281 \| 0.244981\|
	\| 4000 \| 0.70170 \| 0.411666 \| 0.234287\|
	\| 5000 \| 0.66400 \| 0.411769 \| 0.227942\|
	\| 6000 \| 0.63810 \| 0.413067 \| 0.225690\|

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-4
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 10
	- lr_scheduler_type: linear

	### Test results
	The final model was evaluated as follows:

	On Reazonspeech:
	- WER: 40.519700%
	- CER: 23.220979%

	On common_voice_11_0:
	- WER: 22.705487%
	- CER: 9.399390%

	### Framework versions

	- Transformers 4.39.1
	- Pytorch 2.2.1+cu118
	- Datasets 2.17.1