Update README.md

df342be verified 7 months ago

7.6 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	metrics:
	- wer
	- cer
	model-index:
	- name: hubert-large-japanese-asr
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Reazonspeech
	type: custom
	args: ja
	metrics:
	- name: Test WER
	type: wer
	value: 40.5197
	- name: Test CER
	type: cer
	value: 23.220979
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common_voice_11_0
	type: common_voice
	args: ja
	metrics:
	- name: Test WER
	type: wer
	value: 22.705487
	- name: Test CER
	type: cer
	value: 9.39939
	datasets:
	- reazon-research/reazonspeech
	- mozilla-foundation/common_voice_11_0
	language:
	- ja
	---


	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# hubert-large-asr

	This model is a fine-tuned version of [rinna/japanese-hubert-large](https://huggingface.co/rinna/japanese-hubert-large) ASR. Initially fine-tuned on the [reazonspeech(small) dataset](https://huggingface.co/datasets/reazon-research/reazonspeech), it was subsequently further fine-tuned on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.

	This model can only predict Hiragana.

	## Acknowledgments

	This model's fine-tuning approach was inspired by and references the training methodology used in [vumichien/wav2vec2-large-xlsr-japanese-hiragana](https://huggingface.co/vumichien/wav2vec2-large-xlsr-japanese-hiragana).


	## Training procedure

	The model was fine-tuned in two main stages, first on the Reazonspeech dataset, followed by the common_voice_11_0 dataset. Details of the training steps and results are as follows:

	### Training on Reazonspeech
	The initial fine-tuning on the Reazonspeech(small) dataset was carried out with the following performance metrics:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|-------\|---------------\|-----------------\|--------\|
	\| 1000 \| 12.29880 \| 3.610288 \| 1.00000\|
	\| 2000 \| 3.601800 \| 3.505306 \| 1.00000\|
	\| 3000 \| 2.80300 \| 1.948012 \| 0.722361\|
	\| 4000 \| 1.961500 \| 1.545842 \| 0.558738\|
	\| 5000 \| 1.712000 \| 1.420027 \| 0.509049\|
	\| 6000 \| 1.565500 \| 1.235171 \| 0.466279\|
	\| 7000 \| 1.504900 \| 1.160565 \| 0.461829\|
	\| 8000 \| 1.409800 \| 1.088012 \| 0.427435\|
	\| 9000 \| 1.358800 \| 1.097211 \| 0.409861\|
	\| 10000 \| 1.318600 \| 1.062294 \| 0.403694\|
	\| 11000 \| 1.258500 \| 1.026783 \| 0.385464\|
	\| 12000 \| 1.245100 \| 1.024860 \| 0.379845\|
	\| 13000 \| 1.217700 \| 0.985201 \| 0.375634\|
	\| 14000 \| 1.187900 \| 0.977686 \| 0.367163\|
	\| 15000 \| 1.168100 \| 0.978529 \| 0.363656\|
	\| 16000 \| 1.135800 \| 0.965668 \| 0.363942\|
	\| 17000 \| 1.140600 \| 0.953237 \| 0.360912\|


	### Training on common_voice_11_0
	After fine-tuning on Reazonspeech, further fine-tuning was performed on the common_voice_11_0 dataset, leading to the following results:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|------\|---------------\|-----------------\|--------\|
	\| 1000 \| 1.08950 \| 0.49275 \| 0.302035\|
	\| 2000 \| 0.86100 \| 0.45113 \| 0.266950\|
	\| 3000 \| 0.76240 \| 0.442281 \| 0.244981\|
	\| 4000 \| 0.70170 \| 0.411666 \| 0.234287\|
	\| 5000 \| 0.66400 \| 0.411769 \| 0.227942\|
	\| 6000 \| 0.63810 \| 0.413067 \| 0.225690\|

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-4
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- num_train_epochs: 10
	- lr_scheduler_type: linear

	### How to evaluate the model

	```python
	from transformers import HubertForCTC, Wav2Vec2Processor
	from datasets import load_dataset
	import torch
	import torchaudio
	import librosa
	import numpy as np
	import re
	import MeCab
	import pykakasi
	from evaluate import load

	model = HubertForCTC.from_pretrained('TKU410410103/hubert-large-japanese-asr')
	processor = Wav2Vec2Processor.from_pretrained("TKU410410103/hubert-large-japanese-asr")

	# load dataset
	test_dataset = load_dataset('mozilla-foundation/common_voice_11_0', 'ja', split='test')
	remove_columns = [col for col in test_dataset.column_names if col not in ['audio', 'sentence']]
	test_dataset = test_dataset.remove_columns(remove_columns)

	# resample
	def process_waveforms(batch):
	speech_arrays = []
	sampling_rates = []

	for audio_path in batch['audio']:
	speech_array, _ = torchaudio.load(audio_path['path'])
	speech_array_resampled = librosa.resample(np.asarray(speech_array[0].numpy()), orig_sr=48000, target_sr=16000)
	speech_arrays.append(speech_array_resampled)
	sampling_rates.append(16000)

	batch["array"] = speech_arrays
	batch["sampling_rate"] = sampling_rates

	return batch

	# hiragana
	CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
	"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
	"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
	"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
	chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

	wakati = MeCab.Tagger("-Owakati")
	kakasi = pykakasi.kakasi()
	kakasi.setMode("J","H")
	kakasi.setMode("K","H")
	kakasi.setMode("r","Hepburn")
	conv = kakasi.getConverter()

	def prepare_char(batch):
	batch["sentence"] = conv.do(wakati.parse(batch["sentence"]).strip())
	batch["sentence"] = re.sub(chars_to_ignore_regex,'', batch["sentence"]).strip()
	return batch


	resampled_eval_dataset = test_dataset.map(process_waveforms, batched=True, batch_size=50, num_proc=4)
	eval_dataset = resampled_eval_dataset.map(prepare_char, num_proc=4)

	# begin the evaluation process
	wer = load("wer")
	cer = load("cer")

	def evaluate(batch):
	inputs = processor(batch["array"], sampling_rate=16_000, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits
	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

	columns_to_remove = [column for column in eval_dataset.column_names if column != "sentence"]
	batch_size = 16
	result = eval_dataset.map(evaluate, remove_columns=columns_to_remove, batched=True, batch_size=batch_size)

	wer_result = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
	cer_result = cer.compute(predictions=result["pred_strings"], references=result["sentence"])

	print("WER: {:2f}%".format(100 * wer_result))
	print("CER: {:2f}%".format(100 * cer_result))
	```

	### Test results
	The final model was evaluated as follows:

	On reazonspeech(tiny):
	- WER: 40.519700%
	- CER: 23.220979%

	On common_voice_11_0:
	- WER: 22.705487%
	- CER: 9.399390%

	### Framework versions

	- Transformers 4.39.1
	- Pytorch 2.2.1+cu118
	- Datasets 2.17.1