Update README.md

5524c9a verified 2 months ago

5.35 kB

	---
	language:
	- ja
	license: apache-2.0
	tags:
	- generated_from_trainer
	datasets:
	- mozilla-foundation/common_voice_11_0
	metrics:
	- wer
	- cer
	model-index:
	- name: uniTKU-hubert-japanese-asr
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: common_voice_11_0
	type: common_voice
	args: ja
	metrics:
	- type: wer
	value: 27.511982
	name: Test WER
	- type: cer
	value: 11.563649
	name: Test CER
	---

	# uniTKU-hubert-japanese-asr

	This model was fine-tuned on a dataset provided by uniTKU, and it has maintained the original performance metrics on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja).

	This model can only predict Hiragana.

	## Training Procedure

	Fine-tuning on the uniTKU dataset led to the following results:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|-------\|---------------\|-----------------\|--------\|
	\| 100 \| 1.127100 \| 1.089644 \| 0.668508\|
	\| 200 \| 0.873500 \| 0.682353 \| 0.508287\|
	\| 300 \| 0.786200 \| 0.482965 \| 0.397790\|
	\| 400 \| 0.670400 \| 0.345377 \| 0.381215\|
	\| 500 \| 0.719500 \| 0.387554 \| 0.337017\|
	\| 600 \| 0.707700 \| 0.371083 \| 0.292818\|
	\| 700 \| 0.658300 \| 0.236447 \| 0.243094\|
	\| 800 \| 0.611100 \| 0.207679 \| 0.193370\|

	### Training hyperparameters

	The training hyperparameters remained consistent throughout the fine-tuning process:

	- learning_rate: 1e-4
	- train_batch_size: 16
	- eval_batch_size: 16
	- gradient_accumulation_steps: 2
	- max_steps: 800
	- lr_scheduler_type: linear

	### How to evaluate the model

	```python
	from transformers import HubertForCTC, Wav2Vec2Processor
	from datasets import load_dataset
	import torch
	import torchaudio
	import librosa
	import numpy as np
	import re
	import MeCab
	import pykakasi
	from evaluate import load

	model = HubertForCTC.from_pretrained('TKU410410103/uniTKU-hubert-japanese-asr')
	processor = Wav2Vec2Processor.from_pretrained("TKU410410103/uniTKU-hubert-japanese-asr")

	# load dataset
	test_dataset = load_dataset('mozilla-foundation/common_voice_11_0', 'ja', split='test')
	remove_columns = [col for col in test_dataset.column_names if col not in ['audio', 'sentence']]
	test_dataset = test_dataset.remove_columns(remove_columns)

	# resample
	def process_waveforms(batch):
	speech_arrays = []
	sampling_rates = []

	for audio_path in batch['audio']:
	speech_array, _ = torchaudio.load(audio_path['path'])
	speech_array_resampled = librosa.resample(np.asarray(speech_array[0].numpy()), orig_sr=48000, target_sr=16000)
	speech_arrays.append(speech_array_resampled)
	sampling_rates.append(16000)

	batch["array"] = speech_arrays
	batch["sampling_rate"] = sampling_rates

	return batch

	# hiragana
	CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
	"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
	"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
	"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
	chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

	wakati = MeCab.Tagger("-Owakati")
	kakasi = pykakasi.kakasi()
	kakasi.setMode("J","H")
	kakasi.setMode("K","H")
	kakasi.setMode("r","Hepburn")
	conv = kakasi.getConverter()

	def prepare_char(batch):
	batch["sentence"] = conv.do(wakati.parse(batch["sentence"]).strip())
	batch["sentence"] = re.sub(chars_to_ignore_regex,'', batch["sentence"]).strip()
	return batch


	resampled_eval_dataset = test_dataset.map(process_waveforms, batched=True, batch_size=50, num_proc=4)
	eval_dataset = resampled_eval_dataset.map(prepare_char, num_proc=4)

	# begin the evaluation process
	wer = load("wer")
	cer = load("cer")

	def evaluate(batch):
	inputs = processor(batch["array"], sampling_rate=16_000, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits
	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

	columns_to_remove = [column for column in eval_dataset.column_names if column != "sentence"]
	batch_size = 16
	result = eval_dataset.map(evaluate, remove_columns=columns_to_remove, batched=True, batch_size=batch_size)

	wer_result = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
	cer_result = cer.compute(predictions=result["pred_strings"], references=result["sentence"])

	print("WER: {:2f}%".format(100 * wer_result))
	print("CER: {:2f}%".format(100 * cer_result))
	```

	### Test results
	The final model was evaluated as follows:

	On uniTKU Dataset:
	- WER: 19.003370%
	- CER: 11.027523%

	On common_voice_11_0:
	- WER: 27.511982%
	- CER: 11.563649%

	### Framework versions

	- Transformers 4.39.1
	- Pytorch 2.2.1+cu118
	- Datasets 2.17.1