TKU410410103 commited on
Commit
7cb282c
1 Parent(s): ec133d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -2
README.md CHANGED
@@ -49,7 +49,7 @@ should probably proofread and complete it, then remove this comment. -->
49
 
50
  # hubert-large-asr
51
 
52
- This model is a fine-tuned version of [rinna/japanese-hubert-large](https://huggingface.co/rinna/japanese-hubert-large) ASR. Initially fine-tuned on the [Reazonspeech(small) dataset](https://huggingface.co/datasets/reazon-research/reazonspeech), it was subsequently further fine-tuned on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.
53
 
54
  ## Acknowledgments
55
 
@@ -110,7 +110,7 @@ The following hyperparameters were used during training:
110
  ### Test results
111
  The final model was evaluated as follows:
112
 
113
- On Reazonspeech:
114
  - WER: 40.519700%
115
  - CER: 23.220979%
116
 
@@ -118,6 +118,89 @@ On common_voice_11_0:
118
  - WER: 22.705487%
119
  - CER: 9.399390%
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ### Framework versions
122
 
123
  - Transformers 4.39.1
 
49
 
50
  # hubert-large-asr
51
 
52
+ This model is a fine-tuned version of [rinna/japanese-hubert-large](https://huggingface.co/rinna/japanese-hubert-large) ASR. Initially fine-tuned on the [reazonspeech(small) dataset](https://huggingface.co/datasets/reazon-research/reazonspeech), it was subsequently further fine-tuned on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.
53
 
54
  ## Acknowledgments
55
 
 
110
  ### Test results
111
  The final model was evaluated as follows:
112
 
113
+ On reazonspeech(tiny):
114
  - WER: 40.519700%
115
  - CER: 23.220979%
116
 
 
118
  - WER: 22.705487%
119
  - CER: 9.399390%
120
 
121
+ ### How to use the model
122
+
123
+ ```python
124
+ from transformers import HubertForCTC, Wav2Vec2Processor
125
+ from datasets import load_dataset
126
+ import torchaudio
127
+ import librosa
128
+ import numpy as np
129
+ import re
130
+ import MeCab
131
+ import pykakasi
132
+ from evaluate import load
133
+
134
+ model = HubertForCTC.from_pretrained('TKU410410103/hubert-large-japanese-asr')
135
+ processor = Wav2Vec2Processor.from_pretrained("TKU410410103/hubert-large-japanese-asr")
136
+
137
+ # load dataset
138
+ test_dataset = load_dataset('mozilla-foundation/common_voice_11_0', 'ja', split='test')
139
+ remove_columns = [col for col in test_dataset.column_names if col not in ['audio', 'sentence']]
140
+ test_dataset = test_dataset.remove_columns(remove_columns)
141
+
142
+ # resample
143
+ def process_waveforms(batch):
144
+ speech_arrays = []
145
+ sampling_rates = []
146
+
147
+ for audio_path in batch['audio']:
148
+ speech_array, _ = torchaudio.load(audio_path['path'])
149
+ speech_array_resampled = librosa.resample(np.asarray(speech_array[0].numpy()), orig_sr=48000, target_sr=16000)
150
+ speech_arrays.append(speech_array_resampled)
151
+ sampling_rates.append(16000)
152
+
153
+ batch["array"] = speech_arrays
154
+ batch["sampling_rate"] = sampling_rates
155
+
156
+ return batch
157
+
158
+ # hiragana
159
+ CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
160
+ "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
161
+ "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
162
+ "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
163
+ "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
164
+ chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
165
+
166
+ wakati = MeCab.Tagger("-Owakati")
167
+ kakasi = pykakasi.kakasi()
168
+ kakasi.setMode("J","H")
169
+ kakasi.setMode("K","H")
170
+ kakasi.setMode("r","Hepburn")
171
+ conv = kakasi.getConverter()
172
+
173
+ def prepare_char(batch):
174
+ batch["sentence"] = conv.do(wakati.parse(batch["sentence"]).strip())
175
+ batch["sentence"] = re.sub(chars_to_ignore_regex,'', batch["sentence"]).strip()
176
+ return batch
177
+
178
+
179
+ resampled_eval_dataset = test_dataset.map(process_waveforms, batched=True, batch_size=50, num_proc=4)
180
+ eval_dataset = resampled_eval_dataset.map(prepare_char, num_proc=4)
181
+
182
+ # begin the evaluation process
183
+ wer = load("wer")
184
+ cer = load("cer")
185
+
186
+ def evaluate(batch):
187
+ inputs = processor(batch["array"], sampling_rate=16_000, return_tensors="pt", padding=True)
188
+ with torch.no_grad():
189
+ logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits
190
+ pred_ids = torch.argmax(logits, dim=-1)
191
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
192
+ return batch
193
+
194
+ columns_to_remove = [column for column in eval_dataset.column_names if column != "sentence"]
195
+ batch_size = 16
196
+ result = eval_dataset.map(evaluate, remove_columns=columns_to_remove, batched=True, batch_size=batch_size)
197
+
198
+ wer_result = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
199
+ cer_result = cer.compute(predictions=result["pred_strings"], references=result["sentence"])
200
+
201
+ print("WER: {:2f}%".format(100 * wer_result))
202
+ print("CER: {:2f}%".format(100 * cer_result))
203
+ ```
204
  ### Framework versions
205
 
206
  - Transformers 4.39.1