metadata

license: apache-2.0
base_model: openai/whisper-large-v3
tags:
  - generated_from_trainer
metrics:
  - wer
model-index:
  - name: Hibiki_ASR_Phonemizer
    results: []
language:
  - ja

Hibiki ASR Phonemizer

This model is a Phoneme Level Speech Recognition network, originally a fine-tuned version of openai/whisper-large-v3 on a mixture of Different Japanese datasets.

it can detect, transcribe and do the following:

non-speech sounds such as gasp, erotic moans, laughter, etc.
adding punctuations more faithfully.

a Grapheme decoder head (i.e outputting normal Japanese) will probably be trained as well. Though going directly from audio to Phonemes will result in a more accurate representation for Japanese.

evaluation set:

Loss: 0.2186
Wer: 21.6707

Inference and Post-proc


# this function was borrowed and modified from Aaron Yinghao Li, the Author of StyleTTS paper.

from datasets import Dataset, Audio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import jaconv

kana_mapper = dict([
    ("ゔぁ","ba"),
          .
          .
          .
          etc. # Take a look at the Notebook for the whole code
    ("ぉ"," o"),
    ("ゎ"," ɯa"),
    ("ぉ"," o"),

    ("を","o")
])


def post_fix(text):
    orig = text

    for k, v in kana_mapper.items():
        text = text.replace(k, v)

    return text


processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("Respair/Hibiki_ASR_Phonemizer").to("cuda:0")

forced_decoder_ids = processor.get_decoder_prompt_ids(task="transcribe", language='japanese')


import re

sample = Dataset.from_dict({"audio": ["/content/kl_chunk1987.wav"]}).cast_column("audio", Audio(16000))
sample = sample[0]['audio']

# Ensure the input features are on the same device as the model
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features.to("cuda:0")

# generate token ids
predicted_ids = model.generate(input_features,forced_decoder_ids=forced_decoder_ids, repetition_penalty=1.2)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


# You can add your final adjustments here, it's better to write a dict though, but I'm just giving you a quick demonstration here.

if ' neɽitai ' in transcription[0]:
    transcription[0] = transcription[0].replace(' neɽitai ', "naɽitai")

if 'harɯdʑisama' in transcription[0]:
    transcription[0] = transcription[0].replace('harɯdʑisama', "arɯdʑisama")


if "ki ni ɕinai" in transcription[0]:
    transcription[0] = re.sub(r'(?<!\s)ki ni ɕinai', r' ki ni ɕinai', transcription[0])

if 'ʔt' in transcription[0]:
    transcription[0] = re.sub(r'(?<!\s)ʔt', r'ʔt', transcription[0])

if 'de aɽoɯ' in transcription[0]:
    transcription[0] = re.sub(r'(?<!\s)de aɽoɯ', r' de aɽoɯ', transcription[0])

post_fix(jaconv.kata2hira(transcription[0].lstrip())) # Ensuring the model won't hallucinate and return kana

the Full code -> Notebook

Intended uses & limitations

No restrictions is imposed by me, but proceed at your own risk, The User (You) are entirely responisble for their actions.

Training and evaluation data

Japanese Common Voice 17
ehehe Corpus
Custom Game and Anime dataset (around 8 hours)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 24
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 6000

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.2101	0.8058	1000	0.2090	30.1840
0.1369	1.6116	2000	0.1837	27.6756
0.0838	2.4174	3000	0.1829	26.4036
0.0454	3.2232	4000	0.1922	20.9549
0.0434	4.0290	5000	0.2072	20.8898
0.021	4.8348	6000	0.2186	21.6707

Compute and Duration

1x A100(40G)
64gb RAM
BF16
14hrs

Framework versions

Transformers 4.41.1
Pytorch 2.4.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1