Spaces:
Running
on
L4
Phoneme recognition
Is it possible to use whisper to output phoneme transcription instead of text transcription?
See related discussion: https://github.com/openai/whisper/discussions/318
Hi
@sanchit-gandhi
Thank you for pointing me towards discussions page.
If I understand it correctly, whisper currently cannot output phoneme transcription. However, there was one response that said one could train a whisper model with audio + phoneme transcriptions instead of the recommended audio + text transcriptions. Is this possible? Because for fine-tuning whisper with audio + phoneme transcriptions, I would be using pretrained feature extractor and tokenizer as per your blog https://huggingface.co/blog/fine-tune-whisper.
Please let me know your thoughts on this
Thanks!
Hey @dg96 - that's a cool proposition! I think we could fine-tune Whisper for phoneme transcriptions. The feature extractor can stay the same (we can pre-process the audio in the same way as before). We'd need to change the tokenizer to handle the new vocabulary. Namely, what we need to do is build a new tokenizer over the possible phonemes. For this, you can follow this guide: https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt
You should then have a tokenizer that you can load with HF Transformers:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(...)
Once we have built our new tokenizer, we need to make sure that the Whisper embedding layer has the same dimensionality as the number of tokens:
# new random embeddings for our phoneme tokens
model.resize_token_embeddings(len(tokenizer))
Once we've done that, the Whisper model will now be set to predict phonemes instead of sub-word tokens. You can then fine-tune the model on an (audio, phoneme) dataset in exactly the same way as the fine-tuning blog describes. You might want to change the compute_metrics
function to a more applicable metric for phoneme prediction than WER.
I am not an expert on Whisper, but a related use case is needs timing data as well. For example, to control a 3D animated character's facial expressions, you need phonemes and timing data for the phoneme. Otherwise the lipsync can get out of alignment.