Error During Fine-Tuning Nvidia TTS Fastpitch Model with Custom Dataset
Description:
I am currently trying to fine-tune the FastPitch model from NVIDIA NeMo on a custom dataset but encountered the error upon running this part of code:
!(python fastpitch_finetune.py --config-name=fastpitch_align_v1.05.yaml \
train_dataset=./9017_manifest_train_dur_5_mins_local.json \
validation_datasets=./9017_manifest_dev_ns_all_local.json \
sup_data_path=./fastpitch_sup_data \
phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.10 \
heteronyms_path=tts_dataset_files/heteronyms-052722 \
exp_manager.exp_dir=./ljspeech_to_9017_no_mixing_5_mins \
+init_from_nemo_model=./tts_en_fastpitch_align.nemo \
+trainer.max_steps=1000 ~trainer.max_epochs \
trainer.check_val_every_n_epoch=25 \
model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=2 \
model.n_speakers=1 model.pitch_mean=152.3 model.pitch_std=64.0 \
model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 \
~model.optim.sched model.optim.name=adam trainer.devices=1 trainer.strategy=auto \
+model.text_tokenizer.add_blank_at=true \
)
RuntimeError:
The size of tensor a (128) must match the size of tensor b (122) at non-singleton dimension 2
Detailed Error Log:
[NeMo W 2024-06-20 13:08:38 nemo_logging:349] /home/rev9ai/anaconda3/envs/voice_my/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
[NeMo W 2024-06-20 13:09:13 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
[NeMo I 2024-06-20 13:09:13 save_restore_connector:263] Model FastPitchModel was successfully restored from /mnt/ssd/hasan/voice_my/tts_en_fastpitch_align.nemo.
...
RuntimeError: The size of tensor a (128) must match the size of tensor b (122) at non-singleton dimension 2
Manifest.json Data Sample:
{"audio_filepath": "audio/segment_5440.flac", "text": " Chapter 3", "duration": 1.03, "normalized_text": "chapter three"}
{"audio_filepath": "audio/segment_5441.flac", "text": " Old World Marketing vs. New World Marketing", "duration": 3.08, "normalized_text": "old world marketing versus new world marketing"}
{"audio_filepath": "audio/segment_5442.flac", "text": " Smart orthodontists in this economy research and commit to marketing strategies which are proven and which provide a high return on investment.", "duration": 8.46, "normalized_text": "smart orthodontists in this economy research and commit to marketing strategies which are proven and which provide a high return on investment."}
{"audio_filepath": "audio/segment_5443.flac", "text": " Time and time again I see orthodontists invest in marketing strategies that no longer work, which are not measurable in any format other than money out of their pockets. This chapter will put a stop to this nonsense once and for all. If you decide to listen and implement the proven money-making strategies for you and your practice.", "duration": 19.34, "normalized_text": "time and time again i see orthodontists invest in marketing strategies that no longer work, which are not measurable in any format other than money out of their pockets. this chapter will put a stop to this nonsense once and for all. if you decide to listen and implement the proven money-making strategies for you and your practice."}
{"audio_filepath": "audio/segment_5444.flac", "text": " Old world marketing refers to different media techniques and strategies which were once effective and may still be effective in today's quickly changing world.", "duration": 8.71, "normalized_text": "old world marketing refers to different media techniques and strategies which were once effective and may still be effective in today's quickly changing world."}
Steps Taken:
Firstly, I Followed the FastPitch Finetuning tutorial.
But I encountered the tensor size mismatch error. Then I Passed my data through the Data Preparation pipeline & completed text and audio preprocessing as outlined here.
Then I gave preprocessed data to FastPitch_Finetuning.ipynb But still the same tensor size mismatch error persists.
Moreover, if I use the FastPitch_Data_Preparation.ipynb pipeline for finetuning as well then its results are not good.
I also Configured training parameters and paths as specified.
Environment:
1-Python=3.10.12
2-torch=2.0.1
3-torchvision=0.15.2
Any insights or suggestions to resolve this error would be greatly appreciated. Thanks