F5 added to the TTS Arena fork, can anyone offer a better voice sample?

#32
by Pendrokar - opened

I added this Space to the TTS Arena fork. It is a duplicate of TTS-AGI/TTS-Arena.
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena

I already got some heat from @cocktailpeanut that F5 is failing to beat other models only because I am trying to sabotage it. I have no such intention.

I am giving it the same following sample, with the transcript as is given to XTTSv2.

Hispaniola was rolling scuppers under in the ocean swell. The booms were tearing at the blocks, the rudder was banging to and fro, and the whole ship creaking, groaning, and jumping like a manufactory.

Gradio client test:
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/blob/main/test_tts_e2_f5_f5.py#L8-L13

XTTS can handle it well. F5 has so far been unstable, sometimes including the voice sample in the output*, being unemotional/monotone/depressed and mispronouncing words (awestruck).

You can view the vote table here and see the sentences in which F5's output gets rejected most often:
https://huggingface.co/datasets/Pendrokar/TTS_Arena/viewer/default/train?f[rejected][value]=%27mrfakename/E2-F5-TTS%27

If you have a better suggestion for the voice sample fit for the task, I am all ears.

[edit] * OK, I might have fixed the voice sample being included in the output, it relates to Gradio age old issue that it gives the minimum value from slider parameters rather than their default, in this case "speed" and "crossfade", so I had to override them.

Sign up or log in to comment