Spaces:

ttsds
/

benchmark

Running

App Files Files Community

benchmark / README.md

cdminix

Update README.md

abd9a73 verified about 1 month ago

preview code

raw

history blame

No virus

2.47 kB

	---
	title: TTSDS Benchmark and Leaderboard
	emoji: 🥇
	colorFrom: green
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: true
	license: mit
	tags:
	- leaderboard
	- submission:semiautomatic
	- test:public
	- judge:auto
	- modality:audio
	- eval:generation
	- tts
	short_description: Text-To-Speech (TTS) Evaluation using objective metrics.
	---

	# TTSDS Benchmark

	As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech.
	However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments.
	Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility.
	By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up.

	## More information
	More details can be found in our paper [TTSDS -- Text-to-Speech Distribution Score](https://arxiv.org/abs/2407.12707).

	## Reproducibility
	To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds).

	## Credits


	This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models.
	Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub.
	Additionally, our benchmark uses the following datasets:
	- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h)
	- [LibriTTS](https://www.openslr.org/60/)
	- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950)
	- [Common Voice](https://commonvoice.mozilla.org/)
	- [ESC-50](https://github.com/karolpiczak/ESC-50)
	And the following metrics/representations/tools:
	- [Wav2Vec2](https://arxiv.org/abs/2006.11477)
	- [Hubert](https://arxiv.org/abs/2006.11477)
	- [WavLM](https://arxiv.org/abs/2110.13900)
	- [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality)
	- [VoiceFixer](https://arxiv.org/abs/2204.05841)
	- [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf)
	- [Whisper](https://arxiv.org/abs/2212.04356)
	- [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model)
	- [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder)
	- [WeSpeaker](https://arxiv.org/abs/2210.17016)
	- [D-Vector](https://github.com/yistLin/dvector)

	Authors: Christoph Minixhofer, Ondřej Klejch, and Peter Bell
	of the University of Edinburgh.