metadata

title: TTSDS Benchmark
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: mit
tags:
  - leaderboard
  - submission:semiautomatic
  - test:public
  - judge:auto
  - modality:audio
  - eval:generation

TTSDS Benchmark

As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech. However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments. Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility. By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up.

More information

More details can be found in our paper TTSDS -- Text-to-Speech Distribution Score.

Reproducibility

To reproduce our results, check out our repository here.

Credits

This benchmark is inspired by TTS Arena which instead focuses on the subjective evaluation of TTS models. Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub. Additionally, our benchmark uses the following datasets:

LJSpeech
LibriTTS
VCTK
Common Voice
ESC-50 And the following metrics/representations/tools:
Wav2Vec2
Hubert
WavLM
PESQ
VoiceFixer
WADA SNR
Whisper
Masked Prosody Model
PyWorld
WeSpeaker
D-Vector

Authors: Christoph Minixhofer, Ondřej Klejch, and Peter Bell of the University of Edinburgh.