File size: 3,350 Bytes
026ee6b
cd25c30
 
 
 
 
 
 
 
26e4879
026ee6b
 
 
 
cd25c30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
026ee6b
 
 
 
 
 
 
4af9c19
 
 
026ee6b
 
 
 
 
 
 
 
 
 
 
4ad5c43
 
026ee6b
 
 
 
 
 
 
 
 
4af9c19
4ad5c43
 
 
 
 
 
 
 
 
026ee6b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
LLM_BENCHMARKS_TEXT = f"""
# About

As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech.
However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments.
Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility.
By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up.

## More information
More details can be found in our paper [*TTSDS -- Text-to-Speech Distribution Score*](https://arxiv.org/abs/2407.12707).

## Reproducibility
To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds).

## Credits


This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models.
Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub.
Additionally, our benchmark uses the following datasets:
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h)
- [LibriTTS](https://www.openslr.org/60/)
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950)
- [Common Voice](https://commonvoice.mozilla.org/)
- [ESC-50](https://github.com/karolpiczak/ESC-50)
And the following metrics/representations/tools:
- [Wav2Vec2](https://arxiv.org/abs/2006.11477)
- [Hubert](https://arxiv.org/abs/2006.11477)
- [WavLM](https://arxiv.org/abs/2110.13900)
- [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality)
- [VoiceFixer](https://arxiv.org/abs/2204.05841)
- [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf)
- [Whisper](https://arxiv.org/abs/2212.04356)
- [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model)
- [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder)
- [WeSpeaker](https://arxiv.org/abs/2210.17016)
- [D-Vector](https://github.com/yistLin/dvector)

Authors: Christoph Minixhofer, OndΕ™ej Klejch, and Peter Bell 
of the University of Edinburgh.
"""

EVALUATION_QUEUE_TEXT = """
## How to submit a TTS model to the leaderboard

### 1) download the evaluation dataset
The evaluation dataset consists of wav / text pairs.

You can download ``speaker_text_pairs.tar.gz`` from here:
https://huggingface.co/datasets/ttsds/speaker_text_pairs/blob/main/speaker_text_pairs.tar.gz

The format of the dataset is as follows:
```
eval/
β”œβ”€β”€ 0001.wav
β”œβ”€β”€ 0001.txt
β”œβ”€β”€ 0002.wav
β”œβ”€β”€ 0002.txt
β”œβ”€β”€ ...
```

Please note that the .wav file is the speaker reference and the .txt file is the prompt.

### 2) create your TTS dataset
Create a dataset with your TTS model and the evaluation dataset.
Use the wav files as speaker reference and the text as the prompt.
Create a .tar.gz file with the dataset, and make sure to inlcude .wav files and .txt files.

### 3) submit your TTS dataset
Submit your dataset below.
"""

CITATION_TEXT = """
@misc{minixhofer2024ttsds,
      title={TTSDS -- Text-to-Speech Distribution Score}, 
      author={Christoph Minixhofer and OndΕ™ej Klejch and Peter Bell},
      year={2024},
      eprint={2407.12707},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2407.12707}, 
}
"""