Spaces:
Running
Running
update readme
Browse files
README.md
CHANGED
@@ -16,36 +16,42 @@ tags:
|
|
16 |
- eval:generation
|
17 |
---
|
18 |
|
19 |
-
#
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
-
|
50 |
-
-
|
51 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
- eval:generation
|
17 |
---
|
18 |
|
19 |
+
# TTSDS Benchmark
|
20 |
+
|
21 |
+
As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech.
|
22 |
+
However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments.
|
23 |
+
Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility.
|
24 |
+
By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up.
|
25 |
+
|
26 |
+
## More information
|
27 |
+
More details can be found in our paper [*TTSDS -- Text-to-Speech Distribution Score*](https://arxiv.org/abs/2407.12707).
|
28 |
+
|
29 |
+
## Reproducibility
|
30 |
+
To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds).
|
31 |
+
|
32 |
+
## Credits
|
33 |
+
|
34 |
+
|
35 |
+
This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models.
|
36 |
+
Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub.
|
37 |
+
Additionally, our benchmark uses the following datasets:
|
38 |
+
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h)
|
39 |
+
- [LibriTTS](https://www.openslr.org/60/)
|
40 |
+
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950)
|
41 |
+
- [Common Voice](https://commonvoice.mozilla.org/)
|
42 |
+
- [ESC-50](https://github.com/karolpiczak/ESC-50)
|
43 |
+
And the following metrics/representations/tools:
|
44 |
+
- [Wav2Vec2](https://arxiv.org/abs/2006.11477)
|
45 |
+
- [Hubert](https://arxiv.org/abs/2006.11477)
|
46 |
+
- [WavLM](https://arxiv.org/abs/2110.13900)
|
47 |
+
- [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality)
|
48 |
+
- [VoiceFixer](https://arxiv.org/abs/2204.05841)
|
49 |
+
- [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf)
|
50 |
+
- [Whisper](https://arxiv.org/abs/2212.04356)
|
51 |
+
- [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model)
|
52 |
+
- [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder)
|
53 |
+
- [WeSpeaker](https://arxiv.org/abs/2210.17016)
|
54 |
+
- [D-Vector](https://github.com/yistLin/dvector)
|
55 |
+
|
56 |
+
Authors: Christoph Minixhofer, Ondřej Klejch, and Peter Bell
|
57 |
+
of the University of Edinburgh.
|