NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Abstract
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech (2024)
- Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models (2024)
- ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering (2024)
- Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations (2024)
- SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
This work looks very promising. Do you believe it may be possible with appropriate transcriptions in the dataset to embed control over tones-of-voice/emotion as has seen to be possible with models such as Bark or Tortoise, or would the lack of transformer encoders for any aspect other than 'timbre extraction' (as worded in the paper) make this unlikely?
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper