Inference

The pretrained model checkpoints can be reached at 🤗 Hugging Face and 🤖 Model Scope, or will be automatically downloaded when running inference scripts.

Currently support 30s for a single generation, which is the total length including both prompt and output audio. However, you can provide infer_cli and infer_gradio with longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.

To avoid possible inference failures, make sure you have seen through the following instructions.

Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.

Gradio App

Currently supported features:

Basic TTS with Chunk Inference
Multi-Style / Multi-Speaker Generation
Voice Chat powered by Qwen2.5-3B-Instruct

The cli command f5-tts_infer-gradio equals to python src/f5_tts/infer/infer_gradio.py, which launches a Gradio APP (web interface) for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and update the path to load_model() in infer_gradio.py. Currently only load TTS models first, will load ASR model to do transcription if ref_text not provided, will load LLM model if use Voice Chat.

Could also be used as a component for larger application.

import gradio as gr
from f5_tts.infer.infer_gradio import app

with gr.Blocks() as main_app:
    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")

    # ... other Gradio components

    app.render()

main_app.launch()

CLI Inference

The cli command f5-tts_infer-cli equals to python src/f5_tts/infer/infer_cli.py, which is a command line tool for inference.

The script will load model checkpoints from Huggingface. You can also manually download files and use --ckpt_file to specify the model you want to load, or directly update in infer_cli.py.

For change vocab.txt use --vocab_file to provide your vocab.txt file.

Basically you can inference with flags:

# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

And a .toml file would help with more flexible usage.

f5-tts_infer-cli -c custom.toml

For example, you can use .toml to pass in variables, refer to src/f5_tts/infer/examples/basic/basic.toml:

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/basic/basic_ref_en.wav"
# If an empty "", transcribes the reference audio automatically.
ref_text = "Some call me nature, others call me mother nature."
gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
# File with text to generate. Ignores the text above.
gen_file = ""
remove_silence = false
output_dir = "tests"

You can also leverage .toml file to do multi-style generation, refer to src/f5_tts/infer/examples/multi/story.toml.

# F5-TTS | E2-TTS
model = "F5-TTS"
ref_audio = "infer/examples/multi/main.flac"
# If an empty "", transcribes the reference audio automatically.
ref_text = ""
gen_text = ""
# File with text to generate. Ignores the text above.
gen_file = "infer/examples/multi/story.txt"
remove_silence = true
output_dir = "tests"

[voices.town]
ref_audio = "infer/examples/multi/town.flac"
ref_text = ""

[voices.country]
ref_audio = "infer/examples/multi/country.flac"
ref_text = ""

You should mark the voice with [main] [town] [country] whenever you want to change voice, refer to src/f5_tts/infer/examples/multi/story.txt.

Speech Editing

To test speech editing capabilities, use the following command:

python src/f5_tts/infer/speech_edit.py