Add Fish Speech

#48
by lengyue233 - opened

Hi everyone,

We are thrilled to announce that we have open sourced our new text-to-speech model, Fish Speech 1, today! You can find the model and more details on our Hugging Face blog post: https://huggingface.co/blog/lengyue233/fish-speech-1.

We have prepared two demos for you to try out:

  • The medium pretrain demo, which excels at general speaking, can be found at Fish Audio.
  • The large SFT demo, which works particularly well on ACGN content, is available on Hugging Face Space.

To better understand our model's performance, we are eager to integrate the medium pretrain model into TTS Arena for evaluation. We believe this will provide valuable insights into how Fish Speech 1 compares to other state-of-the-art TTS models. If the TTS Arena team requires any assistance or support during the integration process, we are more than happy to provide any necessary resources or guidance.

Best regards,
The Fisu Audio Team

TTS AGI org

Hi, congratulations on your launch!! Are there any plans to switch to an open source license?

Hi, Fish Speech is an open-source model. The code is available under the BSD-3-Clause license, and the model weights are released under the BY-CC-NC-SA 4.0 license.
Feel free to use it for any non-commercial purposes.

TTS AGI org

Thanks! Are there any plans to release the weights under an open source license (see OSD)?

Currently, we don't have any plan to release the weights for commercial use.

We have a very strong release coming soon, it's close to elvenlabs now. Some samples here:



We have a very strong release coming soon, it's close to elvenlabs now. Some samples here:

With that kind statement of confidence, I have to be honest here. While it is better than half of the current models in the Arena, I predict that it will score below StyleTTS and XTTS if added. No were near ElevenLabs. It feels unstable, as in, it always has a slight stuttering. 😕

Of course that is for the voting public to decide.

What about now?

What about now?

Better. I've added Fish Speech's HF Space to the Arena fork, which unlike this space uses HuggingFace Gradio Spaces to generate the audio. A few of the cached samples should be of Fish Speech. The ⚡button.
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena

I made and update the fork as I find that TTS-AGI organization is not genuine about their stated goal.

What about now?

Better. I've added Fish Speech's HF Space to the Arena fork, which unlike this space uses HuggingFace Gradio Spaces to generate the audio. A few of the cached samples should be of Fish Speech. The ⚡button.
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena

I made and update the fork as I find that TTS-AGI organization is not genuine about their stated goal.

BTW, did you use some reference audio (or timbre) for Fish Speech?

BTW, did you use some reference audio (or timbre) for Fish Speech?

Reference audio. It is the one that OpenVoice used to use here on this very space. Zero-shot TTS spaces use that voice.
https://huggingface.co/spaces/TTS-AGI/TTS-Arena/discussions/19#65e00cf8121aa0d0b49e8789

Multiple voices per model would be useful to avoid a biased vote as the voter starts to notice the connection between model and voice. Would not be hard to do with Zero-shot TTS.

The issue is that the voice lacks energy and emotion, unlike Edge TTS. We'd expect Fish-Speech to mimic this behavior since it's not a semantic-based TTS model. It should mimic everything, not just timbre and some pitch/duration like XTTS or Tortoise. For best results, start with the English example in our space.

Sign up or log in to comment