Text-to-Speech
Transformers
Safetensors
Arabic
vits
text-to-audio
Inference Endpoints
vits-ar / README.md
wasmdashai's picture
Update README.md
74bb657 verified
---
datasets:
- mozilla-foundation/common_voice_17_0
- wasmdashai/db-arabic-f1-nn
language:
- ar
license: afl-3.0
pipeline_tag: text-to-speech
---
# Model Card for Model ID
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
An advanced text-to-speech (TTS) system specifically designed for the Arabic language, built on the VITS architecture and utilizing the pre-trained weights from Facebook's vits ara model. The model is capable of:
Generating natural and realistic speech: Producing high-quality Arabic speech that closely mimics human voices, preserving intonation and linguistic nuances.
Understanding colloquial text: Processing text written in various Arabic dialects, including idiomatic expressions and local vocabulary.
Model Details
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text.
## Usage
MMS-TTS is available in the ๐Ÿค— Transformers library from version 4.33 onwards. To use this checkpoint,
first install the latest version of the library:
```
pip install transformers[torch]
```
Then, run inference with the following code-snippet:
```python
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("wasmdashai/vits-ar")
tokenizer = AutoTokenizer.from_pretrained("wasmdashai/vits-ar")
text = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู… ูˆุฑุญู…ุฉ ุงู„ู„ู‡ ูˆุจุฑูƒุงุชุฉ ู…ุง ุงู„ุฌุฏูŠุฏ ุŸ "
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
full_generation =model(**inputs)
full_generation_waveform = full_generation.waveform.cpu().numpy().reshape(-1)
from IPython.display import Audio
Audio(full_generation_waveform, rate=model.config.sampling_rate)
```
## Contact
You can also email us at [email protected]
## ู…ุฌู…ูˆุนุฉ ู†ู…ุงุฐุฌ ุชูˆู„ูŠุฏ ุงู„ู„ู‡ุฌุงุช ุงู„ุนุฑุจูŠุฉ
### ู…ู‚ุฏู…ุฉ
ูŠุณุฑู†ุง ุฃู† ู†ุนู„ู† ุนู† ุฅุตุฏุงุฑ ู…ุฌู…ูˆุนุฉ ู…ู† ู†ู…ุงุฐุฌ ุชูˆู„ูŠุฏ ุงู„ู„ู‡ุฌุงุช ุงู„ุนุฑุจูŠุฉ ู‚ุฑูŠุจู‹ุง. ุชู… ุชุตู…ูŠู… ู‡ุฐู‡ ุงู„ู†ู…ุงุฐุฌ ุจุงุณุชุฎุฏุงู… ุชู‚ู†ูŠุงุช ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ุงู„ู…ุชู‚ุฏู…ุฉ ู„ุชู‚ุฏูŠู… ุชุฌุฑุจุฉ ุทุจูŠุนูŠุฉ ูˆูˆุงู‚ุนูŠุฉ ููŠ ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… (Text-to-Speech) ุจู…ุฎุชู„ู ุงู„ู„ู‡ุฌุงุช ุงู„ุนุฑุจูŠุฉ.
### ุฌุฏูˆู„ ุงู„ู†ู…ุงุฐุฌ
| **ุงู„ู„ู‡ุฌุฉ** | **ุงุณู… ุงู„ู†ู…ูˆุฐุฌ** | **ุงู„ูˆุตู** | **ุชุงุฑูŠุฎ ุงู„ุฅุตุฏุงุฑ ุงู„ู…ุชูˆู‚ุน** | **ู…ุณุชูˆู‰ ุฌูˆุฏุฉ ุงู„ุตูˆุช** |
|-------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------|----------------------|
| ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ | [vits-ar](https://huggingface.co/wasmdashai/vits-ar) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ูŠู…ู†ูŠุฉ ุจุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ. | ู…ุชูˆูุฑ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ูŠู…ู†ูŠุฉ | [vits-ar-ye](https://huggingface.co/wasmdashai/vits-ar-ye) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ูŠู…ู†ูŠุฉ ุจุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุณุนูˆุฏูŠุฉ | [vits-ar-sa](https://huggingface.co/wasmdashai/vits-ar-sa-huba) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุณุนูˆุฏูŠุฉ ุจุฌูˆุฏุฉ ุนุงู„ูŠุฉ ูˆุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ. | ู…ุชูˆูุฑ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ู…ุตุฑูŠุฉ | [vits-ar-eg](https://huggingface.co/wasmdashai/vits-ar-eg) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ู…ุตุฑูŠุฉ ุจุฃุณู„ูˆุจ ุทุจูŠุนูŠ ูˆุณู„ุณ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ู„ุจู†ุงู†ูŠุฉ | [vits-ar-lb](https://huggingface.co/wasmdashai/vits-ar-lb) | ู†ู…ูˆุฐุฌ ู…ุชุฎุตุต ููŠ ุงู„ู„ู‡ุฌุฉ ุงู„ู„ุจู†ุงู†ูŠุฉ ู„ุชูˆู„ูŠุฏ ูƒู„ุงู… ุจุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ ูˆูˆุงู‚ุนูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ู…ุบุฑุจูŠุฉ | [vits-ar-ma](https://huggingface.co/wasmdashai/vits-ar-ma) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ู…ุบุฑุจูŠุฉ ุจู‚ุฏุฑุฉ ุนู„ู‰ ูู‡ู… ุงู„ู…ุตุทู„ุญุงุช ุงู„ู…ุญู„ูŠุฉ.| ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุฅู…ุงุฑุงุชูŠุฉ | [vits-ar-ae](https://huggingface.co/wasmdashai/vits-ar-ae) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุฅู…ุงุฑุงุชูŠุฉ ุจูˆุงู‚ุนูŠุฉ ูˆุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุฃุฑุฏู†ูŠุฉ | [vits-ar-jo](https://huggingface.co/wasmdashai/vits-ar-jo) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุฃุฑุฏู†ูŠุฉ ุจุฅุชู‚ุงู† ู„ู„ุชูุงุตูŠู„ ุงู„ุตูˆุชูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุนุฑุงู‚ูŠุฉ | [vits-ar-iq](https://huggingface.co/wasmdashai/vits-ar-iq) | ู†ู…ูˆุฐุฌ ู„ุชูˆู„ูŠุฏ ุงู„ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุนุฑุงู‚ูŠุฉ ุจุฏู‚ุฉ ููŠ ู†ุทู‚ ุงู„ูƒู„ู…ุงุช ูˆุงู„ุชุนุงุจูŠุฑ ุงู„ุดุงุฆุนุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุณูˆุฑูŠุฉ | [vits-ar-sy](https://huggingface.co/wasmdashai/vits-ar-sy) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุณูˆุฑูŠุฉ ุจูˆุถูˆุญ ูˆุตูˆุช ุทุจูŠุนูŠ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ูู„ุณุทูŠู†ูŠุฉ | [vits-ar-ps](https://huggingface.co/wasmdashai/vits-ar-ps) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ูู„ุณุทูŠู†ูŠุฉ ุจุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุณูˆุฏุงู†ูŠุฉ | [vits-ar-sd](https://huggingface.co/wasmdashai/vits-ar-sd) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุณูˆุฏุงู†ูŠุฉ ู…ุน ูู‡ู… ุงู„ู…ูุฑุฏุงุช ุงู„ู…ุญู„ูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุฌุฒุงุฆุฑูŠุฉ | [vits-ar-dz](https://huggingface.co/wasmdashai/vits-ar-dz) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุฌุฒุงุฆุฑูŠุฉ ุจุฏู‚ุฉ ูˆุฌูˆุฏุฉ ุนุงู„ูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุชูˆู†ุณูŠุฉ | [vits-ar-tn](https://huggingface.co/wasmdashai/vits-ar-tn) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุชูˆู†ุณูŠุฉ ุจุฅุชู‚ุงู† ู„ู„ุชูุงุตูŠู„ ุงู„ู…ุญู„ูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ู„ูŠุจูŠุฉ | [vits-ar-ly](https://huggingface.co/wasmdashai/vits-ar-ly) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ู„ูŠุจูŠุฉ ุจุฏู‚ุฉ ูˆูˆุงู‚ุนูŠุฉ ููŠ ุงู„ู†ุทู‚. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุจุญุฑูŠู†ูŠุฉ | [vits-ar-bh](https://huggingface.co/wasmdashai/vits-ar-bh) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุจุญุฑูŠู†ูŠุฉ ุจุฌูˆุฏุฉ ุตูˆุช ุนุงู„ูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ุนู…ุงู†ูŠุฉ | [vits-ar-om](https://huggingface.co/wasmdashai/vits-ar-om) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ุนู…ุงู†ูŠุฉ ุจุฏู‚ุฉ ูˆูˆุถูˆุญ ููŠ ุงู„ู†ุทู‚. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ู‚ุทุฑูŠุฉ | [vits-ar-qa](https://huggingface.co/wasmdashai/vits-ar-qa) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ู‚ุทุฑูŠุฉ ุจุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ ูˆูˆุงู‚ุนูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ูƒูˆูŠุชูŠุฉ | [vits-ar-kw](https://huggingface.co/wasmdashai/vits-ar-kw) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ูƒูˆูŠุชูŠุฉ ุจุฌูˆุฏุฉ ุนุงู„ูŠุฉ ูˆูˆุถูˆุญ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
| ุงู„ู„ู‡ุฌุฉ ุงู„ู…ูˆุฑูŠุชุงู†ูŠุฉ | [vits-ar-mr](https://huggingface.co/wasmdashai/vits-ar-mr) | ู†ู…ูˆุฐุฌ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ู‡ุฌุฉ ุงู„ู…ูˆุฑูŠุชุงู†ูŠุฉ ุจุชูุงุตูŠู„ ุฏู‚ูŠู‚ุฉ ูˆูˆุงู‚ุนูŠุฉ. | ู‚ุฑูŠุจุงู‹ | ู…ุชูˆุณุท |
### ุงู„ุชูุงุตูŠู„ ุงู„ูู†ูŠุฉ
ุชุนุชู…ุฏ ุฌู…ูŠุน ุงู„ู†ู…ุงุฐุฌ ุนู„ู‰ ุจู†ูŠุฉ VITSุŒ ูˆู‡ูŠ ู†ู…ูˆุฐุฌ ุดุงู…ู„ ู„ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ูŠุชูŠุญ ุชูˆู„ูŠุฏ ู…ูˆุฌุงุช ุตูˆุชูŠุฉ ูˆุงู‚ุนูŠุฉ ุจู†ุงุกู‹ ุนู„ู‰ ุงู„ู…ุฏุฎู„ุงุช ุงู„ู†ุตูŠุฉ. ุชุญุชูˆูŠ ุงู„ู†ู…ุงุฐุฌ ุนู„ู‰ ู…ุญูˆู„ุงุช ู„ุชุญู„ูŠู„ ุงู„ู†ุต ูˆุชูˆู„ูŠุฏ ุงู„ูƒู„ุงู… ุจู†ุงุกู‹ ุนู„ู‰ ุฎุตุงุฆุต ุงู„ุตูˆุช ุงู„ู…ุญู„ูŠุฉ ู„ูƒู„ ู„ู‡ุฌุฉ.
### ุงู„ุชุฑู‚ูŠุงุช ุงู„ู…ุณุชู‚ุจู„ูŠุฉ
ุณูŠุชู… ุชู‚ุฏูŠู… ุชุญุฏูŠุซุงุช ู…ู†ุชุธู…ุฉ ู„ุชุญุณูŠู† ุฌูˆุฏุฉ ุงู„ุตูˆุช ูˆุฒูŠุงุฏุฉ ูƒูุงุกุฉ ูู‡ู… ุงู„ู„ู‡ุฌุงุช ุงู„ู…ุฎุชู„ูุฉ. ุชุงุจุนูˆู†ุง ู„ู…ุนุฑูุฉ ุงู„ู…ุฒูŠุฏ ุญูˆู„ ุชูˆุงุฑูŠุฎ ุงู„ุฅุทู„ุงู‚ ุงู„ุฏู‚ูŠู‚ุฉ ู„ูƒู„ ู†ู…ูˆุฐุฌ.
## Acknowledgements
This implementation is based on [tts-arabic](https://github.com/nipponjo/tts-arabic-pytorch), [VITS](https://github.com/jaywalnut310/vits), [Finetune VITS](https://github.com/ylacombe/finetune-hf-vits) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their awesome work.