initial_commit

Browse files

Fill out empty sections in auto-generated README

Update README.md

update readme

Files changed (15) hide show

.gitattributes +35 -0
README.md +91 -0
added_tokens.json +4 -0
config.json +92 -0
generation_config.json +9 -0
model.safetensors +3 -0
preprocessor_config.json +19 -0
runs/Jan14_01-59-01_Christians-Desktop/events.out.tfevents.1705193943.Christians-Desktop.157984.0 +3 -0
runs/Jan14_01-59-01_Christians-Desktop/events.out.tfevents.1705281827.Christians-Desktop.157984.1 +3 -0
runs/Jan15_09-07-38_Christians-Desktop/events.out.tfevents.1705306078.Christians-Desktop.157984.2 +3 -0
runs/Jan15_09-07-38_Christians-Desktop/events.out.tfevents.1705367613.Christians-Desktop.157984.3 +3 -0
special_tokens_map.json +13 -0
spm_char.model +3 -0
tokenizer_config.json +63 -0
training_args.bin +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+---
+language:
+- da
+license: mit
+base_model: microsoft/speecht5_tts
+tags:
+- generated_from_trainer
+datasets:
+- alexandrainst/nst-da
+model-index:
+- name: speecht5_tts-finetuned-nst-da
+  results: []
+metrics:
+- mse
+pipeline_tag: text-to-speech
+---
+# speecht5_tts-finetuned-nst-da
+This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.3738
+## Model description
+Given that danish is a low-resource language, not many open-source implementations of a danish text-to-speech synthesizer are available online. As of writing, the only other existing implementations available on 🤗 are [facebook/seamless-streaming](https://huggingface.co/facebook/seamless-streaming) and [audo/seamless-m4t-v2-large](https://huggingface.co/audo/seamless-m4t-v2-large). This model has been developed to provide a simpler alternative that still performs reasonable well, both in terms of output quality and inference time. Additionally, contrary to the aforementioned models, this model also has an associated Space on 🤗 at [JackismyShephard/danish-speech-synthesis](https://huggingface.co/spaces/JackismyShephard/danish-speech-synthesis) which provides an easy interface for danish text-to-speech synthesis, as well as optional speech enhancement.
+## Intended uses & limitations
+The model is intended for danish text-to-speech synthesis.
+The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts). The model performs best for short to medium length input text and expects input text to contain no more than 600 vocabulary tokens. Additionally, for best performance the model should be given a danish speaker embedding, ideally generated from an audio clip from the training split of [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using [speechbrain/spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb).
+The output of the model is a log-mel spectogram, which should be converted to a waveform using [microsoft/speecht5_hifigan](https://huggingface.co/microsoft/speecht5_hifigan). For better quality output the resulting waveform can be enhanced using [ResembleAI/resemble-enhance](https://huggingface.co/ResembleAI/resemble-enhance).
+An example script showing how to use the model for inference can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned_nst-da-inference.ipynb).
+## Training and evaluation data
+The model was trained and evaluated on [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using MSE as both loss and metric.
+## Training procedure
+The script used for training the model can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned-nst-da-training.ipynb)
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 1e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 20
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch | Step  | Validation Loss |
+|:-------------:|:-----:|:-----:|:---------------:|
+| 0.463         | 1.0   | 4715  | 0.4169          |
+| 0.4302        | 2.0   | 9430  | 0.3963          |
+| 0.447         | 3.0   | 14145 | 0.3883          |
+| 0.4283        | 4.0   | 18860 | 0.3847          |
+| 0.394         | 5.0   | 23575 | 0.3830          |
+| 0.3934        | 6.0   | 28290 | 0.3812          |
+| 0.3928        | 7.0   | 33005 | 0.3795          |
+| 0.4123        | 8.0   | 37720 | 0.3781          |
+| 0.3851        | 9.0   | 42435 | 0.3785          |
+| 0.4234        | 10.0  | 47150 | 0.3783          |
+| 0.3781        | 11.0  | 51865 | 0.3759          |
+| 0.3951        | 12.0  | 56580 | 0.3782          |
+| 0.4073        | 13.0  | 61295 | 0.3757          |
+| 0.4278        | 14.0  | 66010 | 0.3768          |
+| 0.4172        | 15.0  | 70725 | 0.3747          |
+| 0.3854        | 16.0  | 75440 | 0.3753          |
+| 0.4876        | 17.0  | 80155 | 0.3741          |
+| 0.432         | 18.0  | 84870 | 0.3738          |
+| 0.4435        | 19.0  | 89585 | 0.3745          |
+| 0.4255        | 20.0  | 94300 | 0.3739          |
+### Framework versions
+- Transformers 4.37.0.dev0
+- Pytorch 2.1.2+cu118
+- Datasets 2.15.0
+- Tokenizers 0.15.0

added_tokens.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "<ctc_blank>": 80,
+  "<mask>": 79
+}

config.json ADDED Viewed

	@@ -0,0 +1,92 @@

+{
+  "_name_or_path": "microsoft/speecht5_tts",
+  "activation_dropout": 0.1,
+  "apply_spec_augment": true,
+  "architectures": [
+    "SpeechT5ForTextToSpeech"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 0,
+  "conv_bias": false,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "decoder_attention_heads": 12,
+  "decoder_ffn_dim": 3072,
+  "decoder_layerdrop": 0.1,
+  "decoder_layers": 6,
+  "decoder_start_token_id": 2,
+  "encoder_attention_heads": 12,
+  "encoder_ffn_dim": 3072,
+  "encoder_layerdrop": 0.1,
+  "encoder_layers": 12,
+  "encoder_max_relative_position": 160,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_norm": "group",
+  "feat_proj_dropout": 0.0,
+  "guided_attention_loss_num_heads": 2,
+  "guided_attention_loss_scale": 10.0,
+  "guided_attention_loss_sigma": 0.4,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "is_encoder_decoder": true,
+  "layer_norm_eps": 1e-05,
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.05,
+  "max_length": 1876,
+  "max_speech_positions": 1876,
+  "max_text_positions": 600,
+  "model_type": "speecht5",
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_mel_bins": 80,
+  "pad_token_id": 1,
+  "positional_dropout": 0.1,
+  "reduction_factor": 2,
+  "scale_embedding": false,
+  "speaker_embedding_dim": 512,
+  "speech_decoder_postnet_dropout": 0.5,
+  "speech_decoder_postnet_kernel": 5,
+  "speech_decoder_postnet_layers": 5,
+  "speech_decoder_postnet_units": 256,
+  "speech_decoder_prenet_dropout": 0.5,
+  "speech_decoder_prenet_layers": 2,
+  "speech_decoder_prenet_units": 256,
+  "torch_dtype": "float32",
+  "transformers_version": "4.37.0.dev0",
+  "use_cache": false,
+  "use_guided_attention_loss": true,
+  "vocab_size": 81
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "decoder_start_token_id": 2,
+  "eos_token_id": 2,
+  "max_length": 1876,
+  "pad_token_id": 1,
+  "transformers_version": "4.37.0.dev0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6fd7115b5fd692e5b4e818cec73302c5178be23b813a6666edfc2e62bb6b3365
+size 577789320

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "do_normalize": false,
+  "feature_extractor_type": "SpeechT5FeatureExtractor",
+  "feature_size": 1,
+  "fmax": 7600,
+  "fmin": 80,
+  "frame_signal_scale": 1.0,
+  "hop_length": 16,
+  "mel_floor": 1e-10,
+  "num_mel_bins": 80,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "SpeechT5Processor",
+  "reduction_factor": 2,
+  "return_attention_mask": true,
+  "sampling_rate": 16000,
+  "win_function": "hann_window",
+  "win_length": 64
+}

runs/Jan14_01-59-01_Christians-Desktop/events.out.tfevents.1705193943.Christians-Desktop.157984.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b8efe0d0f024486a3b0b73b5c1b4e7bce006958d43be54ee95e609461f1c6fa
+size 2263807

runs/Jan14_01-59-01_Christians-Desktop/events.out.tfevents.1705281827.Christians-Desktop.157984.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c977686d449c8a0bc10d6de559a807799b3f699db62037bafacaab0c36a85700
+size 364

runs/Jan15_09-07-38_Christians-Desktop/events.out.tfevents.1705306078.Christians-Desktop.157984.2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f3d84c4307cb86f619288832906f710c7610f61781d26c0fc8c26ec132c12db
+size 762165

runs/Jan15_09-07-38_Christians-Desktop/events.out.tfevents.1705367613.Christians-Desktop.157984.3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e8960e4c085032fb63151881c6a320a53e81bfd578b6deabdab9b29c1556981
+size 364

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "unk_token": "<unk>"
+}

spm_char.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fcc48f3e225f627b1641db410ceb0c8649bd2b0c982e150b03f8be3728ab560
+size 238473

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "79": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "80": {
+      "content": "<ctc_blank>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 600,
+  "normalize": false,
+  "pad_token": "<pad>",
+  "processor_class": "SpeechT5Processor",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "SpeechT5Tokenizer",
+  "unk_token": "<unk>"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2ab472663086f916e33cf2c755647d5168c72207604e204f721d54f6d82a7734
+size 4920