speecht5-asr-punctuation-sensitive
This model is part of SotoMedia's Automatic Video Dubbing project, aiming to build first open source video dubbing technolgy across a diverse range of languages. You can find more details about our project and our pibline here.
Description:
The speecht5-asr-punctuation-sensitive model is an advanced Automatic Speech Recognition (ASR) system designed to transcribe spoken English while maintaining a high level of awareness for punctuation. This model is trained to accurately recognize and preserve punctuation marks, enhancing the fidelity of transcriptions in scenarios where punctuation is crucial for conveying meaning.
- Model type: transformer encoder- decoder
- Language: En
- Base model: SpeechT5-ASR checkpoint
- ** Finetuning dataset:** MuST-C-en_ar
Key Features:
Punctuation Sensitivity: The model is specifically engineered to be highly sensitive to punctuation nuances in spoken English, ensuring accurate representation of the speaker's intended meaning. New Vocabulary: Change vocabulary to be on Piece-level rather than character-level with vocabulary size 500 piece.
Usage
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="seba3y/speecht5-asr-punctuation-sensitive")
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained("seba3y/speecht5-asr-punctuation-sensitive")
model = AutoModelForSpeechSeq2Seq.from_pretrained("seba3y/speecht5-asr-punctuation-sensitive")
Fintuning & Evaluation Details
Dataset
MuST-C is a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into several target languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.
Datasplits:
set: dev
talks 11 sentences 1073 words src 24274 words tgt 21387 time 2h28m34s set: tst-COMMON
talks 27 sentences 2019 words src 41955 words tgt 36443 time 4h04m39s set: tst-HE
talks 12 sentences 578 words src 13080 words tgt 10912 time 1h26m51s set: train
talks 2412 sentences 212085 words src 4520522 words tgt 4000457 time 463h15m44s
Hyperparameters
Paramter | Value |
---|---|
per_device_train_batch_size | 6 |
per_device_eval_batch_size | 16 |
gradient_accumulation_steps | 12 |
eval_accumulation_steps | 16 |
dataloader_num_workers | 2 |
learning_rate | 5e-5 |
adafactor | True |
weight_decay | 0.08989525 |
max_grad_norm | 0.58585 |
num_train_epochs | 5 |
warmup_ratio | 0.7 |
lr_scheduler_type | constant_with_warmup |
fp16 | True |
gradient_checkpointing | True |
sortish_sampler | True |
Results
Train loss: 0.8925
Split | Word Error Rate (%) |
---|---|
dev | 44.8 |
tst-HE | 39.1 |
tst-COMMON | 43.2 |
Citation
- MuST-C dataset
@InProceedings{mustc19, author = "Di Gangi, Mattia Antonino and Cattoni, Roldano and Bentivogli, Luisa and Negri, Matteo > and Turchi, Marco",
title = "{MuST-C: a Multilingual Speech Translation Corpus}",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 2 (Short Papers)", year = "2019", address = "Minneapolis, MN, USA", month = "June"}}
- SpeechT5-ASR
@inproceedings{ao-etal-2022-speecht5,
title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {May},
year = {2022},
pages={5723--5738},
}
- Downloads last month
- 46