speecht5-asr-punctuation-sensitive

This model is part of SotoMedia's Automatic Video Dubbing project, aiming to build first open source video dubbing technolgy across a diverse range of languages. You can find more details about our project and our pibline here.

Description:

The speecht5-asr-punctuation-sensitive model is an advanced Automatic Speech Recognition (ASR) system designed to transcribe spoken English while maintaining a high level of awareness for punctuation. This model is trained to accurately recognize and preserve punctuation marks, enhancing the fidelity of transcriptions in scenarios where punctuation is crucial for conveying meaning.

Model type: transformer encoder- decoder
Language: En
Base model: SpeechT5-ASR checkpoint
** Finetuning dataset:** MuST-C-en_ar

Key Features:

Punctuation Sensitivity: The model is specifically engineered to be highly sensitive to punctuation nuances in spoken English, ensuring accurate representation of the speaker's intended meaning. New Vocabulary: Change vocabulary to be on Piece-level rather than character-level with vocabulary size 500 piece.

Usage

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="seba3y/speecht5-asr-punctuation-sensitive")

# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("seba3y/speecht5-asr-punctuation-sensitive")
model = AutoModelForSpeechSeq2Seq.from_pretrained("seba3y/speecht5-asr-punctuation-sensitive")

Fintuning & Evaluation Details

Dataset

MuST-C is a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into several target languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

Datasplits:

set: dev

talks 11

sentences 1073

words src 24274

words tgt 21387

time 2h28m34s
set: tst-COMMON

talks 27

sentences 2019

words src 41955

words tgt 36443

time 4h04m39s
set: tst-HE

talks 12

sentences 578

words src 13080

words tgt 10912

time 1h26m51s
set: train

talks 2412

sentences 212085

words src 4520522

words tgt 4000457

time 463h15m44s


talks	11
sentences	1073
words src	24274
words tgt	21387
time	2h28m34s


talks	27
sentences	2019
words src	41955
words tgt	36443
time	4h04m39s


talks	12
sentences	578
words src	13080
words tgt	10912
time	1h26m51s


talks	2412
sentences	212085
words src	4520522
words tgt	4000457
time	463h15m44s

Hyperparameters

Paramter	Value
per_device_train_batch_size	6
per_device_eval_batch_size	16
gradient_accumulation_steps	12
eval_accumulation_steps	16
dataloader_num_workers	2
learning_rate	5e-5
adafactor	True
weight_decay	0.08989525
max_grad_norm	0.58585
num_train_epochs	5
warmup_ratio	0.7
lr_scheduler_type	constant_with_warmup
fp16	True
gradient_checkpointing	True
sortish_sampler	True

Results

Train loss: 0.8925

Split	Word Error Rate (%)
dev	44.8
tst-HE	39.1
tst-COMMON	43.2

Citation

MuST-C dataset

@InProceedings{mustc19, author = "Di Gangi, Mattia Antonino and Cattoni, Roldano and Bentivogli, Luisa and Negri, Matteo > and Turchi, Marco",
 title = "{MuST-C: a Multilingual Speech Translation Corpus}",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
 Volume 2 (Short Papers)", year = "2019", address = "Minneapolis, MN, USA", month = "June"}}

SpeechT5-ASR

@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}