spktsagar
/

wav2vec2-large-xls-r-300m-nepali-openslr

Automatic Speech Recognition

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

wav2vec2-large-xls-r-300m-nepali-openslr / README.md

spktsagar's picture

Update README.md

bff9109 almost 2 years ago

|

3.47 kB

	---
	language:
	- ne
	- np
	license: apache-2.0
	tags:
	- generated_from_trainer
	- automatic-speech-recognition
	- speech
	- openslr
	- nepali
	datasets:
	- spktsagar/openslr-nepali-asr-cleaned
	metrics:
	- wer
	model-index:
	- name: wav2vec2-large-xls-r-300m-nepali-openslr
	results:
	- task:
	type: automatic-speech-recognition
	name: Nepali Speech Recognition
	dataset:
	type: spktsagar/openslr-nepali-asr-cleaned
	name: OpenSLR Nepali ASR
	config: original
	split: train
	metrics:
	- type: were
	value: 24.05
	name: Test WER
	verified: false


	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# wav2vec2-large-xls-r-300m-nepali-openslr

	This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on an [OpenSLR Nepali ASR](https://huggingface.co/datasets/spktsagar/openslr-nepali-asr-cleaned) dataset.
	It achieves the following results on the evaluation set:
	- eval_loss: 0.1913
	- eval_wer: 0.2405
	- eval_runtime: 586.4075
	- eval_samples_per_second: 36.829
	- eval_steps_per_second: 4.604
	- epoch: 4.6
	- step: 17600

	## Model description

	Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau. Soon after the superior performance of Wav2Vec2 was demonstrated on one of the most popular English datasets for ASR, called LibriSpeech, Facebook AI presented a multi-lingual version of Wav2Vec2, called XLSR. XLSR stands for cross-lingual speech representations and refers to model's ability to learn speech representations that are useful across multiple languages.

	## How to use?
	1. Install transformers and librosa
	```
	pip install librosa, transformers
	```
	2. Run the following code which loads your audio file, preprocessor, models, and returns your prediction
	```python
	import librosa
	from transformers import pipeline

	audio, sample_rate = librosa.load("<path to your audio file>", sr=16000)
	recognizer = pipeline("automatic-speech-recognition", model="spktsagar/wav2vec2-large-xls-r-300m-nepali-openslr")
	prediction = recognizer(audio)
	```

	## Intended uses & limitations

	The model is trained on the OpenSLR Nepali ASR dataset, which in itself has some incorrect transcriptions, so it is obvious that the model will not have perfect predictions for your transcript. Similarly, due to colab's resource limit utterances longer than 5 sec are filtered out from the dataset during training and evaluation. Hence, the model might not perform as expected when given audio input longer than 5 sec.

	## Training and evaluation data and Training procedure

	For dataset preparation and training code, please consult [my blog](https://sagar-spkt.github.io/posts/2022/08/finetune-xlsr-nepali/).

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 10
	- mixed_precision_training: Native AMP

	### Framework versions

	- Transformers 4.23.1
	- Pytorch 1.11.0+cu113
	- Datasets 2.6.0
	- Tokenizers 0.13.1