metadata

language:
  - ne
  - np
license: apache-2.0
tags:
  - generated_from_trainer
  - automatic-speech-recognition
  - speech
  - openslr
  - nepali
datasets:
  - spktsagar/openslr-nepali-asr-cleaned
metrics:
  - wer
model-index:
  - name: wav2vec2-large-xls-r-300m-nepali-openslr
    results:
      - task:
          type: automatic-speech-recognition
          name: Nepali Speech Recognition
        dataset:
          type: spktsagar/openslr-nepali-asr-cleaned
          name: OpenSLR Nepali ASR
          config: original
          split: train
        metrics:
          - type: were
            value: 24.05
            name: Test WER
            verified: false

wav2vec2-large-xls-r-300m-nepali-openslr

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on an OpenSLR Nepali ASR dataset. It achieves the following results on the evaluation set:

eval_loss: 0.1913
eval_wer: 0.2405
eval_runtime: 586.4075
eval_samples_per_second: 36.829
eval_steps_per_second: 4.604
epoch: 4.6
step: 17600

Model description

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau. Soon after the superior performance of Wav2Vec2 was demonstrated on one of the most popular English datasets for ASR, called LibriSpeech, Facebook AI presented a multi-lingual version of Wav2Vec2, called XLSR. XLSR stands for cross-lingual speech representations and refers to model's ability to learn speech representations that are useful across multiple languages.

How to use?

Install transformers and librosa

pip install librosa, transformers

Run the following code which loads your audio file, preprocessor, models, and returns your prediction

import librosa
from transformers import pipeline

audio, sample_rate = librosa.load("<path to your audio file>", sr=16000)
recognizer = pipeline("automatic-speech-recognition", model="spktsagar/wav2vec2-large-xls-r-300m-nepali-openslr")
prediction = recognizer(audio)

Intended uses & limitations

The model is trained on the OpenSLR Nepali ASR dataset, which in itself has some incorrect transcriptions, so it is obvious that the model will not have perfect predictions for your transcript. Similarly, due to colab's resource limit utterances longer than 5 sec are filtered out from the dataset during training and evaluation. Hence, the model might not perform as expected when given audio input longer than 5 sec.

Training and evaluation data and Training procedure

For dataset preparation and training code, please consult my blog.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 10
mixed_precision_training: Native AMP

Framework versions

Transformers 4.23.1
Pytorch 1.11.0+cu113
Datasets 2.6.0
Tokenizers 0.13.1