Edit model card

model_output

This model is a fine-tuned version of openai/whisper-small specifically developed to process the unique phonetic and linguistic features of the Hassaniya dialect, a variant of Arabic spoken predominantly in Mauritania and some parts of the Sahel. It achieves the following results on the evaluation set:

  • Loss: 1.0886
  • Wer: 53.7052
  • Cer: 19.4695

Whisper Samll for Hassanya

This model utilizes state-of-the-art techniques in AI and NLP to provide efficient and accurate automatic speech recognition for the Hassanya dialect. This initiative addresses both a technological need and a cultural imperative to preserve a linguistically unique form of Arabic.

Intended Uses & Limitations

This model is intended for use in professional transcription services and linguistic research. It can facilitate the creation of accurate textual representations of Hassanya speech, contributing to digital heritage preservation and linguistic studies. Users should note that performance may vary based on the audio quality and the speaker's accent.

Training and Evaluation Data

The model was trained on a curated dataset of Hassanya audio recordings collected through AudioScribe, an application dedicated to high-quality data collection. The dataset is divided into three subsets with the following total audio lengths:

  • Training set: 5 hours 30 minutes
  • Testing set: 4 minutes
  • Evaluation set: 18 minutes

This diverse dataset includes various speech samples from native speakers across different age groups and genders to ensure robust model performance.

Model Performance

The model has been evaluated at several stages to assess its accuracy and efficiency both before and after training. Below are the key performance metrics:

  • Pre-training Evaluation on Eval Set

    • Loss: 1.8927192687988281
    • WER: 109.56840390879479
    • CER: 64.17582417582418
  • Post-training Evaluation on Eval Set

    • Loss: 1.0886411666870117
    • WER: 53.70521172638436
    • CER: 19.46949602122016
  • Post-training Evaluation on Test Set

    • Loss: 1.1392910480499268
    • WER: 52.17391304347826,
    • CER: 18.580060422960727

These results show a significant improvement in all metrics over the course of training, especially in terms of the Word Error Rate and Character Error Rate, demonstrating the model's growing accuracy and efficiency in recognizing Hassanya speech.

Training Procedure

Training Duration

The training of the model was completed in approximately 2 hours.

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 0.0001
  • Train Batch Size: 8
  • Eval Batch Size: 8
  • Seed: 42
  • Gradient Accumulation Steps: 4
  • Total Train Batch Size: 32
  • Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
  • LR Scheduler Type: Linear
  • Num Epochs: 30
  • Mixed Precision Training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
2.0952 0.9888 22 0.9536 83.7541 37.1504
0.6677 1.9775 44 0.7836 64.5765 23.9257
0.3515 2.9663 66 0.7909 58.8762 21.0989
0.1671 4.0 89 0.8284 58.0212 21.3793
0.0983 4.9888 111 0.9085 59.2020 21.1898
0.0712 5.9775 133 0.9241 65.2687 26.1538
0.0517 6.9663 155 0.9446 58.4283 21.6370
0.0371 8.0 178 1.0098 56.9218 21.4627
0.0304 8.9888 200 1.0016 55.2117 19.8181
0.024 9.9775 222 0.9747 57.7769 24.7594
0.0186 10.9663 244 1.0000 56.1482 20.1213
0.0129 12.0 267 1.0024 56.1889 20.2501
0.0091 12.9888 289 1.0274 55.6596 19.9545
0.0055 13.9775 311 1.0290 55.7410 20.0076
0.0044 14.9663 333 1.0400 57.8176 22.5161
0.0031 16.0 356 1.0504 54.8046 19.5908
0.002 16.9888 378 1.0569 54.6417 19.4922
0.0016 17.9775 400 1.0714 55.3339 19.6286
0.0017 18.9663 422 1.0604 56.2296 20.3638
0.0023 20.0 445 1.0661 54.9674 19.9621
0.0022 20.9888 467 1.0563 53.9902 19.8560
0.0012 21.9775 489 1.0757 54.1531 19.3937
0.0008 22.9663 511 1.0789 54.3974 19.7272
0.0006 24.0 534 1.0806 54.4788 19.6211
0.0006 24.9888 556 1.0818 54.1531 19.5377
0.0005 25.9775 578 1.0839 54.0717 19.5225
0.0005 26.9663 600 1.0862 53.9088 19.4847
0.0004 28.0 623 1.0876 53.6238 19.3861
0.0005 28.9888 645 1.0885 53.7052 19.4771
0.0004 29.6629 660 1.0886 53.7052 19.4695

Framework versions

  • Transformers 4.44.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1

How to Use

This model is accessible via two methods depending on the level of control you prefer: directly using the model and processor for fine-grained operations, or employing the high-level pipeline interface for simplicity.

Direct Model Control

For users who prefer direct control over model loading and audio processing, the following example demonstrates how to use the Whisper model for speech recognition tasks:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load the Whisper model and processor
processor = WhisperProcessor.from_pretrained("abscheik/whisper-samll-hassanya")
model = WhisperForConditionalGeneration.from_pretrained("abscheik/whisper-samll-hassanya")

def transcribe_audio(audio_path):
    """Function to transcribe audio using the Whisper model directly"""
    speech, sampling_rate = librosa.load(audio_path, sr=None)
    # Ensure the audio is in the correct format and sample rate for Whisper
    if sampling_rate != 16000:
      # Resample using librosa if the sampling rate is not 16000 Hz
      speech = librosa.resample(speech, orig_sr=sampling_rate, target_sr=16000)
      sampling_rate = 16000
    input_features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt")
    generated_ids = model.generate(input_features.input_features)
    transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
    return transcription

# Example usage
audio_file_path = 'path_to_your_hassaniya_audio_file.wav'
transcription = transcribe_audio(audio_file_path)
print("Transcription:", transcription)

Using Hugging Face Pipeline

For ease of use, especially when simplification is key, the Hugging Face pipeline interface provides a streamlined way to transcribe audio:

from transformers import pipeline

# Create a pipeline for automatic speech recognition
pipe = pipeline("automatic-speech-recognition", model="abscheik/whisper-samll-hassanya")

def transcribe_audio(audio_path):
    """Function to transcribe audio using a high-level Hugging Face pipeline"""
    transcription = pipe(audio_path)
    return transcription['text']

# Example usage
audio_file path = 'path_to_your_hassaniya_audio_file.wav'
transcription = transcribe_audio(audio_file path)
print("Transcription:", transcription)
Downloads last month
38
Safetensors
Model size
242M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Model tree for abscheik/whisper-samll-hassanya

Finetuned
this model

Evaluation results