model_output

This model is a fine-tuned version of openai/whisper-small specifically developed to process the unique phonetic and linguistic features of the Hassaniya dialect, a variant of Arabic spoken predominantly in Mauritania and some parts of the Sahel. It achieves the following results on the evaluation set:

Loss: 1.0886
Wer: 53.7052
Cer: 19.4695

Whisper Samll for Hassanya

This model utilizes state-of-the-art techniques in AI and NLP to provide efficient and accurate automatic speech recognition for the Hassanya dialect. This initiative addresses both a technological need and a cultural imperative to preserve a linguistically unique form of Arabic.

Intended Uses & Limitations

This model is intended for use in professional transcription services and linguistic research. It can facilitate the creation of accurate textual representations of Hassanya speech, contributing to digital heritage preservation and linguistic studies. Users should note that performance may vary based on the audio quality and the speaker's accent.

Training and Evaluation Data

The model was trained on a curated dataset of Hassanya audio recordings collected through AudioScribe, an application dedicated to high-quality data collection. The dataset is divided into three subsets with the following total audio lengths:

Training set: 5 hours 30 minutes
Testing set: 4 minutes
Evaluation set: 18 minutes

This diverse dataset includes various speech samples from native speakers across different age groups and genders to ensure robust model performance.

Model Performance

The model has been evaluated at several stages to assess its accuracy and efficiency both before and after training. Below are the key performance metrics:

Pre-training Evaluation on Eval Set
- Loss: 1.8927192687988281
- WER: 109.56840390879479
- CER: 64.17582417582418
Post-training Evaluation on Eval Set
- Loss: 1.0886411666870117
- WER: 53.70521172638436
- CER: 19.46949602122016
Post-training Evaluation on Test Set
- Loss: 1.1392910480499268
- WER: 52.17391304347826,
- CER: 18.580060422960727

These results show a significant improvement in all metrics over the course of training, especially in terms of the Word Error Rate and Character Error Rate, demonstrating the model's growing accuracy and efficiency in recognizing Hassanya speech.

Training Procedure

Training Duration

The training of the model was completed in approximately 2 hours.

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 0.0001
Train Batch Size: 8
Eval Batch Size: 8
Seed: 42
Gradient Accumulation Steps: 4
Total Train Batch Size: 32
Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
LR Scheduler Type: Linear
Num Epochs: 30
Mixed Precision Training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
2.0952	0.9888	22	0.9536	83.7541	37.1504
0.6677	1.9775	44	0.7836	64.5765	23.9257
0.3515	2.9663	66	0.7909	58.8762	21.0989
0.1671	4.0	89	0.8284	58.0212	21.3793
0.0983	4.9888	111	0.9085	59.2020	21.1898
0.0712	5.9775	133	0.9241	65.2687	26.1538
0.0517	6.9663	155	0.9446	58.4283	21.6370
0.0371	8.0	178	1.0098	56.9218	21.4627
0.0304	8.9888	200	1.0016	55.2117	19.8181
0.024	9.9775	222	0.9747	57.7769	24.7594
0.0186	10.9663	244	1.0000	56.1482	20.1213
0.0129	12.0	267	1.0024	56.1889	20.2501
0.0091	12.9888	289	1.0274	55.6596	19.9545
0.0055	13.9775	311	1.0290	55.7410	20.0076
0.0044	14.9663	333	1.0400	57.8176	22.5161
0.0031	16.0	356	1.0504	54.8046	19.5908
0.002	16.9888	378	1.0569	54.6417	19.4922
0.0016	17.9775	400	1.0714	55.3339	19.6286
0.0017	18.9663	422	1.0604	56.2296	20.3638
0.0023	20.0	445	1.0661	54.9674	19.9621
0.0022	20.9888	467	1.0563	53.9902	19.8560
0.0012	21.9775	489	1.0757	54.1531	19.3937
0.0008	22.9663	511	1.0789	54.3974	19.7272
0.0006	24.0	534	1.0806	54.4788	19.6211
0.0006	24.9888	556	1.0818	54.1531	19.5377
0.0005	25.9775	578	1.0839	54.0717	19.5225
0.0005	26.9663	600	1.0862	53.9088	19.4847
0.0004	28.0	623	1.0876	53.6238	19.3861
0.0005	28.9888	645	1.0885	53.7052	19.4771
0.0004	29.6629	660	1.0886	53.7052	19.4695

Framework versions

Transformers 4.44.0
Pytorch 2.3.1+cu121
Datasets 2.21.0
Tokenizers 0.19.1

How to Use

This model is accessible via two methods depending on the level of control you prefer: directly using the model and processor for fine-grained operations, or employing the high-level pipeline interface for simplicity.

Direct Model Control

For users who prefer direct control over model loading and audio processing, the following example demonstrates how to use the Whisper model for speech recognition tasks:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load the Whisper model and processor
processor = WhisperProcessor.from_pretrained("abscheik/whisper-samll-hassanya")
model = WhisperForConditionalGeneration.from_pretrained("abscheik/whisper-samll-hassanya")

def transcribe_audio(audio_path):
    """Function to transcribe audio using the Whisper model directly"""
    speech, sampling_rate = librosa.load(audio_path, sr=None)
    # Ensure the audio is in the correct format and sample rate for Whisper
    if sampling_rate != 16000:
      # Resample using librosa if the sampling rate is not 16000 Hz
      speech = librosa.resample(speech, orig_sr=sampling_rate, target_sr=16000)
      sampling_rate = 16000
    input_features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt")
    generated_ids = model.generate(input_features.input_features)
    transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
    return transcription

# Example usage
audio_file_path = 'path_to_your_hassaniya_audio_file.wav'
transcription = transcribe_audio(audio_file_path)
print("Transcription:", transcription)

Using Hugging Face Pipeline

For ease of use, especially when simplification is key, the Hugging Face pipeline interface provides a streamlined way to transcribe audio:

from transformers import pipeline

# Create a pipeline for automatic speech recognition
pipe = pipeline("automatic-speech-recognition", model="abscheik/whisper-samll-hassanya")

def transcribe_audio(audio_path):
    """Function to transcribe audio using a high-level Hugging Face pipeline"""
    transcription = pipe(audio_path)
    return transcription['text']

# Example usage
audio_file path = 'path_to_your_hassaniya_audio_file.wav'
transcription = transcribe_audio(audio_file path)
print("Transcription:", transcription)

abscheik
/

whisper-samll-hassanya