abscheik's picture
Update README.md
7b81708 verified
---
license: apache-2.0
base_model: openai/whisper-small
tags:
- generated_from_trainer
- ASR
- Hassaniya
- Mauritanian Arabic
- Arabic Dialects
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-samll-hassanya
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Hassaniya Audio Dataset
type: private
metrics:
- name: Word Error Rate
value: 53.7052
type: wer
- name: Character Error Rate
value: 19.4695
type: cer
---
# model_output
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) specifically developed to process the unique phonetic and linguistic features of the Hassaniya dialect, a variant of Arabic spoken predominantly in Mauritania and some parts of the Sahel.
It achieves the following results on the evaluation set:
- Loss: 1.0886
- Wer: 53.7052
- Cer: 19.4695
## Whisper Samll for Hassanya
This model utilizes state-of-the-art techniques in AI and NLP to provide efficient and accurate automatic speech recognition for the Hassanya dialect. This initiative addresses both a technological need and a cultural imperative to preserve a linguistically unique form of Arabic.
## Intended Uses & Limitations
This model is intended for use in professional transcription services and linguistic research. It can facilitate the creation of accurate textual representations of Hassanya speech, contributing to digital heritage preservation and linguistic studies. Users should note that performance may vary based on the audio quality and the speaker's accent.
## Training and Evaluation Data
The model was trained on a curated dataset of Hassanya audio recordings collected through AudioScribe, an application dedicated to high-quality data collection. The dataset is divided into three subsets with the following total audio lengths:
- **Training set**: 5 hours 30 minutes
- **Testing set**: 4 minutes
- **Evaluation set**: 18 minutes
This diverse dataset includes various speech samples from native speakers across different age groups and genders to ensure robust model performance.
## Model Performance
The model has been evaluated at several stages to assess its accuracy and efficiency both before and after training. Below are the key performance metrics:
- **Pre-training Evaluation on Eval Set**
- Loss: 1.8927192687988281
- WER: 109.56840390879479
- CER: 64.17582417582418
- **Post-training Evaluation on Eval Set**
- Loss: 1.0886411666870117
- WER: 53.70521172638436
- CER: 19.46949602122016
- **Post-training Evaluation on Test Set**
- Loss: 1.1392910480499268
- WER: 52.17391304347826,
- CER: 18.580060422960727
These results show a significant improvement in all metrics over the course of training, especially in terms of the Word Error Rate and Character Error Rate, demonstrating the model's growing accuracy and efficiency in recognizing Hassanya speech.
## Training Procedure
### Training Duration
The training of the model was completed in approximately 2 hours.
### Training Hyperparameters
The following hyperparameters were used during training:
- **Learning Rate**: 0.0001
- **Train Batch Size**: 8
- **Eval Batch Size**: 8
- **Seed**: 42
- **Gradient Accumulation Steps**: 4
- **Total Train Batch Size**: 32
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- **LR Scheduler Type**: Linear
- **Num Epochs**: 30
- **Mixed Precision Training**: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
|:-------------:|:-------:|:----:|:---------------:|:-------:|:-------:|
| 2.0952 | 0.9888 | 22 | 0.9536 | 83.7541 | 37.1504 |
| 0.6677 | 1.9775 | 44 | 0.7836 | 64.5765 | 23.9257 |
| 0.3515 | 2.9663 | 66 | 0.7909 | 58.8762 | 21.0989 |
| 0.1671 | 4.0 | 89 | 0.8284 | 58.0212 | 21.3793 |
| 0.0983 | 4.9888 | 111 | 0.9085 | 59.2020 | 21.1898 |
| 0.0712 | 5.9775 | 133 | 0.9241 | 65.2687 | 26.1538 |
| 0.0517 | 6.9663 | 155 | 0.9446 | 58.4283 | 21.6370 |
| 0.0371 | 8.0 | 178 | 1.0098 | 56.9218 | 21.4627 |
| 0.0304 | 8.9888 | 200 | 1.0016 | 55.2117 | 19.8181 |
| 0.024 | 9.9775 | 222 | 0.9747 | 57.7769 | 24.7594 |
| 0.0186 | 10.9663 | 244 | 1.0000 | 56.1482 | 20.1213 |
| 0.0129 | 12.0 | 267 | 1.0024 | 56.1889 | 20.2501 |
| 0.0091 | 12.9888 | 289 | 1.0274 | 55.6596 | 19.9545 |
| 0.0055 | 13.9775 | 311 | 1.0290 | 55.7410 | 20.0076 |
| 0.0044 | 14.9663 | 333 | 1.0400 | 57.8176 | 22.5161 |
| 0.0031 | 16.0 | 356 | 1.0504 | 54.8046 | 19.5908 |
| 0.002 | 16.9888 | 378 | 1.0569 | 54.6417 | 19.4922 |
| 0.0016 | 17.9775 | 400 | 1.0714 | 55.3339 | 19.6286 |
| 0.0017 | 18.9663 | 422 | 1.0604 | 56.2296 | 20.3638 |
| 0.0023 | 20.0 | 445 | 1.0661 | 54.9674 | 19.9621 |
| 0.0022 | 20.9888 | 467 | 1.0563 | 53.9902 | 19.8560 |
| 0.0012 | 21.9775 | 489 | 1.0757 | 54.1531 | 19.3937 |
| 0.0008 | 22.9663 | 511 | 1.0789 | 54.3974 | 19.7272 |
| 0.0006 | 24.0 | 534 | 1.0806 | 54.4788 | 19.6211 |
| 0.0006 | 24.9888 | 556 | 1.0818 | 54.1531 | 19.5377 |
| 0.0005 | 25.9775 | 578 | 1.0839 | 54.0717 | 19.5225 |
| 0.0005 | 26.9663 | 600 | 1.0862 | 53.9088 | 19.4847 |
| 0.0004 | 28.0 | 623 | 1.0876 | 53.6238 | 19.3861 |
| 0.0005 | 28.9888 | 645 | 1.0885 | 53.7052 | 19.4771 |
| 0.0004 | 29.6629 | 660 | 1.0886 | 53.7052 | 19.4695 |
### Framework versions
- Transformers 4.44.0
- Pytorch 2.3.1+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1
## How to Use
This model is accessible via two methods depending on the level of control you prefer: directly using the model and processor for fine-grained operations, or employing the high-level `pipeline` interface for simplicity.
### Direct Model Control
For users who prefer direct control over model loading and audio processing, the following example demonstrates how to use the Whisper model for speech recognition tasks:
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Load the Whisper model and processor
processor = WhisperProcessor.from_pretrained("abscheik/whisper-samll-hassanya")
model = WhisperForConditionalGeneration.from_pretrained("abscheik/whisper-samll-hassanya")
def transcribe_audio(audio_path):
"""Function to transcribe audio using the Whisper model directly"""
speech, sampling_rate = librosa.load(audio_path, sr=None)
# Ensure the audio is in the correct format and sample rate for Whisper
if sampling_rate != 16000:
# Resample using librosa if the sampling rate is not 16000 Hz
speech = librosa.resample(speech, orig_sr=sampling_rate, target_sr=16000)
sampling_rate = 16000
input_features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt")
generated_ids = model.generate(input_features.input_features)
transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
return transcription
# Example usage
audio_file_path = 'path_to_your_hassaniya_audio_file.wav'
transcription = transcribe_audio(audio_file_path)
print("Transcription:", transcription)
```
### Using Hugging Face Pipeline
For ease of use, especially when simplification is key, the Hugging Face `pipeline` interface provides a streamlined way to transcribe audio:
```python
from transformers import pipeline
# Create a pipeline for automatic speech recognition
pipe = pipeline("automatic-speech-recognition", model="abscheik/whisper-samll-hassanya")
def transcribe_audio(audio_path):
"""Function to transcribe audio using a high-level Hugging Face pipeline"""
transcription = pipe(audio_path)
return transcription['text']
# Example usage
audio_file path = 'path_to_your_hassaniya_audio_file.wav'
transcription = transcribe_audio(audio_file path)
print("Transcription:", transcription)
```