whisper-samll-hassanya / README.md

abscheik

Update README.md

7b81708 verified about 1 month ago

preview code

raw

history blame contribute delete

No virus

8.4 kB

	---
	license: apache-2.0
	base_model: openai/whisper-small
	tags:
	- generated_from_trainer
	- ASR
	- Hassaniya
	- Mauritanian Arabic
	- Arabic Dialects
	metrics:
	- wer
	- cer

	pipeline_tag: automatic-speech-recognition

	model-index:
	- name: whisper-samll-hassanya
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Hassaniya Audio Dataset
	type: private
	metrics:
	- name: Word Error Rate
	value: 53.7052
	type: wer
	- name: Character Error Rate
	value: 19.4695
	type: cer
	---

	# model_output

	This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) specifically developed to process the unique phonetic and linguistic features of the Hassaniya dialect, a variant of Arabic spoken predominantly in Mauritania and some parts of the Sahel.
	It achieves the following results on the evaluation set:
	- Loss: 1.0886
	- Wer: 53.7052
	- Cer: 19.4695

	## Whisper Samll for Hassanya

	This model utilizes state-of-the-art techniques in AI and NLP to provide efficient and accurate automatic speech recognition for the Hassanya dialect. This initiative addresses both a technological need and a cultural imperative to preserve a linguistically unique form of Arabic.

	## Intended Uses & Limitations

	This model is intended for use in professional transcription services and linguistic research. It can facilitate the creation of accurate textual representations of Hassanya speech, contributing to digital heritage preservation and linguistic studies. Users should note that performance may vary based on the audio quality and the speaker's accent.

	## Training and Evaluation Data

	The model was trained on a curated dataset of Hassanya audio recordings collected through AudioScribe, an application dedicated to high-quality data collection. The dataset is divided into three subsets with the following total audio lengths:

	- Training set: 5 hours 30 minutes
	- Testing set: 4 minutes
	- Evaluation set: 18 minutes

	This diverse dataset includes various speech samples from native speakers across different age groups and genders to ensure robust model performance.

	## Model Performance

	The model has been evaluated at several stages to assess its accuracy and efficiency both before and after training. Below are the key performance metrics:

	- Pre-training Evaluation on Eval Set
	- Loss: 1.8927192687988281
	- WER: 109.56840390879479
	- CER: 64.17582417582418

	- Post-training Evaluation on Eval Set
	- Loss: 1.0886411666870117
	- WER: 53.70521172638436
	- CER: 19.46949602122016

	- Post-training Evaluation on Test Set
	- Loss: 1.1392910480499268
	- WER: 52.17391304347826,
	- CER: 18.580060422960727

	These results show a significant improvement in all metrics over the course of training, especially in terms of the Word Error Rate and Character Error Rate, demonstrating the model's growing accuracy and efficiency in recognizing Hassanya speech.

	## Training Procedure

	### Training Duration
	The training of the model was completed in approximately 2 hours.

	### Training Hyperparameters

	The following hyperparameters were used during training:
	- Learning Rate: 0.0001
	- Train Batch Size: 8
	- Eval Batch Size: 8
	- Seed: 42
	- Gradient Accumulation Steps: 4
	- Total Train Batch Size: 32
	- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
	- LR Scheduler Type: Linear
	- Num Epochs: 30
	- Mixed Precision Training: Native AMP


	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \| Cer \|
	\|:-------------:\|:-------:\|:----:\|:---------------:\|:-------:\|:-------:\|
	\| 2.0952 \| 0.9888 \| 22 \| 0.9536 \| 83.7541 \| 37.1504 \|
	\| 0.6677 \| 1.9775 \| 44 \| 0.7836 \| 64.5765 \| 23.9257 \|
	\| 0.3515 \| 2.9663 \| 66 \| 0.7909 \| 58.8762 \| 21.0989 \|
	\| 0.1671 \| 4.0 \| 89 \| 0.8284 \| 58.0212 \| 21.3793 \|
	\| 0.0983 \| 4.9888 \| 111 \| 0.9085 \| 59.2020 \| 21.1898 \|
	\| 0.0712 \| 5.9775 \| 133 \| 0.9241 \| 65.2687 \| 26.1538 \|
	\| 0.0517 \| 6.9663 \| 155 \| 0.9446 \| 58.4283 \| 21.6370 \|
	\| 0.0371 \| 8.0 \| 178 \| 1.0098 \| 56.9218 \| 21.4627 \|
	\| 0.0304 \| 8.9888 \| 200 \| 1.0016 \| 55.2117 \| 19.8181 \|
	\| 0.024 \| 9.9775 \| 222 \| 0.9747 \| 57.7769 \| 24.7594 \|
	\| 0.0186 \| 10.9663 \| 244 \| 1.0000 \| 56.1482 \| 20.1213 \|
	\| 0.0129 \| 12.0 \| 267 \| 1.0024 \| 56.1889 \| 20.2501 \|
	\| 0.0091 \| 12.9888 \| 289 \| 1.0274 \| 55.6596 \| 19.9545 \|
	\| 0.0055 \| 13.9775 \| 311 \| 1.0290 \| 55.7410 \| 20.0076 \|
	\| 0.0044 \| 14.9663 \| 333 \| 1.0400 \| 57.8176 \| 22.5161 \|
	\| 0.0031 \| 16.0 \| 356 \| 1.0504 \| 54.8046 \| 19.5908 \|
	\| 0.002 \| 16.9888 \| 378 \| 1.0569 \| 54.6417 \| 19.4922 \|
	\| 0.0016 \| 17.9775 \| 400 \| 1.0714 \| 55.3339 \| 19.6286 \|
	\| 0.0017 \| 18.9663 \| 422 \| 1.0604 \| 56.2296 \| 20.3638 \|
	\| 0.0023 \| 20.0 \| 445 \| 1.0661 \| 54.9674 \| 19.9621 \|
	\| 0.0022 \| 20.9888 \| 467 \| 1.0563 \| 53.9902 \| 19.8560 \|
	\| 0.0012 \| 21.9775 \| 489 \| 1.0757 \| 54.1531 \| 19.3937 \|
	\| 0.0008 \| 22.9663 \| 511 \| 1.0789 \| 54.3974 \| 19.7272 \|
	\| 0.0006 \| 24.0 \| 534 \| 1.0806 \| 54.4788 \| 19.6211 \|
	\| 0.0006 \| 24.9888 \| 556 \| 1.0818 \| 54.1531 \| 19.5377 \|
	\| 0.0005 \| 25.9775 \| 578 \| 1.0839 \| 54.0717 \| 19.5225 \|
	\| 0.0005 \| 26.9663 \| 600 \| 1.0862 \| 53.9088 \| 19.4847 \|
	\| 0.0004 \| 28.0 \| 623 \| 1.0876 \| 53.6238 \| 19.3861 \|
	\| 0.0005 \| 28.9888 \| 645 \| 1.0885 \| 53.7052 \| 19.4771 \|
	\| 0.0004 \| 29.6629 \| 660 \| 1.0886 \| 53.7052 \| 19.4695 \|


	### Framework versions

	- Transformers 4.44.0
	- Pytorch 2.3.1+cu121
	- Datasets 2.21.0
	- Tokenizers 0.19.1


	## How to Use

	This model is accessible via two methods depending on the level of control you prefer: directly using the model and processor for fine-grained operations, or employing the high-level `pipeline` interface for simplicity.

	### Direct Model Control

	For users who prefer direct control over model loading and audio processing, the following example demonstrates how to use the Whisper model for speech recognition tasks:

	```python
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import librosa

	# Load the Whisper model and processor
	processor = WhisperProcessor.from_pretrained("abscheik/whisper-samll-hassanya")
	model = WhisperForConditionalGeneration.from_pretrained("abscheik/whisper-samll-hassanya")

	def transcribe_audio(audio_path):
	"""Function to transcribe audio using the Whisper model directly"""
	speech, sampling_rate = librosa.load(audio_path, sr=None)
	# Ensure the audio is in the correct format and sample rate for Whisper
	if sampling_rate != 16000:
	# Resample using librosa if the sampling rate is not 16000 Hz
	speech = librosa.resample(speech, orig_sr=sampling_rate, target_sr=16000)
	sampling_rate = 16000
	input_features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt")
	generated_ids = model.generate(input_features.input_features)
	transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
	return transcription

	# Example usage
	audio_file_path = 'path_to_your_hassaniya_audio_file.wav'
	transcription = transcribe_audio(audio_file_path)
	print("Transcription:", transcription)
	```

	### Using Hugging Face Pipeline

	For ease of use, especially when simplification is key, the Hugging Face `pipeline` interface provides a streamlined way to transcribe audio:

	```python
	from transformers import pipeline

	# Create a pipeline for automatic speech recognition
	pipe = pipeline("automatic-speech-recognition", model="abscheik/whisper-samll-hassanya")

	def transcribe_audio(audio_path):
	"""Function to transcribe audio using a high-level Hugging Face pipeline"""
	transcription = pipe(audio_path)
	return transcription['text']

	# Example usage
	audio_file path = 'path_to_your_hassaniya_audio_file.wav'
	transcription = transcribe_audio(audio_file path)
	print("Transcription:", transcription)
	```