Praveendecode commited on
Commit
0951733
1 Parent(s): f6f952d
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: automatic-speech-recognition
3
+ ---
4
+ # Voice AI
5
+
6
+ ![image](https://github.com/praveendecode/Voice_AI/assets/95226524/5b7e735b-2164-416d-84d4-20737181e434)
7
+
8
+
9
+ ## Goal:
10
+
11
+ - The Voice AI project aims to implement a Speech-to-Text system using the Hugging Face Whisper ASR models.
12
+ - The primary objectives include accurate transcription of Marathi audio and model fine-tuning for improved performance.
13
+
14
+
15
+
16
+ ## Problem Statement:
17
+
18
+ - Addressing the challenge of accurate Marathi speech transcription is crucial for applications like transcription services, voice assistants, and accessibility tools.
19
+ - Inaccurate transcription affects user experience and accessibility for Marathi speakers.
20
+
21
+
22
+
23
+ ## Methodology:
24
+
25
+ - The project utilizes the Hugging Face Whisper ASR models for automatic speech recognition. Fine-tuning strategies.
26
+ - PEFT (Parameter-Efficient Fine-Tuning) and LORA (Low-Rank Adaptation) technique is explored for efficient training.
27
+
28
+
29
+
30
+ ## Data Collection and Preprocessing:
31
+
32
+ - Common Voice Marathi dataset from Mozilla Foundation is used.
33
+ - Data preprocessing involves down-sampling audio to 16kHz, feature extraction, and tokenization using the Whisper models' feature extractor and tokenizer.
34
+
35
+
36
+
37
+
38
+ ## Model Architecture:
39
+
40
+ - The Whisper ASR models, specifically Whisper Small and Large versions, serve as the primary architecture and used for comparison
41
+ - PEFT and LORA adaptations are applied to improve training efficiency and adaptation to specific tasks.
42
+
43
+ Training and Fine-Tuning:
44
+
45
+ - The Seq2SeqTrainingArguments and Seq2SeqTrainer from the Hugging Face Transformers library are utilized for model training.
46
+ - Fine-tuning strategies are applied to optimize model performance.
47
+
48
+
49
+ ## Evaluation Metrics:
50
+
51
+ - Word Error Rate (WER) is employed as the primary metric for evaluating model performance.
52
+ - The goal is to minimize WER, ensuring accurate transcription of Marathi speech.
53
+ - Before fine tuning used provided test dataset in whisper large-v3, calculated Average WER is 73.8 and Whisper small Average WER is 93.3
54
+
55
+
56
+ ## Challenges Faced:
57
+
58
+ Challenges encountered during the project include GPU memory limitations, fine-tuning difficulties, and handling large models. Strategies to overcome these challenges are discussed.
59
+
60
+ - Storage Constraints: The limited storage capacity in Google Colab posed a challenge, preventing the completion of additional fine-tuning steps due to insufficient space for model checkpoints and intermediate results.
61
+ - Low GPU Resources: The free version of Google Colab provided inadequate GPU capacity, hindering the fine-tuning of larger and more complex models. This limitation impacted the training efficiency and overall model performance.
62
+ - Model Complexity vs Steps: Balancing increased model complexity with a lower number of fine-tuning steps presented a challenge. The compromise led to a higher Word Error Rate (WER), indicating the impact of insufficient training steps on the model's language understanding and transcription accuracy.
63
+
64
+
65
+ ## Results:
66
+
67
+ - Due to storage and GPU limitations, the Voice AI project faced challenges, leading to incomplete fine-tuning, reduced model performance, and trade-offs in model size. These constraints may result in suboptimal transcription accuracy and language understanding .
68
+ - This Fine tuning was not working as expected. But I tried my best to perform tuning.
69
+
70
+ ## Future Work:
71
+
72
+ - Future enhancements will involve exploring additional pre-trained models, incorporating more diverse datasets, and experimenting with alternative fine-tuning techniques with adequate GPU and Storage.
73
+
74
+ ## Credits:
75
+
76
+ - Datasets sourced from Mozilla Common Voice 11.0.
77
+ - Model Tuning: Hugging Face's Whisper-Small (https://huggingface.co/openai/whisper-small)
78
+
79
+ ## Project Execution:
80
+
81
+ - Compare Word Error Rate of large and small models with this notebook: [WER Comparison](https://github.com/praveendecode/Voice_AI/blob/main/Source/Base_Model_Word_Error_Rate.ipynb)
82
+ - Fine-tune using: [Fine-Tuning Process](https://github.com/praveendecode/Voice_AI/blob/main/Source/Fine_Tuning_Whisper_OpenAI_Small.ipynb)
83
+ - For inference, use: [Voice AI Inference Script](https://github.com/praveendecode/Voice_AI/blob/main/Source/voice_ai.py)
84
+
85
+ Note: All code explanations has given in the google colab-notebook