Praveendecode
commited on
Commit
•
0951733
1
Parent(s):
f6f952d
source
Browse files
README.md
ADDED
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: automatic-speech-recognition
|
3 |
+
---
|
4 |
+
# Voice AI
|
5 |
+
|
6 |
+
![image](https://github.com/praveendecode/Voice_AI/assets/95226524/5b7e735b-2164-416d-84d4-20737181e434)
|
7 |
+
|
8 |
+
|
9 |
+
## Goal:
|
10 |
+
|
11 |
+
- The Voice AI project aims to implement a Speech-to-Text system using the Hugging Face Whisper ASR models.
|
12 |
+
- The primary objectives include accurate transcription of Marathi audio and model fine-tuning for improved performance.
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
## Problem Statement:
|
17 |
+
|
18 |
+
- Addressing the challenge of accurate Marathi speech transcription is crucial for applications like transcription services, voice assistants, and accessibility tools.
|
19 |
+
- Inaccurate transcription affects user experience and accessibility for Marathi speakers.
|
20 |
+
|
21 |
+
|
22 |
+
|
23 |
+
## Methodology:
|
24 |
+
|
25 |
+
- The project utilizes the Hugging Face Whisper ASR models for automatic speech recognition. Fine-tuning strategies.
|
26 |
+
- PEFT (Parameter-Efficient Fine-Tuning) and LORA (Low-Rank Adaptation) technique is explored for efficient training.
|
27 |
+
|
28 |
+
|
29 |
+
|
30 |
+
## Data Collection and Preprocessing:
|
31 |
+
|
32 |
+
- Common Voice Marathi dataset from Mozilla Foundation is used.
|
33 |
+
- Data preprocessing involves down-sampling audio to 16kHz, feature extraction, and tokenization using the Whisper models' feature extractor and tokenizer.
|
34 |
+
|
35 |
+
|
36 |
+
|
37 |
+
|
38 |
+
## Model Architecture:
|
39 |
+
|
40 |
+
- The Whisper ASR models, specifically Whisper Small and Large versions, serve as the primary architecture and used for comparison
|
41 |
+
- PEFT and LORA adaptations are applied to improve training efficiency and adaptation to specific tasks.
|
42 |
+
|
43 |
+
Training and Fine-Tuning:
|
44 |
+
|
45 |
+
- The Seq2SeqTrainingArguments and Seq2SeqTrainer from the Hugging Face Transformers library are utilized for model training.
|
46 |
+
- Fine-tuning strategies are applied to optimize model performance.
|
47 |
+
|
48 |
+
|
49 |
+
## Evaluation Metrics:
|
50 |
+
|
51 |
+
- Word Error Rate (WER) is employed as the primary metric for evaluating model performance.
|
52 |
+
- The goal is to minimize WER, ensuring accurate transcription of Marathi speech.
|
53 |
+
- Before fine tuning used provided test dataset in whisper large-v3, calculated Average WER is 73.8 and Whisper small Average WER is 93.3
|
54 |
+
|
55 |
+
|
56 |
+
## Challenges Faced:
|
57 |
+
|
58 |
+
Challenges encountered during the project include GPU memory limitations, fine-tuning difficulties, and handling large models. Strategies to overcome these challenges are discussed.
|
59 |
+
|
60 |
+
- Storage Constraints: The limited storage capacity in Google Colab posed a challenge, preventing the completion of additional fine-tuning steps due to insufficient space for model checkpoints and intermediate results.
|
61 |
+
- Low GPU Resources: The free version of Google Colab provided inadequate GPU capacity, hindering the fine-tuning of larger and more complex models. This limitation impacted the training efficiency and overall model performance.
|
62 |
+
- Model Complexity vs Steps: Balancing increased model complexity with a lower number of fine-tuning steps presented a challenge. The compromise led to a higher Word Error Rate (WER), indicating the impact of insufficient training steps on the model's language understanding and transcription accuracy.
|
63 |
+
|
64 |
+
|
65 |
+
## Results:
|
66 |
+
|
67 |
+
- Due to storage and GPU limitations, the Voice AI project faced challenges, leading to incomplete fine-tuning, reduced model performance, and trade-offs in model size. These constraints may result in suboptimal transcription accuracy and language understanding .
|
68 |
+
- This Fine tuning was not working as expected. But I tried my best to perform tuning.
|
69 |
+
|
70 |
+
## Future Work:
|
71 |
+
|
72 |
+
- Future enhancements will involve exploring additional pre-trained models, incorporating more diverse datasets, and experimenting with alternative fine-tuning techniques with adequate GPU and Storage.
|
73 |
+
|
74 |
+
## Credits:
|
75 |
+
|
76 |
+
- Datasets sourced from Mozilla Common Voice 11.0.
|
77 |
+
- Model Tuning: Hugging Face's Whisper-Small (https://huggingface.co/openai/whisper-small)
|
78 |
+
|
79 |
+
## Project Execution:
|
80 |
+
|
81 |
+
- Compare Word Error Rate of large and small models with this notebook: [WER Comparison](https://github.com/praveendecode/Voice_AI/blob/main/Source/Base_Model_Word_Error_Rate.ipynb)
|
82 |
+
- Fine-tune using: [Fine-Tuning Process](https://github.com/praveendecode/Voice_AI/blob/main/Source/Fine_Tuning_Whisper_OpenAI_Small.ipynb)
|
83 |
+
- For inference, use: [Voice AI Inference Script](https://github.com/praveendecode/Voice_AI/blob/main/Source/voice_ai.py)
|
84 |
+
|
85 |
+
Note: All code explanations has given in the google colab-notebook
|