This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on the openslr dataset. It achieves the following results on the evaluation set:
- Loss: 0.4239
- Wer: 0.4221
Evaluation results on OpenSLR "test" (self-split 10%) (Running ./eval.py):
- WER: 0.4490281634272114
- CER: 0.12198285179047481
Evaluation results on OpenSLR "test" with LM ngram (self-split 10%) (Running ./eval.py):
- WER: 0.32130107100357
- CER: 0.09345053678218891
Note
- Since this dataset is small (4 hours of voice recording), we decided not to train that for too long to avoid overfitting and under-generalization.
- This model performs worse than its 300M-variant. Probably, we don't explore the hyper-parameter enough?
Installation
Install the following libraries on top of HuggingFace Transformers for the supports of language model.
pip install pyctcdecode
pip install https://github.com/kpu/kenlm/archive/master.zip
Usage
Approach 1: Using HuggingFace's pipeline, this will cover everything end-to-end from raw audio input to text output.
from transformers import pipeline
# Load the model
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
# Process raw audio
output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))
Approach 2: More custom way to predict phonemes.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import librosa
import torch
# load model and processor
processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
# Read and process the input
speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, axis=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
print(predicted_sentences)
Intended uses & limitations
The data used for this model is only around 4 hours of recordings.
- We split into 80/10/10. Hence, the training hour is 3.2 hours, which is very very small.
- Yet, its performance is not too bad. Quite interesting for such small dataset, actually. You can try it out.
- Its limitation is:
- Rare characters, e.g. ឬស្សី ឪឡឹក
- Speech needs to be clear and articulate.
- More data to cover more vocabulary and character may help improve this system.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 2000
- num_epochs: 75
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
3.5671 | 5.47 | 400 | 12.0218 | 1.0 |
3.5159 | 10.95 | 800 | 10.6337 | 1.0 |
2.4543 | 16.43 | 1200 | 1.8256 | 0.9839 |
1.9437 | 21.91 | 1600 | 1.1237 | 0.9173 |
1.696 | 27.39 | 2000 | 0.8246 | 0.7700 |
1.5342 | 32.87 | 2400 | 0.6433 | 0.6594 |
1.4509 | 38.35 | 2800 | 0.5500 | 0.5787 |
1.3478 | 43.83 | 3200 | 0.5070 | 0.4907 |
1.3096 | 49.31 | 3600 | 0.4692 | 0.4726 |
1.2532 | 54.79 | 4000 | 0.4448 | 0.4479 |
1.2291 | 60.27 | 4400 | 0.4374 | 0.4366 |
1.196 | 65.75 | 4800 | 0.4314 | 0.4310 |
1.1862 | 71.23 | 5200 | 0.4239 | 0.4221 |
Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.2.dev0
- Tokenizers 0.11.0
- Downloads last month
- 32
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Dataset used to train vitouphy/wav2vec2-xls-r-1b-khmer
Space using vitouphy/wav2vec2-xls-r-1b-khmer 1
Evaluation results
- Test WER on OpenSLR kmself-reported32.130
- Test CER on OpenSLR kmself-reported9.350
- Test WER on Robust Speech Event - Dev Dataself-reported32.130
- Test CER on Robust Speech Event - Dev Dataself-reported9.350