File size: 2,218 Bytes
18f7f3d db64d39 18f7f3d 8fff203 db64d39 18f7f3d ff37640 18f7f3d eb4566c e26ff9d 18f7f3d c0c65d8 21faa72 4d84397 21faa72 4d84397 21faa72 c0c65d8 18f7f3d b8e5928 18f7f3d 404ea35 c5231b4 ce1ac89 404ea35 18f7f3d c0c65d8 1ccb20f c0c65d8 454f7e7 c0c65d8 40b0d08 c0c65d8 9bd2de9 454f7e7 bcc264e 3b28e4a af5ad36 83a351f 68f5bb7 c0c65d8 18f7f3d 8fff203 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
language:
- ko
license: apache-2.0
tags:
- generated_from_trainer
metrics:
- wer
pipeline_tag: automatic-speech-recognition
base_model: facebook/wav2vec2-xls-r-300m
model-index:
- name: wav2vec2-xls-r-phone-mfa_korean
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# wav2vec2-xls-r-300m_phoneme-mfa_korean
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on a phonetically balanced native Korean read-speech corpus.
* Model Management by: [excalibur12](https://huggingface.co/excalibur12)
# Training and Evaluation Data
Training Data
- Data Name: Phonetically Balanced Native Korean Read-speech Corpus
- Num. of Samples: 54,000 (540 speakers)
- Audio Length: 108 Hours
Evaluation Data
- Data Name: Phonetically Balanced Native Korean Read-speech Corpus
- Num. of Samples: 6,000 (60 speakers)
- Audio Length: 12 Hours
# Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 20 (EarlyStopping: patience: 5 epochs max)
- mixed_precision_training: Native AMP
# Evaluation Results
- <b>Phone Error Rate 3.88%</b>
- Monophthong-wise Error Rates: (To be posted)
# Output Examples
![output_examples](./output_examples.png)
# MFA-IPA Phoneset Tables
## Vowels
![mfa_ipa_chart_vowels](./mfa_ipa_chart_vowels.png)
## Consonants
![mfa_ipa_chart_consonants](./mfa_ipa_chart_consonants.png)
## Experimental Results
Official implementation of the paper ([ICPhS 2023](https://www.icphs2023.org))
Major error patterns of L2 Korean speech from five different L1s: Chinese (ZH), Vietnamese (VI), Japanese (JP), Thai (TH), English (EN)
![Experimental Results](./ICPHS2023_table2.png)
# Framework versions
- Transformers 4.21.3
- Pytorch 1.12.1
- Datasets 2.4.0
- Tokenizers 0.12.1 |