File size: 1,988 Bytes
18f7f3d
 
 
 
 
86d70f6
18f7f3d
8fff203
 
 
 
 
18f7f3d
 
 
 
 
ff37640
18f7f3d
eb4566c
18f7f3d
c0c65d8
21faa72
 
 
 
 
 
 
 
 
 
 
c0c65d8
18f7f3d
 
 
 
 
 
 
 
 
 
 
b8e5928
18f7f3d
 
c0c65d8
c5231b4
 
18f7f3d
c0c65d8
1ccb20f
 
c0c65d8
454f7e7
c0c65d8
40b0d08
 
c0c65d8
9bd2de9
454f7e7
bcc264e
af5ad36
 
83a351f
68f5bb7
c0c65d8
18f7f3d
 
 
 
8fff203
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: apache-2.0
tags:
- generated_from_trainer
model-index:
- name: wav2vec2-xls-r-phone-mfa_korean
  results: []
language:
- ko
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# wav2vec2-xls-r-300m_phoneme-mfa_korean

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on a phonetically balanced native Korean read-speech corpus.

# Training and Evaluation Data

Training Data
- Data Name: Phonetically Balanced Native Korean Read-speech Corpus
- Num. of Samples: 54,000
- Audio Length: 108 Hours

Evaluation Data
- Data Name: Phonetically Balanced Native Korean Read-speech Corpus
- Num. of Samples: 6,000
- Audio Length: 12 Hours

# Training Hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 20 (EarlyStopping: patience: 5 epochs max)
- mixed_precision_training: Native AMP

# Evaluation Result

Phone Error Rate 3.88%

# Output Examples
![output_examples](./output_examples.png)

# MFA-IPA Phoneset Tables

## Vowels
![mfa_ipa_chart_vowels](./mfa_ipa_chart_vowels.png)

## Consonants
![mfa_ipa_chart_consonants](./mfa_ipa_chart_consonants.png)

## Experimental Results
Official implementation of the paper (in review)  
Major error patterns of L2 Korean speech from five different L1s: Chinese (ZH), Vietnamese (VI), Japanese (JP), Thai (TH), English (EN)  
![Experimental Results](./ICPHS2023_table2.png)

# Framework versions

- Transformers 4.21.3
- Pytorch 1.12.1
- Datasets 2.4.0
- Tokenizers 0.12.1