File size: 2,218 Bytes
18f7f3d
db64d39
 
18f7f3d
 
 
8fff203
 
 
db64d39
 
 
 
18f7f3d
 
 
 
 
ff37640
18f7f3d
eb4566c
e26ff9d
18f7f3d
c0c65d8
21faa72
 
 
4d84397
21faa72
 
 
 
4d84397
21faa72
 
c0c65d8
18f7f3d
 
 
 
 
 
 
 
 
 
 
b8e5928
18f7f3d
 
404ea35
c5231b4
ce1ac89
404ea35
18f7f3d
c0c65d8
1ccb20f
 
c0c65d8
454f7e7
c0c65d8
40b0d08
 
c0c65d8
9bd2de9
454f7e7
bcc264e
3b28e4a
af5ad36
83a351f
68f5bb7
c0c65d8
18f7f3d
 
 
 
8fff203
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
language:
- ko
license: apache-2.0
tags:
- generated_from_trainer
metrics:
- wer
pipeline_tag: automatic-speech-recognition
base_model: facebook/wav2vec2-xls-r-300m
model-index:
- name: wav2vec2-xls-r-phone-mfa_korean
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# wav2vec2-xls-r-300m_phoneme-mfa_korean

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on a phonetically balanced native Korean read-speech corpus.
* Model Management by: [excalibur12](https://huggingface.co/excalibur12)

# Training and Evaluation Data

Training Data
- Data Name: Phonetically Balanced Native Korean Read-speech Corpus
- Num. of Samples: 54,000 (540 speakers)
- Audio Length: 108 Hours

Evaluation Data
- Data Name: Phonetically Balanced Native Korean Read-speech Corpus
- Num. of Samples: 6,000 (60 speakers)
- Audio Length: 12 Hours

# Training Hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 20 (EarlyStopping: patience: 5 epochs max)
- mixed_precision_training: Native AMP

# Evaluation Results

- <b>Phone Error Rate 3.88%</b>
- Monophthong-wise Error Rates: (To be posted)

# Output Examples
![output_examples](./output_examples.png)

# MFA-IPA Phoneset Tables

## Vowels
![mfa_ipa_chart_vowels](./mfa_ipa_chart_vowels.png)

## Consonants
![mfa_ipa_chart_consonants](./mfa_ipa_chart_consonants.png)

## Experimental Results
Official implementation of the paper ([ICPhS 2023](https://www.icphs2023.org))  
Major error patterns of L2 Korean speech from five different L1s: Chinese (ZH), Vietnamese (VI), Japanese (JP), Thai (TH), English (EN)  
![Experimental Results](./ICPHS2023_table2.png)

# Framework versions

- Transformers 4.21.3
- Pytorch 1.12.1
- Datasets 2.4.0
- Tokenizers 0.12.1