Update README.md
Browse filesAdd more instructions
README.md
CHANGED
@@ -1,31 +1,127 @@
|
|
1 |
---
|
2 |
-
license:
|
3 |
language:
|
4 |
- vi
|
5 |
pipeline_tag: token-classification
|
6 |
tags:
|
7 |
- vietnamese
|
8 |
- accents inserter
|
|
|
|
|
9 |
---
|
10 |
|
11 |
# A Transformer model for inserting Vietnamese accent marks
|
12 |
|
13 |
This model is finetuned from the XLM-Roberta Large.
|
14 |
|
15 |
-
Example input:
|
16 |
-
Target output:
|
17 |
|
18 |
## Model training
|
19 |
This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
|
20 |
to the accented token.
|
21 |
-
|
|
|
|
|
|
|
22 |
|
23 |
## How to use this model
|
24 |
-
There are
|
25 |
-
- Load the model as a token classification model (*AutoModelForTokenClassification*).
|
26 |
-
- Run the input through the model to obtain the tag index for each input token.
|
27 |
-
- Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*.
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
|
|
|
|
|
|
1 |
---
|
2 |
+
license: mit
|
3 |
language:
|
4 |
- vi
|
5 |
pipeline_tag: token-classification
|
6 |
tags:
|
7 |
- vietnamese
|
8 |
- accents inserter
|
9 |
+
metrics:
|
10 |
+
- accuracy
|
11 |
---
|
12 |
|
13 |
# A Transformer model for inserting Vietnamese accent marks
|
14 |
|
15 |
This model is finetuned from the XLM-Roberta Large.
|
16 |
|
17 |
+
Example input: Nhin nhung mua thu di
|
18 |
+
Target output: Nhìn những mùa thu đi
|
19 |
|
20 |
## Model training
|
21 |
This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
|
22 |
to the accented token.
|
23 |
+
|
24 |
+
For more details on the training process, please refer to this
|
25 |
+
<a href="https://peterhung.org/tech/insert-vietnamese-accent-transformer-model/" target="_blank">blog post</a>.
|
26 |
+
|
27 |
|
28 |
## How to use this model
|
29 |
+
There are just a few steps:
|
30 |
+
- Step 1: Load the model as a token classification model (*AutoModelForTokenClassification*).
|
31 |
+
- Step 2: Run the input through the model to obtain the tag index for each input token.
|
32 |
+
- Step 3: Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*. Then,
|
33 |
+
apply the conversion indicated by the tag to each token to obtain accented tokens.
|
34 |
+
|
35 |
+
### Step 1: Load model
|
36 |
+
Note: Install *transformers*, *torch*, *numpy* packages first.
|
37 |
+
|
38 |
+
```python
|
39 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
40 |
+
import torch
|
41 |
+
import numpy as np
|
42 |
+
|
43 |
+
def load_trained_transformer_model():
|
44 |
+
model_path = "peterhung/transformer-vnaccent-marker"
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
|
46 |
+
model = AutoModelForTokenClassification.from_pretrained(model_path)
|
47 |
+
return model, tokenizer
|
48 |
+
|
49 |
+
model, tokenizer = load_trained_transformer_model()
|
50 |
+
```
|
51 |
+
|
52 |
+
### Step 2: Run input text through the model
|
53 |
+
|
54 |
+
```python
|
55 |
+
# only needed if it's run on GPU
|
56 |
+
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
|
57 |
+
model.to(device)
|
58 |
+
|
59 |
+
# set to eval mode
|
60 |
+
model.eval()
|
61 |
+
|
62 |
+
def insert_accents(text, model, tokenizer):
|
63 |
+
our_tokens = text.strip().split()
|
64 |
+
|
65 |
+
# the tokenizer may further split our tokens
|
66 |
+
inputs = tokenizer(our_tokens,
|
67 |
+
is_split_into_words=True,
|
68 |
+
truncation=True,
|
69 |
+
padding=True,
|
70 |
+
return_tensors="pt"
|
71 |
+
)
|
72 |
+
input_ids = inputs['input_ids']
|
73 |
+
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
|
74 |
+
tokens = tokens[1:-1]
|
75 |
+
|
76 |
+
with torch.no_grad():
|
77 |
+
inputs.to(device)
|
78 |
+
outputs = model(**inputs)
|
79 |
+
|
80 |
+
predictions = outputs["logits"].cpu().numpy()
|
81 |
+
predictions = np.argmax(predictions, axis=2)
|
82 |
+
|
83 |
+
# exclude output at index 0 and the last index, which correspond to '<s>' and '</s>'
|
84 |
+
predictions = predictions[0][1:-1]
|
85 |
+
|
86 |
+
assert len(tokens) == len(predictions)
|
87 |
+
|
88 |
+
return tokens, predictions
|
89 |
+
|
90 |
+
|
91 |
+
text = "Nhin nhung mua thu di, em nghe sau len trong nang."
|
92 |
+
|
93 |
+
tokens, predictions = insert_accents(text, model, tokenizer)
|
94 |
+
```
|
95 |
+
|
96 |
+
### Step3: Obtain the accented words
|
97 |
+
|
98 |
+
3.1 Download the tags set file from this repo. Then load it
|
99 |
+
```python
|
100 |
+
def _load_tags_set(fpath):
|
101 |
+
labels = []
|
102 |
+
with open(fpath, 'r') as f:
|
103 |
+
for line in f:
|
104 |
+
line = line.strip()
|
105 |
+
if line:
|
106 |
+
labels.append(line)
|
107 |
+
|
108 |
+
return labels
|
109 |
+
|
110 |
+
label_list = _load_tags_set("/content/training_data/vnaccent/corpus-title.train.selected_tags_names.txt")
|
111 |
+
assert len(label_list) == 528, f"Expect {len(label_list)} tags"
|
112 |
+
```
|
113 |
|
114 |
+
3.2 Print out `tokens` and `predictions` obtained above to see what we're having here
|
115 |
+
```python
|
116 |
+
print(tokens)
|
117 |
+
print(list(f"{pred} ({label_list[pred]})" for pred in predictions))
|
118 |
+
```
|
119 |
+
Obtained
|
120 |
+
```python
|
121 |
+
['▁Nhi', 'n', '▁nhu', 'ng', '▁mua', '▁thu', '▁di', ',', '▁em', '▁nghe', '▁sau', '▁len', '▁trong', '▁nang', '.']
|
122 |
+
['217 (i-ì)', '217 (i-ì)', '388 (u-ữ)', '388 (u-ữ)', '407 (ua-ùa)', '378 (u-u)', '120 (di-đi)', '0 (-)', '185 (e-e)', '185 (e-e)', '41 (au-âu)', '188 (e-ê)', '302 (o-o)', '14 (a-ắ)', '0 (-)']
|
123 |
+
```
|
124 |
|
125 |
|
126 |
+
## Limitations
|
127 |
+
- This model will accept a maximum of 512 tokens, which is a limitation inherited from the base pretrained XLM-Roberta model.
|