peterhung
/

vietnamese-accent-marker-xlm-roberta

@@ -1,31 +1,127 @@
 ---
-license: afl-3.0
 language:
 - vi
 pipeline_tag: token-classification
 tags:
 - vietnamese
 - accents inserter
 ---
 # A Transformer model for inserting Vietnamese accent marks
 This model is finetuned from the XLM-Roberta Large.
-Example input: Toi di hoc.
-Target output: Tôi đi học.
 ## Model training
 This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
 to the accented token.
-For more details on the training process, please refer to this [blog post](https://peterhung.org/tech/insert-vietnamese-accent-transformer-model/).
 ## How to use this model
-There are 2 main steps:
-- Load the model as a token classification model (*AutoModelForTokenClassification*).
-- Run the input through the model to obtain the tag index for each input token.
-- Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*.
-- Apply the transformation to each token to obtain accented tokens.

 ---
+license: mit
 language:
 - vi
 pipeline_tag: token-classification
 tags:
 - vietnamese
 - accents inserter
+metrics:
+- accuracy
 ---
 # A Transformer model for inserting Vietnamese accent marks
 This model is finetuned from the XLM-Roberta Large.
+Example input: Nhin nhung mua thu di
+Target output: Nhìn những mùa thu đi
 ## Model training
 This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
 to the accented token.
+For more details on the training process, please refer to this
+<a href="https://peterhung.org/tech/insert-vietnamese-accent-transformer-model/" target="_blank">blog post</a>.
 ## How to use this model
+There are just a few steps:
+- Step 1: Load the model as a token classification model (*AutoModelForTokenClassification*).
+- Step 2: Run the input through the model to obtain the tag index for each input token.
+- Step 3: Use the tags' index to retreive the actual tags in the file *selected_tags_names.txt*. Then,
+  apply the conversion indicated by the tag to each token to obtain accented tokens.
+### Step 1: Load model
+Note: Install *transformers*, *torch*, *numpy* packages first.
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+import numpy as np
+def load_trained_transformer_model():
+    model_path = "peterhung/transformer-vnaccent-marker"
+    tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
+    model = AutoModelForTokenClassification.from_pretrained(model_path)
+    return model, tokenizer
+model, tokenizer = load_trained_transformer_model()
+```
+### Step 2: Run input text through the model
+```python
+# only needed if it's run on GPU
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+model.to(device)
+# set to eval mode
+model.eval()
+def insert_accents(text, model, tokenizer):
+    our_tokens = text.strip().split()
+    # the tokenizer may further split our tokens
+    inputs = tokenizer(our_tokens,
+                        is_split_into_words=True,
+                        truncation=True,
+                        padding=True,
+                        return_tensors="pt"
+                        )
+    input_ids = inputs['input_ids']
+    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
+    tokens = tokens[1:-1]
+    with torch.no_grad():
+        inputs.to(device)
+        outputs = model(**inputs)
+    predictions = outputs["logits"].cpu().numpy()
+    predictions = np.argmax(predictions, axis=2)
+    # exclude output at index 0 and the last index, which correspond to '<s>' and '</s>'
+    predictions = predictions[0][1:-1]
+    assert len(tokens) == len(predictions)
+    return tokens, predictions
+text = "Nhin nhung mua thu di, em nghe sau len trong nang."
+tokens, predictions = insert_accents(text, model, tokenizer)
+```
+### Step3: Obtain the accented words
+3.1 Download the tags set file from this repo. Then load it
+```python
+def _load_tags_set(fpath):
+    labels = []
+    with open(fpath, 'r') as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                labels.append(line)
+    return labels
+label_list = _load_tags_set("/content/training_data/vnaccent/corpus-title.train.selected_tags_names.txt")
+assert len(label_list) == 528, f"Expect {len(label_list)} tags"
+```
+3.2 Print out `tokens` and `predictions` obtained above to see what we're having here
+```python
+print(tokens)
+print(list(f"{pred} ({label_list[pred]})" for pred in predictions))
+```
+Obtained
+```python
+['▁Nhi', 'n', '▁nhu', 'ng', '▁mua', '▁thu', '▁di', ',', '▁em', '▁nghe', '▁sau', '▁len', '▁trong', '▁nang', '.']
+['217 (i-ì)', '217 (i-ì)', '388 (u-ữ)', '388 (u-ữ)', '407 (ua-ùa)', '378 (u-u)', '120 (di-đi)', '0 (-)', '185 (e-e)', '185 (e-e)', '41 (au-âu)', '188 (e-ê)', '302 (o-o)', '14 (a-ắ)', '0 (-)']
+```
+## Limitations
+- This model will accept a maximum of 512 tokens, which is a limitation inherited from the base pretrained XLM-Roberta model.