Update README.md

fdebe5b verified 6 months ago

4.07 kB

	---
	license: mit
	language:
	- vi
	pipeline_tag: token-classification
	tags:
	- vietnamese
	- accents inserter
	metrics:
	- accuracy
	---

	# A Transformer model for inserting Vietnamese accent marks

	This model is finetuned from the XLM-Roberta Large.

	Example input: Nhin nhung mua thu di
	Target output: Nhìn những mùa thu đi

	## Model training
	This problem was modelled as a token classification problem. For each input token, the goal is to asssign a "tag" that will transform it
	to the accented token.

	For more details on the training process, please refer to this
	<a href="https://peterhung.org/tech/insert-vietnamese-accent-transformer-model/" target="_blank">blog post</a>.


	## How to use this model
	There are just a few steps:
	- Step 1: Load the model as a token classification model (AutoModelForTokenClassification).
	- Step 2: Run the input through the model to obtain the tag index for each input token.
	- Step 3: Use the tags' index to retreive the actual tags in the file selected_tags_names.txt. Then,
	apply the conversion indicated by the tag to each token to obtain accented tokens.

	### Step 1: Load model
	Note: Install transformers, torch, numpy packages first.

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch
	import numpy as np

	def load_trained_transformer_model():
	model_path = "peterhung/transformer-vnaccent-marker"
	tokenizer = AutoTokenizer.from_pretrained(model_path, add_prefix_space=True)
	model = AutoModelForTokenClassification.from_pretrained(model_path)
	return model, tokenizer

	model, tokenizer = load_trained_transformer_model()
	```

	### Step 2: Run input text through the model

	```python
	# only needed if it's run on GPU
	device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
	model.to(device)

	# set to eval mode
	model.eval()

	def insert_accents(text, model, tokenizer):
	our_tokens = text.strip().split()

	# the tokenizer may further split our tokens
	inputs = tokenizer(our_tokens,
	is_split_into_words=True,
	truncation=True,
	padding=True,
	return_tensors="pt"
	)
	input_ids = inputs['input_ids']
	tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
	tokens = tokens[1:-1]

	with torch.no_grad():
	inputs.to(device)
	outputs = model(**inputs)

	predictions = outputs["logits"].cpu().numpy()
	predictions = np.argmax(predictions, axis=2)

	# exclude output at index 0 and the last index, which correspond to '<s>' and '</s>'
	predictions = predictions[0][1:-1]

	assert len(tokens) == len(predictions)

	return tokens, predictions


	text = "Nhin nhung mua thu di, em nghe sau len trong nang."

	tokens, predictions = insert_accents(text, model, tokenizer)
	```

	### Step3: Obtain the accented words

	3.1 Download the tags set file from this repo. Then load it
	```python
	def _load_tags_set(fpath):
	labels = []
	with open(fpath, 'r') as f:
	for line in f:
	line = line.strip()
	if line:
	labels.append(line)

	return labels

	label_list = _load_tags_set("/content/training_data/vnaccent/corpus-title.train.selected_tags_names.txt")
	assert len(label_list) == 528, f"Expect {len(label_list)} tags"
	```

	3.2 Print out `tokens` and `predictions` obtained above to see what we're having here
	```python
	print(tokens)
	print(list(f"{pred} ({label_list[pred]})" for pred in predictions))
	```
	Obtained
	```python
	['▁Nhi', 'n', '▁nhu', 'ng', '▁mua', '▁thu', '▁di', ',', '▁em', '▁nghe', '▁sau', '▁len', '▁trong', '▁nang', '.']
	['217 (i-ì)', '217 (i-ì)', '388 (u-ữ)', '388 (u-ữ)', '407 (ua-ùa)', '378 (u-u)', '120 (di-đi)', '0 (-)', '185 (e-e)', '185 (e-e)', '41 (au-âu)', '188 (e-ê)', '302 (o-o)', '14 (a-ắ)', '0 (-)']
	```


	## Limitations
	- This model will accept a maximum of 512 tokens, which is a limitation inherited from the base pretrained XLM-Roberta model.