metadata

license: gpl-3.0
datasets:
  - Mxode/BiST
language:
  - en
  - zh
pipeline_tag: translation
library_name: transformers

NanoTranslator-XS

English | 简体中文

Introduction

This is the x-small model of the NanoTranslator, currently supported only in English to Chinese.

The ONNX version of the model is also available in the repository.

All models are collected in the NanoTranslator Collection.

	P.	Arch.	Act.	V.	H.	I.	L.	A.H.	K.H.	Tie
XXL2	102	LLaMA	SwiGLU	16K	1120	3072	6	16	8	True
XXL	100	LLaMA	SwiGLU	16K	768	4096	8	24	8	True
XL	78	LLaMA	GeGLU	16K	768	4096	6	24	8	True
L	49	LLaMA	GeGLU	16K	512	2816	8	16	8	True
M2	22	Qwen2	GeGLU	4K	432	2304	6	24	8	True
M	22	LLaMA	SwiGLU	8K	256	1408	16	16	4	True
S	9	LLaMA	SwiGLU	4K	168	896	16	12	4	True
XS	2	LLaMA	SwiGLU	2K	96	512	12	12	4	True

P. - Parameters (in million)
V. - vocab size
H. - hidden size
I. - intermediate size
L. - num layers
A.H. - num attention heads
K.H. - num kv heads
Tie - tie word embeddings

How to use

Prompt format as follows：

<|im_start|> {English Text} <|endoftext|>

Directly using transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = 'Mxode/NanoTranslator-XS'

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

def translate(text: str, model, **kwargs):
    generation_args = dict(
        max_new_tokens = kwargs.pop("max_new_tokens", 512),
        do_sample = kwargs.pop("do_sample", True),
        temperature = kwargs.pop("temperature", 0.55),
        top_p = kwargs.pop("top_p", 0.8),
        top_k = kwargs.pop("top_k", 40),
        **kwargs
    )

    prompt = "<|im_start|>" + text + "<|endoftext|>"
    model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

    generated_ids = model.generate(model_inputs.input_ids, **generation_args)
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

text = "I love to watch my favorite TV series."

response = translate(text, model, max_new_tokens=64, do_sample=False)
print(response)

ONNX

It has been measured that reasoning with ONNX models will be 2-10 times faster than reasoning directly with transformers models.

You should switch to onnx branch manually and download to local.

reference docs:

Using ORTModelForCausalLM

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model_path = "your/folder/to/onnx_model"

ort_model = ORTModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

text = "I love to watch my favorite TV series."

response = translate(text, ort_model, max_new_tokens=64, do_sample=False)
print(response)

Using pipeline

from optimum.pipelines import pipeline

model_path = "your/folder/to/onnx_model"
pipe = pipeline("text-generation", model=model_path, accelerator="ort")

text = "I love to watch my favorite TV series."

response = pipe(text, max_new_tokens=64, do_sample=False)
response