unable to translate into a complete long sentence in Chinese.

#37
by 9999tt - opened

Hi, I tried to test nllb for translating some English sentences to Chinese, and all my sentences are less than 60 tokens. However, most of sentences which more than 30 tokens cannot be generated completely, only half or less part of them can be done.

I also tried the same code, but English to French, it works. All sentences can be generated completly.

I also setted min_length, but sometimes, if I got short sentence, the last part of sentence will be compeately generated.
My code is here, please help:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
r"nllb-200-distilled-600M", token=True, src_lang="eng_Latn"
)
model = AutoModelForSeq2SeqLM.from_pretrained(r"nllb-200-distilled-600M", token=True)

input_path = r"eng_test_short.txt"
output_path = "./nllb_chn.txt"

input_file = open(input_path,'r',encoding='utf-8')

with open(output_path,'w',encoding='utf-8')as f:
for article in input_file:
inputs = tokenizer(article, return_tensors="pt")

print(article)

print(inputs)

translated_tokens = model.generate(

**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=200

**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hans"), max_length=512

)
print(tokenizer.convert_tokens_to_ids("zho_Hans"))

output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True,model_max_length=512)[0]

print(output)
f.writelines(output+'\n')

The output would be like:
input:
Politicians are loath to raise the tax even one penny when gas prices are high.
output:
政客们不愿意在高昂的燃油价格时,

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
r"facebook/nllb-200-distilled-600M", token=True, src_lang="eng_Latn"
)
model = AutoModelForSeq2SeqLM.from_pretrained(r"facebook/nllb-200-distilled-600M", token=True)

input_path = r"a.txt"
output_path = "/content/aa.txt"

input_file = open(input_path, 'r', encoding='utf-8')

with open(output_path, 'w', encoding='utf-8') as f:
for article in input_file:
inputs = tokenizer(article, return_tensors="pt")

    print(article)
    print(inputs)
    
    # Access the 'input_ids' tensor from the 'inputs' dictionary
    translated_tokens = model.generate(
        inputs['input_ids'],  # Pass the tensor to model.generate()
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hans"),
        max_length=512
    )
    print(tokenizer.convert_tokens_to_ids("zho_Hans"))

    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True, model_max_length=512)[0]

    print(output)
    f.writelines(output + '\n')

a.txt
Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Please note: This model is released under the Stability Community License. Visit Stability AI to learn or contact us for commercial licensing details.

Model Description
Developed by: Stability AI
Model type: MMDiT text-to-image generative model
Model Description: This model generates images based on text prompts. It is a Multimodal Diffusion Transformer that use three fixed, pretrained text encoders, and with QK-normalization to improve training stability.
License
Community License: Free for research, non-commercial, and commercial use for organizations or individuals with less than $1M in total annual revenue. More details can be found in the Community License Agreement. Read more at https://stability.ai/license.
For individuals and organizations with annual revenue above $1M: please contact us to get an Enterprise License.

aa.txt
稳定扩散3.5大是一个多模式扩散变压器 (MMDiT) 文本到图像模型,其性能在图像质量,类型,复杂的快速理解和资源效率方面得到改善.
现在
请注意:本模型是根据"稳定社区许可证"发布的. 访问"稳定人工智能"了解商业许可证详情或联系我们.
现在
模型描述
开发者:稳定AI
模型类型:MMDiT文本到图像生成模型
模型描述:该模型基于文字提示生成图像.它是一种多模式扩散变压器,使用三个固定,预训练的文本编码器,并具有QK正常化来提高训练稳定性.
许可证
社区许可证:为每年总收入不到100万美元的组织或个人免费进行研究,非商业和商业使用. 详细信息可在社区许可协议中查看. 阅读更多在https://stability.ai/license.
对于超过100万美元的个人和组织:请联系我们获取企业许可证.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
r"facebook/nllb-200-distilled-600M", token=True, src_lang="eng_Latn"
)
model = AutoModelForSeq2SeqLM.from_pretrained(r"facebook/nllb-200-distilled-600M", token=True)

input_path = r"a.txt"
output_path = "/content/aaa.txt"

input_file = open(input_path, 'r', encoding='utf-8')

with open(output_path, 'w', encoding='utf-8') as f:
for article in input_file:
inputs = tokenizer(article, return_tensors="pt")

    print(article)
    print(inputs)
    
    # Access the 'input_ids' tensor from the 'inputs' dictionary
    translated_tokens = model.generate(
        inputs['input_ids'],  # Pass the tensor to model.generate()
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("arb_Arab"),
        max_length=512
    )
    print(tokenizer.convert_tokens_to_ids("arb_Arab"))

    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True, model_max_length=512)[0]

    print(output)
    f.writelines(output + '\n')

إن Stable Diffusion 3.5 Large هو نموذج متعدد الحركات المتعددة للتنشر من نص إلى صورة يحتوي على أداء أفضل في جودة الصورة وتصنيفها وفهم سريع معقد وكفاءة الموارد.

  • لا.
    يرجى ملاحظة: هذا النموذج يتم إصداره تحت رخصة Stability Community. زيارة Stability AI للتعلم أو الاتصال بنا لمعلومات التراخيص التجارية.
  • لا.
    وصف النموذج
    طوّرتها: استقرار الذكاء الاصطناعي
    نوع النموذج: النموذج المولّد من النص إلى الصورة من MMDiT
    وصف النموذج: هذا النموذج يولد الصور على أساس طلبات النص. إنه محول انتشار متعدد الحركات يستخدم ثلاثة مُرمّحات نص ثابتة، تم تدريبها مسبقاً، ومع تطبيع QK لتحسين استقرار التدريب.
    رخصة
    ترخيص المجتمع: مجاني للبحث وغير التجاري والاستخدام التجاري للمنظمات أو الأفراد الذين لديهم أقل من 1 مليون دولار من الإيرادات السنوية الإجمالية. يمكن العثور على مزيد من التفاصيل في اتفاقية ترخيص المجتمع. اقرأ المزيد على https://stability.ai/license.
    بالنسبة للأفراد والمنظمات التي تتجاوز إيراداتها السنوية مليون دولار: يرجى الاتصال بنا للحصول على رخصة الشركة.

Sign up or log in to comment