60 languages?

by conan1024hao - opened Jan 17, 2023

Jan 17, 2023

In the model description, you said flan-t5 was trained on 60 languages (including Japanese, etc.). However, the vocal_size is only 32138, how could it handle 60 languages?

mayamonwang

Feb 15, 2023

i think this is impossible under sentencepiece

tbboukhari

Feb 26, 2023

Same issue , tokenizer doesn't understand Arabic.

ztvvv

Feb 27, 2023

Same issue , tokenizer doesn't understand Chinese

duongkstn

Mar 2, 2023

neither Vietnamese !

ybelkada

Mar 2, 2023

Hello everyone, thanks for the issue and sorry for the confusion
I think that google has opensourced the English versions only at the moment, we posted a ticket on their repository to track the issue: https://github.com/google-research/t5x/issues/1131

Youseff1987

Jul 15

same problem with Korean, Tokenizer cant recognize Korean tokens

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment