data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 69 column 3
#17
by
sigridjineth
- opened
it seems there's no vocab.json on this repository.
when running data module using pytorch lightning,
from transformers import XLMRobertaTokenizerFast
self.tokenizer = AutoTokenizer.from_pretrained(model, use_local=True) # model: /root/jina-reranker-v2-base-multilingual
I am getting this.
File "/root/venv/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 112, in __init__
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 69 column 3
to my local, just cloned the hf repo for latest main branch.
(venv) root@99074ab04cc2:~/FlagEmbedding/experiments/240710/jina# ls /root/jina-reranker-v2-base-multilingual
README.md embedding.py pytorch_model.bin xlm_padding.py
block.py mha.py special_tokens_map.json
config.json mlp.py tokenizer.json
configuration_xlm_roberta.py modeling_xlm_roberta.py tokenizer_config.json
I get the same error "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 69 column 3"
Update the tokenizer and transformers to latest.
Indeed, I suspect that this should help:
pip install -U transformers tokenizers
- Tom Aarsen