[please help!] I can't load the tokenizer

#218
by alvinshao - opened

When I used the following code to load the tokenizer, a bug occurred. Does anyone know how to fix it?
Code:

model = "meta-llama/Meta-Llama-3-70B"
seqlen = 2048
hf_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

model = transformers.LlamaForCausalLM.from_pretrained(model, torch_dtype='auto',
token=hf_token,
low_cpu_mem_usage=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(model, use_fast=False,
token=hf_token)

Error:

Loading checkpoint shards: 100%|███████| 30/30 [00:03<00:00, 9.20it/s]
Traceback (most recent call last):
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_file
resolved_file = hf_hub_download(
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
return f(*args, **kwargs)
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/huggingface_hub/utils/validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '
', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 8192)
(layers): ModuleList(
(0-79): 80 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=8192, out_features=8192, bias=False)
(k_proj): Linear(in_features=8192, out_features=1024, bias=False)
(v_proj): Linear(in_features=8192, out_features=1024, bias=False)
(o_proj): Linear(in_features=8192, out_features=8192, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=8192, out_features=28672, bias=False)
(up_proj): Linear(in_features=8192, out_features=28672, bias=False)
(down_proj): Linear(in_features=28672, out_features=8192, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((8192,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=8192, out_features=128256, bias=False)
)'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/shaoyuantian/program/RLLM/idea_test/test.py", line 20, in
tokenizer = transformers.AutoTokenizer.from_pretrained(model, use_fast=False,
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 834, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 666, in get_tokenizer_config
resolved_config_file = cached_file(
File "/home/shaoyuantian/anaconda3/envs/rllm/lib/python3.10/site-packages/transformers/utils/hub.py", line 466, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 8192)
(layers): ModuleList(
(0-79): 80 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=8192, out_features=8192, bias=False)
(k_proj): Linear(in_features=8192, out_features=1024, bias=False)
(v_proj): Linear(in_features=8192, out_features=1024, bias=False)
(o_proj): Linear(in_features=8192, out_features=8192, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=8192, out_features=28672, bias=False)
(up_proj): Linear(in_features=8192, out_features=28672, bias=False)
(down_proj): Linear(in_features=28672, out_features=8192, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((8192,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((8192,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=8192, out_features=128256, bias=False)
)'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

Environment:
tokenizers 0.19.1
transformers 4.44.2
huggingface-hub 0.25.0

In addition, I also found that the tokenizer-related files were missing in the model cache path, and the model could be used normally.

alvinshao changed discussion title from I can't load the tokenizer to [please help!] I can't load the tokenizer

Sign up or log in to comment