Tokenizer doesn't add EOS token even when explicitly requested
#90
by
johngiorgi
- opened
I can't seem to get the tokenizer to add the EOS token, even when I explicitly request it. Reproduction below with a fresh download of the tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
add_eos_token=True,
force_download=True,
token=True
)
tokenizer.tokenize("this is a test", add_special_tokens=True)
>>> ['<|begin_of_text|>', 'this', 'Ġis', 'Ġa', 'Ġtest']
I am on transformers==4.40.1
and tokenizers==0.19.1
This is awful. I hope it gets resolved soon!
Try this:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
model, token=os.environ["HUGGING_FACE_TOKEN"], cache_dir=Settings.CACHE_DIR,
)
stopping_ids = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
llm = HuggingFaceLLM(
model_name=model,
model_kwargs={
"token": os.environ["HUGGING_FACE_TOKEN"],
"torch_dtype": torch.bfloat16,
},
generate_kwargs={
"do_sample": True,
"temperature": 0.01,
"top_p": 0.9,
},
tokenizer_name=model,
tokenizer_kwargs={"token": os.environ["HUGGING_FACE_TOKEN"]},
stopping_ids=stopping_ids,
)
osanseviero
changed discussion status to
closed
@osanseviero Why was this closed? And with zero explanation? This is still an issue—with both the base model and the instruct model:
['<|begin_of_text|>', 'this', 'Ġis', 'Ġa', 'Ġtest']