Bug in tokenize()/detokenize()/tokenize() cycle
The tokenizer for this model doesn't seem to work well with multibyte characters which encode to multiple tokens. Specifically:
from huggingface_hub import hf_hub_download
import llama_cpp
repo_id = "bartowski/Meta-Llama-3-8B-Instruct-GGUF"
filename = "Meta-Llama-3-8B-Instruct-IQ3_S.gguf"
downloaded_file = hf_hub_download(repo_id=repo_id, filename=filename)
llama_model = llama_cpp.Llama(model_path=downloaded_file, n_ctx=4096)
print("\n===========================\n")
sample_string = "歪"
sample_bytes = sample_string.encode()
print(f"{sample_bytes=}")
tokens = llama_model.tokenize(sample_bytes, add_bos=False, special=True)
print(f"{tokens=}")
tokenizer = llama_cpp.LlamaTokenizer(llama_model)
first_token = tokenizer.detokenize([tokens[0]])
print(f"{first_token=}")
tokens_2 = tokenizer.tokenize(first_token, add_bos=False, special=True)
print(f"{tokens_2=}")
Results in a segfault (or OS-equivalent failure) for the final tokenizer.tokenize()
call.
The output prior to the segfault is:
sample_bytes=b'\xe6\xad\xaa'
tokens=[15722, 103]
first_token=b'\xe6\xad'
so we can see that sample_string
is being encoded to three bytes, but these are represented by two tokens. We can get the byte-representation of the first token (which happens to be the first two bytes of the three), and then try to tokenize()
just that.... but that fails.
I had expected the final 'print' to be tokens_2=[15722]
.
I'm using Python 3.12, with llama_cpp_python 0.2.83 on Windows and llama_cpp_python 0.2.82 on Linux.
nb: This was originally reported by a Guidance user; I have adapted their repro case above