Update tokenization_chatglm.py
#7
by
ksuriuri
- opened
当运行如下代码:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/home/oneway/ssd2t/model/ZhipuAI/glm-4-9b-chat", trust_remote_code=True)
new_str = tokenizer.decode(198)
print(new_str)
报错:TypeError: token should only be of type types or str
原因是glm4的词表中的key是以bytes类型存储,而bytes类型在transformers的_decode函数中被遍历会变成int类型。
对tokenization_chatglm.py
中的convert_tokens_to_string
函数作如下修改即可解决该问题:
def convert_tokens_to_string(tokens: List[Union[bytes, str, int]]) -> str:
"""
Converts a sequence of tokens in a single string.
"""
text = ""
temp = b""
for t in tokens:
if isinstance(t, int):
t = chr(t)
if isinstance(t, str):
if temp:
text += temp.decode("utf-8", errors="replace")
temp = b""
text += t
elif isinstance(t, bytes):
temp += t
else:
raise TypeError("token should only be of type int, bytes or str")
if temp:
text += temp.decode("utf-8", errors="replace")
return text
zRzRzRzRzRzRzR
changed pull request status to
merged