Update tokenization_chatglm.py

当运行如下代码：
```
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/home/oneway/ssd2t/model/ZhipuAI/glm-4-9b-chat", trust_remote_code=True)
new_str = tokenizer.decode(198)
print(new_str)
```

报错：`TypeError: token should only be of type types or str`

原因是glm4的词表中的key是以bytes类型存储，而bytes类型在transformers的_decode函数中被遍历会变成int类型。

对`tokenization_chatglm.py`中的`convert_tokens_to_string`函数作如下修改即可解决该问题：
```
def convert_tokens_to_string(tokens: List[Union[bytes, str, int]]) -> str:
"""
Converts a sequence of tokens in a single string.
"""
text = ""
temp = b""
for t in tokens:
if isinstance(t, int):
t = chr(t)
if isinstance(t, str):
if temp:
text += temp.decode("utf-8", errors="replace")
temp = b""
text += t
elif isinstance(t, bytes):
temp += t
else:
raise TypeError("token should only be of type int, bytes or str")
if temp:
text += temp.decode("utf-8", errors="replace")
return text
```

Files changed (1) hide show

tokenization_chatglm.py +5 -3

tokenization_chatglm.py CHANGED Viewed

@@ -62,14 +62,16 @@ class ChatGLM4Tokenizer(PreTrainedTokenizer):
         vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
         vocab.update(self.added_tokens_encoder)
         return vocab
-    def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
         """
         Converts a sequence of tokens in a single string.
         """
         text = ""
         temp = b""
         for t in tokens:
             if isinstance(t, str):
                 if temp:
                     text += temp.decode("utf-8", errors="replace")
@@ -78,7 +80,7 @@ class ChatGLM4Tokenizer(PreTrainedTokenizer):
             elif isinstance(t, bytes):
                 temp += t
             else:
-                raise TypeError("token should only be of type types or str")
         if temp:
             text += temp.decode("utf-8", errors="replace")
         return text

         vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
         vocab.update(self.added_tokens_encoder)
         return vocab
+    def convert_tokens_to_string(self, tokens: List[Union[bytes, str, int]]) -> str:
         """
         Converts a sequence of tokens in a single string.
         """
         text = ""
         temp = b""
         for t in tokens:
+            if isinstance(t, int):
+                t = chr(t)
             if isinstance(t, str):
                 if temp:
                     text += temp.decode("utf-8", errors="replace")
             elif isinstance(t, bytes):
                 temp += t
             else:
+                raise TypeError("token should only be of type int, bytes or str")
         if temp:
             text += temp.decode("utf-8", errors="replace")
         return text