调用 QWenTokenizer.convert_tokens_to_string() 报缺失 byte_decoder
#2
by
twang2218
- opened
当使用
tokenizer.convert_tokens_to_string([k])
的时候,会产生一下错误,导致无法执行:
convert_tokens_to_string(b'ictionary') failed: 'QWenTokenizer' object has no attribute 'byte_decoder'
经过翻阅代码发现:
https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/tokenization_qwen.py#L197-L206
def convert_tokens_to_string(self, tokens: List[str]) -> str:
"""
Converts a sequence of tokens in a single string. The most simple way to do it is `" ".join(tokens)` but we
often want to remove sub-word tokenization artifacts at the same time.
"""
text = "".join(tokens)
text = bytearray([self.byte_decoder[c] for c in text]).decode(
"utf-8", errors=self.errors
)
return text
其中确实是用了 self.byte_decoder[c]
,但是无论是 QWenTokenizer
还是 PreTrainedTokenizer
都没有这个变量。
Thank you for raising this issue. This has been fixed, please try
>>> tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, force_download=True)
>>> tokenizer.convert_tokens_to_string([b'ictionary'])
'ictionary'
I'll close this for now. If there are other problems, please open a new one.
jklj077
changed discussion status to
closed