Qwen
/

Qwen-7B

@@ -10,7 +10,7 @@ pipeline_tag: text-generation
 # Qwen-7B
 <p align="center">
-    <img src="assets/logo.jpg" width="400"/>
 <p>
 <br>
@@ -29,7 +29,7 @@ pipeline_tag: text-generation
 2. **强大的性能**：Qwen-7B在多个中英文下游评测任务上（涵盖常识推理、代码、数学、翻译等），效果显著超越现有的相近规模开源模型，甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。
 3. **覆盖更全面的词表**：相比目前以中英词表为主的开源模型，Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好，方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
-如果您想了解更多关于通义千问7B开源模型的细节，我们建议您参阅Github代码库。
 **Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.
@@ -39,7 +39,7 @@ The features of Qwen-7B include:
 2. **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
 3. **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
-For more details about the open-source model of Qwen-7B, please refer to the Github code repository.
 ## 依赖项 (Dependency)
@@ -83,9 +83,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 # 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
 ```
-关于更多的使用说明，请参考我们的Github repo获取更多信息。
-For more information, please refer to our Github repo for more information.
 ## 模型细节 (Model)

 # Qwen-7B
 <p align="center">
+    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
 <p>
 <br>
 2. **强大的性能**：Qwen-7B在多个中英文下游评测任务上（涵盖常识推理、代码、数学、翻译等），效果显著超越现有的相近规模开源模型，甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。
 3. **覆盖更全面的词表**：相比目前以中英词表为主的开源模型，Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好，方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
+如果您想了解更多关于通义千问7B开源模型的细节，我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
 **Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.
 2. **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
 3. **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
+For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
 ## 依赖项 (Dependency)
 # 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
 ```
+关于更多的使用说明，请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
+For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
 ## 模型细节 (Model)

tokenization_qwen.py CHANGED Viewed

@@ -20,7 +20,7 @@ from transformers import PreTrainedTokenizer, AddedToken
 logger = logging.getLogger(__name__)
-TIKTOKEN_NAME = "qwen.tiktoken"
 class QWenTokenizer(PreTrainedTokenizer):
@@ -28,17 +28,11 @@ class QWenTokenizer(PreTrainedTokenizer):
     """NOTE: This tokenizer will not handle special tokens to avoid injection attacks"""
-    @classmethod
-    def from_pretrained(
-        cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs
-    ):
-        merges_file = os.path.join(pretrained_model_name_or_path, TIKTOKEN_NAME)
-        tokenizer = cls(merges_file, *inputs, **kwargs)
-        return tokenizer
     def __init__(
         self,
-        merges_file,
         errors="replace",
         max_len=None,
         unk_token="<|endoftext|>",
@@ -113,7 +107,7 @@ class QWenTokenizer(PreTrainedTokenizer):
                 )
             }
-        mergeable_ranks = load_tiktoken_bpe(merges_file)
         special_tokens = {
             token: index
             for index, token in enumerate(special_tokens, start=len(mergeable_ranks))

 logger = logging.getLogger(__name__)
+VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken"}
 class QWenTokenizer(PreTrainedTokenizer):
     """NOTE: This tokenizer will not handle special tokens to avoid injection attacks"""
+    vocab_files_names = VOCAB_FILES_NAMES
     def __init__(
         self,
+        vocab_file,
         errors="replace",
         max_len=None,
         unk_token="<|endoftext|>",
                 )
             }
+        mergeable_ranks = load_tiktoken_bpe(vocab_file)
         special_tokens = {
             token: index
             for index, token in enumerate(special_tokens, start=len(mergeable_ranks))