update tokenization_qwen.py
Browse files- README.md +5 -5
- tokenization_qwen.py +4 -10
README.md
CHANGED
@@ -10,7 +10,7 @@ pipeline_tag: text-generation
|
|
10 |
# Qwen-7B
|
11 |
|
12 |
<p align="center">
|
13 |
-
<img src="
|
14 |
<p>
|
15 |
<br>
|
16 |
|
@@ -29,7 +29,7 @@ pipeline_tag: text-generation
|
|
29 |
2. **强大的性能**:Qwen-7B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的相近规模开源模型,甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。
|
30 |
3. **覆盖更全面的词表**:相比目前以中英词表为主的开源模型,Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
|
31 |
|
32 |
-
如果您想了解更多关于通义千问7B开源模型的细节,我们建议您参阅Github
|
33 |
|
34 |
**Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.
|
35 |
|
@@ -39,7 +39,7 @@ The features of Qwen-7B include:
|
|
39 |
2. **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
|
40 |
3. **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
|
41 |
|
42 |
-
For more details about the open-source model of Qwen-7B, please refer to the Github code repository.
|
43 |
|
44 |
## 依赖项 (Dependency)
|
45 |
|
@@ -83,9 +83,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
|
|
83 |
# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
|
84 |
```
|
85 |
|
86 |
-
关于更多的使用说明,请参考我们的Github repo获取更多信息。
|
87 |
|
88 |
-
For more information, please refer to our Github repo for more information.
|
89 |
|
90 |
## 模型细节 (Model)
|
91 |
|
|
|
10 |
# Qwen-7B
|
11 |
|
12 |
<p align="center">
|
13 |
+
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
|
14 |
<p>
|
15 |
<br>
|
16 |
|
|
|
29 |
2. **强大的性能**:Qwen-7B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的相近规模开源模型,甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。
|
30 |
3. **覆盖更全面的词表**:相比目前以中英词表为主的开源模型,Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
|
31 |
|
32 |
+
如果您想了解更多关于通义千问7B开源模型的细节,我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
|
33 |
|
34 |
**Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.
|
35 |
|
|
|
39 |
2. **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
|
40 |
3. **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
|
41 |
|
42 |
+
For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
|
43 |
|
44 |
## 依赖项 (Dependency)
|
45 |
|
|
|
83 |
# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
|
84 |
```
|
85 |
|
86 |
+
关于更多的使用说明,请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
|
87 |
|
88 |
+
For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
|
89 |
|
90 |
## 模型细节 (Model)
|
91 |
|
tokenization_qwen.py
CHANGED
@@ -20,7 +20,7 @@ from transformers import PreTrainedTokenizer, AddedToken
|
|
20 |
|
21 |
logger = logging.getLogger(__name__)
|
22 |
|
23 |
-
|
24 |
|
25 |
|
26 |
class QWenTokenizer(PreTrainedTokenizer):
|
@@ -28,17 +28,11 @@ class QWenTokenizer(PreTrainedTokenizer):
|
|
28 |
|
29 |
"""NOTE: This tokenizer will not handle special tokens to avoid injection attacks"""
|
30 |
|
31 |
-
|
32 |
-
def from_pretrained(
|
33 |
-
cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs
|
34 |
-
):
|
35 |
-
merges_file = os.path.join(pretrained_model_name_or_path, TIKTOKEN_NAME)
|
36 |
-
tokenizer = cls(merges_file, *inputs, **kwargs)
|
37 |
-
return tokenizer
|
38 |
|
39 |
def __init__(
|
40 |
self,
|
41 |
-
|
42 |
errors="replace",
|
43 |
max_len=None,
|
44 |
unk_token="<|endoftext|>",
|
@@ -113,7 +107,7 @@ class QWenTokenizer(PreTrainedTokenizer):
|
|
113 |
)
|
114 |
}
|
115 |
|
116 |
-
mergeable_ranks = load_tiktoken_bpe(
|
117 |
special_tokens = {
|
118 |
token: index
|
119 |
for index, token in enumerate(special_tokens, start=len(mergeable_ranks))
|
|
|
20 |
|
21 |
logger = logging.getLogger(__name__)
|
22 |
|
23 |
+
VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken"}
|
24 |
|
25 |
|
26 |
class QWenTokenizer(PreTrainedTokenizer):
|
|
|
28 |
|
29 |
"""NOTE: This tokenizer will not handle special tokens to avoid injection attacks"""
|
30 |
|
31 |
+
vocab_files_names = VOCAB_FILES_NAMES
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
def __init__(
|
34 |
self,
|
35 |
+
vocab_file,
|
36 |
errors="replace",
|
37 |
max_len=None,
|
38 |
unk_token="<|endoftext|>",
|
|
|
107 |
)
|
108 |
}
|
109 |
|
110 |
+
mergeable_ranks = load_tiktoken_bpe(vocab_file)
|
111 |
special_tokens = {
|
112 |
token: index
|
113 |
for index, token in enumerate(special_tokens, start=len(mergeable_ranks))
|