Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
license: other
|
3 |
-
license_name:
|
4 |
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
5 |
language:
|
6 |
- en
|
@@ -61,7 +61,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
61 |
|
62 |
prompt = "Give me a short introduction to large language model."
|
63 |
messages = [
|
64 |
-
{"role": "system", "content": "You are a helpful assistant."},
|
65 |
{"role": "user", "content": prompt}
|
66 |
]
|
67 |
text = tokenizer.apply_chat_template(
|
@@ -84,11 +84,25 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
84 |
|
85 |
### Processing Long Texts
|
86 |
|
87 |
-
|
88 |
-
|
89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
-
|
|
|
|
|
|
|
92 |
|
93 |
## Evaluation & Performance
|
94 |
|
|
|
1 |
---
|
2 |
license: other
|
3 |
+
license_name: qwen
|
4 |
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
5 |
language:
|
6 |
- en
|
|
|
61 |
|
62 |
prompt = "Give me a short introduction to large language model."
|
63 |
messages = [
|
64 |
+
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
|
65 |
{"role": "user", "content": prompt}
|
66 |
]
|
67 |
text = tokenizer.apply_chat_template(
|
|
|
84 |
|
85 |
### Processing Long Texts
|
86 |
|
87 |
+
The current `config.json` is set for context length up to 32,768 tokens.
|
88 |
+
To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
|
89 |
+
|
90 |
+
For supported frameworks, you could add the following to `config.json` to enable YaRN:
|
91 |
+
```json
|
92 |
+
{
|
93 |
+
...,
|
94 |
+
"rope_scaling": {
|
95 |
+
"factor": 4.0,
|
96 |
+
"original_max_position_embeddings": 32768,
|
97 |
+
"type": "yarn"
|
98 |
+
}
|
99 |
+
}
|
100 |
+
```
|
101 |
|
102 |
+
For deployment, we recommend using vLLM.
|
103 |
+
Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM.
|
104 |
+
Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**.
|
105 |
+
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
106 |
|
107 |
## Evaluation & Performance
|
108 |
|