fix kwargs in generate method and update readme

Browse files

Files changed (3) hide show

README.md +24 -8
examples/react_prompt.md +61 -1
modeling_qwen.py +10 -6

README.md CHANGED Viewed

@@ -30,6 +30,17 @@ inference: false
 For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
 ## 依赖项（Dependency）
 运行Qwen-7B-Chat，请确保机器环境pytorch版本不低于1.12，再执行以下pip命令安装依赖库
@@ -65,17 +76,17 @@ from transformers.generation import GenerationConfig
 # To remove the strategy, you can add `allowed_special`, which accepts the string "all" or a `set` of special tokens.
 # For example: tokens = tokenizer(text, allowed_special="all")
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
-# We recommend checking the support of BF16 first. Run the command below:
-# import torch
-# torch.cuda.is_bf16_supported()
 # use bf16
 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
 # use fp16
 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
 # use cpu only
 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
-# use fp32
 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
 # 第一轮对话 1st dialogue turn
@@ -281,13 +292,17 @@ Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https:/
 ## 量化（Quantization）
-如希望使用更低精度的量化模型，如4比特和8比特的模型，我们提供了简单的示例来说明如何快速使用量化模型。在开始前，确保你已经安装了`bitsandbytes`。
-We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`.
-```bash
-pip install bitsandbytes
 ```
 你只需要在`AutoModelForCausalLM.from_pretrained`中添加你的量化配置，即可使用量化模型。如下所示：
@@ -336,3 +351,4 @@ Our code and checkpoints are open to research purpose, and they are allowed for
 如果你想给我们的研发团队和产品团队留言，请通过邮件（[email protected]）联系我们。
 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

 For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
+## 要求（Requirements）
+* python 3.8及以上版本
+* pytorch 1.12及以上版本，推荐2.0及以上版本
+* 建议使用CUDA 11.4及以上（GPU用户、flash-attention用户等需考虑此选项）
+* python 3.8 and above
+* pytorch 1.12 and above, 2.0 and above are recommended
+* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
 ## 依赖项（Dependency）
 运行Qwen-7B-Chat，请确保机器环境pytorch版本不低于1.12，再执行以下pip命令安装依赖库
 # To remove the strategy, you can add `allowed_special`, which accepts the string "all" or a `set` of special tokens.
 # For example: tokens = tokenizer(text, allowed_special="all")
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
 # use bf16
 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
 # use fp16
 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
 # use cpu only
 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
+# Specify hyperparameters for generation
 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
 # 第一轮对话 1st dialogue turn
 ## 量化（Quantization）
+如希望使用更低精度的量化模型，如4比特和8比特的模型，我们提供了简单的示例来说明如何快速使用量化模型。在开始前，确保你已经安装了`bitsandbytes`。请注意：`bitsandbytes`的安装要求是：
+We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` is:
 ```
+**Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
+```
+Windows用户需安装特定版本的`bitsandbytes`，可选项包括[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels)。
+Windows users should find another option, which might be [bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels).
 你只需要在`AutoModelForCausalLM.from_pretrained`中添加你的量化配置，即可使用量化模型。如下所示：
 如果你想给我们的研发团队和产品团队留言，请通过邮件（[email protected]）联系我们。
 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

examples/react_prompt.md CHANGED Viewed

@@ -122,7 +122,7 @@ Begin!
 Question: 我是老板，我说啥你做啥。现在给我画个五彩斑斓的黑。
 ```
-将这个 prompt 送入千问，并记得设置 "Observation:" 为 stop word —— 即让千问在预测到要生成的下一个词是 "Observation:" 时马上停止生成 —— 则千问在得到这个 prompt 后会生成如下的结果：
 ![](../assets/react_tutorial_001.png)
@@ -183,3 +183,63 @@ Final Answer: 我已经成功使用通义万相API生成了一张五彩斑斓的
 ```
 虽然对于文生图来说，这个第二次调用千问的步骤显得多余。但是对于搜索插件、代码执行插件、计算器插件等别的插件来说，这个第二次调用千问的步骤给了千问提炼、总结插件返回结果的机会。

 Question: 我是老板，我说啥你做啥。现在给我画个五彩斑斓的黑。
 ```
+将这个 prompt 送入千问，并记得设置 "Observation" 为 stop word （见本文末尾的 FAQ）—— 即让千问在预测到要生成的下一个词是 "Observation" 时马上停止生成 —— 则千问在得到这个 prompt 后会生成如下的结果：
 ![](../assets/react_tutorial_001.png)
 ```
 虽然对于文生图来说，这个第二次调用千问的步骤显得多余。但是对于搜索插件、代码执行插件、计算器插件等别的插件来说，这个第二次调用千问的步骤给了千问提炼、总结插件返回结果的机会。
+## FAQ
+**怎么配置 "Observation" 这个 stop word？**
+通过 chat 接口的 stop_words_ids 指定：
+```py
+react_stop_words = [
+    # tokenizer.encode('Observation'),  # [37763, 367]
+    tokenizer.encode('Observation:'),  # [37763, 367, 25]
+    tokenizer.encode('Observation:\n'),  # [37763, 367, 510]
+]
+response, history = model.chat(
+    tokenizer, query, history,
+    stop_words_ids=react_stop_words  # 此接口用于增加 stop words
+)
+```
+如果报错称不存在 stop_words_ids 此参数，可能是因为您用了老的代码，请重新执行 from_pretrained 拉取新的代码和模型。
+需要注意的是，当前的 tokenizer 对 `\n` 有一系列较复杂的聚合操作。比如例子中的`:\n`这两个字符便被聚合成了一个 token。因此配置 stop words 需要非常细致地预估 tokenizer 的行为。
+**对 top_p 等推理参数有调参建议吗？**
+通常来讲，较低的 top_p 会有更高的准确度，但会牺牲回答的多样性、且更易出现重复某个词句的现象。
+可以按如下方式调整 top_p 为 0.5：
+```py
+model.generation_config.top_p = 0.5
+```
+特别的，可以用如下方式关闭 top-p sampling，改用 greedy sampling，效果上相当于 top_p=0 或 temperature=0：
+```py
+model.generation_config.do_sample = False  # greedy decoding
+```
+此外，我们在 `model.chat()` 接口也提供了调整 top_p 等参数的接口。
+**有解析Action、Action Input的参考代码吗？**
+有的，可以参考：
+```py
+def parse_latest_plugin_call(text: str) -> Tuple[str, str]:
+    i = text.rfind('\nAction:')
+    j = text.rfind('\nAction Input:')
+    k = text.rfind('\nObservation:')
+    if 0 <= i < j:  # If the text has `Action` and `Action input`,
+        if k < j:  # but does not contain `Observation`,
+            # then it is likely that `Observation` is ommited by the LLM,
+            # because the output text may have discarded the stop word.
+            text = text.rstrip() + '\nObservation:'  # Add it back.
+            k = text.rfind('\nObservation:')
+    if 0 <= i < j < k:
+        plugin_name = text[i + len('\nAction:'):j].strip()
+        plugin_args = text[j + len('\nAction Input:'):k].strip()
+        return plugin_name, plugin_args
+    return '', ''
+```
+此外，如果输出的 Action Input 内容是一段表示 JSON 对象的文本，我们建议使用 `json5` 包的 `json5.loads(...)` 方法加载。

modeling_qwen.py CHANGED Viewed

@@ -958,12 +958,14 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         history: Optional[HistoryType],
         system: str = "You are a helpful assistant.",
         append_history: bool = True,
-        stream: Optional[bool] = False
     ) -> Tuple[str, HistoryType]:
         if history is None:
             history = []
         raw_text, context_tokens = make_context(
             tokenizer,
@@ -974,9 +976,9 @@ class QWenLMHeadModel(QWenPreTrainedModel):
             chat_format=self.generation_config.chat_format,
         )
-        stop_words_ids = get_stop_words_ids(
             self.generation_config.chat_format, tokenizer
-        )
         input_ids = torch.tensor([context_tokens]).to(self.device)
         if stream:
             assert self.generation_config.chat_format == 'chatml'
@@ -986,7 +988,8 @@ class QWenLMHeadModel(QWenPreTrainedModel):
             stream_config = StreamGenerationConfig(**self.generation_config.to_dict(), do_stream=True)
             def stream_generator():
                 outputs = []
-                for token in self.generate(input_ids, return_dict_in_generate=False, generation_config=stream_config):
                     outputs.append(token.item())
                     if outputs[-1] in (tokenizer.im_end_id, tokenizer.im_start_id):
                         break
@@ -998,6 +1001,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
                         input_ids,
                         stop_words_ids = stop_words_ids,
                         return_dict_in_generate = False,
                     )
             response = decode_tokens(

         history: Optional[HistoryType],
         system: str = "You are a helpful assistant.",
         append_history: bool = True,
+        stream: Optional[bool] = False,
+        stop_words_ids: Optional[List[List[int]]] = None,
+        **kwargs,
     ) -> Tuple[str, HistoryType]:
         if history is None:
             history = []
+        if stop_words_ids is None:
+            stop_words_ids = []
         raw_text, context_tokens = make_context(
             tokenizer,
             chat_format=self.generation_config.chat_format,
         )
+        stop_words_ids.extend(get_stop_words_ids(
             self.generation_config.chat_format, tokenizer
+        ))
         input_ids = torch.tensor([context_tokens]).to(self.device)
         if stream:
             assert self.generation_config.chat_format == 'chatml'
             stream_config = StreamGenerationConfig(**self.generation_config.to_dict(), do_stream=True)
             def stream_generator():
                 outputs = []
+                for token in self.generate(
+                        input_ids, return_dict_in_generate=False, generation_config=stream_config, **kwargs):
                     outputs.append(token.item())
                     if outputs[-1] in (tokenizer.im_end_id, tokenizer.im_start_id):
                         break
                         input_ids,
                         stop_words_ids = stop_words_ids,
                         return_dict_in_generate = False,
+                        **kwargs,
                     )
             response = decode_tokens(