trillionmonster
/

Baichuan-13B-Chat-8bit

@@ -44,37 +44,25 @@ messages.append({"role": "user", "content": "世界上第二高的山峰是哪
 response = model.chat(tokenizer, messages)
 print(response)
 ```
-Here is an example of a conversation using Baichuan-13B-Chat, the correct output is "K2. The world's second highest peak - K2, also known as Mount Godwin-Austen or Chhogori, with an altitude of 8611 meters, is located on the China-Pakistan border in the Karakoram Range."
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from transformers.generation.utils import GenerationConfig
-tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-13B-Chat", use_fast=False, trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-13B-Chat", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True)
-model.generation_config = GenerationConfig.from_pretrained("baichuan-inc/Baichuan-13B-Chat")
 messages = []
 messages.append({"role": "user", "content": "Which moutain is the second highest one in the world?"})
 response = model.chat(tokenizer, messages)
 print(response)
 ```
-## 量化部署
-Baichuan-13B 支持 int8 和 int4 量化，用户只需在推理代码中简单修改两行即可实现。请注意，如果是为了节省显存而进行量化，应加载原始精度模型到 CPU 后再开始量化；避免在 `from_pretrained` 时添加 `device_map='auto'` 或者其它会导致把原始精度模型直接加载到 GPU 的行为的参数。
-Baichuan-13B supports int8 and int4 quantization, users only need to make a simple two-line change in the inference code to implement it. Please note, if quantization is done to save GPU memory, the original precision model should be loaded onto the CPU before starting quantization. Avoid adding parameters such as `device_map='auto'` or others that could cause the original precision model to be loaded directly onto the GPU when executing `from_pretrained`.
-使用 int8 量化 (To use int8 quantization):
 ```python
-model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-13B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
-model = model.quantize(8).cuda()
-```
-同样的，如需使用 int4 量化 (Similarly, to use int4 quantization):
-```python
-model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-13B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
-model = model.quantize(4).cuda()
 ```
 ## 模型详情

 response = model.chat(tokenizer, messages)
 print(response)
 ```
+## int8 量化部署
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from transformers.generation.utils import GenerationConfig
+tokenizer = AutoTokenizer.from_pretrained("trillionmonster/Baichuan-13B-Chat-8bit", use_fast=False, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("trillionmonster/Baichuan-13B-Chat-8bit",  device_map="auto", trust_remote_code=True)
+model.generation_config = GenerationConfig.from_pretrained("trillionmonster/Baichuan-13B-Chat-8bit")
 messages = []
 messages.append({"role": "user", "content": "Which moutain is the second highest one in the world?"})
 response = model.chat(tokenizer, messages)
 print(response)
 ```
+如需使用 int4 量化 (Similarly, to use int4 quantization):
 ```python
+model = AutoModelForCausalLM.from_pretrained("trillionmonster/Baichuan-13B-Chat-8bit",  device_map="auto",load_in_4bit=True,trust_remote_code=True)
 ```
 ## 模型详情