wenge-research commited on
Commit
198b8d9
1 Parent(s): 00be6c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -1
README.md CHANGED
@@ -67,4 +67,70 @@ print(tokenizer.decode(response[0]))
67
  ## 致谢
68
  - 本项目使用了 BigScience 的 [bloomz-7b-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) 模型权重作为初始化权重,并基于词表进行扩展;
69
  - 本项目训练代码参考了 Databricks 的 [dolly](https://github.com/databrickslabs/dolly) 项目及 Huggingface [transformers](https://github.com/huggingface/transformers) 库;
70
- - 本项目分布式训练使用了 Microsoft 的 [DeepSpeed](https://github.com/microsoft/deepspeed) 分布式训练工具及 Huggingface transformers 文档中的 [ZeRO stage 2](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-config) 配置文件;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ## 致谢
68
  - 本项目使用了 BigScience 的 [bloomz-7b-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) 模型权重作为初始化权重,并基于词表进行扩展;
69
  - 本项目训练代码参考了 Databricks 的 [dolly](https://github.com/databrickslabs/dolly) 项目及 Huggingface [transformers](https://github.com/huggingface/transformers) 库;
70
+ - 本项目分布式训练使用了 Microsoft 的 [DeepSpeed](https://github.com/microsoft/deepspeed) 分布式训练工具及 Huggingface transformers 文档中的 [ZeRO stage 2](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-config) 配置文件;
71
+
72
+
73
+ ---
74
+
75
+ # YaYi
76
+
77
+ ## Introduction
78
+ [YaYi](https://www.wenge.com/yayi/index.html) was fine-tuned on millions of artificially constructed high-quality domain data. This training data covers five key domains: media publicity, public opinion analysis, public safety, financial risk control, and urban governance, encompassing over a hundred natural language instruction tasks. Throughout the iterative development process of the YaYi, starting from pre-training initialization weights and progressing to domain-specific model, we have steadily enhanced its foundational Chinese language capabilities and domain analysis capabilities. We've also introduced multi-turn conversation enhancements and integrated various plug-in capabilities. Furthermore, through continuous manual feedback and optimization from hundreds of users during the internal testing phase, we've meticulously refined the model's performance and security.
79
+
80
+ By open-sourcing the YaYi model, we will contribute our own efforts to the development of the Chinese pre-trained large language model open-source community. Through this open-source initiative, we seek to collaborate with every partner to build the YaYi model ecosystem together.
81
+
82
+ ## Run
83
+
84
+ Below is a simple example code for invoking `yayi-7b` for downstream task inference. It can run on a single GPU such as A100/A800/3090 and occupies approximately 20GB of GPU memory when performing inference with FP16 precision. If you need to obtain training data or fine-tune the model based on `yayi-7b`, please refer to our [💻Github Repo](https://github.com/wenge-research/YaYi).
85
+
86
+ ```python
87
+ from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
88
+ import torch
89
+
90
+ yayi_7b_path = "wenge-research/yayi-7b"
91
+ tokenizer = AutoTokenizer.from_pretrained(yayi_7b_path)
92
+ model = AutoModelForCausalLM.from_pretrained(yayi_7b_path, device_map="auto", torch_dtype=torch.bfloat16)
93
+
94
+ prompt = "你好"
95
+ formatted_prompt = f"<|System|>:\nA chat between a human and an AI assistant named YaYi.\nYaYi is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd.\n\n<|Human|>:\n{prompt}\n\n<|YaYi|>:"
96
+ inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
97
+
98
+ eos_token_id = tokenizer("<|End|>").input_ids[0]
99
+ generation_config = GenerationConfig(
100
+ eos_token_id=eos_token_id,
101
+ pad_token_id=eos_token_id,
102
+ do_sample=True,
103
+ max_new_tokens=100,
104
+ temperature=0.3,
105
+ repetition_penalty=1.1,
106
+ no_repeat_ngram_size=0
107
+ )
108
+ response = model.generate(**inputs, generation_config=generation_config)
109
+ print(tokenizer.decode(response[0]))
110
+ ```
111
+
112
+ Please note that a special token `<|End|>` was added as an end-of-sequence marker during model training. Therefore, in the `GenerationConfig` provided above, you should set `eos_token_id` to the token id corresponding to this end-of-sequence marker.
113
+
114
+ ## Related agreements
115
+
116
+ ### Limitations
117
+ The SFT model trained based on the current data and base model still exhibits the following issues in terms of performance:
118
+
119
+ 1. It may generate factually incorrect responses for factual instructions.
120
+ 2. It struggles to effectively identify harmful instructions, potentially leading to harmful content generation.
121
+ 3. Its capabilities in scenarios involving logical reasoning, code generation, scientific computation, and similar tasks still require improvement.
122
+
123
+ ### Disclaimer
124
+
125
+ Due to the limitations of the model mentioned above, we request that developers use the code, data, models, and any derivatives generated from this project solely for research purposes and refrain from using them for commercial or any other potentially harmful purposes to society. Please exercise caution in evaluating and utilizing content generated by the YaYi model, and do not propagate harmful content on the internet. Any adverse consequences resulting from such actions are the responsibility of the disseminator.
126
+
127
+ This project is intended for research purposes only, and the project developers bear no responsibility for any harm or losses incurred due to the use of this project, including but not limited to data, models, code, etc. For more details, please refer to the [Disclaimer](DISCLAIMER).
128
+
129
+ ### License
130
+
131
+ The code in this project is open-source under the [Apache-2.0](LICENSE) license, the data follows the [CC BY-NC 4.0](LICENSE_DATA) license, and the usage of YaYi series model weights must adhere to the [Model License](LICENSE_MODEL).
132
+
133
+ ## Acknowledgements
134
+ - In this project, we used model weights from BigScience's [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) and Meta's [Llama 2](https://huggingface.co/meta-llama) series as initialization weights, along with vocabulary expansion.
135
+ - The training code in this project was inspired by Databricks' [dolly](https://github.com/databrickslabs/dolly) project and Huggingface's [transformers](https://github.com/huggingface/transformers) library.
136
+ - Distributed training in this project utilized Microsoft's [DeepSpeed](https://github.com/microsoft/deepspeed) distributed training tool and configuration files from Huggingface transformers' [ZeRO stage 2](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-config).