Stanislas commited on
Commit
7222297
1 Parent(s): 711c7bf

Initial commit

Browse files
Files changed (3) hide show
  1. MODEL_LICENSE +6 -38
  2. README.md +84 -3
  3. resources/codegeex_logo.png +0 -0
MODEL_LICENSE CHANGED
@@ -1,52 +1,20 @@
1
- The CodeGeeX2-6B License
2
-
3
- 1. 定义
4
-
5
- “许可方”是指分发其软件的 CodeGeeX2-6B 模型团队。
6
-
7
- “软件”是指根据本许可提供的 CodeGeeX2-6B 模型参数。
8
-
9
- 2. 许可授予
10
-
11
- 根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
12
-
13
- 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
14
-
15
- 3. 限制
16
-
17
- 您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
18
-
19
- 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
20
-
21
- 4. 免责声明
22
-
23
- 本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
24
-
25
- 5. 责任限制
26
-
27
- 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
28
-
29
- 6.争议解决
30
-
31
- 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
32
-
33
- 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 [email protected] 与我们联系。
34
 
35
  1. Definitions
36
 
37
- “Licensor” means the CodeGeeX2-6B Model Team that distributes its Software.
38
 
39
- “Software” means the CodeGeeX2-6B model parameters made available under this license.
40
 
41
  2. License Grant
42
 
43
- Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software.
44
 
45
  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
46
 
47
  3. Restriction
48
 
49
- You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
50
 
51
  You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
52
 
@@ -62,4 +30,4 @@ EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGA
62
 
63
  This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
64
 
65
- Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
 
1
+ The CodeGeeX License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  1. Definitions
4
 
5
+ “Licensor” means the CodeGeeX Model Team that distributes its Software.
6
 
7
+ “Software” means the CodeGeeX model parameters made available under this license.
8
 
9
  2. License Grant
10
 
11
+ Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
12
 
13
  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
14
 
15
  3. Restriction
16
 
17
+ You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
18
 
19
  You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
20
 
 
30
 
31
  This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
32
 
33
+ Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at report@aminer.cn.
README.md CHANGED
@@ -1,3 +1,84 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![](resources/codegeex_logo.png)
2
+
3
+ <p align="center">
4
+ 🏠 <a href="https://codegeex.cn" target="_blank">Homepage</a>|🛠 Tools <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code</a>, <a href="https://plugins.jetbrains.com/plugin/20587-codegeex" target="_blank">Jetbrains</a>|🤗 <a href="https://huggingface.co/THUDM/codegeex2-6b" target="_blank">HF Repo</a>|📄 <a href="https://arxiv.org/abs/2303.17568" target="_blank">Paper</a>|👋 Join our <a href="https://wj.qq.com/s2/11274205/a15b/"target="_blank">Wechat</a>
5
+ </p>
6
+
7
+ # CodeGeeX2: 更强大的多语言代码生成模型 | A More Powerful Multilingual Code Generation Model
8
+
9
+ CodeGeeX2 是多语言代码生成模型 [CodeGeeX](https://github.com/THUDM/CodeGeeX) ([KDD’23](https://arxiv.org/abs/2303.17568)) 的第二代模型。CodeGeeX2 基于 [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) 架构加入代码预训练实现,得益于 ChatGLM2 的更优性能,CodeGeeX2 在多项指标上取得性能提升(+107% > CodeGeeX;仅60亿参数即超过150亿参数的 StarCoder-15B 近10%),更多特性包括:
10
+
11
+ * **更强大的代码能力**:基于 ChatGLM2-6B 基座语言模型,CodeGeeX2-6B 进一步经过了 600B 代码数据预训练,相比一代模型,在代码能力上全面提升,[HumanEval-X](https://huggingface.co/datasets/THUDM/humaneval-x) 评测集的六种编程语言均大幅提升 (Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, Rust +321\%),在Python上达到 35.9\% 的 Pass@1 一次通过率,超越规模更大的 StarCoder-15B。
12
+ * **更优秀的模型特性**:继承 ChatGLM2-6B 模型特性,CodeGeeX2-6B 更好支持中英文输入,支持最大 8192 序列长度,推理速度较一代 CodeGeeX-13B 大幅提升,量化后仅需6GB显存即可运行,支持轻量级本地化部署。
13
+ * **更全面的AI编程助手**:CodeGeeX插件([VS Code](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex), [Jetbrains](https://plugins.jetbrains.com/plugin/20587-codegeex))后端升级,支持超过100种编程语言,新增上下文补全、跨文件补全等实用功能。结合 Ask CodeGeeX 交互式AI编程助手,支持中英文对话解决各种编程问题,包括且不限于代码解释、代码翻译、代码纠错、文档生成等,帮助程序员更高效开发。
14
+ * **更开放的协议**:CodeGeeX2-6B 权重对学术研究完全开放,填写[问卷](https://open.bigmodel.cn/mla/form)申请商业使用。
15
+
16
+
17
+ CodeGeeX2 is the second-generation model of the multilingual code generation model [CodeGeeX](https://github.com/THUDM/CodeGeeX) ([KDD’23](https://arxiv.org/abs/2303.17568)), which is implemented based on the [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) architecture trained on more code data. Due to the advantage of ChatGLM2, CodeGeeX2 has been comprehensively improved in coding capability (+107% > CodeGeeX; with only 6B parameters, surpassing larger StarCoder-15B for some tasks). It has the following features:
18
+
19
+ * **More Powerful Coding Capabilities**: Based on the ChatGLM2-6B model, CodeGeeX2-6B has been further pre-trained on 600B code tokens, which has been comprehensively improved in coding capability compared to the first-generation. On the [HumanEval-X](https://huggingface.co/datasets/THUDM/humaneval-x) benchmark, all six languages have been significantly improved (Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, Rust +321\%), and in Python it reached 35.9% of Pass@1 one-time pass rate, surpassing the larger StarCoder-15B.
20
+ * **More Useful Features**: Inheriting the ChatGLM2-6B model features, CodeGeeX2-6B better supports both Chinese and English prompts, maximum 8192 sequence length, and the inference speed is significantly improved compared to the first-generation. After quantization, it only needs 6GB of GPU memory for inference, thus supports lightweight local deployment.
21
+ * **Comprehensive AI Coding Assistant**: The backend of CodeGeeX plugin ([VS Code](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex), [Jetbrains](https://plugins.jetbrains.com/plugin/20587-codegeex)) is upgraded, supporting 100+ programming languages, and adding practical functions such as infilling and cross-file completion. Combined with the "Ask CodeGeeX" interactive AI coding assistant, it can be used to solve various programming problems via Chinese or English dialogue, including but not limited to code summarization, code translation, debugging, and comment generation, which helps increasing the efficiency of developpers.
22
+ * **Open Liscense**: CodeGeeX2-6B weights are fully open to academic research, and please apply for commercial use by filling in the [application form](https://open.bigmodel.cn/mla/form).
23
+
24
+
25
+ ## 软件依赖 | Dependency
26
+
27
+ ```shell
28
+ pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
29
+ ```
30
+
31
+ ## 快速开始 | Get Started
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModel
35
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True)
36
+ model = AutoModel.from_pretrained("THUDM/codegeex2-6b", trust_remote_code=True, device='cuda')
37
+ model = model.eval()
38
+
39
+ # remember adding a language tag for better performance
40
+ prompt = "# language: python\n# write a bubble sort function\n"
41
+ inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
42
+ outputs = model.generate(inputs, max_length=256, top_k=1)
43
+ response = tokenizer.decode(outputs[0])
44
+
45
+ >>> print(response)
46
+ # language: python
47
+ # write a bubble sort function
48
+
49
+
50
+ def bubble_sort(list):
51
+ for i in range(len(list) - 1):
52
+ for j in range(len(list) - 1):
53
+ if list[j] > list[j + 1]:
54
+ list[j], list[j + 1] = list[j + 1], list[j]
55
+ return list
56
+
57
+
58
+ print(bubble_sort([5, 2, 4, 6, 1, 3]))
59
+ ```
60
+
61
+ 关于更多的使用说明,请参考 CodeGeeX2 的 [Github Repo](https://github.com/THUDM/CodeGeeX2)。
62
+
63
+ For more information, please refer to CodeGeeX2's [Github Repo](https://github.com/THUDM/CodeGeeX2).
64
+
65
+ ## 协议 | License
66
+
67
+ 本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源,模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。
68
+
69
+ The code in this repository is open source under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license. The model weights are licensed under the [Model License](MODEL_LICENSE).
70
+
71
+ ## 引用 | Citation
72
+
73
+ 如果觉得我们的工作有帮助,欢迎引用以下论文:
74
+
75
+ If you find our work helpful, please feel free to cite the following paper:
76
+
77
+ ```
78
+ @inproceedings{zheng2023codegeex,
79
+ title={CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X},
80
+ author={Qinkai Zheng and Xiao Xia and Xu Zou and Yuxiao Dong and Shan Wang and Yufei Xue and Zihan Wang and Lei Shen and Andi Wang and Yang Li and Teng Su and Zhilin Yang and Jie Tang},
81
+ booktitle={KDD},
82
+ year={2023}
83
+ }
84
+ ```
resources/codegeex_logo.png ADDED