ChloeAuYeung commited on
Commit
a6d67b5
1 Parent(s): 465974b

upload files

Browse files
README.md CHANGED
@@ -1,3 +1,191 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+
4
+ inference: false
5
+
6
  ---
7
+
8
+ # XVERSE-MoE-A4.2B
9
+
10
+ ## 更新信息
11
+ - **[2024/04/02]** 发布 MoE 架构的 **XVERSE-MoE-A4.2B** 底座模型,Chat 对齐模型将在后续发布。
12
+
13
+ ## Update Information
14
+ - **[2024/04/02]** Released **XVERSE-MoE-A4.2B** MoE base model, the Chat version model will be released later.
15
+
16
+ ## 模型介绍
17
+
18
+ **XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),使用混合专家模型(MoE,Mixture-of-experts)架构,模型的总参数规模为 258 亿,实际激活的参数量为 42 亿,本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**,主要特点如下:
19
+
20
+ - **模型结构**:XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构,将密集模型的 FFN 层扩展为专家层,不同于传统 MoE 中每个专家的大小与标准 FFN 相同(如Mixtral 8x7B ),使用了更细粒度的专家,每个专家是标准 FFN 大小的 1/4,并设置了共享专家(Shared Expert)和非共享专家(Non-shared Expert)两类,共享专家在计算时始终被激活,非共享专家通过 Router 选择性激活。
21
+ - **训练数据**:构建了 3.2 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果;模型使用 8K 长度的训练样本进行训练。
22
+ - **训练框架**:针对 MoE 模型中独有的专家路由和权重计算逻辑,进行了深入定制优化,开发出一套高效的融合算子,以提升计算效率。同时,为解决 MoE 模型显存占用和通信量大的挑战,设计了计算、通信和 CPU-Offload 的 Overlap 处理方式,从而提高整体吞吐量。
23
+
24
+ **XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下:
25
+
26
+ | total params | activated params | n_layers | d_model | n_heads | d_ff | n_non_shared_experts | n_shared_experts | top_k | lr |
27
+ | :----------: | :--------------: | :------: | :-----: | :-----: | :--: | :------------------: | :--------------: | :---: | :----: |
28
+ | 25.8B | 4.2B | 28 | 2560 | 32 | 1728 | 64 | 2 | 6 | 3.5e−4 |
29
+
30
+ ## Model Introduction
31
+
32
+ **XVERSE-MoE-A4.2B** is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology which is using Mixture-of-experts (MoE) architecture. The total parameter scale of the model is 25.8 billion, with an actual number of activated parameters being 4.2 billion. The models released this time is the base model **XVERSE-MoE-A4.2B**. Its key features are as follows:
33
+
34
+ - **Model Structure**: XVERSE-MoE-A4.2B uses the mainstream Decoder-only Transformer network structure that extends the FFN layer of dense models to expert layers. Unlike traditional MoE model where each expert has the same size as standard FFN (such as Mixtral 8x7B), it uses more fine-grained experts, with each expert being 1/4 the size of a standard FFN. It includes shared experts and non-shared experts, where shared experts are always activated during computation, and non-shared experts are selectively activated through a Router.
35
+ - **Training Data**: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 3.2 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages; The model is trained using training samples of length 8k.
36
+ - **Training Framework**: We conducted in-depth customized optimization for the unique expert routing and weight calculation logic in the MoE model, developed an efficient fusion operator to improve computational efficiency. At the same time, to address the challenges of high memory consumption and communication volume in the MoE model, we designed a processing method for overlapping computation, communication, and CPU-Offload to increase overall throughput.
37
+
38
+ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are showed as follows:
39
+
40
+ | total params | activated params | n_layers | d_model | n_heads | d_ff | n_non_shared_experts | n_shared_experts | top_k | lr |
41
+ | :----------: | :--------------: | :------: | :-----: | :-----: | :--: | :------------------: | :--------------: | :---: | :----: |
42
+ | 25.8B | 4.2B | 28 | 2560 | 32 | 1728 | 64 | 2 | 6 | 3.5e−4 |
43
+
44
+ ## 评测结果
45
+
46
+ 为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
47
+
48
+ | 数据集 | XVERSE-MoE-A4.2B-2.7T | XVERSE-13B-2-2.7T | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B |
49
+ | ------------------------ | :-------------------: | :---------------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: |
50
+ | C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 |
51
+ | CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 |
52
+ | Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 |
53
+ | MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 |
54
+ | AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 |
55
+ | RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 |
56
+ | CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 |
57
+ | PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 |
58
+ | GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 |
59
+ | HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 |
60
+
61
+ > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
62
+
63
+ 对于上述所有比较模型,我们优先汇报其官方公布的结果。在缺少官方结果的情况下,我们采用了 [OpenCompass 榜单](https://opencompass.org.cn/leaderboard-llm)的报告结果。其他结果则来自于我们自行执行的评估流程所获得的数据。
64
+ 对于 MMLU ,我们采用作者提供的[评测工具](https://github.com/hendrycks/test),C-Eval、AGIEval、GAOKAO-Bench 与 MMLU 的评测方式相同,其余评测数据集使用 [OpenCompass 评估框架](https://github.com/open-compass/OpenCompass/)进行评估。
65
+
66
+ ## Model Evaluation
67
+
68
+ To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
69
+
70
+ | Dataset | XVERSE-MoE-A4.2B-2.7T | XVERSE-13B-2-2.7T | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B |
71
+ | ------------------------ | :-------------------: | :---------------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: |
72
+ | C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 |
73
+ | CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 |
74
+ | Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 |
75
+ | MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 |
76
+ | AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 |
77
+ | RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 |
78
+ | CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 |
79
+ | PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 |
80
+ | GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 |
81
+ | HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 |
82
+
83
+ > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
84
+
85
+ For all the comparison models mentioned above, we prioritize the disclosure of their officially published results. In the absence of official data, we refer to the reported outcomes from [OpenCompass Leaderboard](https://opencompass.org.cn/leaderboard-llm). Results not covered by the aforementioned sources are derived from our own evaluation pipline.
86
+ For MMLU, we adopt the [evaluation tools](https://github.com/hendrycks/test) provided by the authors, C-Eval, AGIEval, GAOKAO-Bench are the same as MMLU. For the remaining evaluation datasets, the [OpenCompass](https://github.com/open-compass/OpenCompass/) is employed for evaluation.
87
+
88
+ ## 使用方法
89
+
90
+ ### 环境安装
91
+
92
+ 1. 下载本仓库:
93
+
94
+ ```shell
95
+ git clone https://github.com/xverse-ai/XVERSE-MoE-A4.2B
96
+ cd XVERSE-MoE-A4.2B
97
+ ```
98
+
99
+ 2. 使用 pip 安装依赖:
100
+
101
+ ```shell
102
+ pip install -r requirements.txt
103
+ ```
104
+ ### Transformers 加载方式
105
+
106
+ 可通过以下代码加载 XVERSE-MoE-A4.2B 模型来进行推理:
107
+
108
+ ```python
109
+ import torch
110
+ from transformers import AutoTokenizer, AutoModelForCausalLM
111
+ tokenizer = AutoTokenizer.from_pretrained("xverse/XVERSE-MoE-A4.2B")
112
+ model = AutoModelForCausalLM.from_pretrained("xverse/XVERSE-MoE-A4.2B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
113
+ model = model.eval()
114
+ inputs = tokenizer('北京的景点:故宫、天坛、万里长城等。\n深圳的景点:', return_tensors='pt').input_ids
115
+ inputs = inputs.cuda()
116
+ generated_ids = model.generate(inputs, max_new_tokens=64, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)
117
+ print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
118
+ ```
119
+
120
+ ### 网页 Demo
121
+
122
+ 可通过以下代码启动一个web server,在浏览器输入访问地址后,可使用 XVERSE-MoE-A4.2B 模型进行推理:
123
+
124
+ ```shell
125
+ python text_generation_demo.py --port='port' --model_path='/path/to/model/' --tokenizer_path='/path/to/tokenizer/'
126
+ ```
127
+
128
+ ## Usage
129
+
130
+ ### Environment Setup
131
+
132
+ 1. Clone this repository:
133
+
134
+ ```shell
135
+ git clone https://github.com/xverse-ai/XVERSE-MoE-A4.2B
136
+ cd XVERSE-MoE-A4.2B
137
+ ```
138
+
139
+ 2. Install the dependencies using pip:
140
+
141
+ ```shell
142
+ pip install -r requirements.txt
143
+ ```
144
+
145
+ ### Loading with Transformers
146
+
147
+ The XVERSE-MoE-A4.2B model can be loaded for inference using the following code:
148
+
149
+ ```python
150
+ import torch
151
+ from transformers import AutoTokenizer, AutoModelForCausalLM
152
+ tokenizer = AutoTokenizer.from_pretrained("xverse/XVERSE-MoE-A4.2B")
153
+ model = AutoModelForCausalLM.from_pretrained("xverse/XVERSE-MoE-A4.2B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
154
+ model = model.eval()
155
+ inputs = tokenizer('北京的景点:故宫、天坛、万里长城等。\n深圳的景点:', return_tensors='pt').input_ids
156
+ inputs = inputs.cuda()
157
+ generated_ids = model.generate(inputs, max_new_tokens=64, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)
158
+ print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
159
+ ```
160
+
161
+ ### Web Demo
162
+
163
+ The following code can be used to start a web server. By entering the access address in the browser, you can perform inference with the XVERSE-MoE-A4.2B model:
164
+
165
+ ```shell
166
+ python chat_demo.py --port='port' --model_path='/path/to/model/' --tokenizer_path='/path/to/tokenizer/'
167
+ ```
168
+
169
+ ## 局限性与免责申明
170
+
171
+ XVERSE-MoE-A4.2B 与其他所有 LLM 一样,在某些情况下可能会产生不准确、有偏见或其他令人反感的内容。因此,请谨慎使用模型生成的内容,请勿将生成的有害内容进行传播,在部署任何 XVERSE-MoE-A4.2B 的应用之前,开发人员应根据其具体应用对模型进行安全测试和调优。
172
+
173
+ 我们强烈警告不要将 XVERSE-MoE-A4.2B 模型用于制造或传播有害信息,或进行任何可能损害公众、国家、社会安全或违反法规的活动。如果使用 XVERSE-MoE-A4.2B 模型产生任何问题,无论是数据安全问题、公共舆论风险,还是模型被误解、滥用、传播或不合规使用所引发的任何风险和问题,我们将不承担任何责任。
174
+
175
+ ## 模型开源协议
176
+
177
+ 使用本仓库的源码需要遵循 [Apache-2.0](LICENSE) 开源协议,使用 XVERSE-MoE-A4.2B 的模型权重则需要遵循[模型许可协议](MODEL_LICENSE.pdf)。
178
+
179
+ XVERSE-MoE-A4.2B 模型权重对学术研究**完全开放**,并且支持**免费商用**。如需申请商业许可证,请填写【[申请表](https://chat.xverse.cn/home/business.html)】,如有其他问题或合作,请联系 <[email protected]>。
180
+
181
+ ## Limitations and Disclaimer
182
+
183
+ Like all other Large Language Models (LLMs), XVERSE-MoE-A4.2B may produce inaccurate, biased, or otherwise offensive content under certain circumstances. Therefore, please use the content generated by the model with caution and refrain from disseminating harmful content. Before deploying any application of XVERSE-MoE-A4.2B, developers should conduct safety tests and optimization of the model according to its specific application.
184
+
185
+ We strongly warn against the use of the XVERSE-MoE-A4.2B model for producing or spreading harmful information, or conducting any activities that might harm the public, national, or social security, or violate regulations. We assume no responsibility for any problems arising from the use of the XVERSE-MoE-A4.2B model, whether it be data security issues, public opinion risks, or any risks and issues caused by misunderstanding, misuse, dissemination, or non-compliance with the model.
186
+
187
+ ## Open Source License
188
+
189
+ The use of the source code in this repository must follow the [Apache-2.0](LICENSE) open-source license, while the use of the model weights of XVERSE-MoE-A4.2B needs to adhere to the [Model License Agreement](MODEL_LICENSE.pdf).
190
+
191
+ The XVERSE-MoE-A4.2B model weights are **fully open** to academic research and support **free commercial use**. To apply for a commercial license, please fill in the [application form](https://chat.xverse.cn/home/business.html). For other questions or collaborations, please contact <[email protected]>.
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XverseForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_xverse.XverseConfig",
7
+ "AutoModelForCausalLM": "modeling_xverse.XverseForCausalLM"
8
+ },
9
+ "pad_token_id": 1,
10
+ "bos_token_id": 2,
11
+ "eos_token_id": 3,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 2560,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 1728,
16
+ "max_position_embeddings": 8192,
17
+ "model_type": "xverse",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 28,
20
+ "rms_norm_eps": 1e-06,
21
+ "tie_word_embeddings": false,
22
+ "rope_theta": 500000,
23
+ "moe_top_k": 6,
24
+ "num_experts": 64,
25
+ "num_shared_experts": 2,
26
+ "output_router_logits": false,
27
+ "router_aux_loss_coef": 0.01,
28
+ "torch_dtype": "bfloat16",
29
+ "transformers_version": "4.38.2",
30
+ "use_cache": true,
31
+ "vocab_size": 100534
32
+ }
33
+
configuration_xverse.py ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ XVERSE model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ XVERSE_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
29
+
30
+
31
+ class XverseConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`XverseModel`]. It is used to instantiate an Xverse
34
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
35
+ defaults will yield a similar configuration to that of the XVERSE-13B.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 100278):
43
+ Vocabulary size of the XVERSE model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling [`XverseModel`]
45
+ hidden_size (`int`, *optional*, defaults to 5120):
46
+ Dimension of the hidden representations.
47
+ intermediate_size (`int`, *optional*, defaults to 13824):
48
+ Dimension of the MLP representations.
49
+ num_hidden_layers (`int`, *optional*, defaults to 40):
50
+ Number of hidden layers in the Transformer encoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 40):
52
+ Number of attention heads for each attention layer in the Transformer encoder.
53
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
54
+ The non-linear activation function (function or string) in the decoder.
55
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
56
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
57
+ just in case (e.g., 512 or 1024 or 2048).
58
+ initializer_range (`float`, *optional*, defaults to 0.02):
59
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
60
+ rms_norm_eps (`float`, *optional*, defaults to 1e-6):
61
+ The epsilon used by the rms normalization layers.
62
+ use_cache (`bool`, *optional*, defaults to `True`):
63
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
64
+ relevant if `config.is_decoder=True`.
65
+ pad_token_id (`int`, *optional*):
66
+ Padding token id.
67
+ bos_token_id (`int`, *optional*, defaults to 1):
68
+ Beginning of stream token id.
69
+ eos_token_id (`int`, *optional*, defaults to 2):
70
+ End of stream token id.
71
+ pretraining_tp (`int`, *optional*, defaults to 1):
72
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
73
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
74
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
75
+ issue](https://github.com/pytorch/pytorch/issues/76232).
76
+ tie_word_embeddings(`bool`, *optional*, defaults to `False`):
77
+ Whether to tie weight embeddings
78
+ rope_theta (`float`, *optional*, defaults to 10000.0):
79
+ The base period of the RoPE embeddings.
80
+ rope_scaling (`Dict`, *optional*):
81
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
82
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
83
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
84
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
85
+ these scaling strategies behave:
86
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
87
+ experimental feature, subject to breaking API changes in future versions.
88
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
89
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
90
+ attention_dropout (`float`, *optional*, defaults to 0.0):
91
+ The dropout ratio for the attention probabilities.
92
+ moe_top_k (`int`, defaults to 6):
93
+ Number of selected experts.
94
+ num_experts (`int`, defaults to 64):
95
+ Number of routed experts.
96
+ num_shared_experts (`int`, defaults to 64):
97
+ Number of shared experts, None for no shared experts.
98
+ output_router_logits (`bool`, optional):
99
+ Whether or not to return the router logits.
100
+ router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
101
+ The aux loss factor for the total loss.
102
+ Example:
103
+
104
+ ```python
105
+ >>> from transformers import XverseModel, XverseConfig
106
+
107
+ >>> # Initializing a Xverse XVERSE-13B style configuration
108
+ >>> configuration = XverseConfig()
109
+
110
+ >>> # Initializing a model from the XVERSE-13B style configuration
111
+ >>> model = XverseModel(configuration)
112
+
113
+ >>> # Accessing the model configuration
114
+ >>> configuration = model.config
115
+ ```"""
116
+ model_type = "xverse"
117
+ keys_to_ignore_at_inference = ["past_key_values"]
118
+
119
+ def __init__(
120
+ self,
121
+ vocab_size=100278,
122
+ hidden_size=5120,
123
+ intermediate_size=13824,
124
+ num_hidden_layers=40,
125
+ num_attention_heads=40,
126
+ num_key_value_heads=None,
127
+ hidden_act="silu",
128
+ max_position_embeddings=8192,
129
+ initializer_range=0.02,
130
+ rms_norm_eps=1e-6,
131
+ use_cache=True,
132
+ pad_token_id=None,
133
+ bos_token_id=1,
134
+ eos_token_id=2,
135
+ pretraining_tp=1,
136
+ tie_word_embeddings=False,
137
+ rope_theta=10000.0,
138
+ rope_scaling=None,
139
+ attention_bias=False,
140
+ attention_dropout=0.0,
141
+ moe_top_k=6,
142
+ num_experts=64,
143
+ num_shared_experts=2,
144
+ output_router_logits=False,
145
+ router_aux_loss_coef=0.01,
146
+ **kwargs,
147
+ ):
148
+ self.vocab_size = vocab_size
149
+ self.max_position_embeddings = max_position_embeddings
150
+ self.hidden_size = hidden_size
151
+ self.intermediate_size = intermediate_size
152
+ self.num_hidden_layers = num_hidden_layers
153
+ self.num_attention_heads = num_attention_heads
154
+
155
+ # for backward compatibility
156
+ if num_key_value_heads is None:
157
+ num_key_value_heads = num_attention_heads
158
+
159
+ self.num_key_value_heads = num_key_value_heads
160
+ self.hidden_act = hidden_act
161
+ self.initializer_range = initializer_range
162
+ self.rms_norm_eps = rms_norm_eps
163
+ self.pretraining_tp = pretraining_tp
164
+ self.use_cache = use_cache
165
+ self.rope_theta = rope_theta
166
+ self.rope_scaling = rope_scaling
167
+ self._rope_scaling_validation()
168
+ self.attention_bias = attention_bias
169
+ self.attention_dropout = attention_dropout
170
+
171
+ self.moe_top_k = moe_top_k
172
+ self.num_experts = num_experts
173
+ self.num_shared_experts = num_shared_experts
174
+ self.output_router_logits = output_router_logits
175
+ self.router_aux_loss_coef = router_aux_loss_coef
176
+
177
+ super().__init__(
178
+ pad_token_id=pad_token_id,
179
+ bos_token_id=bos_token_id,
180
+ eos_token_id=eos_token_id,
181
+ tie_word_embeddings=tie_word_embeddings,
182
+ **kwargs,
183
+ )
184
+
185
+ def _rope_scaling_validation(self):
186
+ """
187
+ Validate the `rope_scaling` configuration.
188
+ """
189
+ if self.rope_scaling is None:
190
+ return
191
+
192
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
193
+ raise ValueError(
194
+ "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
195
+ f"got {self.rope_scaling}"
196
+ )
197
+ rope_scaling_type = self.rope_scaling.get("type", None)
198
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
199
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
200
+ raise ValueError(
201
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
202
+ )
203
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
204
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
modeling_xverse.py ADDED
@@ -0,0 +1,1521 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch xverse model."""
21
+ import math
22
+ import warnings
23
+ from typing import List, Optional, Tuple, Union
24
+
25
+ import torch
26
+ import torch.nn.functional as F
27
+ import torch.utils.checkpoint
28
+ from torch import nn
29
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
30
+
31
+ from transformers.activations import ACT2FN
32
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
33
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
34
+ from transformers.modeling_outputs import (
35
+ MoeModelOutputWithPast,
36
+ MoeCausalLMOutputWithPast
37
+ )
38
+ from transformers.modeling_utils import PreTrainedModel
39
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
40
+ from transformers.utils import (
41
+ add_start_docstrings,
42
+ add_start_docstrings_to_model_forward,
43
+ is_flash_attn_2_available,
44
+ is_flash_attn_greater_or_equal_2_10,
45
+ logging,
46
+ replace_return_docstrings,
47
+ )
48
+ from .configuration_xverse import XverseConfig
49
+
50
+
51
+ if is_flash_attn_2_available():
52
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
53
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
54
+
55
+
56
+ logger = logging.get_logger(__name__)
57
+
58
+ _CONFIG_FOR_DOC = "XverseConfig"
59
+
60
+ # Copied from transformers.models.mixtral.modeling_mixtral.load_balancing_loss_func
61
+ def load_balancing_loss_func(
62
+ gate_logits: torch.Tensor, num_experts: torch.Tensor = None, top_k=2, attention_mask: Optional[torch.Tensor] = None
63
+ ) -> float:
64
+ r"""
65
+ Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch.
66
+
67
+ See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
68
+ function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
69
+ experts is too unbalanced.
70
+
71
+ Args:
72
+ gate_logits (Union[`torch.Tensor`, Tuple[torch.Tensor]):
73
+ Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of
74
+ shape [batch_size X sequence_length, num_experts].
75
+ attention_mask (`torch.Tensor`, None):
76
+ The attention_mask used in forward function
77
+ shape [batch_size X sequence_length] if not None.
78
+ num_experts (`int`, *optional*):
79
+ Number of experts
80
+
81
+ Returns:
82
+ The auxiliary loss.
83
+ """
84
+ if gate_logits is None or not isinstance(gate_logits, tuple):
85
+ return 0
86
+
87
+ if isinstance(gate_logits, tuple):
88
+ compute_device = gate_logits[0].device
89
+ concatenated_gate_logits = torch.cat([layer_gate.to(compute_device) for layer_gate in gate_logits], dim=0)
90
+
91
+ routing_weights = torch.nn.functional.softmax(concatenated_gate_logits, dim=-1)
92
+
93
+ _, selected_experts = torch.topk(routing_weights, top_k, dim=-1)
94
+
95
+ expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts)
96
+
97
+ if attention_mask is None:
98
+ # Compute the percentage of tokens routed to each experts
99
+ tokens_per_expert = torch.mean(expert_mask.float(), dim=0)
100
+
101
+ # Compute the average probability of routing to these experts
102
+ router_prob_per_expert = torch.mean(routing_weights, dim=0)
103
+ else:
104
+ batch_size, sequence_length = attention_mask.shape
105
+ num_hidden_layers = concatenated_gate_logits.shape[0] // (batch_size * sequence_length)
106
+
107
+ # Compute the mask that masks all padding tokens as 0 with the same shape of expert_mask
108
+ expert_attention_mask = (
109
+ attention_mask[None, :, :, None, None]
110
+ .expand((num_hidden_layers, batch_size, sequence_length, top_k, num_experts))
111
+ .reshape(-1, top_k, num_experts)
112
+ .to(compute_device)
113
+ )
114
+
115
+ # Compute the percentage of tokens routed to each experts
116
+ tokens_per_expert = torch.sum(expert_mask.float() * expert_attention_mask, dim=0) / torch.sum(
117
+ expert_attention_mask, dim=0
118
+ )
119
+
120
+ # Compute the mask that masks all padding tokens as 0 with the same shape of tokens_per_expert
121
+ router_per_expert_attention_mask = (
122
+ attention_mask[None, :, :, None]
123
+ .expand((num_hidden_layers, batch_size, sequence_length, num_experts))
124
+ .reshape(-1, num_experts)
125
+ .to(compute_device)
126
+ )
127
+
128
+ # Compute the average probability of routing to these experts
129
+ router_prob_per_expert = torch.sum(routing_weights * router_per_expert_attention_mask, dim=0) / torch.sum(
130
+ router_per_expert_attention_mask, dim=0
131
+ )
132
+
133
+ overall_loss = torch.sum(tokens_per_expert * router_prob_per_expert.unsqueeze(0))
134
+ return overall_loss * num_experts
135
+
136
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
137
+ def _get_unpad_data(attention_mask):
138
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
139
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
140
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
141
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
142
+ return (
143
+ indices,
144
+ cu_seqlens,
145
+ max_seqlen_in_batch,
146
+ )
147
+
148
+ # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Xverse
149
+ class XverseRMSNorm(nn.Module):
150
+ def __init__(self, hidden_size, eps=1e-6):
151
+ """
152
+ XverseRMSNorm is equivalent to T5LayerNorm
153
+ """
154
+ super().__init__()
155
+ self.weight = nn.Parameter(torch.ones(hidden_size))
156
+ self.variance_epsilon = eps
157
+
158
+ def forward(self, hidden_states):
159
+ input_dtype = hidden_states.dtype
160
+ hidden_states = hidden_states.to(torch.float32)
161
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
162
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
163
+ return self.weight * hidden_states.to(input_dtype)
164
+
165
+
166
+ ALL_LAYERNORM_LAYERS.append(XverseRMSNorm)
167
+
168
+ # Copied from transformers.models.llama.modeling_llama.LlamaRotaryEmbedding with Llama->Xverse
169
+ class XverseRotaryEmbedding(nn.Module):
170
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
171
+ super().__init__()
172
+ self.scaling_factor = scaling_factor
173
+ self.dim = dim
174
+ self.max_position_embeddings = max_position_embeddings
175
+ self.base = base
176
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
177
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
178
+ # For BC we register cos and sin cached
179
+ self.max_seq_len_cached = max_position_embeddings
180
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
181
+ t = t / self.scaling_factor
182
+ freqs = torch.outer(t, self.inv_freq)
183
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
184
+ emb = torch.cat((freqs, freqs), dim=-1)
185
+ self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
186
+ self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)
187
+
188
+ @property
189
+ def sin_cached(self):
190
+ logger.warning_once(
191
+ "The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
192
+ "the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class"
193
+ )
194
+ return self._sin_cached
195
+
196
+ @property
197
+ def cos_cached(self):
198
+ logger.warning_once(
199
+ "The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
200
+ "the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class"
201
+ )
202
+ return self._cos_cached
203
+
204
+ @torch.no_grad()
205
+ def forward(self, x, position_ids, seq_len=None):
206
+ if seq_len is not None:
207
+ logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
208
+
209
+ # x: [bs, num_attention_heads, seq_len, head_size]
210
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
211
+ position_ids_expanded = position_ids[:, None, :].float()
212
+ # Force float32 since bfloat16 loses precision on long contexts
213
+ # See https://github.com/huggingface/transformers/pull/29285
214
+ device_type = x.device.type
215
+ device_type = device_type if isinstance(device_type, str) else "cpu"
216
+ with torch.autocast(device_type=device_type, enabled=False):
217
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
218
+ emb = torch.cat((freqs, freqs), dim=-1)
219
+ cos = emb.cos()
220
+ sin = emb.sin()
221
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
222
+
223
+
224
+ # Copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->Xverse
225
+ class XverseLinearScalingRotaryEmbedding(XverseRotaryEmbedding):
226
+ """XverseRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
227
+
228
+ def forward(self, x, position_ids, seq_len=None):
229
+ # difference to the original RoPE: a scaling factor is aplied to the position ids
230
+ position_ids = position_ids.float() / self.scaling_factor
231
+ cos, sin = super().forward(x, position_ids, seq_len)
232
+ return cos, sin
233
+
234
+
235
+ # Copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->Xverse
236
+ class XverseDynamicNTKScalingRotaryEmbedding(XverseRotaryEmbedding):
237
+ """XverseRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
238
+
239
+ def forward(self, x, position_ids, seq_len=None):
240
+ # difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
241
+ seq_len = torch.max(position_ids) + 1
242
+ if seq_len > self.max_position_embeddings:
243
+ base = self.base * (
244
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
245
+ ) ** (self.dim / (self.dim - 2))
246
+ inv_freq = 1.0 / (
247
+ base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(x.device) / self.dim)
248
+ )
249
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: this may break with compilation
250
+
251
+ cos, sin = super().forward(x, position_ids, seq_len)
252
+ return cos, sin
253
+
254
+
255
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
256
+ def rotate_half(x):
257
+ """Rotates half the hidden dims of the input."""
258
+ x1 = x[..., : x.shape[-1] // 2]
259
+ x2 = x[..., x.shape[-1] // 2 :]
260
+ return torch.cat((-x2, x1), dim=-1)
261
+
262
+
263
+ # Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
264
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
265
+ """Applies Rotary Position Embedding to the query and key tensors.
266
+
267
+ Args:
268
+ q (`torch.Tensor`): The query tensor.
269
+ k (`torch.Tensor`): The key tensor.
270
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
271
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
272
+ position_ids (`torch.Tensor`, *optional*):
273
+ Deprecated and unused.
274
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
275
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
276
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
277
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
278
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
279
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
280
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
281
+ Returns:
282
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
283
+ """
284
+ cos = cos.unsqueeze(unsqueeze_dim)
285
+ sin = sin.unsqueeze(unsqueeze_dim)
286
+ q_embed = (q * cos) + (rotate_half(q) * sin)
287
+ k_embed = (k * cos) + (rotate_half(k) * sin)
288
+ return q_embed, k_embed
289
+
290
+
291
+ # Copied from transformers.models.llama.modeling_llama.LlamaMLP with Llama->Xverse
292
+ class XverseMLP(nn.Module):
293
+ def __init__(self, config, hidden_size=None, intermediate_size=None, hidden_act=None):
294
+ super().__init__()
295
+ self.config = config
296
+ self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
297
+ self.intermediate_size = config.intermediate_size if intermediate_size is None else intermediate_size
298
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
299
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
300
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
301
+ self.act_fn = ACT2FN[config.hidden_act] if hidden_act is None else ACT2FN[hidden_act]
302
+
303
+ def forward(self, x):
304
+ if self.config.pretraining_tp > 1:
305
+ slice = self.intermediate_size // self.config.pretraining_tp
306
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
307
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
308
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
309
+
310
+ gate_proj = torch.cat(
311
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
312
+ )
313
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
314
+
315
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
316
+ down_proj = [
317
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
318
+ ]
319
+ down_proj = sum(down_proj)
320
+ else:
321
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
322
+
323
+ return down_proj
324
+
325
+ class XverseMoEMLP(nn.Module):
326
+ def __init__(
327
+ self,
328
+ config: XverseConfig,
329
+ hidden_size: int,
330
+ intermediate_size: int,
331
+ hidden_act: str,
332
+ ):
333
+ super().__init__()
334
+ self.config = config
335
+ self.top_k = config.moe_top_k
336
+ self.num_experts = config.num_experts
337
+ self.num_shared_experts = config.num_shared_experts if config.num_shared_experts is not None else None
338
+
339
+ self.router = nn.Linear(hidden_size, self.num_experts, bias=False, dtype=torch.float)
340
+ self.experts = nn.ModuleList([XverseMLP(config, hidden_size, intermediate_size, hidden_act) for _ in range(self.num_experts)])
341
+ if self.num_shared_experts is not None:
342
+ self.shared_experts = XverseMLP(config, hidden_size, self.num_shared_experts * intermediate_size, hidden_act)
343
+
344
+ def forward(self, hidden_states):
345
+ batch_size, sequence_length, hidden_dim = hidden_states.shape
346
+
347
+ final_hidden_states = torch.zeros(
348
+ (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
349
+ )
350
+
351
+ input_dtype = hidden_states.dtype
352
+ hidden_states = hidden_states.view(-1, hidden_dim).float()
353
+
354
+ router_logits = self.router(hidden_states)
355
+
356
+ routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
357
+ routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
358
+
359
+ expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts)
360
+ expert_mask = expert_mask.permute(2, 1, 0)
361
+
362
+ routing_weights /= (routing_weights.sum(dim=-1, keepdim=True) + 1e-06)
363
+
364
+ routing_weights = routing_weights.to(input_dtype)
365
+ hidden_states = hidden_states.to(input_dtype)
366
+
367
+ for expert_idx, expert_layer in enumerate(self.experts):
368
+ idx, top_x = torch.where(expert_mask[expert_idx])
369
+
370
+ if top_x.shape[0] == 0:
371
+ continue
372
+
373
+ top_x_list = top_x.tolist()
374
+ idx_list = idx.tolist()
375
+
376
+ current_state = hidden_states[None, top_x_list].view(-1, hidden_dim)
377
+ current_hidden_states = expert_layer(current_state)
378
+ current_hidden_states = current_hidden_states * routing_weights[top_x_list, idx_list, None]
379
+
380
+ final_hidden_states.index_add_(0, top_x, current_hidden_states)
381
+
382
+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
383
+
384
+ if self.num_shared_experts is not None:
385
+ hidden_states = hidden_states.view(batch_size, sequence_length, hidden_dim)
386
+ shared_hidden = self.shared_experts(hidden_states)
387
+ final_hidden_states = final_hidden_states + shared_hidden
388
+
389
+ return final_hidden_states, router_logits
390
+
391
+
392
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
393
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
394
+ """
395
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
396
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
397
+ """
398
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
399
+ if n_rep == 1:
400
+ return hidden_states
401
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
402
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
403
+
404
+
405
+ # Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->Xverse
406
+ class XverseAttention(nn.Module):
407
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
408
+
409
+ def __init__(self, config: XverseConfig, layer_idx: Optional[int] = None):
410
+ super().__init__()
411
+ self.config = config
412
+ self.layer_idx = layer_idx
413
+ if layer_idx is None:
414
+ logger.warning_once(
415
+ f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
416
+ "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
417
+ "when creating this class."
418
+ )
419
+
420
+ self.attention_dropout = config.attention_dropout
421
+ self.hidden_size = config.hidden_size
422
+ self.num_heads = config.num_attention_heads
423
+ self.head_dim = self.hidden_size // self.num_heads
424
+ self.num_key_value_heads = config.num_key_value_heads
425
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
426
+ self.max_position_embeddings = config.max_position_embeddings
427
+ self.rope_theta = config.rope_theta
428
+ self.is_causal = True
429
+
430
+ if (self.head_dim * self.num_heads) != self.hidden_size:
431
+ raise ValueError(
432
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
433
+ f" and `num_heads`: {self.num_heads})."
434
+ )
435
+
436
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
437
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
438
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
439
+ self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=config.attention_bias)
440
+ self._init_rope()
441
+
442
+ def _init_rope(self):
443
+ if self.config.rope_scaling is None:
444
+ self.rotary_emb = XverseRotaryEmbedding(
445
+ self.head_dim,
446
+ max_position_embeddings=self.max_position_embeddings,
447
+ base=self.rope_theta,
448
+ )
449
+ else:
450
+ scaling_type = self.config.rope_scaling["type"]
451
+ scaling_factor = self.config.rope_scaling["factor"]
452
+ if scaling_type == "linear":
453
+ self.rotary_emb = XverseLinearScalingRotaryEmbedding(
454
+ self.head_dim,
455
+ max_position_embeddings=self.max_position_embeddings,
456
+ scaling_factor=scaling_factor,
457
+ base=self.rope_theta,
458
+ )
459
+ elif scaling_type == "dynamic":
460
+ self.rotary_emb = XverseDynamicNTKScalingRotaryEmbedding(
461
+ self.head_dim,
462
+ max_position_embeddings=self.max_position_embeddings,
463
+ scaling_factor=scaling_factor,
464
+ base=self.rope_theta,
465
+ )
466
+ else:
467
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
468
+
469
+ def forward(
470
+ self,
471
+ hidden_states: torch.Tensor,
472
+ attention_mask: Optional[torch.Tensor] = None,
473
+ position_ids: Optional[torch.LongTensor] = None,
474
+ past_key_value: Optional[Cache] = None,
475
+ output_attentions: bool = False,
476
+ use_cache: bool = False,
477
+ cache_position: Optional[torch.LongTensor] = None,
478
+ **kwargs,
479
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
480
+ bsz, q_len, _ = hidden_states.size()
481
+
482
+ if self.config.pretraining_tp > 1:
483
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
484
+ query_slices = self.q_proj.weight.split(
485
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
486
+ )
487
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
488
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
489
+
490
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
491
+ query_states = torch.cat(query_states, dim=-1)
492
+
493
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
494
+ key_states = torch.cat(key_states, dim=-1)
495
+
496
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
497
+ value_states = torch.cat(value_states, dim=-1)
498
+
499
+ else:
500
+ query_states = self.q_proj(hidden_states)
501
+ key_states = self.k_proj(hidden_states)
502
+ value_states = self.v_proj(hidden_states)
503
+
504
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
505
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
506
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
507
+
508
+ past_key_value = getattr(self, "past_key_value", past_key_value)
509
+ cos, sin = self.rotary_emb(value_states, position_ids)
510
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
511
+
512
+ if past_key_value is not None:
513
+ # sin and cos are specific to RoPE models; position_ids needed for the static cache
514
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
515
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
516
+
517
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
518
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
519
+
520
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
521
+
522
+ if attention_mask is not None: # no matter the length, we just slice it
523
+ causal_mask = attention_mask
524
+ if cache_position is not None:
525
+ causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
526
+ attn_weights = attn_weights + causal_mask
527
+
528
+ # upcast attention to fp32
529
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
530
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
531
+ attn_output = torch.matmul(attn_weights, value_states)
532
+
533
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
534
+ raise ValueError(
535
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
536
+ f" {attn_output.size()}"
537
+ )
538
+
539
+ attn_output = attn_output.transpose(1, 2).contiguous()
540
+
541
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
542
+
543
+ if self.config.pretraining_tp > 1:
544
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
545
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
546
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
547
+ else:
548
+ attn_output = self.o_proj(attn_output)
549
+
550
+ if not output_attentions:
551
+ attn_weights = None
552
+
553
+ return attn_output, attn_weights, past_key_value
554
+
555
+
556
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2 with Llama->Xverse
557
+ class XverseFlashAttention2(XverseAttention):
558
+ """
559
+ xverse flash attention module. This module inherits from `XverseAttention` as the weights of the module stays
560
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
561
+ flash attention and deal with padding tokens in case the input contains any of them.
562
+ """
563
+
564
+ def __init__(self, *args, **kwargs):
565
+ super().__init__(*args, **kwargs)
566
+
567
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
568
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
569
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
570
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
571
+
572
+ def forward(
573
+ self,
574
+ hidden_states: torch.Tensor,
575
+ attention_mask: Optional[torch.LongTensor] = None,
576
+ position_ids: Optional[torch.LongTensor] = None,
577
+ past_key_value: Optional[Cache] = None,
578
+ output_attentions: bool = False,
579
+ use_cache: bool = False,
580
+ cache_position: Optional[torch.LongTensor] = None,
581
+ **kwargs,
582
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
583
+ output_attentions = False
584
+
585
+ bsz, q_len, _ = hidden_states.size()
586
+
587
+ query_states = self.q_proj(hidden_states)
588
+ key_states = self.k_proj(hidden_states)
589
+ value_states = self.v_proj(hidden_states)
590
+
591
+ # Flash attention requires the input to have the shape
592
+ # batch_size x seq_length x head_dim x hidden_dim
593
+ # therefore we just need to keep the original shape
594
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
595
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
596
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
597
+
598
+ cos, sin = self.rotary_emb(value_states, position_ids)
599
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
600
+
601
+ past_key_value = getattr(self, "past_key_value", past_key_value)
602
+
603
+ if past_key_value is not None:
604
+ # sin and cos are specific to RoPE models; position_ids needed for the static cache
605
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
606
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
607
+
608
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
609
+ # to be able to avoid many of these transpose/reshape/view.
610
+ query_states = query_states.transpose(1, 2)
611
+ key_states = key_states.transpose(1, 2)
612
+ value_states = value_states.transpose(1, 2)
613
+
614
+ dropout_rate = self.attention_dropout if self.training else 0.0
615
+
616
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
617
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
618
+ # cast them back in the correct dtype just to be sure everything works as expected.
619
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
620
+ # in fp32. (XverseRMSNorm handles it correctly)
621
+
622
+ input_dtype = query_states.dtype
623
+ if input_dtype == torch.float32:
624
+ if torch.is_autocast_enabled():
625
+ target_dtype = torch.get_autocast_gpu_dtype()
626
+ # Handle the case where the model is quantized
627
+ elif hasattr(self.config, "_pre_quantization_dtype"):
628
+ target_dtype = self.config._pre_quantization_dtype
629
+ else:
630
+ target_dtype = self.q_proj.weight.dtype
631
+
632
+ logger.warning_once(
633
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
634
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
635
+ f" {target_dtype}."
636
+ )
637
+
638
+ query_states = query_states.to(target_dtype)
639
+ key_states = key_states.to(target_dtype)
640
+ value_states = value_states.to(target_dtype)
641
+
642
+ attn_output = self._flash_attention_forward(
643
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
644
+ )
645
+
646
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
647
+ attn_output = self.o_proj(attn_output)
648
+
649
+ if not output_attentions:
650
+ attn_weights = None
651
+
652
+ return attn_output, attn_weights, past_key_value
653
+
654
+ def _flash_attention_forward(
655
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
656
+ ):
657
+ """
658
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
659
+ first unpad the input, then computes the attention scores and pad the final attention scores.
660
+
661
+ Args:
662
+ query_states (`torch.Tensor`):
663
+ Input query states to be passed to Flash Attention API
664
+ key_states (`torch.Tensor`):
665
+ Input key states to be passed to Flash Attention API
666
+ value_states (`torch.Tensor`):
667
+ Input value states to be passed to Flash Attention API
668
+ attention_mask (`torch.Tensor`):
669
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
670
+ position of padding tokens and 1 for the position of non-padding tokens.
671
+ dropout (`int`, *optional*):
672
+ Attention dropout
673
+ softmax_scale (`float`, *optional*):
674
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
675
+ """
676
+ if not self._flash_attn_uses_top_left_mask:
677
+ causal = self.is_causal
678
+ else:
679
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in XverseFlashAttention2 __init__.
680
+ causal = self.is_causal and query_length != 1
681
+
682
+ # Contains at least one padding token in the sequence
683
+ if attention_mask is not None:
684
+ batch_size = query_states.shape[0]
685
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
686
+ query_states, key_states, value_states, attention_mask, query_length
687
+ )
688
+
689
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
690
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
691
+
692
+ attn_output_unpad = flash_attn_varlen_func(
693
+ query_states,
694
+ key_states,
695
+ value_states,
696
+ cu_seqlens_q=cu_seqlens_q,
697
+ cu_seqlens_k=cu_seqlens_k,
698
+ max_seqlen_q=max_seqlen_in_batch_q,
699
+ max_seqlen_k=max_seqlen_in_batch_k,
700
+ dropout_p=dropout,
701
+ softmax_scale=softmax_scale,
702
+ causal=causal,
703
+ )
704
+
705
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
706
+ else:
707
+ attn_output = flash_attn_func(
708
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
709
+ )
710
+
711
+ return attn_output
712
+
713
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
714
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
715
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
716
+
717
+ key_layer = index_first_axis(
718
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
719
+ )
720
+ value_layer = index_first_axis(
721
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
722
+ )
723
+ if query_length == kv_seq_len:
724
+ query_layer = index_first_axis(
725
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
726
+ )
727
+ cu_seqlens_q = cu_seqlens_k
728
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
729
+ indices_q = indices_k
730
+ elif query_length == 1:
731
+ max_seqlen_in_batch_q = 1
732
+ cu_seqlens_q = torch.arange(
733
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
734
+ ) # There is a memcpy here, that is very bad.
735
+ indices_q = cu_seqlens_q[:-1]
736
+ query_layer = query_layer.squeeze(1)
737
+ else:
738
+ # The -q_len: slice assumes left padding.
739
+ attention_mask = attention_mask[:, -query_length:]
740
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
741
+
742
+ return (
743
+ query_layer,
744
+ key_layer,
745
+ value_layer,
746
+ indices_q,
747
+ (cu_seqlens_q, cu_seqlens_k),
748
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
749
+ )
750
+
751
+
752
+ # Copied from transformers.models.llama.modeling_llama.LlamaSdpaAttention with Llama->Xverse
753
+ class XverseSdpaAttention(XverseAttention):
754
+ """
755
+ xverse attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
756
+ `XverseAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
757
+ SDPA API.
758
+ """
759
+
760
+ # Adapted from XverseAttention.forward
761
+ def forward(
762
+ self,
763
+ hidden_states: torch.Tensor,
764
+ attention_mask: Optional[torch.Tensor] = None,
765
+ position_ids: Optional[torch.LongTensor] = None,
766
+ past_key_value: Optional[Cache] = None,
767
+ output_attentions: bool = False,
768
+ use_cache: bool = False,
769
+ cache_position: Optional[torch.LongTensor] = None,
770
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
771
+ if output_attentions:
772
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
773
+ logger.warning_once(
774
+ "XverseMoEModel is using XverseSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
775
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
776
+ )
777
+ return super().forward(
778
+ hidden_states=hidden_states,
779
+ attention_mask=attention_mask,
780
+ position_ids=position_ids,
781
+ past_key_value=past_key_value,
782
+ output_attentions=output_attentions,
783
+ use_cache=use_cache,
784
+ cache_position=cache_position,
785
+ )
786
+
787
+ bsz, q_len, _ = hidden_states.size()
788
+
789
+ query_states = self.q_proj(hidden_states)
790
+ key_states = self.k_proj(hidden_states)
791
+ value_states = self.v_proj(hidden_states)
792
+
793
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
794
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
795
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
796
+
797
+ cos, sin = self.rotary_emb(value_states, position_ids)
798
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
799
+
800
+ # In case static cache is used, it is an instance attribute.
801
+ past_key_value = getattr(self, "past_key_value", past_key_value)
802
+
803
+ if past_key_value is not None:
804
+ # sin and cos are specific to RoPE models; position_ids needed for the static cache
805
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
806
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
807
+
808
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
809
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
810
+
811
+ causal_mask = attention_mask
812
+ if attention_mask is not None and cache_position is not None:
813
+ causal_mask = causal_mask[:, :, cache_position, : key_states.shape[-2]]
814
+
815
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
816
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
817
+ if query_states.device.type == "cuda" and causal_mask is not None:
818
+ query_states = query_states.contiguous()
819
+ key_states = key_states.contiguous()
820
+ value_states = value_states.contiguous()
821
+
822
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
823
+ query_states,
824
+ key_states,
825
+ value_states,
826
+ attn_mask=causal_mask,
827
+ dropout_p=self.attention_dropout if self.training else 0.0,
828
+ )
829
+
830
+ attn_output = attn_output.transpose(1, 2).contiguous()
831
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
832
+
833
+ attn_output = self.o_proj(attn_output)
834
+
835
+ return attn_output, None, past_key_value
836
+
837
+
838
+ XVERSE_ATTENTION_CLASSES = {
839
+ "eager": XverseAttention,
840
+ "flash_attention_2": XverseFlashAttention2,
841
+ "sdpa": XverseSdpaAttention,
842
+ }
843
+
844
+
845
+ class XverseMoEDecoderLayer(nn.Module):
846
+ def __init__(self, config: XverseConfig, layer_idx: int):
847
+ super().__init__()
848
+ self.hidden_size = config.hidden_size
849
+
850
+ self.self_attn = XVERSE_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
851
+
852
+ self.mlp = XverseMoEMLP(
853
+ config=config,
854
+ hidden_size=self.hidden_size,
855
+ intermediate_size=config.intermediate_size,
856
+ hidden_act=config.hidden_act,
857
+ )
858
+ self.input_layernorm = XverseRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
859
+ self.post_attention_layernorm = XverseRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
860
+
861
+ def forward(
862
+ self,
863
+ hidden_states: torch.Tensor,
864
+ attention_mask: Optional[torch.Tensor] = None,
865
+ position_ids: Optional[torch.LongTensor] = None,
866
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
867
+ output_attentions: Optional[bool] = False,
868
+ output_router_logits: Optional[bool] = False,
869
+ use_cache: Optional[bool] = False,
870
+ cache_position: Optional[torch.LongTensor] = None,
871
+ **kwargs,
872
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
873
+ """
874
+ Args:
875
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
876
+ attention_mask (`torch.FloatTensor`, *optional*):
877
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
878
+ query_sequence_length, key_sequence_length)` if default attention is used.
879
+ output_attentions (`bool`, *optional*):
880
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
881
+ returned tensors for more detail.
882
+ output_router_logits (`bool`, optional): Whether or not to return the router logits.
883
+ use_cache (`bool`, *optional*):
884
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
885
+ (see `past_key_values`).
886
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
887
+ """
888
+ if "padding_mask" in kwargs:
889
+ warnings.warn(
890
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
891
+ )
892
+
893
+ residual = hidden_states
894
+
895
+ hidden_states = self.input_layernorm(hidden_states)
896
+
897
+ # Self Attention
898
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
899
+ hidden_states=hidden_states,
900
+ attention_mask=attention_mask,
901
+ position_ids=position_ids,
902
+ past_key_value=past_key_value,
903
+ output_attentions=output_attentions,
904
+ use_cache=use_cache,
905
+ cache_position=cache_position,
906
+ **kwargs,
907
+ )
908
+ hidden_states = residual + hidden_states
909
+
910
+ # Fully Connected
911
+ residual = hidden_states
912
+ hidden_states = self.post_attention_layernorm(hidden_states)
913
+
914
+ hidden_states, router_logits = self.mlp(hidden_states)
915
+ # if isinstance(hidden_states, tuple):
916
+ # hidden_states, router_logits = hidden_states
917
+ # else:
918
+ # router_logits = None
919
+
920
+ hidden_states = residual + hidden_states
921
+
922
+ outputs = (hidden_states,)
923
+
924
+ if output_attentions:
925
+ outputs += (self_attn_weights,)
926
+
927
+ if use_cache:
928
+ outputs += (present_key_value,)
929
+
930
+ if output_router_logits:
931
+ outputs += (router_logits,)
932
+
933
+ return outputs
934
+
935
+
936
+ XVERSE_START_DOCSTRING = r"""
937
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
938
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
939
+ etc.)
940
+
941
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
942
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
943
+ and behavior.
944
+
945
+ Parameters:
946
+ config ([`XverseConfig`]):
947
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
948
+ load the weights associated with the model, only the configuration. Check out the
949
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
950
+ """
951
+
952
+
953
+ @add_start_docstrings(
954
+ "The bare Xverse Model outputting raw hidden-states without any specific head on top.",
955
+ XVERSE_START_DOCSTRING,
956
+ )
957
+ class XversePreTrainedModel(PreTrainedModel):
958
+ config_class = XverseConfig
959
+ base_model_prefix = "model"
960
+ supports_gradient_checkpointing = True
961
+ _no_split_modules = ["XverseMoEDecoderLayer"]
962
+ _skip_keys_device_placement = ["past_key_values"]
963
+ _supports_flash_attn_2 = True
964
+ _supports_sdpa = True
965
+ _supports_cache_class = True
966
+
967
+ def _init_weights(self, module):
968
+ std = self.config.initializer_range
969
+ if isinstance(module, nn.Linear):
970
+ module.weight.data.normal_(mean=0.0, std=std)
971
+ if module.bias is not None:
972
+ module.bias.data.zero_()
973
+ elif isinstance(module, nn.Embedding):
974
+ module.weight.data.normal_(mean=0.0, std=std)
975
+ if module.padding_idx is not None:
976
+ module.weight.data[module.padding_idx].zero_()
977
+
978
+ def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] = None):
979
+ if self.config._attn_implementation == "flash_attention_2" and cache_cls == StaticCache:
980
+ raise ValueError(
981
+ "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
982
+ "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
983
+ )
984
+
985
+ if max_cache_len > self.model.causal_mask.shape[-1] or self.device != self.model.causal_mask.device:
986
+ causal_mask = torch.full(
987
+ (max_cache_len, max_cache_len), fill_value=True, device=self.device, dtype=torch.bool
988
+ )
989
+ self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
990
+
991
+ for layer in self.model.layers:
992
+ weights = layer.self_attn.o_proj.weight
993
+ layer.self_attn.past_key_value = cache_cls(
994
+ self.config, max_batch_size, max_cache_len, device=weights.device, dtype=weights.dtype
995
+ )
996
+
997
+ def _reset_cache(self):
998
+ for layer in self.model.layers:
999
+ layer.self_attn.past_key_value = None
1000
+
1001
+
1002
+ XVERSE_INPUTS_DOCSTRING = r"""
1003
+ Args:
1004
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
1005
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
1006
+ it.
1007
+
1008
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1009
+ [`PreTrainedTokenizer.__call__`] for details.
1010
+
1011
+ [What are input IDs?](../glossary#input-ids)
1012
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
1013
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1014
+
1015
+ - 1 for tokens that are **not masked**,
1016
+ - 0 for tokens that are **masked**.
1017
+
1018
+ [What are attention masks?](../glossary#attention-mask)
1019
+
1020
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1021
+ [`PreTrainedTokenizer.__call__`] for details.
1022
+
1023
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
1024
+ `past_key_values`).
1025
+
1026
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
1027
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
1028
+ information on the default strategy.
1029
+
1030
+ - 1 indicates the head is **not masked**,
1031
+ - 0 indicates the head is **masked**.
1032
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1033
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1034
+ config.n_positions - 1]`.
1035
+
1036
+ [What are position IDs?](../glossary#position-ids)
1037
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
1038
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
1039
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
1040
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
1041
+
1042
+ Two formats are allowed:
1043
+ - a [`~cache_utils.Cache`] instance;
1044
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
1045
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
1046
+ cache format.
1047
+
1048
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
1049
+ legacy cache format will be returned.
1050
+
1051
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
1052
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
1053
+ of shape `(batch_size, sequence_length)`.
1054
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
1055
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
1056
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
1057
+ model's internal embedding lookup matrix.
1058
+ use_cache (`bool`, *optional*):
1059
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1060
+ `past_key_values`).
1061
+ output_attentions (`bool`, *optional*):
1062
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1063
+ tensors for more detail.
1064
+ output_hidden_states (`bool`, *optional*):
1065
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1066
+ more detail.
1067
+ return_dict (`bool`, *optional*):
1068
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1069
+ """
1070
+
1071
+
1072
+ @add_start_docstrings(
1073
+ "The bare xverse Model outputting raw hidden-states without any specific head on top.",
1074
+ XVERSE_START_DOCSTRING,
1075
+ )
1076
+ class XverseMoEModel(XversePreTrainedModel):
1077
+ """
1078
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`XverseMoEDecoderLayer`]
1079
+
1080
+ Args:
1081
+ config: XverseConfig
1082
+ """
1083
+
1084
+ def __init__(self, config: XverseConfig):
1085
+ super().__init__(config)
1086
+ self.padding_idx = config.pad_token_id
1087
+ self.vocab_size = config.vocab_size
1088
+
1089
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1090
+ self.layers = nn.ModuleList(
1091
+ [XverseMoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1092
+ )
1093
+ self.norm = XverseRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1094
+ self.gradient_checkpointing = False
1095
+
1096
+ # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
1097
+ # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_position_embeddings`.
1098
+ causal_mask = torch.full(
1099
+ (config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool
1100
+ )
1101
+ self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
1102
+ # Initialize weights and apply final processing
1103
+ self.post_init()
1104
+
1105
+ def get_input_embeddings(self):
1106
+ return self.embed_tokens
1107
+
1108
+ def set_input_embeddings(self, value):
1109
+ self.embed_tokens = value
1110
+
1111
+ @add_start_docstrings_to_model_forward(XVERSE_INPUTS_DOCSTRING)
1112
+ def forward(
1113
+ self,
1114
+ input_ids: torch.LongTensor = None,
1115
+ attention_mask: Optional[torch.Tensor] = None,
1116
+ position_ids: Optional[torch.LongTensor] = None,
1117
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1118
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1119
+ use_cache: Optional[bool] = None,
1120
+ output_attentions: Optional[bool] = None,
1121
+ output_hidden_states: Optional[bool] = None,
1122
+ output_router_logits: Optional[bool] = None,
1123
+ return_dict: Optional[bool] = None,
1124
+ cache_position: Optional[torch.LongTensor] = None,
1125
+ ) -> Union[Tuple, MoeModelOutputWithPast]:
1126
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1127
+ output_router_logits = (
1128
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1129
+ )
1130
+ output_hidden_states = (
1131
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1132
+ )
1133
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1134
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1135
+
1136
+ if (input_ids is None) ^ (inputs_embeds is not None):
1137
+ raise ValueError(
1138
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
1139
+ )
1140
+
1141
+ if self.gradient_checkpointing and self.training and use_cache:
1142
+ logger.warning_once(
1143
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
1144
+ )
1145
+ use_cache = False
1146
+
1147
+ if inputs_embeds is None:
1148
+ inputs_embeds = self.embed_tokens(input_ids)
1149
+
1150
+ past_seen_tokens = 0
1151
+ if use_cache: # kept for BC (cache positions)
1152
+ if not isinstance(past_key_values, StaticCache):
1153
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
1154
+ past_seen_tokens = past_key_values.get_seq_length()
1155
+
1156
+ if cache_position is None:
1157
+ if isinstance(past_key_values, StaticCache):
1158
+ raise ValueError("cache_position is a required argument when using StaticCache.")
1159
+ cache_position = torch.arange(
1160
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
1161
+ )
1162
+
1163
+ if position_ids is None:
1164
+ position_ids = cache_position.unsqueeze(0)
1165
+
1166
+ causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
1167
+
1168
+ # embed positions
1169
+ hidden_states = inputs_embeds
1170
+
1171
+ # decoder layers
1172
+ all_hidden_states = () if output_hidden_states else None
1173
+ all_self_attns = () if output_attentions else None
1174
+ all_router_logits = () if output_router_logits else None
1175
+ next_decoder_cache = None
1176
+
1177
+ for decoder_layer in self.layers:
1178
+ if output_hidden_states:
1179
+ all_hidden_states += (hidden_states,)
1180
+
1181
+ if self.gradient_checkpointing and self.training:
1182
+ layer_outputs = self._gradient_checkpointing_func(
1183
+ decoder_layer.__call__,
1184
+ hidden_states,
1185
+ causal_mask,
1186
+ position_ids,
1187
+ past_key_values,
1188
+ output_attentions,
1189
+ output_router_logits,
1190
+ use_cache,
1191
+ cache_position,
1192
+ )
1193
+ else:
1194
+ layer_outputs = decoder_layer(
1195
+ hidden_states,
1196
+ attention_mask=causal_mask,
1197
+ position_ids=position_ids,
1198
+ past_key_value=past_key_values,
1199
+ output_attentions=output_attentions,
1200
+ output_router_logits=output_router_logits,
1201
+ use_cache=use_cache,
1202
+ cache_position=cache_position,
1203
+ )
1204
+
1205
+ hidden_states = layer_outputs[0]
1206
+
1207
+ if use_cache:
1208
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1209
+
1210
+ if output_attentions:
1211
+ all_self_attns += (layer_outputs[1],)
1212
+
1213
+ if output_router_logits:
1214
+ all_router_logits += (layer_outputs[-1],)
1215
+
1216
+ hidden_states = self.norm(hidden_states)
1217
+
1218
+ # add hidden states from the last decoder layer
1219
+ if output_hidden_states:
1220
+ all_hidden_states += (hidden_states,)
1221
+
1222
+ next_cache = None
1223
+ if use_cache:
1224
+ next_cache = (
1225
+ next_decoder_cache.to_legacy_cache() if isinstance(next_decoder_cache, Cache) else next_decoder_cache
1226
+ )
1227
+ if not return_dict:
1228
+ return tuple(v for v in [
1229
+ hidden_states, next_cache, all_hidden_states, all_self_attns,
1230
+ all_router_logits
1231
+ ] if v is not None)
1232
+
1233
+ return MoeModelOutputWithPast(
1234
+ last_hidden_state=hidden_states,
1235
+ past_key_values=next_cache,
1236
+ hidden_states=all_hidden_states,
1237
+ attentions=all_self_attns,
1238
+ router_logits=all_router_logits,
1239
+ )
1240
+
1241
+ # TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
1242
+ # KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
1243
+ # (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
1244
+ # `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
1245
+ def _update_causal_mask(self, attention_mask, input_tensor):
1246
+ if self.config._attn_implementation == "flash_attention_2":
1247
+ if attention_mask is not None and 0.0 in attention_mask:
1248
+ return attention_mask
1249
+ return None
1250
+
1251
+ batch_size, seq_length = input_tensor.shape[:2]
1252
+ dtype = input_tensor.dtype
1253
+ device = input_tensor.device
1254
+
1255
+ # support going beyond cached `max_position_embedding`
1256
+ if seq_length > self.causal_mask.shape[-1]:
1257
+ causal_mask = torch.full((2 * self.causal_mask.shape[-1], 2 * self.causal_mask.shape[-1]), fill_value=1)
1258
+ self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
1259
+
1260
+ # We use the current dtype to avoid any overflows
1261
+ min_dtype = torch.finfo(dtype).min
1262
+ causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
1263
+
1264
+ causal_mask = causal_mask.to(dtype=dtype, device=device)
1265
+ if attention_mask is not None and attention_mask.dim() == 2:
1266
+ mask_length = attention_mask.shape[-1]
1267
+ padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
1268
+ causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
1269
+
1270
+ if self.config._attn_implementation == "sdpa" and attention_mask is not None:
1271
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
1272
+ is_tracing = (
1273
+ torch.jit.is_tracing()
1274
+ or isinstance(input_tensor, torch.fx.Proxy)
1275
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
1276
+ )
1277
+ if not is_tracing and torch.any(attention_mask != 1):
1278
+ # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
1279
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1280
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1281
+ causal_mask = causal_mask.mul(~torch.all(causal_mask == min_dtype, dim=-1, keepdim=True)).to(dtype)
1282
+
1283
+ return causal_mask
1284
+ class XverseForCausalLM(XversePreTrainedModel):
1285
+ _tied_weights_keys = ["lm_head.weight"]
1286
+
1287
+ def __init__(self, config):
1288
+ super().__init__(config)
1289
+ self.model = XverseMoEModel(config)
1290
+ self.vocab_size = config.vocab_size
1291
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1292
+
1293
+ self.router_aux_loss_coef = config.router_aux_loss_coef
1294
+ self.num_experts = config.num_experts
1295
+ self.moe_top_k = config.moe_top_k
1296
+ # Initialize weights and apply final processing
1297
+ self.post_init()
1298
+
1299
+ def get_input_embeddings(self):
1300
+ return self.model.embed_tokens
1301
+
1302
+ def set_input_embeddings(self, value):
1303
+ self.model.embed_tokens = value
1304
+
1305
+ def get_output_embeddings(self):
1306
+ return self.lm_head
1307
+
1308
+ def set_output_embeddings(self, new_embeddings):
1309
+ self.lm_head = new_embeddings
1310
+
1311
+ def set_decoder(self, decoder):
1312
+ self.model = decoder
1313
+
1314
+ def get_decoder(self):
1315
+ return self.model
1316
+
1317
+ @add_start_docstrings_to_model_forward(XVERSE_INPUTS_DOCSTRING)
1318
+ @replace_return_docstrings(output_type=MoeCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1319
+ def forward(
1320
+ self,
1321
+ input_ids: torch.LongTensor = None,
1322
+ attention_mask: Optional[torch.Tensor] = None,
1323
+ position_ids: Optional[torch.LongTensor] = None,
1324
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1325
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1326
+ labels: Optional[torch.LongTensor] = None,
1327
+ use_cache: Optional[bool] = None,
1328
+ output_attentions: Optional[bool] = None,
1329
+ output_hidden_states: Optional[bool] = None,
1330
+ output_router_logits: Optional[bool] = None,
1331
+ return_dict: Optional[bool] = None,
1332
+ cache_position: Optional[torch.LongTensor] = None,
1333
+ ) -> Union[Tuple, MoeCausalLMOutputWithPast]:
1334
+ r"""
1335
+ Args:
1336
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1337
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1338
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1339
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1340
+
1341
+ Returns:
1342
+
1343
+ Example:
1344
+
1345
+ ```python
1346
+ >>> from transformers import AutoTokenizer, XverseForCausalLM
1347
+
1348
+ >>> model = XverseForCausalLM.from_pretrained("meta-xverse/xverse-2-7b-hf")
1349
+ >>> tokenizer = AutoTokenizer.from_pretrained("meta-xverse/xverse-2-7b-hf")
1350
+
1351
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1352
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1353
+
1354
+ >>> # Generate
1355
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1356
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1357
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1358
+ ```"""
1359
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1360
+ output_router_logits = (
1361
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1362
+ )
1363
+ output_hidden_states = (
1364
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1365
+ )
1366
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1367
+
1368
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1369
+ outputs = self.model(
1370
+ input_ids=input_ids,
1371
+ attention_mask=attention_mask,
1372
+ position_ids=position_ids,
1373
+ past_key_values=past_key_values,
1374
+ inputs_embeds=inputs_embeds,
1375
+ use_cache=use_cache,
1376
+ output_attentions=output_attentions,
1377
+ output_hidden_states=output_hidden_states,
1378
+ output_router_logits=output_router_logits,
1379
+ return_dict=return_dict,
1380
+ cache_position=cache_position,
1381
+ )
1382
+
1383
+ hidden_states = outputs[0]
1384
+ if self.config.pretraining_tp > 1:
1385
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1386
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1387
+ logits = torch.cat(logits, dim=-1)
1388
+ else:
1389
+ logits = self.lm_head(hidden_states)
1390
+ logits = logits.float()
1391
+
1392
+ loss = None
1393
+ if labels is not None:
1394
+ # Shift so that tokens < n predict n
1395
+ shift_logits = logits[..., :-1, :].contiguous()
1396
+ shift_labels = labels[..., 1:].contiguous()
1397
+ # Flatten the tokens
1398
+ loss_fct = CrossEntropyLoss()
1399
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1400
+ shift_labels = shift_labels.view(-1)
1401
+ # Enable model parallelism
1402
+ shift_labels = shift_labels.to(shift_logits.device)
1403
+ loss = loss_fct(shift_logits, shift_labels)
1404
+
1405
+ aux_loss = None
1406
+ if output_router_logits:
1407
+ aux_loss = load_balancing_loss_func(
1408
+ outputs.router_logits if return_dict else outputs[-1],
1409
+ self.num_experts,
1410
+ self.moe_top_k,
1411
+ attention_mask,
1412
+ )
1413
+ if labels is not None:
1414
+ loss += self.router_aux_loss_coef * aux_loss.to(loss.device) # make sure to reside in the same device
1415
+
1416
+ if not return_dict:
1417
+ output = (logits,) + outputs[1:]
1418
+ if output_router_logits:
1419
+ output = (aux_loss,) + output
1420
+ return (loss,) + output if loss is not None else output
1421
+
1422
+ return MoeCausalLMOutputWithPast(
1423
+ loss=loss,
1424
+ aux_loss=aux_loss,
1425
+ logits=logits,
1426
+ past_key_values=outputs.past_key_values,
1427
+ hidden_states=outputs.hidden_states,
1428
+ attentions=outputs.attentions,
1429
+ router_logits=outputs.router_logits,
1430
+ )
1431
+
1432
+ def prepare_inputs_for_generation(
1433
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, cache_position=None, **kwargs
1434
+ ):
1435
+ # With static cache, the `past_key_values` is None
1436
+ # TODO joao: standardize interface for the different Cache classes and remove of this if
1437
+ has_static_cache = False
1438
+ if past_key_values is None:
1439
+ past_key_values = getattr(getattr(self.model.layers[0], "self_attn", {}), "past_key_value", None)
1440
+ has_static_cache = past_key_values is not None
1441
+
1442
+ past_length = 0
1443
+ if past_key_values is not None:
1444
+ if isinstance(past_key_values, Cache):
1445
+ past_length = cache_position[0] if cache_position is not None else past_key_values.get_seq_length()
1446
+ max_cache_length = (
1447
+ torch.tensor(past_key_values.get_max_length(), device=input_ids.device)
1448
+ if past_key_values.get_max_length() is not None
1449
+ else None
1450
+ )
1451
+ cache_length = past_length if max_cache_length is None else torch.min(max_cache_length, past_length)
1452
+ # TODO joao: remove this `else` after `generate` prioritizes `Cache` objects
1453
+ else:
1454
+ cache_length = past_length = past_key_values[0][0].shape[2]
1455
+ max_cache_length = None
1456
+
1457
+ # Keep only the unprocessed tokens:
1458
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1459
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1460
+ # input)
1461
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1462
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1463
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1464
+ # input_ids based on the past_length.
1465
+ elif past_length < input_ids.shape[1]:
1466
+ input_ids = input_ids[:, past_length:]
1467
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1468
+
1469
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1470
+ if (
1471
+ max_cache_length is not None
1472
+ and attention_mask is not None
1473
+ and cache_length + input_ids.shape[1] > max_cache_length
1474
+ ):
1475
+ attention_mask = attention_mask[:, -max_cache_length:]
1476
+
1477
+ position_ids = kwargs.get("position_ids", None)
1478
+ if attention_mask is not None and position_ids is None:
1479
+ # create position_ids on the fly for batch generation
1480
+ position_ids = attention_mask.long().cumsum(-1) - 1
1481
+ position_ids.masked_fill_(attention_mask == 0, 1)
1482
+ if past_key_values:
1483
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1484
+
1485
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1486
+ if inputs_embeds is not None and past_key_values is None:
1487
+ model_inputs = {"inputs_embeds": inputs_embeds}
1488
+ else:
1489
+ # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
1490
+ # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
1491
+ # TODO: use `next_tokens` directly instead.
1492
+ model_inputs = {"input_ids": input_ids.contiguous()}
1493
+
1494
+ input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
1495
+ if cache_position is None:
1496
+ cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
1497
+ else:
1498
+ cache_position = cache_position[-input_length:]
1499
+
1500
+ if has_static_cache:
1501
+ past_key_values = None
1502
+
1503
+ model_inputs.update(
1504
+ {
1505
+ "position_ids": position_ids,
1506
+ "cache_position": cache_position,
1507
+ "past_key_values": past_key_values,
1508
+ "use_cache": kwargs.get("use_cache"),
1509
+ "attention_mask": attention_mask,
1510
+ }
1511
+ )
1512
+ return model_inputs
1513
+
1514
+ @staticmethod
1515
+ def _reorder_cache(past_key_values, beam_idx):
1516
+ reordered_past = ()
1517
+ for layer_past in past_key_values:
1518
+ reordered_past += (
1519
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1520
+ )
1521
+ return reordered_past
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "model_max_length": 1000000000000000019884624838656,
4
+ "tokenizer_class": "PreTrainedTokenizerFast"
5
+ }