horiz94 commited on
Commit
f470c67
1 Parent(s): bd3213b

Upload 9 files

Browse files
README.md CHANGED
@@ -1,3 +1,162 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+
6
+ # Tele-FLM
7
+ Tele-FLM (aka FLM-2) is a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
8
+ Built upon the decoder-only transformer architecture, it has been trained on approximately 2T tokens.
9
+ Tele-FLM demonstrates superior performances at its scale, and sometimes surpass larger models.
10
+ In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
11
+
12
+ ## Model Details
13
+
14
+ - **Developed by:** BAAI & TeleAI
15
+ - **Language(s):** English; Chinese; Other languages
16
+ - **License:** Apache 2.0
17
+
18
+
19
+
20
+ ## Bias, Risks, and Limitations
21
+
22
+ Although we've made extensive efforts to thoroughly clean and filter the training corpus for the model, due to the open nature of the dataset, the model may still have picked up on some unsafe examples. Consequently, the model may still generate unexpected content, including but not limited to discrimination, bias, or offensive language. We would like to strongly advise users not to spread any unsafe content generated by the model. The project developers cannot be held responsible for any repercussions stemming from the dissemination of harmful information.
23
+
24
+
25
+ ## Quick Start
26
+
27
+ Use the code below to get started with Tele-FLM.
28
+
29
+ ```python
30
+ import torch
31
+ from transformers import AutoTokenizer, AutoModelForCausalLM
32
+ tokenizer = AutoTokenizer.from_pretrained('CofeAI/Tele-FLM', trust_remote_code=True)
33
+ model = AutoModelForCausalLM.from_pretrained('CofeAI/Tele-FLM', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto", trust_remote_code=True)
34
+ inputs = tokenizer('北京市是中国的首都', return_tensors='pt').to(model.device)
35
+ generated = model.generate(**inputs, max_new_tokens=128, repetition_penalty=1.03)
36
+ print(tokenizer.decode(generated.cpu()[0], skip_special_tokens=True))
37
+ ```
38
+
39
+ ## Training Details
40
+
41
+ ### Training Data
42
+ Our training dataset comprises a variety of domains, as detailed in the table below.
43
+ The total amount of data is roughly 2 trillion, with English and Chinese data in a ratio of about 2:1.
44
+ In line with the methodology of GPT-4, we collected some instruct data and incorporated it into our pre-training data after removing the test sets of common datasets using the strict n-gram-based method. We deliberately avoid “training on the test set” or any other benchmark-oriented trick.
45
+ |Domain |Language|Sampling Prop. |Epochs |Disk Size |
46
+ |-------|:--------------:|:--------------:|:-------:|:-----------:|
47
+ | Webtext |en, zh | 75.21% | 1.0 | 5.9 TB |
48
+ | Code |code, zh | 9.81% | 1.0 | 528.1 GB |
49
+ | Book |en, zh | 7.17% | 0.8 | 647.6 GB |
50
+ | WorldKnowledge |multi, en, zh | 2.87% | 2.5 | 67.5 GB |
51
+ | QA |en, zh | 2.12% | 1.0 | 159.2 GB |
52
+ | AcademicPaper |en | 0.99% | 1.0 | 54.4 GB |
53
+ | Profession-Law |zh | 1.04% | 1.0 | 84.2 GB |
54
+ | Profession-Math |math | 0.62% | 2.0 | 6.1 GB |
55
+ | Profession-Patent |zh | 0.14% | 1.0 | 10.4 GB |
56
+ | Profession-Medical |zh | 0.02% | 1.0 | 1.2 GB |
57
+ | Classical chinese |zh | 0.02% | 2.5 | 0.5 GB |
58
+
59
+
60
+ ### Model Architecture
61
+ We adopt the architecture of FLM-101B as the backbone for Tele-FLM, with several modifications:
62
+ - Rotary Positional Embedding (RoPE)
63
+ - RMSNorm for normalization
64
+ - SwiGLU for activation function
65
+ - Linear bias disabled
66
+ - Embedding and language model head untied
67
+
68
+ Consequently, Tele-FLM is largely compatible with Llama architecturally.
69
+ To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
70
+
71
+ In the pre-training stage, we employ μP for optimal hyperparameter search. The μP model (Tele-FLM_μP) is architecturally identical to Tele-FLM except for the model width(# attention heads).
72
+ The architecture of Tele-FLM and Tele-FLM_μP is listed below.
73
+ For more details of μP, please refer to our technical report and the original Tensor Program papers.
74
+
75
+ | Models | layer<br>number | attention<br>heads| hidden<br>size | ffn hidden<br>size| vocab<br>size | context<br>length | param size<br>(M) |
76
+ |--------|--------------|----------------|-------------|----------------|------------|----------------|----------------|
77
+ | Tele-FLM | 64 | 64 | 8,192 | 21,824 | 80,000 | 4,096 | 52,850 |
78
+ | Tele-FLM_μP | 64 | 4 | 512 | 1,344 | 80,000 | 4,096 | 283 |
79
+
80
+
81
+
82
+
83
+ ### Training Hyperparameters
84
+
85
+ Due to the smaller size, Tele-FLM_μP allows for significantly more experimental runs within fixed time and resource constraints.
86
+ We searched six hyperparameters for pretraining. All the hyperparameters are shown below.
87
+
88
+
89
+ | Searched Hyperparameters ||| Non-Searched Hyperparameters ||
90
+ |--------------------------------------------|-|-|-|----------------------------------|
91
+ | Learning Rate | 1.5e-4 || LR Schedule Type | cosine |
92
+ | Matrix Learning Rate | 1.5e-4 || LR Schedule (tokens) | 2.5T |
93
+ | Minimum Learning Rate | 1.5e-5 || Warmup Step | 2,000 |
94
+ | Standard Deviation | 4e-3 || Clip Grad | 1.0 |
95
+ | Matrix Standard Deviation | 4.242e-3 || Weight Decay | 0.0 |
96
+ | Input Mult | 1.0 || Batch Size (tokens) | 5,505,024 |
97
+ | Output Mult | 3.125e-2 || RoPE Theta | 10,000 |
98
+
99
+
100
+ ### Training Loss
101
+
102
+
103
+ <p align="center" width="100%">
104
+ <a><img src="figures/train_loss.png" alt="nexa-octopus" style="width: 90%; min-width: 500px; display: block; margin: auto;"></a>
105
+ </p>
106
+
107
+
108
+ #### Hardware
109
+
110
+ Tele-FLM is trained on a cluster of 112 A800 SXM4 GPU servers, each with 8 NVLink A800 GPUs and 2TB of RAM.
111
+ The nodes have varied CPU configurations: 96 nodes with Intel 8358 (128x 2.60GHz) CPUs and 16 nodes with AMD 7643 (96x 2.30GHz) CPUs.
112
+ All nodes are interconnected via InfiniBand (IB). The training process lasted around two months, including downtime due to unexpected factors.
113
+
114
+ #### Software
115
+
116
+ Tele-FLM utilizes 3D parallel training, combining the prevailing methodologies: data parallelism, tensor parallelism, and pipeline parallelism.
117
+ The parallel training setup for Tele-FLM is configured as follows: tensor parallel=4, pipeline parallel=2, and data parallel=112.
118
+
119
+
120
+
121
+
122
+ ## Evaluation
123
+
124
+ ### English
125
+
126
+ #### Open LLM Leaderboard
127
+ | Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | WinoGrade | GSM8K | HumanEval | BBH |
128
+ |------------|:-------:|:-------:|:---------:|:------:|:----------:|:---------:|:------:|:---------:|:------:|
129
+ | | | 25-shot | 10-shot | 5-shot | zero-shot | 5-shot | 5-shot | zero-shot | 3-shot |
130
+ | LLAMA2-70B | 63.39 | 67.32 | 87.33 | 69.83 | 44.92 | 83.74 | 54.06 | 46.95 | 52.94 |
131
+ | LLAMA2-13B | 50.29 | 59.39 | 82.13 | 55.77 | 37.38 | 76.64 | 22.82 | 28.66 | 39.52 |
132
+ | LLAMA-65B | 56.98 | 63.48 | 86.09 | 63.93 | 43.43 | 82.56 | 37.23 | 33.54 | 45.54 |
133
+ | LLAMA-13B | 46.20 | 56.23 | 80.93 | 47.67 | 39.48 | 76.24 | 7.58 | 23.78 | 37.72 |
134
+ | Tele-FLM | 56.60 | 59.47 | 82.25 | 64.00 | 43.09 | 79.40 | 45.19 | 34.76 | 44.60 |
135
+
136
+ ### Chinese
137
+
138
+ #### OpenCompass
139
+ | Model | Average | C-Eval | CMMLU | C3 | CHID | CSL |
140
+ |--------------|:-------:|:------:|:-----:|:-----:|:-----:|:-----:|
141
+ | GPT-4 | 76.64 | 69.90 | 71.00 | 95.10 | 82.20 | 65.00 |
142
+ | GPT-3.5 | 61.86 | 52.50 | 53.90 | 85.60 | 60.40 | 56.90 |
143
+ | Qwen1.5-72B | 80.45 | 83.72 | 83.09 | 81.86 | 91.09 | 62.50 |
144
+ | Qwen-72B | 83.00 | 83.30 | 83.60 | 95.80 | 91.10 | 61.20 |
145
+ | DeepSeek-67B | 73.46 | 66.90 | 70.40 | 77.80 | 89.10 | 63.10 |
146
+ | Tele-FLM | 71.13 | 65.48 | 66.98 | 66.25 | 92.57 | 64.38 |
147
+
148
+
149
+ ## Tech report
150
+ For more detailed capabilities of Tele-FLM, see [Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
151
+
152
+ If you find our work helpful, please consider citing it.
153
+ ```
154
+ @misc{li2024teleflm,
155
+ title={Tele-FLM Technical Report},
156
+ author={Xiang Li and Yiqun Yao and Xin Jiang and Xuezhi Fang and Chao Wang and Xinzhang Liu and Zihan Wang and Yu Zhao and Xin Wang and Yuyao Huang and Shuangyong Song and Yongxiang Li and Zheng Zhang and Bo Zhao and Aixin Sun and Yequan Wang and Zhongjiang He and Zhongyuan Wang and Xuelong Li and Tiejun Huang},
157
+ year={2024},
158
+ eprint={2404.16645},
159
+ archivePrefix={arXiv},
160
+ primaryClass={cs.CL}
161
+ }
162
+ ```
configuration_teleflm.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ Tele-FLM model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ TeleFLM_PRETRAINED_CONFIG_ARCHIVE_MAP={}
29
+
30
+
31
+ class TeleFLMConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`TeleFLM`]. It is used to instantiate an TeleFLM
34
+ model according to the specified arguments, defining the model architecture.
35
+
36
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
37
+ documentation from [`PretrainedConfig`] for more information.
38
+
39
+
40
+ Args:
41
+ vocab_size (`int`, *optional*, defaults to 32000):
42
+ Vocabulary size of the TeleFLM model. Defines the number of different tokens that can be represented by the
43
+ `inputs_ids` passed when calling [`TeleFLM`]
44
+ hidden_size (`int`, *optional*, defaults to 4096):
45
+ Dimension of the hidden representations.
46
+ intermediate_size (`int`, *optional*, defaults to 11008):
47
+ Dimension of the MLP representations.
48
+ num_hidden_layers (`int`, *optional*, defaults to 32):
49
+ Number of hidden layers in the Transformer decoder.
50
+ num_attention_heads (`int`, *optional*, defaults to 32):
51
+ Number of attention heads for each attention layer in the Transformer decoder.
52
+ num_key_value_heads (`int`, *optional*):
53
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
54
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
55
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
56
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
57
+ by meanpooling all the original heads within that group. For more details checkout [this
58
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
59
+ `num_attention_heads`.
60
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
61
+ The non-linear activation function (function or string) in the decoder.
62
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
63
+ The maximum sequence length that this model might ever be used with. TeleFLM supports up to 4096 tokens.
64
+ initializer_range (`float`, *optional*, defaults to 0.02):
65
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
66
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
67
+ The epsilon used by the rms normalization layers.
68
+ use_cache (`bool`, *optional*, defaults to `True`):
69
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
70
+ relevant if `config.is_decoder=True`.
71
+ pad_token_id (`int`, *optional*):
72
+ Padding token id.
73
+ bos_token_id (`int`, *optional*, defaults to 1):
74
+ Beginning of stream token id.
75
+ eos_token_id (`int`, *optional*, defaults to 2):
76
+ End of stream token id.
77
+ pretraining_tp (`int`, *optional*, defaults to 1):
78
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
79
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
80
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
81
+ issue](https://github.com/pytorch/pytorch/issues/76232).
82
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
83
+ Whether to tie weight embeddings
84
+ rope_theta (`float`, *optional*, defaults to 10000.0):
85
+ The base period of the RoPE embeddings.
86
+ rope_scaling (`Dict`, *optional*):
87
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
88
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
89
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
90
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
91
+ these scaling strategies behave:
92
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
93
+ experimental feature, subject to breaking API changes in future versions.
94
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
95
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
96
+ attention_dropout (`float`, *optional*, defaults to 0.0):
97
+ The dropout ratio for the attention probabilities.
98
+
99
+ ```python
100
+ >>> from transformers import TeleFLMModel, TeleFLMConfig
101
+
102
+ >>> # Initializing a TeleFLM configuration
103
+ >>> configuration = TeleFLMConfig()
104
+
105
+ >>> # Initializing a model from TeleFLM configuration
106
+ >>> model = TeleFLMModel(configuration)
107
+
108
+ >>> # Accessing the model configuration
109
+ >>> configuration = model.config
110
+ ```"""
111
+
112
+ model_type = "TeleFLM"
113
+ keys_to_ignore_at_inference = ["past_key_values"]
114
+
115
+ def __init__(
116
+ self,
117
+ vocab_size=32000,
118
+ hidden_size=4096,
119
+ intermediate_size=11008,
120
+ num_hidden_layers=32,
121
+ num_attention_heads=32,
122
+ num_key_value_heads=None,
123
+ hidden_act="silu",
124
+ max_position_embeddings=2048,
125
+ initializer_range=0.02,
126
+ rms_norm_eps=1e-6,
127
+ use_cache=True,
128
+ pad_token_id=None,
129
+ bos_token_id=1,
130
+ eos_token_id=2,
131
+ pretraining_tp=1,
132
+ tie_word_embeddings=False,
133
+ rope_theta=10000.0,
134
+ rope_scaling=None,
135
+ attention_bias=False,
136
+ attention_dropout=0.0,
137
+ use_mup=False,
138
+ mup_scale_factor=1.0,
139
+ output_mult=1.0,
140
+ input_mult=1.0,
141
+ **kwargs,
142
+ ):
143
+ self.vocab_size = vocab_size
144
+ self.max_position_embeddings = max_position_embeddings
145
+ self.hidden_size = hidden_size
146
+ self.intermediate_size = intermediate_size
147
+ self.num_hidden_layers = num_hidden_layers
148
+ self.num_attention_heads = num_attention_heads
149
+
150
+ # for backward compatibility
151
+ if num_key_value_heads is None:
152
+ num_key_value_heads = num_attention_heads
153
+
154
+ self.num_key_value_heads = num_key_value_heads
155
+ self.hidden_act = hidden_act
156
+ self.initializer_range = initializer_range
157
+ self.rms_norm_eps = rms_norm_eps
158
+ self.pretraining_tp = pretraining_tp
159
+ self.use_cache = use_cache
160
+ self.rope_theta = rope_theta
161
+ self.rope_scaling = rope_scaling
162
+ self._rope_scaling_validation()
163
+ self.attention_bias = attention_bias
164
+ self.attention_dropout = attention_dropout
165
+ self.use_mup=use_mup
166
+ self.mup_scale_factor=mup_scale_factor
167
+ self.output_mult=output_mult
168
+ self.input_mult=input_mult
169
+
170
+ super().__init__(
171
+ pad_token_id=pad_token_id,
172
+ bos_token_id=bos_token_id,
173
+ eos_token_id=eos_token_id,
174
+ tie_word_embeddings=tie_word_embeddings,
175
+ **kwargs,
176
+ )
177
+
178
+ def _rope_scaling_validation(self):
179
+ """
180
+ Validate the `rope_scaling` configuration.
181
+ """
182
+ if self.rope_scaling is None:
183
+ return
184
+
185
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
186
+ raise ValueError(
187
+ "`rope_scaling` must be a dictionary with two fields, `type` and `factor`, " f"got {self.rope_scaling}"
188
+ )
189
+ rope_scaling_type = self.rope_scaling.get("type", None)
190
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
191
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
192
+ raise ValueError(
193
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
194
+ )
195
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
196
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
figures/._train_loss.png ADDED
figures/train_loss.png ADDED
modeling_teleflm.py ADDED
@@ -0,0 +1,1524 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ """ PyTorch Tele-FLM model, based on LLAMA implementation. """
3
+
4
+ import math
5
+ import warnings
6
+ from typing import List, Optional, Tuple, Union
7
+
8
+ import torch
9
+ import torch.nn.functional as F
10
+ import torch.utils.checkpoint
11
+ from torch import nn
12
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
13
+
14
+ from transformers.activations import ACT2FN
15
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
16
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
17
+ from transformers.modeling_outputs import (
18
+ BaseModelOutputWithPast,
19
+ CausalLMOutputWithPast,
20
+ QuestionAnsweringModelOutput,
21
+ SequenceClassifierOutputWithPast,
22
+ )
23
+ from transformers.modeling_utils import PreTrainedModel
24
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
25
+ from transformers.utils import (
26
+ add_start_docstrings,
27
+ add_start_docstrings_to_model_forward,
28
+ is_flash_attn_2_available,
29
+ is_flash_attn_greater_or_equal_2_10,
30
+ logging,
31
+ replace_return_docstrings,
32
+ )
33
+ from .configuration_teleflm import TeleFLMConfig
34
+
35
+ if is_flash_attn_2_available():
36
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
37
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
38
+
39
+
40
+ logger = logging.get_logger(__name__)
41
+
42
+ _CONFIG_FOR_DOC = "TeleFLMConfig"
43
+
44
+
45
+ def _get_unpad_data(attention_mask):
46
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
47
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
48
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
49
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
50
+ return (
51
+ indices,
52
+ cu_seqlens,
53
+ max_seqlen_in_batch,
54
+ )
55
+
56
+
57
+ class TeleFLMRMSNorm(nn.Module):
58
+ def __init__(self, hidden_size, eps=1e-6):
59
+ """
60
+ TeleFLMRMSNorm is equivalent to T5LayerNorm
61
+ """
62
+ super().__init__()
63
+ self.weight = nn.Parameter(torch.ones(hidden_size))
64
+ self.variance_epsilon = eps
65
+
66
+ def forward(self, hidden_states):
67
+ input_dtype = hidden_states.dtype
68
+ hidden_states = hidden_states.to(torch.float32)
69
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
70
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
71
+ return self.weight * hidden_states.to(input_dtype)
72
+
73
+
74
+ ALL_LAYERNORM_LAYERS.append(TeleFLMRMSNorm)
75
+
76
+
77
+ class TeleFLMRotaryEmbedding(nn.Module):
78
+ def __init__(self, dim, max_position_embeddings=4096, base=10000, device=None, scaling_factor=1.0):
79
+ super().__init__()
80
+ self.scaling_factor = scaling_factor
81
+ self.dim = dim
82
+ self.max_position_embeddings = max_position_embeddings
83
+ self.base = base
84
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
85
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
86
+ # For BC we register cos and sin cached
87
+ self.max_seq_len_cached = max_position_embeddings
88
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
89
+ t = t / self.scaling_factor
90
+ freqs = torch.outer(t, self.inv_freq)
91
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
92
+ emb = torch.cat((freqs, freqs), dim=-1)
93
+ self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
94
+ self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)
95
+
96
+
97
+ @torch.no_grad()
98
+ def forward(self, x, position_ids):
99
+ # x: [bs, num_attention_heads, seq_len, head_size]
100
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
101
+ position_ids_expanded = position_ids[:, None, :].float()
102
+ # Force float32 since bfloat16 loses precision on long contexts
103
+ # See https://github.com/huggingface/transformers/pull/29285
104
+ device_type = x.device.type
105
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
106
+ with torch.autocast(device_type=device_type, enabled=False):
107
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
108
+ emb = torch.cat((freqs, freqs), dim=-1)
109
+ cos = emb.cos()
110
+ sin = emb.sin()
111
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
112
+
113
+
114
+ class TeleFLMLinearScalingRotaryEmbedding(TeleFLMRotaryEmbedding):
115
+ """TeleFLMRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
116
+
117
+ def forward(self, x, position_ids):
118
+ # difference to the original RoPE: a scaling factor is aplied to the position ids
119
+ position_ids = position_ids.float() / self.scaling_factor
120
+ cos, sin = super().forward(x, position_ids)
121
+ return cos, sin
122
+
123
+
124
+ class TeleFLMDynamicNTKScalingRotaryEmbedding(TeleFLMRotaryEmbedding):
125
+ """TeleFLMRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
126
+
127
+ def forward(self, x, position_ids):
128
+ # difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
129
+ seq_len = torch.max(position_ids) + 1
130
+ if seq_len > self.max_position_embeddings:
131
+ base = self.base * (
132
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
133
+ ) ** (self.dim / (self.dim - 2))
134
+ inv_freq = 1.0 / (
135
+ base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(x.device) / self.dim)
136
+ )
137
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: this may break with compilation
138
+
139
+ cos, sin = super().forward(x, position_ids)
140
+ return cos, sin
141
+
142
+
143
+ def rotate_half(x):
144
+ """Rotates half the hidden dims of the input."""
145
+ x1 = x[..., : x.shape[-1] // 2]
146
+ x2 = x[..., x.shape[-1] // 2 :]
147
+ return torch.cat((-x2, x1), dim=-1)
148
+
149
+
150
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
151
+ """Applies Rotary Position Embedding to the query and key tensors.
152
+
153
+ Args:
154
+ q (`torch.Tensor`): The query tensor.
155
+ k (`torch.Tensor`): The key tensor.
156
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
157
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
158
+ position_ids (`torch.Tensor`, *optional*):
159
+ Deprecated and unused.
160
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
161
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
162
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
163
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
164
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
165
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
166
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
167
+ Returns:
168
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
169
+ """
170
+ cos = cos.unsqueeze(unsqueeze_dim)
171
+ sin = sin.unsqueeze(unsqueeze_dim)
172
+ q_embed = (q * cos) + (rotate_half(q) * sin)
173
+ k_embed = (k * cos) + (rotate_half(k) * sin)
174
+ return q_embed, k_embed
175
+
176
+
177
+ class TeleFLMMLP(nn.Module):
178
+ def __init__(self, config):
179
+ super().__init__()
180
+ self.config = config
181
+ self.hidden_size = config.hidden_size
182
+ self.intermediate_size = config.intermediate_size
183
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
184
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
185
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
186
+ self.act_fn = ACT2FN[config.hidden_act]
187
+
188
+ def forward(self, x):
189
+ if self.config.pretraining_tp > 1:
190
+ slice = self.intermediate_size // self.config.pretraining_tp
191
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
192
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
193
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
194
+
195
+ gate_proj = torch.cat(
196
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
197
+ )
198
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
199
+
200
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
201
+ down_proj = [
202
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
203
+ ]
204
+ down_proj = sum(down_proj)
205
+ else:
206
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
207
+
208
+ return down_proj
209
+
210
+
211
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
212
+ """
213
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
214
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
215
+ """
216
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
217
+ if n_rep == 1:
218
+ return hidden_states
219
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
220
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
221
+
222
+
223
+ class TeleFLMAttention(nn.Module):
224
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
225
+
226
+ def __init__(self, config: TeleFLMConfig, layer_idx: Optional[int] = None):
227
+ super().__init__()
228
+ self.config = config
229
+ self.layer_idx = layer_idx
230
+ if layer_idx is None:
231
+ logger.warning_once(
232
+ f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
233
+ "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
234
+ "when creating this class."
235
+ )
236
+
237
+ self.attention_dropout = config.attention_dropout
238
+ self.hidden_size = config.hidden_size
239
+ self.num_heads = config.num_attention_heads
240
+ self.head_dim = self.hidden_size // self.num_heads
241
+ self.num_key_value_heads = config.num_key_value_heads
242
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
243
+ self.max_position_embeddings = config.max_position_embeddings
244
+ self.rope_theta = config.rope_theta
245
+ self.is_causal = True
246
+
247
+ if (self.head_dim * self.num_heads) != self.hidden_size:
248
+ raise ValueError(
249
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
250
+ f" and `num_heads`: {self.num_heads})."
251
+ )
252
+
253
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
254
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
255
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
256
+ self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=config.attention_bias)
257
+ self._init_rope()
258
+
259
+ def _init_rope(self):
260
+ if self.config.rope_scaling is None:
261
+ self.rotary_emb = TeleFLMRotaryEmbedding(
262
+ self.head_dim,
263
+ max_position_embeddings=self.max_position_embeddings,
264
+ base=self.rope_theta,
265
+ )
266
+ else:
267
+ scaling_type = self.config.rope_scaling["type"]
268
+ scaling_factor = self.config.rope_scaling["factor"]
269
+ if scaling_type == "linear":
270
+ self.rotary_emb = TeleFLMLinearScalingRotaryEmbedding(
271
+ self.head_dim,
272
+ max_position_embeddings=self.max_position_embeddings,
273
+ scaling_factor=scaling_factor,
274
+ base=self.rope_theta,
275
+ )
276
+ elif scaling_type == "dynamic":
277
+ self.rotary_emb = TeleFLMDynamicNTKScalingRotaryEmbedding(
278
+ self.head_dim,
279
+ max_position_embeddings=self.max_position_embeddings,
280
+ scaling_factor=scaling_factor,
281
+ base=self.rope_theta,
282
+ )
283
+ else:
284
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
285
+
286
+ def forward(
287
+ self,
288
+ hidden_states: torch.Tensor,
289
+ attention_mask: Optional[torch.Tensor] = None,
290
+ position_ids: Optional[torch.LongTensor] = None,
291
+ past_key_value: Optional[Cache] = None,
292
+ output_attentions: bool = False,
293
+ use_cache: bool = False,
294
+ cache_position: Optional[torch.LongTensor] = None,
295
+ **kwargs,
296
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
297
+ bsz, q_len, _ = hidden_states.size()
298
+
299
+ if self.config.pretraining_tp > 1:
300
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
301
+ query_slices = self.q_proj.weight.split(
302
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
303
+ )
304
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
305
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
306
+
307
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
308
+ query_states = torch.cat(query_states, dim=-1)
309
+
310
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
311
+ key_states = torch.cat(key_states, dim=-1)
312
+
313
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
314
+ value_states = torch.cat(value_states, dim=-1)
315
+
316
+ else:
317
+ query_states = self.q_proj(hidden_states)
318
+ key_states = self.k_proj(hidden_states)
319
+ value_states = self.v_proj(hidden_states)
320
+
321
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
322
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
323
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
324
+
325
+ past_key_value = getattr(self, "past_key_value", past_key_value)
326
+ cos, sin = self.rotary_emb(value_states, position_ids)
327
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
328
+
329
+ if past_key_value is not None:
330
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
331
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
332
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
333
+
334
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
335
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
336
+
337
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
338
+
339
+ if attention_mask is not None: # no matter the length, we just slice it
340
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
341
+ attn_weights = attn_weights + causal_mask
342
+
343
+ # upcast attention to fp32
344
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
345
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
346
+ attn_output = torch.matmul(attn_weights, value_states)
347
+
348
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
349
+ raise ValueError(
350
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
351
+ f" {attn_output.size()}"
352
+ )
353
+
354
+ attn_output = attn_output.transpose(1, 2).contiguous()
355
+
356
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
357
+
358
+ if self.config.pretraining_tp > 1:
359
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
360
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
361
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
362
+ else:
363
+ attn_output = self.o_proj(attn_output)
364
+
365
+ if not output_attentions:
366
+ attn_weights = None
367
+
368
+ return attn_output, attn_weights, past_key_value
369
+
370
+
371
+ class TeleFLMFlashAttention2(TeleFLMAttention):
372
+ """
373
+ Tele-FLM flash attention module. This module inherits from `TeleFLMAttention` as the weights of the module stays
374
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
375
+ flash attention and deal with padding tokens in case the input contains any of them.
376
+ """
377
+
378
+ def __init__(self, *args, **kwargs):
379
+ super().__init__(*args, **kwargs)
380
+
381
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
382
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
383
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
384
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
385
+
386
+ def forward(
387
+ self,
388
+ hidden_states: torch.Tensor,
389
+ attention_mask: Optional[torch.LongTensor] = None,
390
+ position_ids: Optional[torch.LongTensor] = None,
391
+ past_key_value: Optional[Cache] = None,
392
+ output_attentions: bool = False,
393
+ use_cache: bool = False,
394
+ cache_position: Optional[torch.LongTensor] = None,
395
+ **kwargs,
396
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
397
+ output_attentions = False
398
+
399
+ bsz, q_len, _ = hidden_states.size()
400
+
401
+ query_states = self.q_proj(hidden_states)
402
+ key_states = self.k_proj(hidden_states)
403
+ value_states = self.v_proj(hidden_states)
404
+
405
+ # Flash attention requires the input to have the shape
406
+ # batch_size x seq_length x head_dim x hidden_dim
407
+ # therefore we just need to keep the original shape
408
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
409
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
410
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
411
+
412
+ cos, sin = self.rotary_emb(value_states, position_ids)
413
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
414
+
415
+ past_key_value = getattr(self, "past_key_value", past_key_value)
416
+
417
+ if past_key_value is not None:
418
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
419
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
420
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
421
+
422
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
423
+ # to be able to avoid many of these transpose/reshape/view.
424
+ query_states = query_states.transpose(1, 2)
425
+ key_states = key_states.transpose(1, 2)
426
+ value_states = value_states.transpose(1, 2)
427
+
428
+ dropout_rate = self.attention_dropout if self.training else 0.0
429
+
430
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
431
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
432
+ # cast them back in the correct dtype just to be sure everything works as expected.
433
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
434
+ # in fp32. (TeleFLMRMSNorm handles it correctly)
435
+
436
+ input_dtype = query_states.dtype
437
+ if input_dtype == torch.float32:
438
+ if torch.is_autocast_enabled():
439
+ target_dtype = torch.get_autocast_gpu_dtype()
440
+ # Handle the case where the model is quantized
441
+ elif hasattr(self.config, "_pre_quantization_dtype"):
442
+ target_dtype = self.config._pre_quantization_dtype
443
+ else:
444
+ target_dtype = self.q_proj.weight.dtype
445
+
446
+ logger.warning_once(
447
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
448
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
449
+ f" {target_dtype}."
450
+ )
451
+
452
+ query_states = query_states.to(target_dtype)
453
+ key_states = key_states.to(target_dtype)
454
+ value_states = value_states.to(target_dtype)
455
+
456
+ attn_output = self._flash_attention_forward(
457
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
458
+ )
459
+
460
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
461
+ attn_output = self.o_proj(attn_output)
462
+
463
+ if not output_attentions:
464
+ attn_weights = None
465
+
466
+ return attn_output, attn_weights, past_key_value
467
+
468
+ def _flash_attention_forward(
469
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
470
+ ):
471
+ """
472
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
473
+ first unpad the input, then computes the attention scores and pad the final attention scores.
474
+
475
+ Args:
476
+ query_states (`torch.Tensor`):
477
+ Input query states to be passed to Flash Attention API
478
+ key_states (`torch.Tensor`):
479
+ Input key states to be passed to Flash Attention API
480
+ value_states (`torch.Tensor`):
481
+ Input value states to be passed to Flash Attention API
482
+ attention_mask (`torch.Tensor`):
483
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
484
+ position of padding tokens and 1 for the position of non-padding tokens.
485
+ dropout (`float`):
486
+ Attention dropout
487
+ softmax_scale (`float`, *optional*):
488
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
489
+ """
490
+ if not self._flash_attn_uses_top_left_mask:
491
+ causal = self.is_causal
492
+ else:
493
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in TeleFLMFlashAttention2 __init__.
494
+ causal = self.is_causal and query_length != 1
495
+
496
+ # Contains at least one padding token in the sequence
497
+ if attention_mask is not None:
498
+ batch_size = query_states.shape[0]
499
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
500
+ query_states, key_states, value_states, attention_mask, query_length
501
+ )
502
+
503
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
504
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
505
+
506
+ attn_output_unpad = flash_attn_varlen_func(
507
+ query_states,
508
+ key_states,
509
+ value_states,
510
+ cu_seqlens_q=cu_seqlens_q,
511
+ cu_seqlens_k=cu_seqlens_k,
512
+ max_seqlen_q=max_seqlen_in_batch_q,
513
+ max_seqlen_k=max_seqlen_in_batch_k,
514
+ dropout_p=dropout,
515
+ softmax_scale=softmax_scale,
516
+ causal=causal,
517
+ )
518
+
519
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
520
+ else:
521
+ attn_output = flash_attn_func(
522
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
523
+ )
524
+
525
+ return attn_output
526
+
527
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
528
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
529
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
530
+
531
+ key_layer = index_first_axis(
532
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
533
+ )
534
+ value_layer = index_first_axis(
535
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
536
+ )
537
+ if query_length == kv_seq_len:
538
+ query_layer = index_first_axis(
539
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
540
+ )
541
+ cu_seqlens_q = cu_seqlens_k
542
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
543
+ indices_q = indices_k
544
+ elif query_length == 1:
545
+ max_seqlen_in_batch_q = 1
546
+ cu_seqlens_q = torch.arange(
547
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
548
+ ) # There is a memcpy here, that is very bad.
549
+ indices_q = cu_seqlens_q[:-1]
550
+ query_layer = query_layer.squeeze(1)
551
+ else:
552
+ # The -q_len: slice assumes left padding.
553
+ attention_mask = attention_mask[:, -query_length:]
554
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
555
+
556
+ return (
557
+ query_layer,
558
+ key_layer,
559
+ value_layer,
560
+ indices_q,
561
+ (cu_seqlens_q, cu_seqlens_k),
562
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
563
+ )
564
+
565
+
566
+ class TeleFLMSdpaAttention(TeleFLMAttention):
567
+ """
568
+ Tele-FLM attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
569
+ `TeleFLMAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
570
+ SDPA API.
571
+ """
572
+
573
+ # Adapted from TeleFLMAttention.forward
574
+ def forward(
575
+ self,
576
+ hidden_states: torch.Tensor,
577
+ attention_mask: Optional[torch.Tensor] = None,
578
+ position_ids: Optional[torch.LongTensor] = None,
579
+ past_key_value: Optional[Cache] = None,
580
+ output_attentions: bool = False,
581
+ use_cache: bool = False,
582
+ cache_position: Optional[torch.LongTensor] = None,
583
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
584
+ if output_attentions:
585
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
586
+ logger.warning_once(
587
+ "TeleFLMModel is using TeleFLMSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
588
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
589
+ )
590
+ return super().forward(
591
+ hidden_states=hidden_states,
592
+ attention_mask=attention_mask,
593
+ position_ids=position_ids,
594
+ past_key_value=past_key_value,
595
+ output_attentions=output_attentions,
596
+ use_cache=use_cache,
597
+ cache_position=cache_position,
598
+ )
599
+
600
+ bsz, q_len, _ = hidden_states.size()
601
+
602
+ query_states = self.q_proj(hidden_states)
603
+ key_states = self.k_proj(hidden_states)
604
+ value_states = self.v_proj(hidden_states)
605
+
606
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
607
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
608
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
609
+
610
+ cos, sin = self.rotary_emb(value_states, position_ids)
611
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
612
+
613
+ # In case static cache is used, it is an instance attribute.
614
+ past_key_value = getattr(self, "past_key_value", past_key_value)
615
+
616
+ if past_key_value is not None:
617
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
618
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
619
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
620
+
621
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
622
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
623
+
624
+ causal_mask = attention_mask
625
+ # if attention_mask is not None and cache_position is not None:
626
+ if attention_mask is not None:
627
+ causal_mask = causal_mask[:, :, :, : key_states.shape[-2]]
628
+
629
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
630
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
631
+ if query_states.device.type == "cuda" and causal_mask is not None:
632
+ query_states = query_states.contiguous()
633
+ key_states = key_states.contiguous()
634
+ value_states = value_states.contiguous()
635
+
636
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
637
+ query_states,
638
+ key_states,
639
+ value_states,
640
+ attn_mask=causal_mask,
641
+ dropout_p=self.attention_dropout if self.training else 0.0,
642
+ )
643
+
644
+ attn_output = attn_output.transpose(1, 2).contiguous()
645
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
646
+
647
+ attn_output = self.o_proj(attn_output)
648
+
649
+ return attn_output, None, past_key_value
650
+
651
+
652
+ TELEFLM_ATTENTION_CLASSES = {
653
+ "eager": TeleFLMAttention,
654
+ "flash_attention_2": TeleFLMFlashAttention2,
655
+ "sdpa": TeleFLMSdpaAttention,
656
+ }
657
+
658
+
659
+ class TeleFLMDecoderLayer(nn.Module):
660
+ def __init__(self, config: TeleFLMConfig, layer_idx: int):
661
+ super().__init__()
662
+ self.hidden_size = config.hidden_size
663
+ self.self_attn = TELEFLM_ATTENTION_CLASSES.get(config._attn_implementation, TeleFLMAttention)(config=config, layer_idx=layer_idx)
664
+ self.mlp = TeleFLMMLP(config)
665
+ self.input_layernorm = TeleFLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
666
+ self.post_attention_layernorm = TeleFLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
667
+
668
+ def forward(
669
+ self,
670
+ hidden_states: torch.Tensor,
671
+ attention_mask: Optional[torch.Tensor] = None,
672
+ position_ids: Optional[torch.LongTensor] = None,
673
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
674
+ output_attentions: Optional[bool] = False,
675
+ use_cache: Optional[bool] = False,
676
+ cache_position: Optional[torch.LongTensor] = None,
677
+ **kwargs,
678
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
679
+ """
680
+ Args:
681
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
682
+ attention_mask (`torch.FloatTensor`, *optional*):
683
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
684
+ query_sequence_length, key_sequence_length)` if default attention is used.
685
+ output_attentions (`bool`, *optional*):
686
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
687
+ returned tensors for more detail.
688
+ use_cache (`bool`, *optional*):
689
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
690
+ (see `past_key_values`).
691
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
692
+ """
693
+ if "padding_mask" in kwargs:
694
+ warnings.warn(
695
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
696
+ )
697
+
698
+ residual = hidden_states
699
+
700
+ hidden_states = self.input_layernorm(hidden_states)
701
+
702
+ # Self Attention
703
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
704
+ hidden_states=hidden_states,
705
+ attention_mask=attention_mask,
706
+ position_ids=position_ids,
707
+ past_key_value=past_key_value,
708
+ output_attentions=output_attentions,
709
+ use_cache=use_cache,
710
+ cache_position=cache_position,
711
+ **kwargs,
712
+ )
713
+ hidden_states = residual + hidden_states
714
+
715
+ # Fully Connected
716
+ residual = hidden_states
717
+ hidden_states = self.post_attention_layernorm(hidden_states)
718
+ hidden_states = self.mlp(hidden_states)
719
+ hidden_states = residual + hidden_states
720
+
721
+ outputs = (hidden_states,)
722
+
723
+ if output_attentions:
724
+ outputs += (self_attn_weights,)
725
+
726
+ if use_cache:
727
+ outputs += (present_key_value,)
728
+
729
+ return outputs
730
+
731
+
732
+ TELEFLM_START_DOCSTRING = r"""
733
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
734
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
735
+ etc.)
736
+
737
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
738
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
739
+ and behavior.
740
+
741
+ Parameters:
742
+ config ([`TeleFLMConfig`]):
743
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
744
+ load the weights associated with the model, only the configuration. Check out the
745
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
746
+ """
747
+
748
+
749
+ @add_start_docstrings(
750
+ "The bare Tele-FLM Model outputting raw hidden-states without any specific head on top.",
751
+ TELEFLM_START_DOCSTRING,
752
+ )
753
+ class TeleFLMPreTrainedModel(PreTrainedModel):
754
+ config_class = TeleFLMConfig
755
+ base_model_prefix = "model"
756
+ supports_gradient_checkpointing = True
757
+ _no_split_modules = ["TeleFLMDecoderLayer"]
758
+ _skip_keys_device_placement = ["past_key_values"]
759
+ _supports_flash_attn_2 = True
760
+ _supports_sdpa = True
761
+ _supports_cache_class = True
762
+
763
+ def _init_weights(self, module):
764
+ std = self.config.initializer_range
765
+ if isinstance(module, nn.Linear):
766
+ module.weight.data.normal_(mean=0.0, std=std)
767
+ if module.bias is not None:
768
+ module.bias.data.zero_()
769
+ elif isinstance(module, nn.Embedding):
770
+ module.weight.data.normal_(mean=0.0, std=std)
771
+ if module.padding_idx is not None:
772
+ module.weight.data[module.padding_idx].zero_()
773
+
774
+ def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] = None):
775
+ if self.config._attn_implementation == "flash_attention_2" and cache_cls == StaticCache:
776
+ raise ValueError(
777
+ "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
778
+ "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
779
+ )
780
+
781
+ for layer in self.model.layers:
782
+ device = layer.input_layernorm.weight.device
783
+ if hasattr(self.config, "_pre_quantization_dtype"):
784
+ dtype = self.config._pre_quantization_dtype
785
+ else:
786
+ dtype = layer.self_attn.o_proj.weight.dtype
787
+ layer.self_attn.past_key_value = cache_cls(
788
+ self.config, max_batch_size, max_cache_len, device=device, dtype=dtype
789
+ )
790
+
791
+ def _reset_cache(self):
792
+ for layer in self.model.layers:
793
+ layer.self_attn.past_key_value = None
794
+
795
+
796
+ TELEFLM_INPUTS_DOCSTRING = r"""
797
+ Args:
798
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
799
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
800
+ it.
801
+
802
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
803
+ [`PreTrainedTokenizer.__call__`] for details.
804
+
805
+ [What are input IDs?](../glossary#input-ids)
806
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
807
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
808
+
809
+ - 1 for tokens that are **not masked**,
810
+ - 0 for tokens that are **masked**.
811
+
812
+ [What are attention masks?](../glossary#attention-mask)
813
+
814
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
815
+ [`PreTrainedTokenizer.__call__`] for details.
816
+
817
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
818
+ `past_key_values`).
819
+
820
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
821
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
822
+ information on the default strategy.
823
+
824
+ - 1 indicates the head is **not masked**,
825
+ - 0 indicates the head is **masked**.
826
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
827
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
828
+ config.n_positions - 1]`.
829
+
830
+ [What are position IDs?](../glossary#position-ids)
831
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
832
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
833
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
834
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
835
+
836
+ Two formats are allowed:
837
+ - a [`~cache_utils.Cache`] instance;
838
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
839
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
840
+ cache format.
841
+
842
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
843
+ legacy cache format will be returned.
844
+
845
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
846
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
847
+ of shape `(batch_size, sequence_length)`.
848
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
849
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
850
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
851
+ model's internal embedding lookup matrix.
852
+ use_cache (`bool`, *optional*):
853
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
854
+ `past_key_values`).
855
+ output_attentions (`bool`, *optional*):
856
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
857
+ tensors for more detail.
858
+ output_hidden_states (`bool`, *optional*):
859
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
860
+ more detail.
861
+ return_dict (`bool`, *optional*):
862
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
863
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
864
+ Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
865
+ this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
866
+ the complete sequence length.
867
+ """
868
+
869
+
870
+ @add_start_docstrings(
871
+ "The bare Tele-FLM Model outputting raw hidden-states without any specific head on top.",
872
+ TELEFLM_START_DOCSTRING,
873
+ )
874
+ class TeleFLMModel(TeleFLMPreTrainedModel):
875
+ """
876
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`TeleFLMDecoderLayer`]
877
+
878
+ Args:
879
+ config: TeleFLMConfig
880
+ """
881
+
882
+ def __init__(self, config: TeleFLMConfig):
883
+ super().__init__(config)
884
+ self.padding_idx = config.pad_token_id
885
+ self.vocab_size = config.vocab_size
886
+ # Mup
887
+ self.use_mup = config.use_mup
888
+ if self.use_mup:
889
+ self.input_mult = config.input_mult
890
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
891
+ self.layers = nn.ModuleList(
892
+ [TeleFLMDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
893
+ )
894
+ self.norm = TeleFLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
895
+ self.gradient_checkpointing = False
896
+
897
+ # Initialize weights and apply final processing
898
+ self.post_init()
899
+
900
+ def get_input_embeddings(self):
901
+ return self.embed_tokens
902
+
903
+ def set_input_embeddings(self, value):
904
+ self.embed_tokens = value
905
+
906
+ @add_start_docstrings_to_model_forward(TELEFLM_INPUTS_DOCSTRING)
907
+ def forward(
908
+ self,
909
+ input_ids: torch.LongTensor = None,
910
+ attention_mask: Optional[torch.Tensor] = None,
911
+ position_ids: Optional[torch.LongTensor] = None,
912
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
913
+ inputs_embeds: Optional[torch.FloatTensor] = None,
914
+ use_cache: Optional[bool] = None,
915
+ output_attentions: Optional[bool] = None,
916
+ output_hidden_states: Optional[bool] = None,
917
+ return_dict: Optional[bool] = None,
918
+ cache_position: Optional[torch.LongTensor] = None,
919
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
920
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
921
+ output_hidden_states = (
922
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
923
+ )
924
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
925
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
926
+
927
+ if (input_ids is None) ^ (inputs_embeds is not None):
928
+ raise ValueError(
929
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
930
+ )
931
+
932
+ if self.gradient_checkpointing and self.training and use_cache:
933
+ logger.warning_once(
934
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
935
+ )
936
+ use_cache = False
937
+
938
+ if inputs_embeds is None:
939
+ inputs_embeds = self.embed_tokens(input_ids)
940
+
941
+ # Mup
942
+ if self.use_mup:
943
+ inputs_embeds = inputs_embeds * self.input_mult
944
+
945
+ past_seen_tokens = 0
946
+ if use_cache: # kept for BC (cache positions)
947
+ if not isinstance(past_key_values, StaticCache):
948
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
949
+ past_seen_tokens = past_key_values.get_seq_length()
950
+
951
+ if cache_position is None:
952
+ if isinstance(past_key_values, StaticCache):
953
+ raise ValueError("cache_position is a required argument when using StaticCache.")
954
+ cache_position = torch.arange(
955
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
956
+ )
957
+
958
+ if position_ids is None:
959
+ position_ids = cache_position.unsqueeze(0)
960
+
961
+ causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
962
+
963
+ # embed positions
964
+ hidden_states = inputs_embeds
965
+
966
+ # decoder layers
967
+ all_hidden_states = () if output_hidden_states else None
968
+ all_self_attns = () if output_attentions else None
969
+ next_decoder_cache = None
970
+
971
+ for decoder_layer in self.layers:
972
+ if output_hidden_states:
973
+ all_hidden_states += (hidden_states,)
974
+
975
+ if self.gradient_checkpointing and self.training:
976
+ layer_outputs = self._gradient_checkpointing_func(
977
+ decoder_layer.__call__,
978
+ hidden_states,
979
+ causal_mask,
980
+ position_ids,
981
+ past_key_values,
982
+ output_attentions,
983
+ use_cache,
984
+ cache_position,
985
+ )
986
+ else:
987
+ layer_outputs = decoder_layer(
988
+ hidden_states,
989
+ attention_mask=causal_mask,
990
+ position_ids=position_ids,
991
+ past_key_value=past_key_values,
992
+ output_attentions=output_attentions,
993
+ use_cache=use_cache,
994
+ cache_position=cache_position,
995
+ )
996
+
997
+ hidden_states = layer_outputs[0]
998
+
999
+ if use_cache:
1000
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1001
+
1002
+ if output_attentions:
1003
+ all_self_attns += (layer_outputs[1],)
1004
+
1005
+ hidden_states = self.norm(hidden_states)
1006
+
1007
+ # add hidden states from the last decoder layer
1008
+ if output_hidden_states:
1009
+ all_hidden_states += (hidden_states,)
1010
+
1011
+ next_cache = None
1012
+ if use_cache:
1013
+ next_cache = (
1014
+ next_decoder_cache.to_legacy_cache() if isinstance(next_decoder_cache, Cache) else next_decoder_cache
1015
+ )
1016
+ if not return_dict:
1017
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1018
+ return BaseModelOutputWithPast(
1019
+ last_hidden_state=hidden_states,
1020
+ past_key_values=next_cache,
1021
+ hidden_states=all_hidden_states,
1022
+ attentions=all_self_attns,
1023
+ )
1024
+
1025
+ # TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
1026
+ # KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
1027
+ # (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
1028
+ # `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
1029
+ def _update_causal_mask(self, attention_mask, input_tensor, cache_position):
1030
+ if self.config._attn_implementation == "flash_attention_2":
1031
+ if attention_mask is not None and 0.0 in attention_mask:
1032
+ return attention_mask
1033
+ return None
1034
+
1035
+ dtype, device = input_tensor.dtype, input_tensor.device
1036
+ min_dtype = torch.finfo(dtype).min
1037
+ sequence_length = input_tensor.shape[1]
1038
+ if hasattr(getattr(self.layers[0], "self_attn", {}), "past_key_value"): # static cache
1039
+ target_length = self.config.max_position_embeddings
1040
+ else: # dynamic cache
1041
+ target_length = (
1042
+ attention_mask.shape[-1] if isinstance(attention_mask, torch.Tensor) else cache_position[-1] + 1
1043
+ )
1044
+
1045
+ causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
1046
+ if sequence_length != 1:
1047
+ causal_mask = torch.triu(causal_mask, diagonal=1)
1048
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
1049
+ causal_mask = causal_mask[None, None, :, :].expand(input_tensor.shape[0], 1, -1, -1)
1050
+ if attention_mask is not None:
1051
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
1052
+ if attention_mask.dim() == 2:
1053
+ mask_length = attention_mask.shape[-1]
1054
+ padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
1055
+ causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
1056
+ elif attention_mask.dim() == 4:
1057
+ # backwards compatibility: we allow passing a 4D attention mask shorter than the input length with
1058
+ # cache. In that case, the 4D attention mask attends to the newest tokens only.
1059
+ if attention_mask.shape[-2] < cache_position[0] + sequence_length:
1060
+ offset = cache_position[0]
1061
+ else:
1062
+ offset = 0
1063
+ mask_shape = attention_mask.shape
1064
+ mask_slice = (attention_mask.eq(0.0)).to(dtype=dtype) * min_dtype
1065
+ causal_mask[
1066
+ : mask_shape[0], : mask_shape[1], offset : mask_shape[2] + offset, : mask_shape[3]
1067
+ ] = mask_slice
1068
+
1069
+ if (
1070
+ self.config._attn_implementation == "sdpa"
1071
+ and attention_mask is not None
1072
+ and attention_mask.device.type == "cuda"
1073
+ ):
1074
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
1075
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1076
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1077
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
1078
+
1079
+ return causal_mask
1080
+
1081
+
1082
+ class TeleFLMForCausalLM(TeleFLMPreTrainedModel):
1083
+ _tied_weights_keys = ["lm_head.weight"]
1084
+
1085
+ def __init__(self, config):
1086
+ super().__init__(config)
1087
+ self.model = TeleFLMModel(config)
1088
+ self.vocab_size = config.vocab_size
1089
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1090
+ self.use_mup = config.use_mup
1091
+ if self.use_mup:
1092
+ self.mup_scale_factor = config.mup_scale_factor
1093
+ self.output_mult = config.output_mult / self.mup_scale_factor
1094
+ # Initialize weights and apply final processing
1095
+ self.post_init()
1096
+
1097
+ def get_input_embeddings(self):
1098
+ return self.model.embed_tokens
1099
+
1100
+ def set_input_embeddings(self, value):
1101
+ self.model.embed_tokens = value
1102
+
1103
+ def get_output_embeddings(self):
1104
+ return self.lm_head
1105
+
1106
+ def set_output_embeddings(self, new_embeddings):
1107
+ self.lm_head = new_embeddings
1108
+
1109
+ def set_decoder(self, decoder):
1110
+ self.model = decoder
1111
+
1112
+ def get_decoder(self):
1113
+ return self.model
1114
+
1115
+ @add_start_docstrings_to_model_forward(TELEFLM_INPUTS_DOCSTRING)
1116
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1117
+ def forward(
1118
+ self,
1119
+ input_ids: torch.LongTensor = None,
1120
+ attention_mask: Optional[torch.Tensor] = None,
1121
+ position_ids: Optional[torch.LongTensor] = None,
1122
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1123
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1124
+ labels: Optional[torch.LongTensor] = None,
1125
+ use_cache: Optional[bool] = None,
1126
+ output_attentions: Optional[bool] = None,
1127
+ output_hidden_states: Optional[bool] = None,
1128
+ return_dict: Optional[bool] = None,
1129
+ cache_position: Optional[torch.LongTensor] = None,
1130
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1131
+ r"""
1132
+ Args:
1133
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1134
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1135
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1136
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1137
+
1138
+ Returns:
1139
+
1140
+ Example:
1141
+
1142
+ ```python
1143
+ >>> from transformers import AutoTokenizer, TeleFLMForCausalLM
1144
+
1145
+ >>> model = TeleFLMForCausalLM.from_pretrained("CofeAI/Tele-FLM")
1146
+ >>> tokenizer = AutoTokenizer.from_pretrained("CofeAI/Tele-FLM")
1147
+
1148
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1149
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1150
+
1151
+ >>> # Generate
1152
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1153
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1154
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1155
+ ```"""
1156
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1157
+ output_hidden_states = (
1158
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1159
+ )
1160
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1161
+
1162
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1163
+ outputs = self.model(
1164
+ input_ids=input_ids,
1165
+ attention_mask=attention_mask,
1166
+ position_ids=position_ids,
1167
+ past_key_values=past_key_values,
1168
+ inputs_embeds=inputs_embeds,
1169
+ use_cache=use_cache,
1170
+ output_attentions=output_attentions,
1171
+ output_hidden_states=output_hidden_states,
1172
+ return_dict=return_dict,
1173
+ cache_position=cache_position,
1174
+ )
1175
+
1176
+ hidden_states = outputs[0]
1177
+ if self.config.pretraining_tp > 1:
1178
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1179
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1180
+ logits = torch.cat(logits, dim=-1)
1181
+ else:
1182
+ logits = self.lm_head(hidden_states)
1183
+ logits = logits.float()
1184
+ # Mup
1185
+ if self.use_mup:
1186
+ logits = logits * self.output_mult
1187
+
1188
+ loss = None
1189
+ if labels is not None:
1190
+ # Shift so that tokens < n predict n
1191
+ shift_logits = logits[..., :-1, :].contiguous()
1192
+ shift_labels = labels[..., 1:].contiguous()
1193
+ # Flatten the tokens
1194
+ loss_fct = CrossEntropyLoss()
1195
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1196
+ shift_labels = shift_labels.view(-1)
1197
+ # Enable model parallelism
1198
+ shift_labels = shift_labels.to(shift_logits.device)
1199
+ loss = loss_fct(shift_logits, shift_labels)
1200
+
1201
+ if not return_dict:
1202
+ output = (logits,) + outputs[1:]
1203
+ return (loss,) + output if loss is not None else output
1204
+
1205
+ return CausalLMOutputWithPast(
1206
+ loss=loss,
1207
+ logits=logits,
1208
+ past_key_values=outputs.past_key_values,
1209
+ hidden_states=outputs.hidden_states,
1210
+ attentions=outputs.attentions,
1211
+ )
1212
+
1213
+ def prepare_inputs_for_generation(
1214
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, cache_position=None, **kwargs
1215
+ ):
1216
+ # With static cache, the `past_key_values` is None
1217
+ # TODO joao: standardize interface for the different Cache classes and remove of this if
1218
+ has_static_cache = False
1219
+ if past_key_values is None:
1220
+ past_key_values = getattr(getattr(self.model.layers[0], "self_attn", {}), "past_key_value", None)
1221
+ has_static_cache = past_key_values is not None
1222
+
1223
+ past_length = 0
1224
+ if past_key_values is not None:
1225
+ if isinstance(past_key_values, Cache):
1226
+ past_length = cache_position[0] if cache_position is not None else past_key_values.get_seq_length()
1227
+ max_cache_length = (
1228
+ torch.tensor(past_key_values.get_max_length(), device=input_ids.device)
1229
+ if past_key_values.get_max_length() is not None
1230
+ else None
1231
+ )
1232
+ cache_length = past_length if max_cache_length is None else torch.min(max_cache_length, past_length)
1233
+ # TODO joao: remove this `else` after `generate` prioritizes `Cache` objects
1234
+ else:
1235
+ cache_length = past_length = past_key_values[0][0].shape[2]
1236
+ max_cache_length = None
1237
+
1238
+ # Keep only the unprocessed tokens:
1239
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1240
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1241
+ # input)
1242
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1243
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1244
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1245
+ # input_ids based on the past_length.
1246
+ elif past_length < input_ids.shape[1]:
1247
+ input_ids = input_ids[:, past_length:]
1248
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1249
+
1250
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1251
+ if (
1252
+ max_cache_length is not None
1253
+ and attention_mask is not None
1254
+ and cache_length + input_ids.shape[1] > max_cache_length
1255
+ ):
1256
+ attention_mask = attention_mask[:, -max_cache_length:]
1257
+
1258
+ position_ids = kwargs.get("position_ids", None)
1259
+ if attention_mask is not None and position_ids is None:
1260
+ # create position_ids on the fly for batch generation
1261
+ position_ids = attention_mask.long().cumsum(-1) - 1
1262
+ position_ids.masked_fill_(attention_mask == 0, 1)
1263
+ if past_key_values:
1264
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1265
+
1266
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1267
+ if inputs_embeds is not None and past_key_values is None:
1268
+ model_inputs = {"inputs_embeds": inputs_embeds}
1269
+ else:
1270
+ # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
1271
+ # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
1272
+ # TODO: use `next_tokens` directly instead.
1273
+ model_inputs = {"input_ids": input_ids.contiguous()}
1274
+
1275
+ input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
1276
+ if cache_position is None:
1277
+ cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
1278
+ else:
1279
+ cache_position = cache_position[-input_length:]
1280
+
1281
+ if has_static_cache:
1282
+ past_key_values = None
1283
+
1284
+ model_inputs.update(
1285
+ {
1286
+ "position_ids": position_ids,
1287
+ "cache_position": cache_position,
1288
+ "past_key_values": past_key_values,
1289
+ "use_cache": kwargs.get("use_cache"),
1290
+ "attention_mask": attention_mask,
1291
+ }
1292
+ )
1293
+ return model_inputs
1294
+
1295
+ @staticmethod
1296
+ def _reorder_cache(past_key_values, beam_idx):
1297
+ reordered_past = ()
1298
+ for layer_past in past_key_values:
1299
+ reordered_past += (
1300
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1301
+ )
1302
+ return reordered_past
1303
+
1304
+
1305
+ @add_start_docstrings(
1306
+ """
1307
+ The Tele-FLM Model transformer with a sequence classification head on top (linear layer).
1308
+
1309
+ [`TeleFLMForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1310
+ (e.g. GPT-2) do.
1311
+
1312
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1313
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1314
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1315
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1316
+ each row of the batch).
1317
+ """,
1318
+ TELEFLM_START_DOCSTRING,
1319
+ )
1320
+ class TeleFLMForSequenceClassification(TeleFLMPreTrainedModel):
1321
+ def __init__(self, config):
1322
+ super().__init__(config)
1323
+ self.num_labels = config.num_labels
1324
+ self.model = TeleFLMModel(config)
1325
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1326
+
1327
+ # Initialize weights and apply final processing
1328
+ self.post_init()
1329
+
1330
+ def get_input_embeddings(self):
1331
+ return self.model.embed_tokens
1332
+
1333
+ def set_input_embeddings(self, value):
1334
+ self.model.embed_tokens = value
1335
+
1336
+ @add_start_docstrings_to_model_forward(TELEFLM_INPUTS_DOCSTRING)
1337
+ def forward(
1338
+ self,
1339
+ input_ids: torch.LongTensor = None,
1340
+ attention_mask: Optional[torch.Tensor] = None,
1341
+ position_ids: Optional[torch.LongTensor] = None,
1342
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1343
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1344
+ labels: Optional[torch.LongTensor] = None,
1345
+ use_cache: Optional[bool] = None,
1346
+ output_attentions: Optional[bool] = None,
1347
+ output_hidden_states: Optional[bool] = None,
1348
+ return_dict: Optional[bool] = None,
1349
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1350
+ r"""
1351
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1352
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1353
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1354
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1355
+ """
1356
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1357
+
1358
+ transformer_outputs = self.model(
1359
+ input_ids,
1360
+ attention_mask=attention_mask,
1361
+ position_ids=position_ids,
1362
+ past_key_values=past_key_values,
1363
+ inputs_embeds=inputs_embeds,
1364
+ use_cache=use_cache,
1365
+ output_attentions=output_attentions,
1366
+ output_hidden_states=output_hidden_states,
1367
+ return_dict=return_dict,
1368
+ )
1369
+ hidden_states = transformer_outputs[0]
1370
+ logits = self.score(hidden_states)
1371
+
1372
+ if input_ids is not None:
1373
+ batch_size = input_ids.shape[0]
1374
+ else:
1375
+ batch_size = inputs_embeds.shape[0]
1376
+
1377
+ if self.config.pad_token_id is None and batch_size != 1:
1378
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1379
+ if self.config.pad_token_id is None:
1380
+ sequence_lengths = -1
1381
+ else:
1382
+ if input_ids is not None:
1383
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
1384
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
1385
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
1386
+ sequence_lengths = sequence_lengths.to(logits.device)
1387
+ else:
1388
+ sequence_lengths = -1
1389
+
1390
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1391
+
1392
+ loss = None
1393
+ if labels is not None:
1394
+ labels = labels.to(logits.device)
1395
+ if self.config.problem_type is None:
1396
+ if self.num_labels == 1:
1397
+ self.config.problem_type = "regression"
1398
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1399
+ self.config.problem_type = "single_label_classification"
1400
+ else:
1401
+ self.config.problem_type = "multi_label_classification"
1402
+
1403
+ if self.config.problem_type == "regression":
1404
+ loss_fct = MSELoss()
1405
+ if self.num_labels == 1:
1406
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1407
+ else:
1408
+ loss = loss_fct(pooled_logits, labels)
1409
+ elif self.config.problem_type == "single_label_classification":
1410
+ loss_fct = CrossEntropyLoss()
1411
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1412
+ elif self.config.problem_type == "multi_label_classification":
1413
+ loss_fct = BCEWithLogitsLoss()
1414
+ loss = loss_fct(pooled_logits, labels)
1415
+ if not return_dict:
1416
+ output = (pooled_logits,) + transformer_outputs[1:]
1417
+ return ((loss,) + output) if loss is not None else output
1418
+
1419
+ return SequenceClassifierOutputWithPast(
1420
+ loss=loss,
1421
+ logits=pooled_logits,
1422
+ past_key_values=transformer_outputs.past_key_values,
1423
+ hidden_states=transformer_outputs.hidden_states,
1424
+ attentions=transformer_outputs.attentions,
1425
+ )
1426
+
1427
+
1428
+ @add_start_docstrings(
1429
+ """
1430
+ The TeleFLM Model transformer with a span classification head on top for extractive question-answering tasks like
1431
+ SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
1432
+ """,
1433
+ TELEFLM_START_DOCSTRING,
1434
+ )
1435
+ class TeleFLMForQuestionAnswering(TeleFLMPreTrainedModel):
1436
+ base_model_prefix = "transformer"
1437
+
1438
+ # Copied from transformers.models.bloom.modeling_bloom.BloomForQuestionAnswering.__init__ with Bloom->TeleFLM
1439
+ def __init__(self, config):
1440
+ super().__init__(config)
1441
+ self.transformer = TeleFLMModel(config)
1442
+ self.qa_outputs = nn.Linear(config.hidden_size, 2)
1443
+
1444
+ # Initialize weights and apply final processing
1445
+ self.post_init()
1446
+
1447
+ def get_input_embeddings(self):
1448
+ return self.transformer.embed_tokens
1449
+
1450
+ def set_input_embeddings(self, value):
1451
+ self.transformer.embed_tokens = value
1452
+
1453
+ @add_start_docstrings_to_model_forward(TELEFLM_INPUTS_DOCSTRING)
1454
+ def forward(
1455
+ self,
1456
+ input_ids: Optional[torch.LongTensor] = None,
1457
+ attention_mask: Optional[torch.FloatTensor] = None,
1458
+ position_ids: Optional[torch.LongTensor] = None,
1459
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1460
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1461
+ start_positions: Optional[torch.LongTensor] = None,
1462
+ end_positions: Optional[torch.LongTensor] = None,
1463
+ output_attentions: Optional[bool] = None,
1464
+ output_hidden_states: Optional[bool] = None,
1465
+ return_dict: Optional[bool] = None,
1466
+ ) -> Union[Tuple, QuestionAnsweringModelOutput]:
1467
+ r"""
1468
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1469
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1470
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1471
+ are not taken into account for computing the loss.
1472
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1473
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1474
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1475
+ are not taken into account for computing the loss.
1476
+ """
1477
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1478
+
1479
+ outputs = self.transformer(
1480
+ input_ids,
1481
+ attention_mask=attention_mask,
1482
+ position_ids=position_ids,
1483
+ past_key_values=past_key_values,
1484
+ inputs_embeds=inputs_embeds,
1485
+ output_attentions=output_attentions,
1486
+ output_hidden_states=output_hidden_states,
1487
+ return_dict=return_dict,
1488
+ )
1489
+
1490
+ sequence_output = outputs[0]
1491
+
1492
+ logits = self.qa_outputs(sequence_output)
1493
+ start_logits, end_logits = logits.split(1, dim=-1)
1494
+ start_logits = start_logits.squeeze(-1).contiguous()
1495
+ end_logits = end_logits.squeeze(-1).contiguous()
1496
+
1497
+ total_loss = None
1498
+ if start_positions is not None and end_positions is not None:
1499
+ # If we are on multi-GPU, split add a dimension
1500
+ if len(start_positions.size()) > 1:
1501
+ start_positions = start_positions.squeeze(-1).to(start_logits.device)
1502
+ if len(end_positions.size()) > 1:
1503
+ end_positions = end_positions.squeeze(-1).to(end_logits.device)
1504
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1505
+ ignored_index = start_logits.size(1)
1506
+ start_positions = start_positions.clamp(0, ignored_index)
1507
+ end_positions = end_positions.clamp(0, ignored_index)
1508
+
1509
+ loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
1510
+ start_loss = loss_fct(start_logits, start_positions)
1511
+ end_loss = loss_fct(end_logits, end_positions)
1512
+ total_loss = (start_loss + end_loss) / 2
1513
+
1514
+ if not return_dict:
1515
+ output = (start_logits, end_logits) + outputs[2:]
1516
+ return ((total_loss,) + output) if total_loss is not None else output
1517
+
1518
+ return QuestionAnsweringModelOutput(
1519
+ loss=total_loss,
1520
+ start_logits=start_logits,
1521
+ end_logits=end_logits,
1522
+ hidden_states=outputs.hidden_states,
1523
+ attentions=outputs.attentions,
1524
+ )
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenization_teleflm.py ADDED
@@ -0,0 +1,403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+
21
+ """Tokenization classes for Tele-FLM."""
22
+ import os
23
+ from shutil import copyfile
24
+ from typing import Any, Dict, List, Optional, Tuple
25
+
26
+ import sentencepiece as spm
27
+ import re
28
+ from transformers.convert_slow_tokenizer import import_protobuf
29
+ from transformers import AddedToken, PreTrainedTokenizer
30
+ from transformers.utils import logging
31
+ from transformers.tokenization_utils_base import TextInput
32
+
33
+ logger = logging.get_logger(__name__)
34
+
35
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
36
+
37
+ PRETRAINED_VOCAB_FILES_MAP = {
38
+ "vocab_file": {},
39
+ "tokenizer_file": {},
40
+ }
41
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
42
+ "teleflm-tokenizer": 8192,
43
+ }
44
+ SPIECE_UNDERLINE = "▁"
45
+
46
+
47
+ class TeleFLMTokenizer(PreTrainedTokenizer):
48
+ """
49
+ Construct a Tele-FLM tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is
50
+ no padding token in the original model.
51
+
52
+ Args:
53
+ vocab_file (`str`):
54
+ Path to the vocabulary file.
55
+ unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`):
56
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
57
+ token instead.
58
+ bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`):
59
+ The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
60
+ eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`):
61
+ The end of sequence token.
62
+ pad_token (`str` or `tokenizers.AddedToken`, *optional*):
63
+ A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
64
+ attention mechanisms or loss computation.
65
+ sp_model_kwargs (`Dict[str, Any]`, `Optional`, *optional*):
66
+ Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
67
+ SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
68
+ to set:
69
+
70
+ - `enable_sampling`: Enable subword regularization.
71
+ - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
72
+
73
+ - `nbest_size = {0,1}`: No sampling is performed.
74
+ - `nbest_size > 1`: samples from the nbest_size results.
75
+ - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
76
+ using forward-filtering-and-backward-sampling algorithm.
77
+
78
+ - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
79
+ BPE-dropout.
80
+
81
+ add_bos_token (`bool`, *optional*, defaults to `True`):
82
+ Whether or not to add an `bos_token` at the start of sequences.
83
+ add_eos_token (`bool`, *optional*, defaults to `False`):
84
+ Whether or not to add an `eos_token` at the end of sequences.
85
+ clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
86
+ Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
87
+ extra spaces.
88
+ spaces_between_special_tokens (`bool`, *optional*, defaults to `False`):
89
+ Whether or not to add spaces between special tokens.
90
+
91
+ """
92
+
93
+ vocab_files_names = VOCAB_FILES_NAMES
94
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
95
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
96
+ model_input_names = ["input_ids", "attention_mask"]
97
+
98
+ def __init__(
99
+ self,
100
+ vocab_file,
101
+ bos_token="<s>",
102
+ eos_token="</s>",
103
+ unk_token="<unk>",
104
+ pad_token=None,
105
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
106
+ add_bos_token=False,
107
+ add_eos_token=False,
108
+ clean_up_tokenization_spaces=False,
109
+ spaces_between_special_tokens=False,
110
+ **kwargs,
111
+ ):
112
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
113
+ bos_token = AddedToken(bos_token, normalized=False, special=True) if isinstance(bos_token, str) else bos_token
114
+ eos_token = AddedToken(eos_token, normalized=False, special=True) if isinstance(eos_token, str) else eos_token
115
+ pad_token = AddedToken(pad_token, normalized=False, special=True) if isinstance(pad_token, str) else pad_token
116
+ self.vocab_file = vocab_file
117
+ self.add_bos_token = add_bos_token
118
+ self.add_eos_token = add_eos_token
119
+ self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
120
+ super().__init__(
121
+ bos_token=bos_token,
122
+ eos_token=eos_token,
123
+ unk_token=unk_token,
124
+ pad_token=pad_token,
125
+ add_bos_token=add_bos_token,
126
+ add_eos_token=add_eos_token,
127
+ sp_model_kwargs=self.sp_model_kwargs,
128
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
129
+ spaces_between_special_tokens=spaces_between_special_tokens,
130
+ **kwargs,
131
+ )
132
+
133
+ @property
134
+ def unk_token_length(self):
135
+ return len(self.sp_model.encode(str(self.unk_token)))
136
+
137
+ # Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_spm_processor
138
+ def get_spm_processor(self, from_slow=False):
139
+ tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
140
+ with open(self.vocab_file, "rb") as f:
141
+ sp_model = f.read()
142
+ model_pb2 = import_protobuf(f"The new behaviour of {self.__class__.__name__} (with `self.legacy = False`)")
143
+ model = model_pb2.ModelProto.FromString(sp_model)
144
+ normalizer_spec = model_pb2.NormalizerSpec()
145
+ normalizer_spec.add_dummy_prefix = True
146
+ model.normalizer_spec.MergeFrom(normalizer_spec)
147
+ sp_model = model.SerializeToString()
148
+ tokenizer.LoadFromSerializedProto(sp_model)
149
+ return tokenizer
150
+
151
+ def __getstate__(self):
152
+ state = self.__dict__.copy()
153
+ state["sp_model"] = None
154
+ state["sp_model_proto"] = self.sp_model.serialized_model_proto()
155
+ return state
156
+
157
+ def __setstate__(self, d):
158
+ self.__dict__ = d
159
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
160
+ self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
161
+
162
+ @property
163
+ def vocab_size(self):
164
+ """Returns vocab size"""
165
+ return self.sp_model.get_piece_size()
166
+
167
+ def get_vocab(self):
168
+ """Returns vocab as a dict"""
169
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
170
+ vocab.update(self.added_tokens_encoder)
171
+ return vocab
172
+
173
+ def tokenize(self, text: TextInput, **kwargs) -> List[str]:
174
+ """
175
+ Converts a string in a sequence of tokens, using the tokenizer.
176
+
177
+ Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies
178
+ (BPE/SentencePieces/WordPieces). Takes care of added tokens.
179
+
180
+ Args:
181
+ text (`str`):
182
+ The sequence to be encoded.
183
+ **kwargs (additional keyword arguments):
184
+ Passed along to the model-specific `prepare_for_tokenization` preprocessing method.
185
+
186
+ Returns:
187
+ `List[str]`: The list of tokens.
188
+ """
189
+ split_special_tokens = kwargs.pop("split_special_tokens", self.split_special_tokens)
190
+ remove_dummy_prefix = kwargs.pop("remove_dummy_prefix", False)
191
+
192
+ text, kwargs = self.prepare_for_tokenization(text, **kwargs)
193
+
194
+ if kwargs:
195
+ logger.warning(f"Keyword arguments {kwargs} not recognized.")
196
+
197
+ if hasattr(self, "do_lower_case") and self.do_lower_case:
198
+ # convert non-special tokens to lowercase. Might be super slow as well?
199
+ escaped_special_toks = [re.escape(s_tok) for s_tok in (self.all_special_tokens)]
200
+ escaped_special_toks += [
201
+ re.escape(s_tok.content)
202
+ for s_tok in (self._added_tokens_decoder.values())
203
+ if not s_tok.special and s_tok.normalized
204
+ ]
205
+ pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
206
+ text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
207
+
208
+ if split_special_tokens:
209
+ no_split_token = []
210
+ tokens = [text]
211
+ else:
212
+ no_split_token = self._added_tokens_encoder.keys() # don't split on any of the added tokens
213
+ # "This is something<special_token_1> else"
214
+ tokens = self.tokens_trie.split(text)
215
+
216
+ # ["This is something", "<special_token_1>", " else"]
217
+ for i, token in enumerate(tokens):
218
+ if token in no_split_token:
219
+ tok_extended = self._added_tokens_decoder.get(self._added_tokens_encoder[token], None)
220
+ left = tokens[i - 1] if i > 0 else None
221
+ right = tokens[i + 1] if i < len(tokens) - 1 else None
222
+ if isinstance(tok_extended, AddedToken):
223
+ if tok_extended.rstrip and right:
224
+ # A bit counter-intuitive but we strip the left of the string
225
+ # since tok_extended.rstrip means the special token is eating all white spaces on its right
226
+ tokens[i + 1] = right.lstrip()
227
+ # Strip white spaces on the left
228
+ if tok_extended.lstrip and left:
229
+ tokens[i - 1] = left.rstrip() # Opposite here
230
+ if tok_extended.single_word and left and left[-1] != " ":
231
+ tokens[i - 1] += token
232
+ tokens[i] = ""
233
+ elif tok_extended.single_word and right and right[0] != " ":
234
+ tokens[i + 1] = token + tokens[i + 1]
235
+ tokens[i] = ""
236
+ else:
237
+ raise ValueError(
238
+ f"{tok_extended} cannot be tokenized because it was not properly added"
239
+ f" to the tokenizer. This means that it is not an `AddedToken` but a {type(tok_extended)}"
240
+ )
241
+ # ["This is something", "<special_token_1>", "else"]
242
+ tokenized_text = []
243
+ for token in tokens:
244
+ # Need to skip eventual empty (fully stripped) tokens
245
+ if not token:
246
+ continue
247
+ if token in no_split_token:
248
+ tokenized_text.append(token)
249
+ else:
250
+ tokenized_text.extend(self._tokenize(token, remove_dummy_prefix=remove_dummy_prefix))
251
+ # ["This", " is", " something", "<special_token_1>", "else"]
252
+ return tokenized_text
253
+
254
+ def _tokenize(self, text, **kwargs):
255
+ """
256
+ Returns a tokenized string.
257
+
258
+ We add a option to remove dummpy prefix during tokenization instead of changing the default behaviour of the sentencepiece tokenizer.
259
+ This is useful when there're two tokenized sentences to be merged into one as the last one will have an extra dummy prefix which results in a
260
+ inconsistant pattern.
261
+ """
262
+ tokens = self.sp_model.encode(text, out_type=str)
263
+ if text.startswith((SPIECE_UNDERLINE, " ")):
264
+ return tokens
265
+ if len(tokens) > 0 and kwargs.get("remove_dummy_prefix") is True:
266
+ tokens[0] = tokens[0].replace(SPIECE_UNDERLINE, "", 1)
267
+ return tokens
268
+
269
+ def _convert_token_to_id(self, token):
270
+ """Converts a token (str) in an id using the vocab."""
271
+ return self.sp_model.piece_to_id(token)
272
+
273
+ def _convert_id_to_token(self, index):
274
+ """Converts an index (integer) in a token (str) using the vocab."""
275
+ token = self.sp_model.IdToPiece(index)
276
+ return token
277
+
278
+ def convert_tokens_to_string(self, tokens):
279
+ """Converts a sequence of tokens (string) in a single string."""
280
+ current_sub_tokens = []
281
+ out_string = ""
282
+ # prev_is_special = False
283
+ for i, token in enumerate(tokens):
284
+ # make sure that special tokens are not decoded using sentencepiece model
285
+ if token in self.all_special_tokens:
286
+ # if not prev_is_special and i != 0 and self.legacy:
287
+ # out_string += " "
288
+ out_string += self.sp_model.decode(current_sub_tokens) + token
289
+ # prev_is_special = True
290
+ current_sub_tokens = []
291
+ else:
292
+ current_sub_tokens.append(token)
293
+ # prev_is_special = False
294
+ out_string += self.sp_model.decode(current_sub_tokens)
295
+ return out_string
296
+
297
+ def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
298
+ """
299
+ Save the vocabulary and special tokens file to a directory.
300
+
301
+ Args:
302
+ save_directory (`str`):
303
+ The directory in which to save the vocabulary.
304
+
305
+ Returns:
306
+ `Tuple(str)`: Paths to the files saved.
307
+ """
308
+ if not os.path.isdir(save_directory):
309
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
310
+ return
311
+ out_vocab_file = os.path.join(
312
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
313
+ )
314
+
315
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
316
+ copyfile(self.vocab_file, out_vocab_file)
317
+ elif not os.path.isfile(self.vocab_file):
318
+ with open(out_vocab_file, "wb") as fi:
319
+ content_spiece_model = self.sp_model.serialized_model_proto()
320
+ fi.write(content_spiece_model)
321
+
322
+ return (out_vocab_file,)
323
+
324
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
325
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
326
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
327
+
328
+ output = bos_token_id + token_ids_0 + eos_token_id
329
+
330
+ if token_ids_1 is not None:
331
+ output = output + bos_token_id + token_ids_1 + eos_token_id
332
+
333
+ return output
334
+
335
+ def get_special_tokens_mask(
336
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
337
+ ) -> List[int]:
338
+ """
339
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
340
+ special tokens using the tokenizer `prepare_for_model` method.
341
+
342
+ Args:
343
+ token_ids_0 (`List[int]`):
344
+ List of IDs.
345
+ token_ids_1 (`List[int]`, *optional*):
346
+ Optional second list of IDs for sequence pairs.
347
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
348
+ Whether or not the token list is already formatted with special tokens for the model.
349
+
350
+ Returns:
351
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
352
+ """
353
+ if already_has_special_tokens:
354
+ return super().get_special_tokens_mask(
355
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
356
+ )
357
+
358
+ bos_token_id = [1] if self.add_bos_token else []
359
+ eos_token_id = [1] if self.add_eos_token else []
360
+
361
+ if token_ids_1 is None:
362
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
363
+ return (
364
+ bos_token_id
365
+ + ([0] * len(token_ids_0))
366
+ + eos_token_id
367
+ + bos_token_id
368
+ + ([0] * len(token_ids_1))
369
+ + eos_token_id
370
+ )
371
+
372
+ def create_token_type_ids_from_sequences(
373
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
374
+ ) -> List[int]:
375
+ """
376
+ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
377
+ sequence pair mask has the following format:
378
+
379
+ ```
380
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
381
+ | first sequence | second sequence |
382
+ ```
383
+
384
+ if token_ids_1 is None, only returns the first portion of the mask (0s).
385
+
386
+ Args:
387
+ token_ids_0 (`List[int]`):
388
+ List of ids.
389
+ token_ids_1 (`List[int]`, *optional*):
390
+ Optional second list of IDs for sequence pairs.
391
+
392
+ Returns:
393
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
394
+ """
395
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
396
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
397
+
398
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
399
+
400
+ if token_ids_1 is not None:
401
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
402
+
403
+ return output
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e2bf2c2d38bab8a4d7107e36073be27be40a625b2f4e57f5a0609bdb70deed8
3
+ size 1159468
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "3": {
30
+ "content": "<pad>",
31
+ "lstrip": false,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ }
37
+ },
38
+ "auto_map": {
39
+ "AutoTokenizer": [
40
+ "tokenization_teleflm.TeleFLMTokenizer",
41
+ null
42
+ ]
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "eos_token": "</s>",
47
+ "model_max_length": 8192,
48
+ "pad_token": "<pad>",
49
+ "sp_model_kwargs": {},
50
+ "spaces_between_special_tokens": false,
51
+ "tokenizer_class": "TeleFLMTokenizer",
52
+ "unk_token": "<unk>",
53
+ "use_fast": false
54
+ }