philipp-zettl commited on
Commit
ed99ccc
β€’
1 Parent(s): 04934f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -25
README.md CHANGED
@@ -1,6 +1,32 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # Model Card for Model ID
@@ -17,21 +43,11 @@ tags: []
17
 
18
  This is the model card of a πŸ€— transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
@@ -71,34 +87,159 @@ Users (both direct and downstream) should be made aware of the risks, biases and
71
 
72
  Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Training Details
77
 
78
  ### Training Data
79
 
80
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
81
 
82
- [More Information Needed]
 
 
 
 
 
 
83
 
84
  ### Training Procedure
85
 
86
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
 
 
 
 
 
 
89
 
90
- [More Information Needed]
 
91
 
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ## Evaluation
104
 
 
1
  ---
2
  library_name: transformers
3
+ datasets:
4
+ - google-research-datasets/tydiqa
5
+ license: apache-2.0
6
+ pipeline_tag: text2text-generation
7
+ base_model: google/flan-t5-small
8
+ widget:
9
+ - text: "question: What is the huggingface hub? context: The Hugging Face Hub is a
10
+ platform with over 350k models, 75k datasets, and 150k demo apps (Spaces),
11
+ all open source and publicly available, in an online platform where people
12
+ can easily collaborate and build ML together. The Hub works as a central
13
+ place where anyone can explore, experiment, collaborate, and build
14
+ technology with Machine Learning. Are you ready to join the path towards
15
+ open source Machine Learning? πŸ€—"
16
+ example_title: πŸ€— Hub
17
+ - text: "question: What is huggingface datasets? context: πŸ€— Datasets is a library
18
+ for easily accessing and sharing datasets for Audio, Computer Vision, and
19
+ Natural Language Processing (NLP) tasks. Load a dataset in a single line
20
+ of code, and use our powerful data processing methods to quickly get your
21
+ dataset ready for training in a deep learning model. Backed by the Apache
22
+ Arrow format, process large datasets with zero-copy reads without any
23
+ memory constraints for optimal speed and efficiency. We also feature a
24
+ deep integration with the Hugging Face Hub, allowing you to easily load
25
+ and share a dataset with the wider machine learning community. Find your
26
+ dataset today on the Hugging Face Hub, and take an in-depth look inside of
27
+ it with the live viewer."
28
+ example_title: πŸ€— datasets
29
+
30
  ---
31
 
32
  # Model Card for Model ID
 
43
 
44
  This is the model card of a πŸ€— transformers model that has been pushed on the Hub. This model card has been automatically generated.
45
 
46
+ - **Developed by:** [philipp-zettl](https://huggingface.co/philipp-zettl)
47
+ - **Model type:** Seq2Seq
48
+ - **Language(s) (NLP):**
49
+ - **License:** Apache 2.0
50
+ - **Finetuned from model:** [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
 
 
 
 
 
 
 
 
 
 
51
 
52
  ## Uses
53
 
 
87
 
88
  Use the code below to get started with the model.
89
 
90
+ ```python
91
+ # Load model directly
92
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained("philipp-zettl/t5-small-tydiqa-en")
95
+ model = AutoModelForSeq2SeqLM.from_pretrained("philipp-zettl/t5-small-tydiqa-en")
96
+
97
+ question = "Some question?"
98
+ # For instance retrieved using similarity search
99
+ context = "A long context ..."
100
+
101
+ inputs = [f"question: {q} context: {c}" for q, c in [[question, context]]]
102
+ model_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True)
103
+ input_ids = torch.tensor(model_inputs['input_ids']).to(device)
104
+ attention_mask = torch.tensor(model_inputs['attention_mask']).to(device)
105
+ with torch.no_grad():
106
+ sample_output = model.generate(input_ids[:1], max_length=100)
107
+ sample_output_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)
108
+ input_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
109
+ print(f"Sample Input", input_text)
110
+ print(f"Sample Output", sample_output_text)
111
+ ```
112
 
113
  ## Training Details
114
 
115
  ### Training Data
116
 
117
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
118
+ Trained on the english samples of [google-research-datasets/tydiqa](https://huggingface.co/datasets/google-research-datasets/tydiqa) using following code
119
+ ```python
120
+ from datasets import load_dataset
121
 
122
+ # Load SQuAD dataset
123
+ squad_dataset = load_dataset('google-research-datasets/tydiqa', 'secondary_task')
124
+
125
+ # Split the dataset into training and validation
126
+ train_dataset = squad_dataset['train'].filter(lambda e: any([e['id'].startswith(lang) for lang in ['english']]))
127
+ validation_dataset = squad_dataset['validation'].filter(lambda e: any([e['id'].startswith(lang) for lang in ['english']]))
128
+ ```
129
 
130
  ### Training Procedure
131
 
132
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
133
 
134
+ #### Preprocessing
135
+ Code for preprocessing
136
+ ```python
137
+ def preprocess_batch(batch, tokenizer, max_input_length=512, max_output_length=128):
138
+ questions = batch['question']
139
+ contexts = batch['context']
140
+ answers = [answer['text'][0] for answer in batch['answers']]
141
 
142
+ inputs = [f"question: {q} context: {c}" for q, c in zip(questions, contexts)]
143
+ model_inputs = tokenizer(inputs, max_length=max_input_length, padding=True, truncation=True)
144
 
145
+ labels = tokenizer(answers, max_length=max_output_length, padding=True, truncation=True)
146
+ model_inputs['labels'] = labels['input_ids']
147
 
148
+ return model_inputs
149
 
150
+ # Tokenize the dataset
151
+ train_dataset = train_dataset.map(lambda batch: preprocess_batch(batch, teacher_tokenizer), batched=True)
152
+ validation_dataset = validation_dataset.map(lambda batch: preprocess_batch(batch, teacher_tokenizer), batched=True)
153
 
154
+ # Set format for PyTorch
155
+ train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
156
+ validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
157
+ ```
158
 
 
159
 
160
+ #### Training Hyperparameters
161
+ Code of training loop:
162
+ ```python
163
+ from tqdm import tqdm
164
+ from transformers import AdamW, DataCollatorForSeq2Seq
165
+ from torch.utils.data import DataLoader
166
+ from torch.utils.tensorboard import SummaryWriter
167
+
168
+ torch.cuda.empty_cache()
169
+
170
+ teacher_model.to(device)
171
+
172
+ # Training parameters
173
+ epochs = 3
174
+ learning_rate = 5e-5
175
+ temperature = 2.0
176
+ batch_size = 2
177
+ optimizer = torch.optim.AdamW(teacher_model.parameters(), lr=learning_rate)
178
+
179
+ # Create a data collator for padding and batching
180
+ data_collator = DataCollatorForSeq2Seq(tokenizer=teacher_tokenizer, model=teacher_model)
181
+
182
+ # Create DataLoaders with the data collator
183
+ train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
184
+ validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size, collate_fn=data_collator)
185
+
186
+ writer = SummaryWriter('./logs', comment='t5-base')
187
+
188
+ print("Starting training...")
189
+
190
+ # Training loop
191
+ for epoch in range(epochs):
192
+ teacher_model.train()
193
+ total_loss = 0
194
+ print(f"Epoch {epoch+1}/{epochs}")
195
+
196
+ progress_bar = tqdm(train_dataloader, desc="Training", leave=False)
197
+
198
+ for step, batch in enumerate(progress_bar):
199
+ # Move student inputs to GPU
200
+ input_ids = batch['input_ids'].to(device)
201
+ attention_mask = batch['attention_mask'].to(device)
202
+ labels = batch['labels'].to(device)
203
+
204
+ # Teacher forward pass on CPU
205
+ teacher_outputs = teacher_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
206
+ teacher_logits = teacher_outputs.logits
207
+
208
+ # Calculate losses
209
+ loss = teacher_outputs.loss # Cross-entropy loss
210
+ writer.add_scalar("Loss/train", loss, step)
211
+
212
+ # Backpropagation
213
+ optimizer.zero_grad()
214
+ loss.backward()
215
+ optimizer.step()
216
+
217
+ total_loss += loss.item()
218
+
219
+ # Verbose logging
220
+ if step % 1 == 0 or step == len(train_dataloader) - 1:
221
+ progress_bar.set_postfix({
222
+ 'step': step,
223
+ 'loss': loss.item(),
224
+ })
225
+
226
+ # Generate a sample output from the student model
227
+ teacher_model.eval()
228
+ with torch.no_grad():
229
+ sample_output = teacher_model.generate(input_ids[:1], max_length=50)
230
+ sample_output_text = teacher_tokenizer.decode(sample_output[0], skip_special_tokens=True)
231
+ input_text = teacher_tokenizer.decode(input_ids[0], skip_special_tokens=True)
232
+ writer.add_text(f"Sample Input", input_text, step)
233
+ writer.add_text(f"Sample Output", sample_output_text, step)
234
+ teacher_model.train()
235
+
236
+ avg_loss = total_loss / len(train_dataloader)
237
+ print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")
238
+ writer.add_scalar("AVG Loss/train", avg_loss, epoch)
239
+
240
+ print("Training complete.")
241
+ writer.close()
242
+ ```
243
 
244
  ## Evaluation
245