license: mit
datasets:
- philipp-zettl/qg-tydiqa_squad2
language:
- en
library_name: transformers
pipeline_tag: text2text-generation
widget:
- text: >-
context: The Hugging Face Hub is a platform with over 350k models, 75k
datasets, and 150k demo apps (Spaces), all open source and publicly
available, in an online platform where people can easily collaborate and
build ML together. The Hub works as a central place where anyone can
explore, experiment, collaborate, and build technology with Machine
Learning. Are you ready to join the path towards open source Machine
Learning? π€
example_title: π€ Hub
- text: >-
context: π€ Datasets is a library for easily accessing and sharing
datasets for Audio, Computer Vision, and Natural Language Processing (NLP)
tasks. Load a dataset in a single line of code, and use our powerful data
processing methods to quickly get your dataset ready for training in a
deep learning model. Backed by the Apache Arrow format, process large
datasets with zero-copy reads without any memory constraints for optimal
speed and efficiency. We also feature a deep integration with the Hugging
Face Hub, allowing you to easily load and share a dataset with the wider
machine learning community. Find your dataset today on the Hugging Face
Hub, and take an in-depth look inside of it with the live viewer.
example_title: π€ datasets
Model Card for t5-small-qg
Model Details
Model Description
This model was trained to generate questions out of a given context.
- Developed by: philipp-zettl
- Model type: Transformer (T5)
- Language(s) (NLP): English
- License: M.I.T
- Finetuned from model [optional]: google/flan-t5-small
Model Sources [optional]
Fine-tune of the amazing google/flan-t5-small
Uses
It's intended to use the model to generate questions from given context. The context should not exceed the model's context length.
Bias, Risks, and Limitations
No bias evaluation was performed on this model.
How to Get Started with the Model
Use the code below to get started with the model.
context = "This is a long text based of multiple concatenated paragraphs."
model_inputs = tokenizer([f"context: {context}"], max_length=512, padding=True, truncation=True)
input_ids = torch.tensor(model_inputs['input_ids']).to(device)
attention_mask = torch.tensor(model_inputs['attention_mask']).to(device)
with torch.no_grad():
sample_output = model.generate(input_ids[:1], max_length=85)
sample_output_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)
input_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
print(f"Sample Input:\n \"{input_text}\"\n\n")
print(f"Model Output: \"{sample_output_text}\"")
Training Details
Training Data
This model was trained on philipp-zettl/qg-tydiqa_squad2.
The training data was collected by combining philipp-zettl/tydiqa-task_2-english with nvidia/ChatQA-Training-Data#squad2.0.
From each base dataset we selected the context
and question
attributes of each sample. Then joined them together into philipp-zettl/qg-tydiqa_squad2.
Training Procedure
Below you can find the full training pipeline used to achieve this fine-tune.
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Base model (e.g., T5-large)
# https://huggingface.co/collections/google/flan-t5-release-65005c39e3201fff885e22fb
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Move only the student model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
Load dataset
from datasets import load_dataset
# Load dataset
squad_dataset = load_dataset('philipp-zettl/qg-tydiqa_squad2')
# Split the dataset into training and validation
train_dataset = squad_dataset['train']
validation_dataset = squad_dataset['test']
Preprocessing: tokenize inputs and labels for faster training cycles, i.e. no need for tokenization during training anymore
def preprocess_batch(batch, tokenizer, max_input_length=512, max_output_length=128):
contexts = batch['context']
answers = batch['question']
inputs = [f"context: {c}" for c in contexts]
model_inputs = tokenizer(inputs, max_length=max_input_length, padding=True, truncation=True)
labels = tokenizer(answers, max_length=max_output_length, padding=True, truncation=True)
model_inputs['labels'] = labels['input_ids']
return model_inputs
# Tokenize the dataset
train_dataset = train_dataset.map(lambda batch: preprocess_batch(batch, tokenizer), batched=True)
validation_dataset = validation_dataset.map(lambda batch: preprocess_batch(batch, tokenizer), batched=True)
# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
The train loop
from tqdm import tqdm
from transformers import AdamW, DataCollatorForSeq2Seq
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
torch.cuda.empty_cache()
model.to(device)
# Training parameters
epochs = 3
learning_rate = 5e-5
temperature = 2.0
batch_size = 8
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# Create a data collator for padding and batching
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
# Create DataLoaders with the data collator
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size, collate_fn=data_collator)
writer = SummaryWriter(comment='t5-small-qg')
print("Starting training...")
# Training loop
for epoch in range(epochs):
model.train()
total_loss = 0
print(f"Epoch {epoch+1}/{epochs}")
progress_bar = tqdm(train_dataloader, desc="Training", leave=False)
for step, batch in enumerate(progress_bar):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
logits = outputs.logits
# Calculate losses
loss = outputs.loss # Cross-entropy loss
writer.add_scalar("Loss/train", loss, step)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
# Verbose logging
if step % 100 == 1 or step == len(train_dataloader) - 1:
progress_bar.set_postfix({
'step': step,
'loss': loss.item(),
})
# Generate a sample output from the student model
model.eval()
with torch.no_grad():
sample_output = model.generate(input_ids[:1], max_length=50)
sample_output_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)
input_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
writer.add_text(f"Sample Input", input_text, step)
writer.add_text(f"Sample Output", sample_output_text, step)
model.train()
avg_loss = total_loss / len(train_dataloader)
print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")
writer.add_scalar("AVG Loss/train", avg_loss, epoch)
print("Training complete.")
writer.close()