library_name: transformers
datasets:
- google-research-datasets/tydiqa
license: apache-2.0
pipeline_tag: text2text-generation
base_model: google/flan-t5-small
widget:
- text: >-
question: What is the huggingface hub? context: The Hugging Face Hub is a
platform with over 350k models, 75k datasets, and 150k demo apps (Spaces),
all open source and publicly available, in an online platform where people
can easily collaborate and build ML together. The Hub works as a central
place where anyone can explore, experiment, collaborate, and build
technology with Machine Learning. Are you ready to join the path towards
open source Machine Learning? π€
example_title: π€ Hub
- text: >-
question: What is huggingface datasets? context: π€ Datasets is a library
for easily accessing and sharing datasets for Audio, Computer Vision, and
Natural Language Processing (NLP) tasks. Load a dataset in a single line
of code, and use our powerful data processing methods to quickly get your
dataset ready for training in a deep learning model. Backed by the Apache
Arrow format, process large datasets with zero-copy reads without any
memory constraints for optimal speed and efficiency. We also feature a
deep integration with the Hugging Face Hub, allowing you to easily load
and share a dataset with the wider machine learning community. Find your
dataset today on the Hugging Face Hub, and take an in-depth look inside of
it with the live viewer.
example_title: π€ datasets
Model Card for Model ID
Model Details
Model Description
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: philipp-zettl
- Model type: Seq2Seq
- Language(s) (NLP):
- License: Apache 2.0
- Finetuned from model: google/flan-t5-small
Uses
Direct Use
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("philipp-zettl/t5-small-tydiqa-en")
model = AutoModelForSeq2SeqLM.from_pretrained("philipp-zettl/t5-small-tydiqa-en")
question = "Some question?"
# For instance retrieved using similarity search
context = "A long context ..."
inputs = [f"question: {q} context: {c}" for q, c in [[question, context]]]
model_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True)
input_ids = torch.tensor(model_inputs['input_ids']).to(device)
attention_mask = torch.tensor(model_inputs['attention_mask']).to(device)
with torch.no_grad():
sample_output = model.generate(input_ids[:1], max_length=100)
sample_output_text = tokenizer.decode(sample_output[0], skip_special_tokens=True)
input_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
print(f"Sample Input", input_text)
print(f"Sample Output", sample_output_text)
Training Details
Training Data
Trained on the english samples of google-research-datasets/tydiqa using following code
from datasets import load_dataset
# Load SQuAD dataset
squad_dataset = load_dataset('google-research-datasets/tydiqa', 'secondary_task')
# Split the dataset into training and validation
train_dataset = squad_dataset['train'].filter(lambda e: any([e['id'].startswith(lang) for lang in ['english']]))
validation_dataset = squad_dataset['validation'].filter(lambda e: any([e['id'].startswith(lang) for lang in ['english']]))
Training Procedure
Preprocessing
Code for preprocessing
def preprocess_batch(batch, tokenizer, max_input_length=512, max_output_length=128):
questions = batch['question']
contexts = batch['context']
answers = [answer['text'][0] for answer in batch['answers']]
inputs = [f"question: {q} context: {c}" for q, c in zip(questions, contexts)]
model_inputs = tokenizer(inputs, max_length=max_input_length, padding=True, truncation=True)
labels = tokenizer(answers, max_length=max_output_length, padding=True, truncation=True)
model_inputs['labels'] = labels['input_ids']
return model_inputs
# Tokenize the dataset
train_dataset = train_dataset.map(lambda batch: preprocess_batch(batch, teacher_tokenizer), batched=True)
validation_dataset = validation_dataset.map(lambda batch: preprocess_batch(batch, teacher_tokenizer), batched=True)
# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
Training Hyperparameters
Code of training loop:
from tqdm import tqdm
from transformers import AdamW, DataCollatorForSeq2Seq
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
torch.cuda.empty_cache()
teacher_model.to(device)
# Training parameters
epochs = 3
learning_rate = 5e-5
temperature = 2.0
batch_size = 2
optimizer = torch.optim.AdamW(teacher_model.parameters(), lr=learning_rate)
# Create a data collator for padding and batching
data_collator = DataCollatorForSeq2Seq(tokenizer=teacher_tokenizer, model=teacher_model)
# Create DataLoaders with the data collator
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
validation_dataloader = DataLoader(validation_dataset, batch_size=batch_size, collate_fn=data_collator)
writer = SummaryWriter('./logs', comment='t5-base')
print("Starting training...")
# Training loop
for epoch in range(epochs):
teacher_model.train()
total_loss = 0
print(f"Epoch {epoch+1}/{epochs}")
progress_bar = tqdm(train_dataloader, desc="Training", leave=False)
for step, batch in enumerate(progress_bar):
# Move student inputs to GPU
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Teacher forward pass on CPU
teacher_outputs = teacher_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
teacher_logits = teacher_outputs.logits
# Calculate losses
loss = teacher_outputs.loss # Cross-entropy loss
writer.add_scalar("Loss/train", loss, step)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
# Verbose logging
if step % 1 == 0 or step == len(train_dataloader) - 1:
progress_bar.set_postfix({
'step': step,
'loss': loss.item(),
})
# Generate a sample output from the student model
teacher_model.eval()
with torch.no_grad():
sample_output = teacher_model.generate(input_ids[:1], max_length=50)
sample_output_text = teacher_tokenizer.decode(sample_output[0], skip_special_tokens=True)
input_text = teacher_tokenizer.decode(input_ids[0], skip_special_tokens=True)
writer.add_text(f"Sample Input", input_text, step)
writer.add_text(f"Sample Output", sample_output_text, step)
teacher_model.train()
avg_loss = total_loss / len(train_dataloader)
print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")
writer.add_scalar("AVG Loss/train", avg_loss, epoch)
print("Training complete.")
writer.close()
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]