Web register classification (English model)
A web register classifier for texts in English, fine-tuned from XLM-RoBERTa-large. The model is trained with the Corpus of Online Registers of English (CORE) to classify documents based on the CORE taxonomy. It is designed to support the development of open language models and for linguists analyzing register variation.
For a multilingual CORE classifier, see here.
Model Details
Model Description
- Developed by: TurkuNLP
- Funded by: The Research Council of Finland, Emil Aaltonen Foundation, University of Turku
- Shared by: TurkuNLP
- Model type: Language model
- Language(s) (NLP): English
- License: apache-2.0
- Finetuned from model: FacebookAI/xlm-roberta-large
Model Sources
- Repository: Coming soon!
- Paper: Coming soon!
Register labels and their abbreviations
Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted. For a more detailed description of the label scheme, see here.
The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.
- LY: Lyrical
- SP: Spoken
- it: Interview
- ID: Interactive discussion
- NA: Narrative
- ne: News report
- sr: Sports report
- nb: Narrative blog
- HI: How-to or instructions
- re: Recipe
- IN: Informational description
- en: Encyclopedia article
- ra: Research article
- dtp: Description of a thing or person
- fi: Frequently asked questions
- lt: Legal terms and conditions
- OP: Opinion
- rv: Review
- ob: Opinion blog
- rs: Denominational religious blog or sermon
- av: Advice
- IP: Informational persuasion
- ds: Description with intent to sell
- ed: News & opinion blog or editorial
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "TurkuNLP/web-register-classification-en"
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Text to be categorized
text = "A text to be categorized"
# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs)
# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()
# Determine a threshold for predicting labels
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
print("Predicted labels:", predicted_labels)
Training Details
Training Data
The model was trained using the Multilingual CORE Corpora, which will be published soon.
Training Procedure
Training Hyperparameters
- Batch size: 8
- Epochs: 9
- Learning rate: 0.00003
- Precision: bfloat16 (non-mixed precision)
- TF32: Enabled
- Seed: 42
- Max Size: 512 tokens
Inference time
Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is 17 ms for a single example. Wirh bigger batches, inference can be considerably faster.
Evaluation
Micro-averaged F1 scores and optimized prediction thresholds (test set):
Language | F1 (All labels) | F1 (Main labels) | Threshold |
---|---|---|---|
English | 0.74 | 0.76 | 0.40 |
Technical Specifications
Compute Infrastructure
- Mahti supercomputer (CSC - IT Center for Science, Finland)
- 1 x NVIDIA A100-SXM4-40GB
Software
- torch 2.2.1
- transformers 4.39.3
Citation
If you use this model, please cite the following publication:
@misc{henriksson2024untanglingunrestrictedwebautomatic,
title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers},
author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala},
year={2024},
eprint={2406.19892},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19892},
}
Earlier related work include the following:
@article{Laippala.etal2022,
title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
year = {2022},
journal = {Language Resources and Evaluation},
issn = {1574-0218},
doi = {10.1007/s10579-022-09624-1},
url = {https://doi.org/10.1007/s10579-022-09624-1},
}
@article{Skantsi_Laippala_2023,
title = {Analyzing the unrestricted web: The finnish corpus of online registers},
doi = {10.1017/S0332586523000021},
journal = {Nordic Journal of Linguistics},
author = {Skantsi, Valtteri and Laippala, Veronika},
year = {2023},
pages = {1–31}
}
Model Card Contact
Erik Henriksson, Hugging Face username: erikhenriksson
- Downloads last month
- 13