cdgp-csg-bart-dgen / README.md
AndyChiang's picture
Update README.md
eae7519
metadata
license: mit
language: en
tags:
  - bart
  - cloze
  - distractor
  - generation
datasets:
  - dgen
widget:
  - text: The only known planet with large amounts of water is <mask>. </s> earth
  - text: The products of photosynthesis are glucose and <mask> else. </s> oxygen

cdgp-csg-bart-dgen

Model description

This model is a Candidate Set Generator in "CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model", Findings of EMNLP 2022.

Its input are stem and answer, and output is candidate set of distractors. It is fine-tuned by DGen dataset based on facebook/bart-base model.

For more details, you can see our paper or GitHub.

How to use?

  1. Download model by hugging face transformers.
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline

tokenizer = BartTokenizer.from_pretrained("AndyChiang/cdgp-csg-bart-dgen")
csg_model = BartForConditionalGeneration.from_pretrained("AndyChiang/cdgp-csg-bart-dgen")
  1. Create a unmasker.
unmasker = pipeline("fill-mask", tokenizer=tokenizer, model=csg_model, top_k=10)
  1. Use the unmasker to generate the candidate set of distractors.
sent = "The only known planet with large amounts of water is <mask>. </s> earth"
cs = unmasker(sent)
print(cs)

Dataset

This model is fine-tuned by DGen dataset, which covers multiple domains including science, vocabulary, common sense and trivia. It is compiled from a wide variety of datasets including SciQ, MCQL, AI2 Science Questions, etc. The detail of DGen dataset is shown below.

DGen dataset Train Valid Test Total
Number of questions 2321 300 259 2880

You can also use the dataset we have already cleaned.

Training

We use a special way to fine-tune model, which is called "Answer-Relating Fine-Tune". More details are in our paper.

Training hyperparameters

The following hyperparameters were used during training:

  • Pre-train language model: facebook/bart-base
  • Optimizer: adam
  • Learning rate: 0.0001
  • Max length of input: 64
  • Batch size: 64
  • Epoch: 1
  • Device: NVIDIA® Tesla T4 in Google Colab

Testing

The evaluations of this model as a Candidate Set Generator in CDGP is as follows:

P@1 F1@3 MRR NDCG@10
8.49 8.24 16.01 22.66

Other models

Candidate Set Generator

Distractor Selector

fastText: cdgp-ds-fasttext

Citation

None