File size: 3,906 Bytes
fc783dc
 
7b6a18c
 
 
 
 
 
 
 
 
95b215e
42a5dc6
fc783dc
7b6a18c
 
 
42a5dc6
 
b4f1ea3
42a5dc6
b4f1ea3
42a5dc6
d898dac
42a5dc6
 
 
ba3516a
b4f1ea3
204b87c
42a5dc6
01eb1b9
 
b4f1ea3
 
 
 
52bc92a
b4f1ea3
 
 
 
 
 
 
 
42a5dc6
 
 
d898dac
b4f1ea3
 
 
 
 
 
 
2940e3e
42a5dc6
 
 
b4f1ea3
42a5dc6
 
 
 
b4f1ea3
 
 
 
 
 
 
 
42a5dc6
 
 
b4f1ea3
 
 
 
 
 
 
 
 
 
 
 
 
 
62051ef
b4f1ea3
 
 
 
 
 
 
 
42a5dc6
b4f1ea3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: mit
language: en
tags:
- bert
- cloze
- distractor
- generation
datasets:
- cloth
widget:
- text: "I feel [MASK] now. [SEP] happy"
- text: "The old man was waiting for a ride across the [MASK]. [SEP] river"
---

# cdgp-csg-bert-cloth

## Model description

This model is a Candidate Set Generator in **"CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model", Findings of EMNLP 2022**.

Its input are stem and answer, and output is candidate set of distractors. It is fine-tuned by [**CLOTH**](https://www.cs.cmu.edu/~glai1/data/cloth/) dataset based on [**bert-base-uncased**](https://huggingface.co/bert-base-uncased) model.

For more details, you can see our **paper** or [**GitHub**](https://github.com/AndyChiangSH/CDGP).

## How to use?

1. Download the model by hugging face transformers.
```python
from transformers import BertTokenizer, BertForMaskedLM, pipeline

tokenizer = BertTokenizer.from_pretrained("AndyChiang/cdgp-csg-bert-cloth")
csg_model = BertForMaskedLM.from_pretrained("AndyChiang/cdgp-csg-bert-cloth")
```

2. Create a unmasker.
```python
unmasker = pipeline("fill-mask", tokenizer=tokenizer, model=csg_model, top_k=10)
```

3. Use the unmasker to generate the candidate set of distractors.
```python
sent = "I feel [MASK] now. [SEP] happy"
cs = unmasker(sent)
print(cs)
```

## Dataset

This model is fine-tuned by [CLOTH](https://www.cs.cmu.edu/~glai1/data/cloth/) dataset, which is a collection of nearly 100,000 cloze questions from middle school and high school English exams. The detail of CLOTH dataset is shown below.

| Number of questions | Train | Valid | Test  |
| ------------------- | ----- | ----- | ----- |
| Middle school       | 22056 | 3273  | 3198  |
| High school         | 54794 | 7794  | 8318  |
| Total               | 76850 | 11067 | 11516 |

You can also use the [dataset](https://huggingface.co/datasets/AndyChiang/cloth) we have already cleaned.

## Training

We use a special way to fine-tune model, which is called **"Answer-Relating Fine-Tune"**. More detail is in our paper.

### Training hyperparameters

The following hyperparameters were used during training:

- Pre-train language model: [bert-base-uncased](https://huggingface.co/bert-base-uncased)
- Optimizer: adam
- Learning rate: 0.0001
- Max length of input: 64
- Batch size: 64
- Epoch: 1
- Device: NVIDIA® Tesla T4 in Google Colab 

## Testing

The evaluations of this model as a Candidate Set Generator in CDGP is as follows:

| P@1   | F1@3  | F1@10 | MRR   | NDCG@10 |
| ----- | ----- | ----- | ----- | ------- |
| 18.50 | 13.80 | 15.37 | 29.96 | 37.82   |

## Other models

### Candidate Set Generator

| Models      | CLOTH                                                                               | DGen                                                                             |
| ----------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| **BERT**    | [*cdgp-csg-bert-cloth*](https://huggingface.co/AndyChiang/cdgp-csg-bert-cloth)        | [cdgp-csg-bert-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bert-dgen)       |
| **SciBERT** | [cdgp-csg-scibert-cloth](https://huggingface.co/AndyChiang/cdgp-csg-scibert-cloth)  | [cdgp-csg-scibert-dgen](https://huggingface.co/AndyChiang/cdgp-csg-scibert-dgen) |
| **RoBERTa** | [cdgp-csg-roberta-cloth](https://huggingface.co/AndyChiang/cdgp-csg-roberta-cloth) | [cdgp-csg-roberta-dgen](https://huggingface.co/AndyChiang/cdgp-csg-roberta-dgen) |
| **BART**    | [cdgp-csg-bart-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bart-cloth)        | [cdgp-csg-bart-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bart-dgen)       |

### Distractor Selector

**fastText**: [cdgp-ds-fasttext](https://huggingface.co/AndyChiang/cdgp-ds-fasttext)


## Citation

None