File size: 7,123 Bytes
d9512ce
7d486c8
 
 
d9512ce
 
7d486c8
 
d9512ce
 
 
 
7d486c8
d9512ce
 
 
7d486c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9512ce
 
 
7d486c8
 
 
d9512ce
 
 
7d486c8
 
 
 
 
 
 
 
 
 
d9512ce
 
 
7d486c8
 
 
 
d9512ce
 
 
7d486c8
 
99ad4c5
7d486c8
d9512ce
99ad4c5
7d486c8
99ad4c5
7d486c8
d9512ce
 
 
 
 
 
 
7d486c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
language: de
datasets:
- Short-Answer-Feedback/saf_legal_domain_german
tags:
- generated_from_trainer
widget:
- text: "Antwort: Wird sich nicht an die Auflagen gehalten (unzureichende Eigenbemühung), droht eine Sperrzeit von 1-2 Wochen. Dadurch wird für die genannte zeit keine Leistung gezahlt, die Anspruchsdauer vermindert sich insgesamt. Bei wichtigen Gründen wird die Sperrzeit nicht verordnet. Lösung: Merkblatt 1 für Arbeitslose, S. 22: Erbringen Sie die Pflichten im Zusammenhang mit den Eigenbemühungen nicht, nicht rechtzeitig oder nicht vollständig, tritt eine Sperrzeit (0,75 p) ein. Merkblatt 1 für Arbeitslose, S. 55: Die Dauer einer Sperrzeit bei unzureichenden Eigenbemühungen beträgt zwei Wochen. (0,25 p). Frage: Mit welcher Folge und welcher Dauer müssen Sie rechnen, wenn Sie Ihre notwendigen Eigenbemühungen nicht rechtzeitig oder nicht vollständig erfüllen?"
---

# mbart-score-finetuned-saf-legal-domain

This model is a fine-tuned version of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the [saf_legal_domain_german](https://huggingface.co/datasets/Short-Answer-Feedback/saf_legal_domain_german) dataset for Short Answer Feedback (SAF).

## Model description

This model was built on top of [mBART](https://arxiv.org/abs/2001.08210), which is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages.

It expects inputs in the following format:
```
Antwort: [answer] Lösung: [reference_answer] Frage: [question]
```

In the example above, `[answer]`, `[reference_answer]` and `[question]` should be replaced by the provided answer, the reference answer and the question to which they refer, respectively.


The outputs are formatted as follows:
```
[score] Feedback: [feedback]
```

Hence, `[score]` will be a numeric value between 0 and 1, while `[feedback]` will be the textual feedback generated by the model according to the given answer.

## Intended uses & limitations

This model is intended to be used for Short Answer Feedback generation in the domain of the German social law. Thus, it is not expected to have particularly good performance on sets of questions and answers out of this scope.

It is important to acknowledge that the model underperforms when a question that was not seen during training is given as input for inference. In particular, it tends to classify most answers as being correct and does not provide relevant feedback in such cases. Nevertheless, this limitation could be partially overcome by extending the dataset with the desired question (and associated answers) and fine-tuning it for a few epochs on the new data.

## Training and evaluation data

As mentioned previously, the model was trained on the [saf_legal_domain_german](https://huggingface.co/datasets/Short-Answer-Feedback/saf_legal_domain_german) dataset, which is divided into the following splits.

| Split                 | Number of examples |
| --------------------- | ------------------ |
| train                 | 1596	             |
| validation            | 400	             |
| test_unseen_answers   | 221	             |
| test_unseen_questions | 275                |

Evaluation was performed on the `test_unseen_answers` and `test_unseen_questions` splits.

## Training procedure

The [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainer) was used to fine-tune the model. The code utilized for pre-processing and training was mostly adapted from the [summarization script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) made available by HuggingFace.

Training was completed in a little over 1 hour on a GPU on Google Colab.

### Training hyperparameters

The following hyperparameters were used during training:
- num_epochs: 9
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- learning_rate: 6e-05
- lr_scheduler_type: linear
- train_batch_size: 1
- gradient_accumulation_steps: 4
- eval_batch_size: 4
- mixed_precision_training: Native AMP
- seed: 42

### Framework versions

- Transformers 4.26.0
- Pytorch 1.13.1+cu116
- Datasets 2.9.0
- Tokenizers 0.13.2

## Evaluation results

The generated feedback was evaluated through means of the [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu), [ROUGE-2](https://huggingface.co/spaces/evaluate-metric/rouge), [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor), [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) metrics from HuggingFace, while the [Root Mean Squared Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error) loss from scikit-learn was used for evaluation of the predicted scores in relation to the golden label scores.

The following results were achieved.

| Split                 | SacreBLEU | ROUGE-2 | METEOR | BERTScore | RMSE  |
| --------------------- | :-------: | :-----: | :----: | :-------: | :---: |
| test_unseen_answers   | 39.4	    | 42.3    | 54.3   | 52.6      | 0.190 |
| test_unseen_questions | 2.8       | 5.0     | 17.9   | 10.7      | 0.317 |

The script used to compute these metrics and perform evaluation can be found in the `evaluation.py` file in this repository.

## Usage

The example below shows how the model can be applied to generate feedback to a given answer.

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('Short-Answer-Feedback/mbart-score-finetuned-saf-legal-domain')
tokenizer = AutoTokenizer.from_pretrained('Short-Answer-Feedback/mbart-score-finetuned-saf-legal-domain')

example_input = 'Antwort: Wird sich nicht an die Auflagen gehalten (unzureichende Eigenbemühung), droht eine Sperrzeit von 1-2 Wochen. Dadurch wird für die genannte zeit keine Leistung gezahlt, die Anspruchsdauer vermindert sich insgesamt. Bei wichtigen Gründen wird die Sperrzeit nicht verordnet. Lösung: Merkblatt 1 für Arbeitslose, S. 22: Erbringen Sie die Pflichten im Zusammenhang mit den Eigenbemühungen nicht, nicht rechtzeitig oder nicht vollständig, tritt eine Sperrzeit (0,75 p) ein. Merkblatt 1 für Arbeitslose, S. 55: Die Dauer einer Sperrzeit bei unzureichenden Eigenbemühungen beträgt zwei Wochen. (0,25 p). Frage: Mit welcher Folge und welcher Dauer müssen Sie rechnen, wenn Sie Ihre notwendigen Eigenbemühungen nicht rechtzeitig oder nicht vollständig erfüllen?'
inputs = tokenizer(example_input, max_length=256, padding='max_length', truncation=True, return_tensors='pt')

generated_tokens = model.generate(
                inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=128
            )
output = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
```

The output produced by the model then looks as follows:

```
0.75 Feedback: Es ist richtig, dass Sie mit einer Sperrzeit rechnen müssen, in der Sie keine Leistung bekommen. Die gesetzlich vorgesehene Sperrzeit bei unzureichenden Eigenbemühungen beträgt jedoch zwei Wochen.
```