File size: 2,237 Bytes
066a039
1f1076a
91ec8a2
 
 
7836f5e
 
 
 
 
77bcdb4
 
066a039
91ec8a2
 
 
 
 
 
 
 
 
 
 
 
 
 
8a9b39c
91ec8a2
 
 
 
936eb80
 
 
 
9d1d71f
936eb80
 
 
 
9d1d71f
936eb80
 
91ec8a2
 
 
 
 
 
 
 
66e0466
91ec8a2
66e0466
91ec8a2
66e0466
 
 
 
 
 
 
 
 
7836f5e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: cc-by-sa-4.0
language:
- en
pipeline_tag: text-classification
tags:
- transformers
- negation
- evaluation
- metric
datasets:
- tum-nlp/cannot-dataset
---
# Model Card for Model NegBLEURT

This model is a negation-aware version of the BLEURT metric for evaluation of generated text.

### Direct Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "tum-nlp/NegBLEURT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

references = ["Ray Charles is legendary.", "Ray Charles is legendary."]
candidates = ["Ray Charles is a legend.", "Ray Charles isn’t legendary."]

tokenized = tokenizer(references, candidates, return_tensors='pt', padding=True)
print(model(**tokenized).logits)
# returns scores 0.8409 and 0.2601 for the two candidates
```

### Use with pipeline
```python
from transformers import pipeline

pipe = pipeline("text-classification", model="tum-nlp/NegBLEURT", function_to_apply="none") # set function_to_apply="none" for regression output!
pairwise_input = [
  [["Ray Charles is legendary.", "Ray Charles is a legend."]],
  [["Ray Charles is legendary.", "Ray Charles isn’t legendary."]]
]
print(pipe(pairwise_input))
# returns [{'label': 'NegBLEURT_score', 'score': 0.8408917784690857}, {'label': 'NegBLEURT_score', 'score': 0.26007288694381714}]
```

## Training Details

The model is a fine-tuned version of the [bleurt-tiny](https://github.com/google-research/bleurt/tree/master/bleurt/test_checkpoint) checkpoint from the official BLUERT repository. 
It was fine-tuned on the CANNOT dataset's train split for 500 steps using the [fine-tuning script](https://github.com/google-research/bleurt/blob/master/bleurt/finetune.py) provided by BLEURT.



## Citation

Please cite our [INLG 2023 paper](https://arxiv.org/abs/2307.13989), if you use our model. 
**BibTeX:**
```bibtex
@misc{anschütz2023correct,
      title={This is not correct! Negation-aware Evaluation of Language Generation Systems}, 
      author={Miriam Anschütz and Diego Miguel Lozano and Georg Groh},
      year={2023},
      eprint={2307.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```