File size: 5,606 Bytes
5fd578a
 
 
 
be13541
 
5fd578a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
datasets:
- allenai/qasper
license: apache-2.0
widget:
- text: "Here is the the abstract for a scientific paper:\n<paste abstract here>\nWhat would be some questions that the paper could answer?\n"
---

# Model Card for TinyLlama-abs2qa
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
<!-- Provide a quick summary of what the model is/does. -->

This model was an experiment to see if I could get a model to generate useful questions from a scientific paper's abstract. The answer was yes!

## Model Details

The base model is TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T, thanks to the TinyLlama devs for training and releasing it!

As such, it has a context size of 4096 tokens

Training data was a modified form of the QASPER train split, which contains 1169 examples of abstracts and suitable questions for NLP papers.

### Model Description

I modified the QASPER dataset a little to do this training. The original has the abstract and a set of questions and their answers. 
For this test I only wanted to see if I could generate questions from abstracts, so I extracted only those parts and formulated them in an alpaca style instruction:

    {"instruction":"Here is the the abstract for a scientific paper:
      It has been shown that word embeddings derived from large corpora 
      tend to incorporate biases present in their training data. Various 
      methods for mitigating these biases have been proposed, but recent 
      work has demonstrated that these methods hide but fail to truly 
      remove the biases, which can still be observed in word 
      nearest-neighbor statistics. In this work we propose a probabilistic
      view of word embedding bias. We leverage this framework to present a 
      novel method for mitigating bias which relies on probabilistic 
      observations to yield a more robust bias mitigation algorithm. 
      We demonstrate that this method effectively reduces bias according 
      to three separate measures of bias while maintaining embedding quality 
      across various popular benchmark semantic tasks
    What would be some questions that the paper could answer?",
    "output":"How is embedding quality assessed?
      What are the three measures of bias which are reduced in experiments?
      What are the probabilistic observations which contribute to the more robust algorithm?"}

I'm not sure how critical the instruction phrasing is, but with the instructions as in the training, 
this tiny model actually does a pretty good job on totally unseen abstracts in NLP.

Training this model with [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) took only 3 minutes on an A100. 
Wrangling the environment to get axolotl to work took a lot longer and if you can I highly recommend using their docker.


- **Developed by:** Andrew Green
- **Model type:** Llama 2 architecture, 1.1B parameters
- **Language(s) (NLP):** english
- **License:** Apache 2.0
- **Finetuned from model:** TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
I intend to use this model or a derivative of it to screen papers for inclusion in literature summarisation tools in the future. 

Another thing I want to try is using this model to augment QASPER for other fields.

Since it is so fast to train, I think it will also be a useful testbed for trying out some other techniques like DPO and SPIN that I want to learn.

### Direct Use

Directly using this model should be possible, though some testing of the impact of slightly different prompting styles would be needed, and I think it 
will generate ad infinitum because I didn't use a chat template - that's on my to-do list and should be quick enough.

From a few quick tests, the generated questions look at least plausible, though they may have questionable utility in the real world

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model was finetuned on scientific articles for NLP, and questions about the articles written by NLP experts. As such, it is quite likely the model 
will not work well on other fields. In my limited testing however, it does seem to generalise ok.

The same risks for misuse and malicious use apply as they would for any LLM, but in particluar this model has the potential to generate questions from 
an abstract, which could lead to it being misused in academia (e.g. to partially automate peer review). This would be a violation of most publisher's terms
I think.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
This model is based on the TinyLlama model, which is a foundation model so all the same risks of out of scope use there apply.

The model is biased towards NLP abstracts, because those are contained in the QASPER dataset on which it is trained.

This is a very small model, so it is likely to be quite limited in its reasoning capabilities, which may lead to nonsense or irrelevant questions being generated.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.