umarbutler
commited on
Commit
β’
3e4ba8d
1
Parent(s):
1c43999
Update README.md
Browse files
README.md
CHANGED
@@ -1,53 +1,96 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
3 |
base_model: gpt2
|
4 |
tags:
|
|
|
|
|
|
|
5 |
- generated_from_trainer
|
6 |
-
|
7 |
-
-
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
|
11 |
-
|
12 |
-
|
13 |
|
14 |
-
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
|
22 |
-
##
|
|
|
|
|
|
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
25 |
|
26 |
-
##
|
|
|
27 |
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
|
|
33 |
|
34 |
-
|
35 |
-
- learning_rate: 1e-05
|
36 |
-
- train_batch_size: 8
|
37 |
-
- eval_batch_size: 16
|
38 |
-
- seed: 42
|
39 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
40 |
-
- lr_scheduler_type: linear
|
41 |
-
- lr_scheduler_warmup_ratio: 0.06
|
42 |
-
- num_epochs: 3
|
43 |
|
44 |
-
|
45 |
|
|
|
|
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
-
|
|
|
49 |
|
50 |
-
-
|
51 |
-
|
52 |
-
|
53 |
-
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
license: apache-2.0
|
5 |
+
library_name: transformers
|
6 |
base_model: gpt2
|
7 |
tags:
|
8 |
+
- law
|
9 |
+
- legal
|
10 |
+
- australia
|
11 |
- generated_from_trainer
|
12 |
+
datasets:
|
13 |
+
- umarbutler/open-australian-legal-corpus
|
14 |
+
widget:
|
15 |
+
- text: "Section 51 of the Constitution provides "
|
16 |
+
example_title: "Text completion"
|
17 |
+
- text: "# Question\nWhat is a restraint of trade?\n\nAnswer: "
|
18 |
+
example_title: "Question answering"
|
19 |
+
inference:
|
20 |
+
parameters:
|
21 |
+
temperature: 0
|
22 |
+
seed: 42
|
23 |
+
best_of: 4
|
24 |
---
|
25 |
|
26 |
+
# Open Australian Legal GPT2 ββοΈ
|
27 |
+
Open Australian Legal GPT2 is the first open source language model trained on Australian law.
|
28 |
|
29 |
+
Naturally, as a finetune of [GPT2](https://huggingface.co/gpt2), the model may be used for any of the tasks for which [GPT2](https://huggingface.co/gpt2) is suitable, including text generation, text completion and question answering.
|
30 |
|
31 |
+
Trained on 37,560 laws and regulations, comprising 635,482,112 tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain.
|
32 |
|
33 |
+
To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
|
34 |
|
35 |
+
Those interested in learning more about the model are encouraged to read Umar Butler's accompanying article, [How I built the first open LLM for Australian law](https://umarbutler.com/how-i-built-the-first-open-llm-for-australian-law/).
|
36 |
|
37 |
+
## Usage π©βπ»
|
38 |
+
The code snippet below demonstrates just one of the many ways in which the model may be accessed:
|
39 |
+
```python
|
40 |
+
>>> from transformers import pipeline, set_seed
|
41 |
|
42 |
+
>>> set_seed(42) # We set a seed for reproducibility.
|
43 |
+
>>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-gpt2')
|
44 |
+
>>> generator('Under the Crimes Act 1914')
|
45 |
+
[{'generated_text': 'Under the Crimes Act 1914, a person who is liable to a payment of a benefit under the Act is also liable to pay'}]
|
46 |
+
```
|
47 |
|
48 |
+
## Creation π§ͺ
|
49 |
+
37,560 documents were sampled from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by filtering for primary and secondary legislation that, when stripped of whitespace, was not empty. Such documents were then randomly shuffled and added to blocks 1,024-tokens-long, with GPT2's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block, resulting in a training dataset of 620,588 blocks, or 635,482,112 tokens.
|
50 |
|
51 |
+
The training dataset was subsequently fed to [GPT2](https://huggingface.co/gpt2) via [`transformers.Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) with the following hyperparameters:
|
52 |
+
| Hyperparameter | Value |
|
53 |
+
| --- | --- |
|
54 |
+
| Sequence length | 1,024 |
|
55 |
+
| Epochs | 3 |
|
56 |
+
| Optimiser | AdamW |
|
57 |
+
| Learning rate | 1e-5 |
|
58 |
+
| Learning rate scheduler | Linear with warmup |
|
59 |
+
| Batch size per device | 4 |
|
60 |
+
| Weight decay | 0.01 |
|
61 |
+
| Warmup ratio | 0.06 |
|
62 |
+
| Gradient accumulation steps | 1 |
|
63 |
|
64 |
+
After training for 3 epochs, or 465,441 steps, over a period of ~25 hours on two GeForce RTX 4090s, the model achieved a loss of XX.
|
65 |
|
66 |
+
## Limitations π§
|
67 |
+
Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of [GPT2](https://huggingface.co/gpt2).
|
68 |
|
69 |
+
One might also expect the model to exhibit a bias towards the type of language employed in legislation and regulations (its source materials) as well as towards Commonwealth law (the largest source of legislation in [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
+
Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.
|
72 |
|
73 |
+
## Licence π
|
74 |
+
The model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
|
75 |
|
76 |
+
## Citation π
|
77 |
+
If you've relied on the model for your work, please cite:
|
78 |
+
```bibtex
|
79 |
+
@misc{butler-2023-open-australian-legal-gpt2,
|
80 |
+
author = {Butler, Umar},
|
81 |
+
year = {2023},
|
82 |
+
title = {Open Australian Legal GPT2},
|
83 |
+
publisher = {Hugging Face},
|
84 |
+
version = {1.0.0},
|
85 |
+
url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-gpt2}
|
86 |
+
}
|
87 |
+
```
|
88 |
|
89 |
+
## Acknowledgements π
|
90 |
+
In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
|
91 |
|
92 |
+
The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
|
93 |
+
|
94 |
+
The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [GPT2](https://huggingface.co/gpt2), which the model was built atop.
|
95 |
+
|
96 |
+
Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.
|