File size: 16,405 Bytes
a743762
 
 
 
 
 
 
 
 
 
 
 
f6505a5
a743762
 
 
5fb9c2f
 
 
 
 
 
 
 
a498abe
b44d34c
5fb9c2f
 
 
 
a743762
 
 
 
 
 
 
 
 
 
 
 
 
 
5fb9c2f
 
 
a743762
 
 
5fb9c2f
e2eccb7
74f9878
e2eccb7
 
 
 
a743762
 
 
 
a498abe
 
a743762
47c02e2
a743762
 
 
 
34481af
a743762
34481af
a743762
d9fff54
a743762
 
 
d9fff54
a743762
 
 
 
 
 
 
1afde0c
a743762
 
1afde0c
a743762
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9fff54
a743762
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a498abe
a743762
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34481af
a743762
 
 
 
 
 
a498abe
 
 
 
 
 
 
 
 
a743762
 
34481af
a743762
 
 
5eb8531
6a04c90
fa90c00
6a04c90
fa90c00
 
 
6a04c90
 
 
 
 
 
a743762
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34481af
a743762
 
 
 
 
db148eb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
language:
- en
license: mit
library_name: transformers
base_model: roberta-base
tags:
- law
- legal
- australia
- generated_from_trainer
- feature-extraction
- fill-mask
datasets:
- umarbutler/open-australian-legal-corpus
widget:
- text: >-
    Section <mask> of the Constitution grants the Australian Parliament the
    power to make laws for the peace, order, and good government of the
    Commonwealth.
- text: The most learned and eminent jurist in Australia's history is <mask> CJ.
- text: >-
    A <mask> of trade is valid to the extent to which it is not against public
    policy, whether it is in severable terms or not.
- text: Norfolk Island is an Australian <mask>.
- text: The representative of the monarch of Australia is the <mask>-General.
- text: >-
    In Mabo v <mask> (No 2) (1992) 175 CLR 1, the Court found that the doctrine
    of terra nullius was not applicable to Australia at the time of British
    settlement of New South Wales.
metrics:
- perplexity
model-index:
- name: emubert
  results:
  - task:
      type: fill-mask
      name: Fill mask
    dataset:
      type: umarbutler/open-australian-legal-qa
      name: Open Australian Legal QA
      split: train
      revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae
    metrics:
    - type: perplexity
      value: 2.05
      name: perplexity
    source:
      name: EmuBert Creator
      url: https://github.com/umarbutler/emubert-creator
pipeline_tag: fill-mask
co2_eq_emissions:
  emissions: 8640
  source: "ML CO2 Impact"
  training_type: "pre-training"
  geographical_location: "Melbourne, Victoria, Australia"
  hardware_used: "Nvidia RTX 2080 Ti"
---

# EmuBert
<img src="https://huggingface.co/umarbutler/emubert/resolve/main/logo.png" width="100" align="left" />

EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.

Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition**, **semantic similarity** and **question answering**.

To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).

## Usage πŸ‘©β€πŸ’»
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta), which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.

It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to perform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
```python
import math
import torch
import itertools

from tqdm import tqdm
from typing import Iterable, Generator
from contextlib import nullcontext
from transformers import AutoModel, AutoTokenizer

BATCH_SIZE = 8

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModel.from_pretrained('umarbutler/emubert').to(device)
model = model.to_bettertransformer() # Optional: convert the model into a BetterTransformer
                                     #           to speed it up.
tokeniser = AutoTokenizer.from_pretrained('umarbutler/emubert')

texts = [
    'The Parliament shall, subject to this Constitution,\
    have power to make laws for the peace, order, and good\
    government of the Commonwealth.',
    
    'The executive power of the Commonwealth is vested in the Queen\
    and is exercisable by the Governor-General as the Queen’s representative,\
    and extends to the execution and maintenance of this Constitution,\
    and of the laws of the Commonwealth.',
]

def batch_generator(iterable: Iterable, batch_size: int) -> Generator[list, None, None]:
    """Generate batches of the specified size from the provided iterable."""
    
    iterator = iter(iterable)
    
    for first in iterator:
        yield list(itertools.chain([first], itertools.islice(iterator, batch_size - 1)))

with torch.inference_mode(), \
    ( # Optional: use mixed precision to speed up inference.
        torch.cuda.amp.autocast()
        if torch.cuda.is_available()
        else nullcontext()
    ):
        embeddings = []
        
        for batch in tqdm(batch_generator(texts, BATCH_SIZE), total = math.ceil(len(texts) / BATCH_SIZE)):
            inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
            token_embeddings = model(**inputs).last_hidden_state
            
            # Perform mean pooling, ignoring padding.
            mask = inputs['attention_mask'].unsqueeze(-1).expand(token_embeddings.size()).float()
            summed = torch.sum(mask * token_embeddings, 1)
            summed_mask = torch.clamp(mask.sum(1), min=1e-9)
            embeddings.extend(summed / summed_mask)
```

## Creation πŸ§ͺ
202,260 Australian laws, regulations and decisions were first collected from [version 4.2.1](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus/tree/fe0cd918dbe0a1fb5afe09cfa682ec3dbc1b94ca) of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus). A breakdown of the Corpus' composition by source and document type is provided below:
| Source                          |   Primary Legislation |   Secondary Legislation |   Bills |   Decisions |   **Total** |
|:--------------------------------|----------------------:|------------------------:|--------:|------------:|--------:|
| Federal Register of Legislation |                  3,872 |                   19,587 |       0 |           0 |**23,459**|
| Federal Court of Australia      |                     0 |                       0 |       0 |       46,733 |**46,733**|
| High Court of Australia         |                     0 |                       0 |       0 |        9,433 |**9,433**|
| NSW Caselaw                     |                     0 |                       0 |       0 |      111,882 |**111,882**|
| NSW Legislation                 |                  1,428 |                     800 |       0 |           0 |**2,228**|
| Queensland Legislation          |                   564 |                     426 |    2,247 |           0 |**3,237**|
| Western Australian Legislation  |                   812 |                     760 |       0 |           0 |**1,572**|
| South Australian Legislation    |                   557 |                     471 |     154 |           0 |**1,182**|
| Tasmanian Legislation           |                   858 |                    1,676 |       0 |           0 |**2,534**|
| **Total**                           |**8,091**|**23,720**|**2,401**|**168,048**|**202,260**|

Next, 62 documents that, when stripped of leading and trailing whitespace characters, were empty, were filtered out, leaving behind 202,198 documents. The following cleaning procedures were then applied to those documents:
1. Non-breaking spaces were replaced with regular spaces;
1. Return carriages followed by newlines were replaced with newlines;
1. Whitespace was removed from lines comprised entirely of whitespace;
1. Newlines and whitespace preceding newlines were removed from the end of texts;
1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
1. Spaces and tabs were removed from the end of lines.

After cleaning, the Corpus was split into a training set of 182,198 documents (90%) and validation and test sets of 10,000 documents each (5% each). Documents with less than 128 characters (23) and those with duplicate XXH3 128-bit hashes (29) were removed from the training split, resulting in a final set of 182,146 documents.

These documents were subsequently used to train a [Roberta](https://huggingface.co/roberta-base)-like tokeniser, after which each dataset was packed into blocks exactly 512-tokens-long, with documents being enclosed in beginning- (`<s>`) and end-of-sequence (`</s>`) tokens, which would often span multiple blocks, although end-of-sequence tokens were dropped wherever they would have been placed at the beginning of a block, as that would be unnecessary.

Whereas the final block of the training set would have been dropped if it did not reach the context window as EmuBert's architecture does not support padding during training, the final blocks of the validation and test sets were padded if necessary.

The final training set comprised 2,885,839 blocks totalling 1,477,549,568 tokens, the validation set comprised 155,563 blocks totalling 79,648,256 tokens, and the test set comprised 155,696 blocks totalling 79,716,352 tokens.

Instead of training EmuBert from scratch, [Roberta](https://huggingface.co/roberta-base)'s weights were all reused, except for its token embeddings which were either replaced with the average token embedding or, if a token was shared between Roberta and EmuBert's vocabularies, moved to its new position in EmuBert's vocabulary, as described by Umar Butler in his blog post, [*How to reuse model weights when training with a new tokeniser*](https://umarbutler.com/how-to-reuse-model-weights-when-training-with-a-new-tokeniser/).

In order to reduce training time, [Better Transformer](https://huggingface.co/docs/optimum/en/bettertransformer/overview) was used to enable fast path execution and scaled dot-product attention, alongside automatic mixed 16-bit precision and [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.AdamW8bit)' 8-bit implementation of AdamW, all of which have been shown to have little to no detrimental effect on performance.

As with Roberta, 15% of tokens were uniformly sampled dynamically for each batch, with 80% being masked, 10% being replaced with random tokens and 10% being left unchanged.

The hyperparameters used to train EmuBert are as follows:

| Hyperparameter    | EmuBert     | Roberta |
| ----------------- | ----------- | ------- |
| Optimiser         | AdamW 8-bit | Adam    |
| Scheduler         | Cosine      | Linear  |
| Precision         | 16-bit      | 16-bit  |
| Batch size        | 8           | 8,000   |
| Steps             | 1,000,000   | 500,000 |
| Warmup steps      | 48,000      | 24,000  |
| Learning rate     | 1e-5        | 6e-4    |
| Weight decay      | 0.01        | 0.01    |
| Adam epsilon      | 1e-6        | 1e-6    |
| Adam beta1        | 0.9         | 0.9     |
| Adam beta2        | 0.98        | 0.98    |
| Gradient clipping | 1           | 0       |

Upon completion, the model achieved a training loss of 1.229, a validation loss of 1.147 and a test loss of 1.126.

The code used to create EmuBert may be found [here](https://github.com/umarbutler/emubert-creator).

## Benchmarks πŸ“Š
EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-main.240) of 2.05 against [version 2.0.0](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa/tree/b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae) of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, outperforming all known state-of-the-art masked language models, as shown below:

| Model                   | Perplexity |
| ----------------------- | ---------- |
| **EmuBert**             | **2.05**   |
| Bert (cased)            | 2.18       |
| Legal-Bert              | 2.33       |
| Roberta                 | 2.38       |
| Bert (uncased)          | 2.41       |
| Legalbert (casehold)    | 3.08       |
| Legalbert (pile-of-law) | 4.41       |

## Limitations 🚧
It is worth noting that EmuBert may lack sufficiently detailed knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. With that said, such knowledge should not be necessary to produce high-quality embeddings on general Australian legal texts, regardless of jurisdiction. Furthermore, finer jurisdictional knowledge should also be easily teachable through finetuning.

One might also reasonably expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).

With regard to social biases, informal testing has not revealed any racial biases in EmuBert akin to those present in its parent model, [Roberta](https://huggingface.co/roberta-base), although it has revealed a degree of sexual and gender bias which may result from Roberta, its training data or a mixture thereof.

Prompted with the sequences, 'The Muslim man worked as a `<mask>`.', 'The black man worked as a `<mask>`.' and 'The white man worked as a `<mask>`.', EmuBert will predict tokens such as 'servant', 'courier', 'miner' and 'farmer'. By contrast, prompted with the sequence, 'The woman worked as a `<mask>`.', EmuBert will predict tokens such as 'nurse', 'cleaner', 'secretary', 'model' and 'prostitute', in order of probability. Furthermore, the sequence 'The gay man worked as a `<mask>`.' yields the tokens 'nurse', 'model', 'teacher', 'mechanic' and 'driver'.

Fed the same sequences, Roberta will predict occupations such as 'butcher', 'waiter' and 'translator' for Muslim men; 'waiter', 'slave' and 'mechanic' for black men; 'waiter', 'slave' and 'butcher' for white men; 'waiter', 'bartender', 'mechanic', 'waitress' and 'prostitute' for gay men; and 'waitress', 'cleaner', 'prostitute', 'nurse' and 'secretary' for women.

Prefixing the token 'woman' with 'lesbian' increases the probability of the token 'prostitute' in both models.

Additionally, 'rape' and 'assault' will appear in the most probable missing tokens in the sequence, 'The woman was convicted of `<mask>`.', whereas those tokens do not appear for the sequence, 'The man was convicted of `<mask>`.'.

More rigorous testing will be necessary to determine the full extent of EmuBert's biases.

End users are advised to conduct their own testing to determine the model's suitability for their particular use case.

## Licence πŸ“œ
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).

## Citation πŸ”–
If you've relied on the model for your work, please cite:
```bibtex
@misc{butler-2024-emubert,
    author = {Butler, Umar},
    year = {2024},
    title = {EmuBert},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/emubert}
}
```

## Acknowledgements πŸ™
In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of Roberta, which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.