--- language: - en license: other license_name: microsoft-research-license license_link: https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx library_name: transformers base_model: microsoft/phi-1_5 tags: - law - legal - australia - generated_from_trainer datasets: - umarbutler/open-australian-legal-corpus inference: false metrics: - perplexity model-index: - name: open-australian-legal-llm results: - task: type: text-generation name: Text generation dataset: type: umarbutler/open-australian-legal-qa name: Open Australian Legal QA split: train revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae metrics: - type: perplexity value: 8.693482443009522 name: Perplexity source: name: lmppl url: https://github.com/asahi417/lmppl --- ⚠️ This model has been superseded by the [Open Australian Legal LLM](https://huggingface.co/umarbutler/open-australian-legal-llm), the largest open source language model trained on Australian law. You are encouraged to use that model instead. ⚠️ # Open Australian Legal Phi-1.5 ‍⚖️ Open Australian Legal Phi-1.5 is an open source [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) model trained on Australian law. Naturally, as a finetune of [Phi-1.5](https://huggingface.co/microsoft/phi-1_5), the model may be used for any of the tasks for which [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) is suitable, including text generation, text completion and question answering. Trained on roughly 45,000 laws, regulations and decisions, comprising 422,373,888 tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain. The model is issued under the same licence as its parent model, namely the [Microsoft Research License](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx). ## Usage 👩‍💻 The code snippet below demonstrates just one of the many ways in which the model may be accessed: ```python >>> from transformers import set_seed, AutoModelForCausalLM, AutoTokenizer, pipeline >>> set_seed(42) # We set a seed for reproducibility. >>> model = AutoModelForCausalLM.from_pretrained('umarbutler/open-australian-legal-phi-1_5', trust_remote_code=True) # `trust_remote_code=True` is required to load Phi 1.5. >>> tokenizer = AutoTokenizer.from_pretrained('umarbutler/open-australian-legal-phi-1_5') >>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer) >>> generator('Section 51 of the Constitution provides', max_length=24) [{'generated_text': 'Section 51 of the Constitution provides that the Parliament may make laws for the peace, order and good government of the Commonwealth.'}] ``` ## Creation 🧪 50,000 laws, regulations and decisions were randomly sampled from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), excluding duplicate texts and documents that, when stripped of leading and trailing whitespace, were less than 128 characters long. The following cleaning procedures were then applied: 1. Non-breaking spaces were replaced with regular spaces; 1. Return carriages followed by newlines were replaced with newlines; 1. Whitespace was removed from lines comprised entirely of whitespace; 1. Newlines and whitespace preceding newlines were removed from the end of texts; 1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and 1. Spaces and tabs were removed from the end of lines. After cleaning, the documents were added to blocks 512-tokens-long, with [Phi-1.5](https://huggingface.co/microsoft/phi-1_5)'s end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. These blocks were then randomly shuffled and split into a training dataset of 742,454 and a validation dataset of 82,495 blocks, or 380,136,448 and 42,237,440 tokens, respectively. The training dataset was subsequently fed to [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) via with the following hyperparameters: | Hyperparameter | Value | | --- | --- | | Sequence length | 512 | | Epochs | 1 | | Optimiser | AdamW | | Learning rate | 2e-5 | | Learning rate scheduler | Linear with warmup | | Batch size per device | 4 | | Weight decay | 0.1 | | Warmup steps | 0.03 | After training for 1 epoch, or 185,614 steps, over a period of ~16 hours on a single GeForce RTX 4090, the model achieved a validation loss of 2.21. ## Limitations 🚧 Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of [Phi-1.5](https://huggingface.co/microsoft/phi-1_5). One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation). Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. ## Licence 📜 The model is issued under the same licence as its parent model, namely the [Microsoft Research License](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx). ## Citation 🔖 If you've relied on the model for your work, please cite: ```bibtex @misc{butler-2023-open-australian-legal-phi-1.5, author = {Butler, Umar}, year = {2023}, title = {Open Australian Legal Phi-1.5}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-phi-1_5} } ``` ## Acknowledgements 🙏 In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today. The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences. The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [Phi-1.5](https://huggingface.co/microsoft/phi-1_5), which the model was built atop. Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.