File size: 1,924 Bytes
cd5e946
 
 
 
 
 
 
 
 
 
 
 
 
1c34a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: mit
datasets:
- ymoslem/Law-StackExchange
language:
- en
metrics:
- f1
base_model:
- google/gemma-2-2b
library_name: mlx
tags:
- legal
widget:
  - text: |
      <start_of_turn>user
      ## Instructions
      You are a helpful AI assistant.
      ## User
      How to make scrambled eggs?<end_of_turn>
      <start_of_turn>model
---
# shellzero/gemma2-2b-ft-law-data-tag-generation
This model was converted to MLX format from [`google/gemma-7b-it`]().
Refer to the [original model card](https://huggingface.co/google/gemma-7b-it) for more details on the model.

```zsh
pip install mlx-lm
```

The model was LoRA fine-tuned on the [ymoslem/Law-StackExchange](https://huggingface.co/datasets/ymoslem/Law-StackExchange) and Synthetic data generated from 
GPT-4o and GPT-35-Turbo using the format below, for 1500 steps using `mlx`.

This fine tune was one of the best runs with our data and achieved high F1 score on our eval dataset. (Part of the Nvidia hackathon)

```python
def format_prompt(system_prompt: str, title: str, question: str) -> str:
    "Format the question to the format of the dataset we fine-tuned to."
    return """<bos><start_of_turn>user
## Instructions
{}
## User
TITLE:
{}
QUESTION:
{}<end_of_turn>
<start_of_turn>model
""".format(
        system_prompt, title, question
    )
```

Here's an example of the system_prompt from the dataset:
```text
Read the following title and question about a legal issue and assign the most appropriate tag to it. All tags must be in lowercase, ordered lexicographically and separated by commas.
```
## Loading the model using `mlx_lm`

```python
from mlx_lm import generate, load
model, tokenizer = load("shellzero/gemma2-2b-ft-law-data-tag-generation")
response = generate(
    model,
    tokenizer,
    prompt=format_prompt(system_prompt, question),
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=32,
)
```