File size: 5,677 Bytes
336e9fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167

---

thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
license: gemma
language:
- ja
- en
tags:
- gemma2
- conversational
base_model:
- google/gemma-2-2b
- google/gemma-2-2b-it
- rinna/gemma-2-baku-2b
base_model_relation: merge
pipeline_tag: text-generation
library_name: transformers

---

[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)


# QuantFactory/gemma-2-baku-2b-it-GGUF
This is quantized version of [rinna/gemma-2-baku-2b-it](https://huggingface.co/rinna/gemma-2-baku-2b-it) created using llama.cpp

# Original Model Card



# `Gemma 2 Baku 2B Instruct (rinna/gemma-2-baku-2b-it)`

![rinna-icon](./rinna.png)

# Overview

The model is an instruction-tuned variant of [rinna/gemma-2-baku-2b](https://huggingface.co/rinna/gemma-2-baku-2b), utilizing Chat Vector and Odds Ratio Preference Optimization (ORPO) for fine-tuning. It adheres to the gemma-2 chat format.

| Size | Continual Pre-Training | Instruction-Tuning |
| :-   | :-                     | :-                 |
| 2B   | Gemma 2 Baku 2B [[HF]](https://huggingface.co/rinna/gemma-2-baku-2b) | Gemma 2 Baku 2B Instruct [[HF]](https://huggingface.co/rinna/gemma-2-baku-2b-it) |

* **Model architecture**

    A 26-layer, 2304-hidden-size transformer-based language model. Please refer to the [Gemma 2 Model Card](https://www.kaggle.com/models/google/gemma-2/) for detailed information on the model's architecture.

* **Training**

    **Model merging.** The base model was endowed with instruction-following capabilities through a chat vector addition process. The chat vector was derived by subtracting the parameter vectors of [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) from [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it), as follows. 

    ~~~~text
    rinna/gemma-2-baku-2b + 1.0 * (google/gemma-2-2b-it - google/gemma-2-2b)
    ~~~~

    During this process, the embedding layer was excluded during the subtraction and addition of parameter vectors.
    
    **OPRO** was applied using a subset of the following dataset to further refine the performance of the merged model.

    - rinna's internal dataset
  
* **Contributors**

    - [Xinqi Chen](https://huggingface.co/Keely0419)
    - [Toshiaki Wakatsuki](https://huggingface.co/t-w)
    - [Kei Sawada](https://huggingface.co/keisawada)

---

# Benchmarking

Please refer to [rinna's LM benchmark page](https://rinnakk.github.io/research/benchmarks/lm/index.html).

---

# How to use the model

~~~~python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "rinna/gemma-2-baku-2b-it"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
    attn_implementation="eager",
)

chat = [
    { "role": "user", "content": "西田幾多郎とはどんな人物ですか?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
outputs = model.generate(
    input_ids,
    max_new_tokens=512,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
~~~~

It is recommended to use eager attention when conducting batch inference under bfloat16 precision. 
Currently, Gemma 2 yields NaN values for input sequences with padding when the default attention mechanism (torch.scaled_dot_product_attention) is employed in conjunction with bfloat16.

---

# Tokenization
The model uses the original [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it) tokenizer.

---

# How to cite
```bibtex
@misc{rinna-gemma-2-baku-2b-it,
    title = {rinna/gemma-2-baku-2b-it},
    author = {Chen, Xinqi and Wakatsuki, Toshiaki and Sawada, Kei},
    url = {https://huggingface.co/rinna/gemma-2-baku-2b-it}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}
```
---

# References
```bibtex
@article{gemma-2-2024,
    title = {Gemma 2},
    url = {https://www.kaggle.com/models/google/gemma-2},
    publisher = {Kaggle},
    author = {Gemma Team},
    year = {2024}
}

@article{huang2023chat,
    title = {Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages},
    author = {Huang, Shih-Cheng and Li, Pin-Zu and Hsu, Yu-Chi and Chen, Kuang-Ming and Lin, Yu Tung and Hsiao, Shih-Kai and Tzong-Han Tsai, Richard and Lee, Hung-yi},
    year = {2023},
    url = {https://arxiv.org/abs/2310.04799}
}

@article{hong2024orpo,
  title = {ORPO: Monolithic Preference Optimization without Reference Model},
  author = {Hong, Jiwoo and Lee, Noah and Thorne, James},
  year = {2024},
  url = {https://arxiv.org/abs/2403.07691}
}
```
---

# License
[Gemma Terms of Use](https://ai.google.dev/gemma/terms)