File size: 1,251 Bytes
63d2d58
c85c959
63d2d58
 
c85c959
63d2d58
c3a4789
 
 
 
 
 
 
c85c959
63d2d58
c85c959
63d2d58
c85c959
 
63d2d58
c85c959
 
63d2d58
c85c959
 
63d2d58
c85c959
 
63d2d58
c85c959
 
63d2d58
c85c959
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
license: apache-2.0
---

base_model:https://huggingface.co/google/gemma-2b

Chinese chat demo of gemma-2b:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e4a2ce5bbdd8d44b504628/RVxNl9oMDMQ8s2lbjz4wh.png)




the language of model: chinese and english

The following uses gemma-2b (a language model that only supports English) to train a large model process that supports Chinese and English.

step 1:
Use SentencePiece(bpe) to train Chinese corpus to obtain tokenizer.model and tokenizer.vocab

step 2:
Merge the Chinese of tokenizer.model and the original of tokenizer.model

step 3:
Use the merged special_tokens_map.json, tokenizer.model, tokenizer_config.json to replace the files of the original model (such as gemma-2b)

step 4:
Use LLaMA-Factory for pre-training. Pay attention to the pre-training parameters. Resize vocab and resize embedding are required.

step 5:
Based on the model pre-trained in step 4, the instructions are fine-tuned, which significantly improves the model's ability to understand and execute instructions.

step 6:
Based on the instruction fine-tuning model, we can use this model for SFT training under different specific tasks, so that the model can perform better on specific tasks.