license: apache-2.0
base_model:https://huggingface.co/google/gemma-2b
Chinese chat demo of gemma-2b:
the language of model: chinese and english
The following uses gemma-2b (a language model that only supports English) to train a large model process that supports Chinese and English.
step 1: Use SentencePiece(bpe) to train Chinese corpus to obtain tokenizer.model and tokenizer.vocab
step 2: Merge the Chinese of tokenizer.model and the original of tokenizer.model
step 3: Use the merged special_tokens_map.json, tokenizer.model, tokenizer_config.json to replace the files of the original model (such as gemma-2b)
step 4: Use LLaMA-Factory for pre-training. Pay attention to the pre-training parameters. Resize vocab and resize embedding are required.
step 5: Based on the model pre-trained in step 4, the instructions are fine-tuned, which significantly improves the model's ability to understand and execute instructions.
step 6: Based on the instruction fine-tuning model, we can use this model for SFT training under different specific tasks, so that the model can perform better on specific tasks.