Guanaco-leh-V2: A Multilingual Instruction-Following Language Model Based on LLaMA 7B

This model is trained with guanaco-lora with lora + embed_tokens + lm_head be trained.

The dataset is from alpaca-cleaned and guanaco. With trained embed and head, the model perform better at Chinese and Japanese then original LLaMA, and with instruction based prompt. You can use this model more easily.

Since this model is trained by guanaco dataset, you can also use this as chatbot. just use this format:

### Instruction:
User: <Message history>
Assistant: <Message history>

### Input:
System: <System response for next message, optional>
User: <Next message>

### Response:

Tips: I just removed the first line of original prompt to reduce token comsumption, plz consider remove it when you want to use this model

Difference between previous model

The main differences are:

model is trained on bf16 not 8bit
ctx cut off length increased to 1024
use larger dataset (latest guanaco + alpaca cleand = 540k entries)
use larger batch size (64->128)

And since the train data has more chat-based data. This model is more fit in chatbot usage.

Try this model:

You can try this model with this colab. Or using generate.py in the guanaco-lora, all the examples are generated by guanaco-lora.

If you want to use the lora model from guanaco-7b-leh-v2-adapter/ , remember to turn off the load_in_8bit, or manually merge it into 7B model!

Recommend Generation parameters:

temperature: 0.5~0.7
top p: 0.65~1.0
top k: 30~50
repeat penalty: 1.03~1.17

Training Setup

2x3090 with model parallel
batch size = bsz 8 * grad acc 16 = 128
ctx cut off length = 1024
only train on output (with loss mask)
enable group of len
538k entries, 2epoch (about 8400 step)
lr 2e-4

Some Example

(As you can see, although guanaco can reply fluently, the content is quite confusing. So you may want to add some thing in the system part.)

I use guanaco with instruction to let it translate a chinese article to JP/DE/EN. And use gpt-4 to scoring them and get this:

Some more information

Why use lora+embed+head

First, I think it is obvious that when a LLM isn't good at some language and you want to ft for it. You should train the embed and head part.
But the question is: "Why not just native finetune?"
If you have searched for some alpaca model or training thing, you may notice that lot of them has 1 problem: "memorize".
The loss will drop at the begin of every epoch, just like some kind of "overfit".
And in my opinion, this is because that the number of params of LLaMA is too large. So it just memorize all the training data.

But if I use lora for attention part(ignore MLP part), the param number is not large enough for "memorizing training data", so it is more unlikely to memorize all the things.

KBlueLeaf
/

guanaco-7b-leh-v2