Question on the usage of TowerBlocks to train this model

#3
by vince62s - opened

Here are 2 examples found in ToowerBlocks:

{'conversations': [{'from': 'human', 'value': 'Source: Orlando Bloom and Miranda Kerr still love each other\nTranslate the source text from English to German.\nTarget: '}, {'from': 'gpt', 'value': 'Orlando Bloom und Miranda Kerr lieben sich noch immer'}], 'lang': 'en-de', 'split': 'dev', 'dataset': 'general_mt_clean', 'task': 'machine_translation'}
{'conversations': [{'from': 'human', 'value': 'From English to German, translate the text:\nSource: The heavily armoured dinosaur used red and white camouflage to hide from predators, and employed a shielding technique known as counter-shading, which is also used by many modern-day animals.\nTarget: '}, {'from': 'gpt', 'value': 'Der stark gepanzerte Dinosaurier verwendete eine rote und weiße Tarnung, um sich vor Raubtieren zu verstecken, und verwendete eine Abschirmungstechnik, die als Gegenbeschattung bekannt ist, die auch von vielen modernen Tieren verwendet wird.'}], 'lang': 'en-de', 'split': 'dev', 'dataset': 'general_mt_clean', 'task': 'machine_translation'}

I have two big questions:

  1. in the dataset we see the from with "human" and "gpt" but in the model card you mention "user" and "assistant" did you actually transform all fileds in user/assistant to train TowerInstruct ?

  2. The instructions vary a lot. In the first example you see the instruction "Translate" avec the source text, and it says "Source" "Traget" instead of "English" and "German" as per the model card. Inthe second example, instruction is before but not exacty as per model card.

Did you use the Towerblocks dataset as is or did you perfom some kind of preprocessing ?

Thanks

Unbabel org

Hey @vince62s ,
Models are trained wit chatML format (on axolotl) which uses "human" and "gpt" rather than "user" and "assistant". Hugging face apply_chat_template function uses other nomenclature but that should be fine. In the end what you want to be passed to the model is something like: '<|im_start|>user\nTranslate the following English source text to Portuguese:\nEnglish: I am a language model for european languages. \nPortuguese: <|im_end|>\n<|im_start|>assistant\n'

Regarding the differences in the dataset... that's on purpose. The model is trained with a bunch of different template so that it does not overfit to a particular type of prompt. In the model card we added 1 example but you can use the model with many different prompts.

Sign up or log in to comment