SnakyMcSnekFace commited on
Commit
f327f9c
1 Parent(s): 9ef4b89

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -174,6 +174,8 @@ Half of the samples was generated by this model where prompts contained the adve
174
 
175
  [KTO](https://arxiv.org/abs/2402.01306) trainer from [Hugging Face TRL library](https://huggingface.co/docs/trl/en/kto_trainer) was employed for performing preference alignment. The LoRA adapter from the previous training stages was merged into the model, and a new LoRA adapter was created for the KTO training. The quantized base model serves as a reference.
176
 
 
 
177
  #### QLoRa adapter configuration
178
 
179
  - Rank: 16
@@ -210,7 +212,7 @@ The model's performance in Adventure Mode has improved substantially. The writin
210
  ![Gradient Norm](img/kto_grad_norm.png)
211
  ![Learning rate](img/kto_learning_rate.png)
212
  ![Rewards](img/kto_train_rewards.png)
213
- ![Log probabilities](img/train_logps.png)
214
  ![KL divergence](img/kto_train_kl_divergence.png)
215
 
216
 
 
174
 
175
  [KTO](https://arxiv.org/abs/2402.01306) trainer from [Hugging Face TRL library](https://huggingface.co/docs/trl/en/kto_trainer) was employed for performing preference alignment. The LoRA adapter from the previous training stages was merged into the model, and a new LoRA adapter was created for the KTO training. The quantized base model serves as a reference.
176
 
177
+ During the alignment, the model was encouraged to respect player's actions and agency, construct a coherent narrative, and use evocative language to describe the world and the outcome of the player's actions.
178
+
179
  #### QLoRa adapter configuration
180
 
181
  - Rank: 16
 
212
  ![Gradient Norm](img/kto_grad_norm.png)
213
  ![Learning rate](img/kto_learning_rate.png)
214
  ![Rewards](img/kto_train_rewards.png)
215
+ ![Log probabilities](img/kto_train_logps.png)
216
  ![KL divergence](img/kto_train_kl_divergence.png)
217
 
218