File size: 2,919 Bytes
a8966c5
 
 
 
 
 
 
465e398
 
 
a8966c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
465e398
a8966c5
465e398
 
a8966c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
tags:
- generated_from_trainer
model-index:
- name: out
  results: []
---
This is the Instruction Fine Tuned version of [Tiny Llama](https://github.com/jzhang38/TinyLlama) on [@Teknium1's](https://twitter.com/Teknium1) [openhermes](https://huggingface.co/datasets/teknium/openhermes) dataset.

`"The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01."`


<details><summary>See axolotl config</summary>

axolotl version: `0.3.0`
```yaml
base_model: ./TinyLlama-1.1B-intermediate-step-1431k-3T

model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./openhermes
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./out

sequence_len: 4096
sample_packing: false

adapter: 
lora_model_dir:
lora_r: 
lora_alpha: 
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project: tinyllama-openhermes
wandb_entity: tensoic
wandb_watch:
wandb_name: 
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 8
num_epochs: 1
optimizer: adamw_bnb_8bit
adam_epsilon: 0.00001
max_grad_norm: 1.0
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:

warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: zero2.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

```

</details><br>



![image/png](https://cdn-uploads.huggingface.co/production/uploads/644bf6ef778ecbfb977e8e84/baBgY3cd4rUKWQITj3sNx.png)

The model achieves the following loss:
- Loss: 1.3647

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 128
- total_eval_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-05
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 1
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 3.0006        | 0.0   | 1    | 1.6838          |
| 0.8195        | 0.25  | 451  | 1.4620          |
| 0.6836        | 0.5   | 902  | 1.4158          |
| 0.6811        | 0.75  | 1353 | 1.3647          |


### Framework versions

- Transformers 4.36.2
- Pytorch 2.0.1+cu117
- Datasets 2.15.0
- Tokenizers 0.15.0