lapp0 commited on
Commit
d581d32
1 Parent(s): 523afcc

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 3607.3171
19
- - eval_frwikippl: 29425.125
20
- - eval_zhwikippl: 52510.3125
21
- - eval_tinystoriesppl: 1167.9218
22
- - eval_loss: 5.1093
23
- - eval_runtime: 6.5022
24
- - eval_samples_per_second: 76.897
25
- - eval_steps_per_second: 9.689
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.0004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
@@ -62,20 +62,20 @@ Peak GPU Memory: 8.0568 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 21321.3555 | 56774.5312 | 6.6010 | 6.4946 | 76.987 | 9.7 | 11289.9248 | 60744.7383 |
66
- | 500 | 0.0808 | 3754.7207 | 29512.3027 | 5.1110 | 6.4981 | 76.946 | 9.695 | 1235.4543 | 53915.7461 |
67
- | 1000 | 0.1616 | 3629.7410 | 29470.7617 | 5.1093 | 6.5256 | 76.621 | 9.654 | 1179.3701 | 52678.6953 |
68
- | 1500 | 0.2424 | 3604.8032 | 29425.125 | 5.1093 | 6.4874 | 77.073 | 9.711 | 1167.5359 | 52510.3125 |
69
- | 2000 | 0.3232 | 3604.8032 | 29425.125 | 5.1093 | 6.5089 | 76.818 | 9.679 | 1167.3427 | 52510.3125 |
70
- | 2500 | 0.4040 | 3607.3171 | 29425.125 | 5.1093 | 6.4993 | 76.931 | 9.693 | 1167.9218 | 52510.3125 |
71
- | 3000 | 0.4848 | 3607.3171 | 29425.125 | 5.1093 | 6.5037 | 76.879 | 9.687 | 1167.9218 | 52510.3125 |
72
- | 3500 | 0.5656 | 3607.3171 | 29425.125 | 5.1093 | 6.4875 | 77.071 | 9.711 | 1167.9218 | 52510.3125 |
73
- | 4000 | 0.6464 | 3607.3171 | 29425.125 | 5.1093 | 6.5141 | 76.757 | 9.671 | 1167.9218 | 52510.3125 |
74
- | 4500 | 0.7272 | 3607.3171 | 29425.125 | 5.1093 | 6.4963 | 76.967 | 9.698 | 1167.9218 | 52510.3125 |
75
- | 5000 | 0.8080 | 3607.3171 | 29425.125 | 5.1093 | 6.4977 | 76.95 | 9.696 | 1167.9218 | 52510.3125 |
76
- | 5500 | 0.8888 | 3607.3171 | 29425.125 | 5.1093 | 6.485 | 77.101 | 9.715 | 1167.9218 | 52510.3125 |
77
- | 6000 | 0.9696 | 3607.3171 | 29425.125 | 5.1093 | 6.5124 | 76.777 | 9.674 | 1167.9218 | 52510.3125 |
78
- | 6188 | 1.0 | 3607.3171 | 29425.125 | 5.1093 | 6.5022 | 76.897 | 9.689 | 1167.9218 | 52510.3125 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 172.3267
19
+ - eval_frwikippl: 37035.1875
20
+ - eval_zhwikippl: 194088.4531
21
+ - eval_tinystoriesppl: 10.7883
22
+ - eval_loss: 1.3421
23
+ - eval_runtime: 6.5007
24
+ - eval_samples_per_second: 76.914
25
+ - eval_steps_per_second: 9.691
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 0.004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 21321.3555 | 56774.5312 | 6.6010 | 6.5203 | 76.684 | 9.662 | 11289.9248 | 60744.7383 |
66
+ | 500 | 0.0808 | 209.4913 | 62706.3438 | 1.4389 | 6.5207 | 76.678 | 9.661 | 11.3247 | 298068.5312 |
67
+ | 1000 | 0.1616 | 182.2395 | 44100.0312 | 1.3516 | 6.5162 | 76.732 | 9.668 | 10.7194 | 255063.875 |
68
+ | 1500 | 0.2424 | 174.5435 | 39099.0508 | 1.3434 | 6.5225 | 76.658 | 9.659 | 10.7310 | 199230.0 |
69
+ | 2000 | 0.3232 | 173.0893 | 37756.8164 | 1.3422 | 6.5133 | 76.766 | 9.672 | 10.7545 | 194918.7188 |
70
+ | 2500 | 0.4040 | 171.9666 | 36889.3945 | 1.3422 | 6.4938 | 76.996 | 9.702 | 10.7906 | 195543.75 |
71
+ | 3000 | 0.4848 | 171.2156 | 36931.0 | 1.3418 | 6.5177 | 76.714 | 9.666 | 10.7314 | 190904.3438 |
72
+ | 3500 | 0.5656 | 172.5805 | 37171.0625 | 1.3418 | 6.5331 | 76.533 | 9.643 | 10.8124 | 193984.8281 |
73
+ | 4000 | 0.6464 | 171.9800 | 37035.1875 | 1.3417 | 6.5151 | 76.744 | 9.67 | 10.7732 | 191414.4375 |
74
+ | 4500 | 0.7272 | 172.1532 | 37056.0664 | 1.3423 | 6.5089 | 76.818 | 9.679 | 10.7879 | 193984.8281 |
75
+ | 5000 | 0.8080 | 172.3400 | 37035.1875 | 1.3422 | 6.5009 | 76.912 | 9.691 | 10.7968 | 196799.8281 |
76
+ | 5500 | 0.8888 | 172.2065 | 37035.1875 | 1.3422 | 6.4968 | 76.961 | 9.697 | 10.7714 | 193984.8281 |
77
+ | 6000 | 0.9696 | 172.3267 | 37035.1875 | 1.3419 | 6.5099 | 76.806 | 9.678 | 10.7910 | 193984.8281 |
78
+ | 6188 | 1.0 | 172.3267 | 37035.1875 | 1.3421 | 6.5007 | 76.914 | 9.691 | 10.7883 | 194088.4531 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
logs/dropout=0.1, learning_rate=0.004, weight_decay=0.1/events.out.tfevents.1723877699.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:547a586aa9753321c3a42c52333fad7a0e1f653aa97e8bd22622ae304be2ccfa
3
+ size 307