lapp0 commited on
Commit
4572dcc
1 Parent(s): 7bbfcd2

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 6023.2686
19
- - eval_frwikippl: 36635.6680
20
- - eval_zhwikippl: 63580.2773
21
- - eval_tinystoriesppl: 2521.4807
22
- - eval_loss: 5.0637
23
- - eval_runtime: 13.0817
24
- - eval_samples_per_second: 76.442
25
- - eval_steps_per_second: 9.555
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -62,23 +62,23 @@ Peak GPU Memory: 8.1729 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 35320.8594 | 74921.0938 | 6.5227 | 13.0719 | 76.5 | 9.562 | 24085.9062 | 74141.7188 |
66
- | 1000 | 0.0808 | 6057.8965 | 36584.1016 | 5.0635 | 13.0614 | 76.561 | 9.57 | 2544.0928 | 63818.1680 |
67
- | 2000 | 0.1616 | 6023.2686 | 36635.6680 | 5.0635 | 13.099 | 76.342 | 9.543 | 2523.1487 | 63580.2773 |
68
- | 3000 | 0.2424 | 6023.2686 | 36635.6680 | 5.0635 | 13.0902 | 76.393 | 9.549 | 2523.1487 | 63580.2773 |
69
- | 4000 | 0.3232 | 6021.4019 | 36635.6680 | 5.0637 | 13.0916 | 76.385 | 9.548 | 2519.3967 | 63580.2773 |
70
- | 5000 | 0.4040 | 6023.2686 | 36635.6680 | 5.0637 | 13.0968 | 76.354 | 9.544 | 2521.4807 | 63580.2773 |
71
- | 6000 | 0.4848 | 6023.2686 | 36635.6680 | 5.0637 | 13.0689 | 76.517 | 9.565 | 2521.4807 | 63580.2773 |
72
- | 7000 | 0.5657 | 6017.6704 | 36656.3242 | 5.0637 | 13.0875 | 76.409 | 9.551 | 2513.5720 | 63546.3320 |
73
- | 8000 | 0.6465 | 6023.2686 | 36635.6680 | 5.0637 | 13.1185 | 76.228 | 9.529 | 2521.4807 | 63580.2773 |
74
- | 9000 | 0.7273 | 6023.2686 | 36635.6680 | 5.0637 | 13.0817 | 76.442 | 9.555 | 2521.4807 | 63580.2773 |
75
- | 10000 | 0.8081 | 6017.6704 | 36635.6680 | 5.0637 | 13.0767 | 76.472 | 9.559 | 2515.2351 | 63546.3320 |
76
- | 11000 | 0.8889 | 6023.2686 | 36635.6680 | 5.0637 | 13.0841 | 76.429 | 9.554 | 2522.7314 | 63580.2773 |
77
- | 12000 | 0.9697 | 6023.2686 | 36635.6680 | 5.0637 | 13.0805 | 76.45 | 9.556 | 2521.0635 | 63580.2773 |
78
- | 12375 | 1.0 | 6023.2686 | 36635.6680 | 5.0637 | 13.0623 | 76.556 | 9.57 | 2521.0635 | 63580.2773 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
82
  - Transformers 4.44.0
83
  - Pytorch 2.3.0
84
- - Datasets 2.21.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 1882.2876
19
+ - eval_frwikippl: 38923.2266
20
+ - eval_zhwikippl: 63461.6641
21
+ - eval_tinystoriesppl: 451.2739
22
+ - eval_loss: 4.8257
23
+ - eval_runtime: 13.1445
24
+ - eval_samples_per_second: 76.078
25
+ - eval_steps_per_second: 9.51
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 10909.4980 | 77116.0 | 6.3550 | 13.1937 | 75.794 | 9.474 | 4267.7983 | 73081.2031 |
66
+ | 1000 | 0.0808 | 1884.7683 | 38923.2266 | 4.8260 | 13.1354 | 76.13 | 9.516 | 453.2929 | 63529.4258 |
67
+ | 2000 | 0.1616 | 1882.5793 | 38923.2266 | 4.8257 | 13.2412 | 75.522 | 9.44 | 451.5352 | 63461.6641 |
68
+ | 3000 | 0.2424 | 1882.5793 | 38923.2266 | 4.8257 | 13.2384 | 75.538 | 9.442 | 451.6844 | 63461.6641 |
69
+ | 4000 | 0.3232 | 1881.7043 | 38923.2266 | 4.8257 | 13.2242 | 75.619 | 9.452 | 450.9009 | 63461.6641 |
70
+ | 5000 | 0.4040 | 1883.1630 | 38923.2266 | 4.8257 | 13.1558 | 76.012 | 9.501 | 451.8337 | 63461.6641 |
71
+ | 6000 | 0.4848 | 1883.1630 | 38923.2266 | 4.8257 | 13.2198 | 75.644 | 9.456 | 451.8337 | 63461.6641 |
72
+ | 7000 | 0.5657 | 1884.4762 | 38923.2266 | 4.8257 | 13.2183 | 75.653 | 9.457 | 452.8433 | 63529.4258 |
73
+ | 8000 | 0.6465 | 1882.5793 | 38923.2266 | 4.8257 | 13.1236 | 76.198 | 9.525 | 451.4604 | 63461.6641 |
74
+ | 9000 | 0.7273 | 1882.2876 | 38923.2266 | 4.8257 | 13.1445 | 76.078 | 9.51 | 451.2739 | 63461.6641 |
75
+ | 10000 | 0.8081 | 1880.2477 | 38923.2266 | 4.8257 | 13.2204 | 75.641 | 9.455 | 450.4167 | 63461.6641 |
76
+ | 11000 | 0.8889 | 1882.5793 | 38923.2266 | 4.8257 | 13.267 | 75.375 | 9.422 | 451.7592 | 63461.6641 |
77
+ | 12000 | 0.9697 | 1883.1630 | 38923.2266 | 4.8257 | 13.182 | 75.861 | 9.483 | 451.8337 | 63461.6641 |
78
+ | 12375 | 1.0 | 1883.1630 | 38923.2266 | 4.8257 | 13.202 | 75.746 | 9.468 | 451.8337 | 63461.6641 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
82
  - Transformers 4.44.0
83
  - Pytorch 2.3.0
84
+ - Datasets 2.20.0
logs/attn_loss_fn=None, attn_weight=0, gradient_accumulation_steps=1, hs_loss_fn=mse, hs_weight=2.0, learning_rate=0.0004, lr_scheduler_kwargs=__num_cycles___4_, lr_scheduler_type=cosine_with_restarts, max/events.out.tfevents.1723834178.93d6cbb3ad53 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4dcf54c0f3ba80cf194440b99eed60de8b69bcd2db4045d4de3c07cf2414325f
3
+ size 307
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4dfea5361764269b261b7bed75573bc5f866f868a496b5eeb6a9ef5d9600cd1a
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0fb2a2484cd2ddfc1ca74f378aecd493ed96dc95efbcd19968d8b21725ce360
3
  size 137033984