End of training

Browse files

Files changed (3) hide show

README.md +23 -23
logs/attn_loss_fn=None, attn_weight=0, gradient_accumulation_steps=1, hs_loss_fn=mse, hs_weight=2.0, learning_rate=0.0004, lr_scheduler_kwargs=__num_cycles___4_, lr_scheduler_type=cosine_with_restarts, max/events.out.tfevents.1723834178.93d6cbb3ad53 +3 -0
model.safetensors +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 6023.2686
-- eval_frwikippl: 36635.6680
-- eval_zhwikippl: 63580.2773
-- eval_tinystoriesppl: 2521.4807
-- eval_loss: 5.0637
-- eval_runtime: 13.0817
-- eval_samples_per_second: 76.442
-- eval_steps_per_second: 9.555
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -62,23 +62,23 @@ Peak GPU Memory: 8.1729 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 35320.8594 | 74921.0938 | 6.5227 | 13.0719 | 76.5 | 9.562 | 24085.9062 | 74141.7188 |
-| 1000 | 0.0808 | 6057.8965 | 36584.1016 | 5.0635 | 13.0614 | 76.561 | 9.57 | 2544.0928 | 63818.1680 |
-| 2000 | 0.1616 | 6023.2686 | 36635.6680 | 5.0635 | 13.099 | 76.342 | 9.543 | 2523.1487 | 63580.2773 |
-| 3000 | 0.2424 | 6023.2686 | 36635.6680 | 5.0635 | 13.0902 | 76.393 | 9.549 | 2523.1487 | 63580.2773 |
-| 4000 | 0.3232 | 6021.4019 | 36635.6680 | 5.0637 | 13.0916 | 76.385 | 9.548 | 2519.3967 | 63580.2773 |
-| 5000 | 0.4040 | 6023.2686 | 36635.6680 | 5.0637 | 13.0968 | 76.354 | 9.544 | 2521.4807 | 63580.2773 |
-| 6000 | 0.4848 | 6023.2686 | 36635.6680 | 5.0637 | 13.0689 | 76.517 | 9.565 | 2521.4807 | 63580.2773 |
-| 7000 | 0.5657 | 6017.6704 | 36656.3242 | 5.0637 | 13.0875 | 76.409 | 9.551 | 2513.5720 | 63546.3320 |
-| 8000 | 0.6465 | 6023.2686 | 36635.6680 | 5.0637 | 13.1185 | 76.228 | 9.529 | 2521.4807 | 63580.2773 |
-| 9000 | 0.7273 | 6023.2686 | 36635.6680 | 5.0637 | 13.0817 | 76.442 | 9.555 | 2521.4807 | 63580.2773 |
-| 10000 | 0.8081 | 6017.6704 | 36635.6680 | 5.0637 | 13.0767 | 76.472 | 9.559 | 2515.2351 | 63546.3320 |
-| 11000 | 0.8889 | 6023.2686 | 36635.6680 | 5.0637 | 13.0841 | 76.429 | 9.554 | 2522.7314 | 63580.2773 |
-| 12000 | 0.9697 | 6023.2686 | 36635.6680 | 5.0637 | 13.0805 | 76.45 | 9.556 | 2521.0635 | 63580.2773 |
-| 12375 | 1.0 | 6023.2686 | 36635.6680 | 5.0637 | 13.0623 | 76.556 | 9.57 | 2521.0635 | 63580.2773 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
-- Datasets 2.21.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 1882.2876
+- eval_frwikippl: 38923.2266
+- eval_zhwikippl: 63461.6641
+- eval_tinystoriesppl: 451.2739
+- eval_loss: 4.8257
+- eval_runtime: 13.1445
+- eval_samples_per_second: 76.078
+- eval_steps_per_second: 9.51
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 10909.4980 | 77116.0 | 6.3550 | 13.1937 | 75.794 | 9.474 | 4267.7983 | 73081.2031 |
+| 1000 | 0.0808 | 1884.7683 | 38923.2266 | 4.8260 | 13.1354 | 76.13 | 9.516 | 453.2929 | 63529.4258 |
+| 2000 | 0.1616 | 1882.5793 | 38923.2266 | 4.8257 | 13.2412 | 75.522 | 9.44 | 451.5352 | 63461.6641 |
+| 3000 | 0.2424 | 1882.5793 | 38923.2266 | 4.8257 | 13.2384 | 75.538 | 9.442 | 451.6844 | 63461.6641 |
+| 4000 | 0.3232 | 1881.7043 | 38923.2266 | 4.8257 | 13.2242 | 75.619 | 9.452 | 450.9009 | 63461.6641 |
+| 5000 | 0.4040 | 1883.1630 | 38923.2266 | 4.8257 | 13.1558 | 76.012 | 9.501 | 451.8337 | 63461.6641 |
+| 6000 | 0.4848 | 1883.1630 | 38923.2266 | 4.8257 | 13.2198 | 75.644 | 9.456 | 451.8337 | 63461.6641 |
+| 7000 | 0.5657 | 1884.4762 | 38923.2266 | 4.8257 | 13.2183 | 75.653 | 9.457 | 452.8433 | 63529.4258 |
+| 8000 | 0.6465 | 1882.5793 | 38923.2266 | 4.8257 | 13.1236 | 76.198 | 9.525 | 451.4604 | 63461.6641 |
+| 9000 | 0.7273 | 1882.2876 | 38923.2266 | 4.8257 | 13.1445 | 76.078 | 9.51 | 451.2739 | 63461.6641 |
+| 10000 | 0.8081 | 1880.2477 | 38923.2266 | 4.8257 | 13.2204 | 75.641 | 9.455 | 450.4167 | 63461.6641 |
+| 11000 | 0.8889 | 1882.5793 | 38923.2266 | 4.8257 | 13.267 | 75.375 | 9.422 | 451.7592 | 63461.6641 |
+| 12000 | 0.9697 | 1883.1630 | 38923.2266 | 4.8257 | 13.182 | 75.861 | 9.483 | 451.8337 | 63461.6641 |
+| 12375 | 1.0 | 1883.1630 | 38923.2266 | 4.8257 | 13.202 | 75.746 | 9.468 | 451.8337 | 63461.6641 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
+- Datasets 2.20.0

logs/attn_loss_fn=None, attn_weight=0, gradient_accumulation_steps=1, hs_loss_fn=mse, hs_weight=2.0, learning_rate=0.0004, lr_scheduler_kwargs=__num_cycles___4_, lr_scheduler_type=cosine_with_restarts, max/events.out.tfevents.1723834178.93d6cbb3ad53 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4dcf54c0f3ba80cf194440b99eed60de8b69bcd2db4045d4de3c07cf2414325f
+size 307

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4dfea5361764269b261b7bed75573bc5f866f868a496b5eeb6a9ef5d9600cd1a
 size 137033984

 version https://git-lfs.github.com/spec/v1
+oid sha256:d0fb2a2484cd2ddfc1ca74f378aecd493ed96dc95efbcd19968d8b21725ce360
 size 137033984