distily_bench_obj_cross_v2.8

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.004
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Peak GPU Memory: 6.6058 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	24265.9961	83952.2266	6.4532	6.5041	76.875	9.686	14059.6133	62337.3242
5000	0.1010	13782.8369	65787.1094	5.9830	6.5013	76.908	9.69	6904.7764	58004.7852
10000	0.2020	13782.8369	65676.0234	5.9770	6.4897	77.046	9.708	6916.2041	58066.7188
15000	0.3030	13804.2139	65639.0234	5.9770	6.4932	77.003	9.702	6925.3584	58066.7188
20000	0.4040	13812.7734	65639.0234	5.9770	6.511	76.793	9.676	6934.5249	58066.7188
25000	0.5051	13829.8955	65639.0234	5.9770	6.5	76.923	9.692	6944.8496	58097.6836
30000	0.6061	13834.1826	65639.0234	5.9765	6.5123	76.778	9.674	6949.4409	58128.7188
35000	0.7071	13834.1826	65639.0234	5.9765	6.4965	76.965	9.698	6952.8945	58159.7148
40000	0.8081	13842.7607	65639.0234	5.9765	6.5677	76.13	9.592	6957.4912	58159.7148
45000	0.9091	13851.3447	65639.0234	5.9770	6.5257	76.62	9.654	6957.4912	58159.7148
49500	1.0	13851.3447	65639.0234	5.9770	6.5191	76.698	9.664	6957.4912	58159.7148