distily_bench_obj_cross_v2.1

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=0, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.004
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant_with_warmup
num_epochs: 1.0

Peak GPU Memory: 8.0568 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	28486.4824	70517.9062	6.3282	6.4922	77.016	9.704	14258.5928	39903.9922
500	0.6464	276.9369	227102.6875	1.3956	6.4832	77.123	9.717	11.3346	4592851.5
773	0.9994	299.6370	276510.875	1.2989	6.5099	76.806	9.678	10.3161	20158404.0