When Babies Teach Babies: Peer Knowledge Sharing Beats Teacher-Guided Distillation in Small-Data LMs

This model uses weighted mutual learning (WML) to find and train distilled versions of a teacher model using peer-to-peer learning. It builds on the approach described in "Weighted Mutual Learning with Diversity-Driven Model Compression" (Zhang et al., 2022), with some key differences.

Approach

Peer Model Initialization

Unlike the original paper which uses differential pruning of the teacher model, we use Bayesian optimization to initialize smaller peer models:

For example, if num_peers = 4, target parameter counts are N/2, N/3, N/4, N/5 (where N is the teacher model size)
Optimize num_layers, attention_heads, and hidden_size to reach target parameter counts
This ensures diversity while also reducing model size

The key difference is that pruning (as used in the original paper) only masks parameters, while our distillation approach actually reduces the model architecture size.

Weighted Mutual Learning

We use the bi-level optimization method from the paper to minimize the WML loss and ensemble loss:

Inner loop: Train peer models using weighted knowledge distillation loss (cross entropy + KL divergence)
Outer loop: Update peer weights using mirror gradient descent to optimize ensemble performance (ensemble loss)

This allows the framework to dynamically adjust the importance of each peer during training.

Hyperparameters of the champion peer model

Hyperparameter	Value
weight_decay	0.1
beta1	0.9
beta2	0.95
bayesian_init_points	10
bayesian_n_iter	100
grad_clip	1.0
prune_importance	'l1'
layer_bound	0.9
batch_size	3
block_size	512
num_epochs	100
loss_alpha	0.5
num_batches	60
warmup_iters	5
learning_rate	0.05
lr_decay_iters	200
min_lr	0.005
enable_early_stopping	True

References

Zhang, M., Wang, L., Campos, D., Huang, W., Guo, C., & Yang, B. (2022). Weighted Mutual Learning with Diversity-Driven Model Compression. Advances in Neural Information Processing Systems, 35.