# When Babies Teach Babies: Peer Knowledge Sharing Beats Teacher-Guided Distillation in Small-Data LMs This model uses weighted mutual learning (WML) to find and train distilled versions of a teacher model using peer-to-peer learning. It builds on the approach described in "Weighted Mutual Learning with Diversity-Driven Model Compression" (Zhang et al., 2022), with some key differences. ## Approach ### Peer Model Initialization Unlike the original paper which uses differential pruning of the teacher model, we use Bayesian optimization to initialize smaller peer models: - For example, if `num_peers = 4`, target parameter counts are N/2, N/3, N/4, N/5 (where N is the teacher model size) - Optimize `num_layers`, `attention_heads`, and `hidden_size` to reach target parameter counts - This ensures diversity while also reducing model size The key difference is that pruning (as used in the original paper) only masks parameters, while our distillation approach actually reduces the model architecture size. ### Weighted Mutual Learning We use the bi-level optimization method from the paper to minimize the WML loss and ensemble loss: 1. Inner loop: Train peer models using weighted knowledge distillation loss (cross entropy + KL divergence) 2. Outer loop: Update peer weights using mirror gradient descent to optimize ensemble performance (ensemble loss) This allows the framework to dynamically adjust the importance of each peer during training. ## Hyperparameters of the champion peer model | Hyperparameter | Value | |----------------|-------| | weight_decay | 0.1 | | beta1 | 0.9 | | beta2 | 0.95 | | bayesian_init_points | 10 | | bayesian_n_iter | 100 | | grad_clip | 1.0 | | prune_importance | 'l1' | | layer_bound | 0.9 | | batch_size | 3 | | block_size | 512 | | num_epochs | 100 | | loss_alpha | 0.5 | | num_batches | 60 | | warmup_iters | 5 | | learning_rate | 0.05 | | lr_decay_iters | 200 | | min_lr | 0.005 | | enable_early_stopping | True | ## References Zhang, M., Wang, L., Campos, D., Huang, W., Guo, C., & Yang, B. (2022). Weighted Mutual Learning with Diversity-Driven Model Compression. Advances in Neural Information Processing Systems, 35.