Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

When Babies Teach Babies: Peer Knowledge Sharing Beats Teacher-Guided Distillation in Small-Data LMs

This model uses weighted mutual learning (WML) to find and train distilled versions of a teacher model using peer-to-peer learning. It builds on the approach described in "Weighted Mutual Learning with Diversity-Driven Model Compression" (Zhang et al., 2022), with some key differences.

Approach

Peer Model Initialization

Unlike the original paper which uses differential pruning of the teacher model, we use Bayesian optimization to initialize smaller peer models:

  • For example, if num_peers = 4, target parameter counts are N/2, N/3, N/4, N/5 (where N is the teacher model size)
  • Optimize num_layers, attention_heads, and hidden_size to reach target parameter counts
  • This ensures diversity while also reducing model size

The key difference is that pruning (as used in the original paper) only masks parameters, while our distillation approach actually reduces the model architecture size.

Weighted Mutual Learning

We use the bi-level optimization method from the paper to minimize the WML loss and ensemble loss:

  1. Inner loop: Train peer models using weighted knowledge distillation loss (cross entropy + KL divergence)
  2. Outer loop: Update peer weights using mirror gradient descent to optimize ensemble performance (ensemble loss)

This allows the framework to dynamically adjust the importance of each peer during training.

Hyperparameters of the champion peer model

Hyperparameter Value
weight_decay 0.1
beta1 0.9
beta2 0.95
bayesian_init_points 10
bayesian_n_iter 100
grad_clip 1.0
prune_importance 'l1'
layer_bound 0.9
batch_size 3
block_size 512
num_epochs 100
loss_alpha 0.5
num_batches 60
warmup_iters 5
learning_rate 0.05
lr_decay_iters 200
min_lr 0.005
enable_early_stopping True

References

Zhang, M., Wang, L., Campos, D., Huang, W., Guo, C., & Yang, B. (2022). Weighted Mutual Learning with Diversity-Driven Model Compression. Advances in Neural Information Processing Systems, 35.

Downloads last month
1
Safetensors
Model size
27.8M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .