production efficient ai
Goal: Train smaller model with the same performance
Model:
Dataset:
Loss: KL divergence of teacher vs student logits
Steps:
Knowledge distillation
Knowledge distillation
Idea: augment the ground truth labels with a distribution of “soft probabilities” from teacher
KD mechanism
Example: DistillBERT
where
L_mlm is teacher (T) original loss function
L_KD is KL divergence of S and T distributions
L_cos is spacial distance of S and T hidden states
How to choose student model?
How to choose student model?
We observe in practice the efficiency when:
teacher & student of the same model type (architecture)