production efficient ai

Goal: Train smaller model with the same performance

Model

Dataset:  

Loss: KL divergence of teacher vs student logits  

Steps:


Knowledge distillation

 Idea: augment the ground truth labels with a distribution of “soft probabilities” from teacher

KD mechanism

Example: DistillBERT

where 

How to choose student model?

We observe in practice the efficiency when: