production efficient ai

Goal: Train smaller model with the same performance

Model:

Dataset:

Loss: KL divergence of teacher vs student logits

Steps:

Knowledge distillation

Idea: augment the ground truth labels with a distribution of “soft probabilities” from teacher

KD mechanism

Example: DistillBERT

where

We observe in practice the efficiency when: