When does label smoothing help? Rafael Mller, Simon Kornblith, - - PowerPoint PPT Presentation

when does label smoothing help
SMART_READER_LITE
LIVE PREVIEW

When does label smoothing help? Rafael Mller, Simon Kornblith, - - PowerPoint PPT Presentation

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing Improves performance across different tasks and architectures. However, why it works is not well understood. P 2 Preliminaries 1 Predictions 0


slide-1
SLIDE 1

Rafael Müller, Simon Kornblith, Geofgrey Hinton

When does label smoothing help?

slide-2
SLIDE 2

Label smoothing

P 2

Improves performance across different tasks and architectures. However, why it works is not well understood.

slide-3
SLIDE 3

Preliminaries

P 3

Cross-entropy Modified targets with label smoothing 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 Predictions Target Label smoothing target

slide-4
SLIDE 4

P 4

Penultimate layer representations

slide-5
SLIDE 5

Penultimate layer representations

P 5

activations penultimate layer weights of last layer for k-th logit (class' prototype) k-th logit

Logits are approximate distance between activations of penultimate layer and class’ prototypes

slide-6
SLIDE 6

Projecting penultimate layer activations in 2-D

P 6

Pick 3 classes (k1, k2, k3) and corresponding templates Project activations onto plane connecting the 3 templates

With label smoothing, activation is close to prototype of correct class and equally distant to protoypes of all remaining classes.

Without label smoothing With label smoothing

slide-7
SLIDE 7

Project name P 7

Implicit Calibration

slide-8
SLIDE 8

Calibration

P 8

Network is calibrated if for a softmax value of X (confidence) the prediction is correct X*100% of time Reliability diagram bins network’s confidences for max-prediction and calculate accuracy for each bin Modern neural networks are overconfident CIFAR100

slide-9
SLIDE 9

Calibration

P 9

Network is calibrated if for a softmax value of X (confidence) the prediction is correct X*100% of time Reliability diagram bins network’s confidences for max-prediction and calculate accuracy for each bin Modern neural networks are overconfident but simple logit temperature scaling is surprisingly effective

slide-10
SLIDE 10

Calibration

P 10

Network is calibrated if for a softmax value of X (confidence) the prediction is correct X*100% of time Reliability diagram bins network’s confidences for max-prediction and calculate accuracy for each bin Modern neural networks are overconfident but simple logit temperature scaling is surprisingly effective And label smoothing has a similar effect to temperature scaling (green curve)

slide-11
SLIDE 11

Calibration with beam-search

English-to-German translation using Transformer

P 11

Expected calibration error (ECE) Beam-search benefits from calibrated predictions (higher BLEU score) Calibration partly explain why LS helps translation (despite hurting perplexity) With LS Without LS

slide-12
SLIDE 12

P 12

Knowledge distillation

slide-13
SLIDE 13

Knowledge distillation

P 13

Toy experiment on MNIST Something goes seriously wrong with distillation when the teacher is trained with label smoothing. Label smoothing improves teacher’s generalization but hurus knowledge transfer to student.

Narrow student (no distillation)

Teacher/distilled student gap WITHOUT label smoothing Teacher/distilled student gap WITH label smoothing

slide-14
SLIDE 14

Revisiting representations training set

P 14

hard targets Label smoothing

Information lost with label smoothing:

  • Confidence difference between examples of

the same class

  • Similarity structure between classes
  • Harder to distinguish between examples, thus

less information for distillation!

slide-15
SLIDE 15

Measuring how much the logit remembers the input

P 15

x => index of image from training set z => image d() => random data augmentation f() => image to difference between two logits (includes neural network) y => real-valued single dimension Approximate p(y|x) as Gaussian with mean and variance calculated via Monte Carlo

slide-16
SLIDE 16

P 16

Summary

slide-17
SLIDE 17

P 17

Summary Label smoothing attenuates differences between examples and classes Label smoothing helps: 1. Better accuracy across datasets and architectures 2. Implicitly calibrates model’s predictions 3. Calibration helps beam-search

a. partly explaining success of label smoothing in translation

Label smoothing does not help: 1. Better teachers may distill worse, i.e. label smoothing trained teacher distill poorly

a. Explained visually and by mutual information reduction

Poster #164