When does label smoothing help? Rafael Mller, Simon Kornblith, - PowerPoint PPT Presentation

When does label smoothing help? Rafael Müller, Simon Kornblith, Geofgrey Hinton

Label smoothing Improves performance across different tasks and architectures. However, why it works is not well understood. P 2

Preliminaries 1 Predictions 0 Cross-entropy 1 2 3 4 1 Target 0 Modified targets with label smoothing 1 2 3 4 1 Label smoothing target 0 1 2 3 4 P 3

Penultimate layer representations P 4

Penultimate layer representations activations penultimate layer weights of last layer for k-th logit (class' prototype) k-th logit Logits are approximate distance between activations of penultimate layer and class’ prototypes P 5

Projecting penultimate layer activations in 2-D Pick 3 classes (k1, k2, k3) and corresponding templates Project activations onto plane connecting the 3 templates Without label smoothing With label smoothing With label smoothing, activation is close to prototype of correct class and equally distant to protoypes of all remaining classes. P 6

Implicit Calibration Project name P 7

Calibration CIFAR100 Network is calibrated if for a softmax value of X (confidence) the prediction is correct X*100% of time Reliability diagram bins network’s confidences for max-prediction and calculate accuracy for each bin Modern neural networks are overconfident P 8

Calibration Network is calibrated if for a softmax value of X (confidence) the prediction is correct X*100% of time Reliability diagram bins network’s confidences for max-prediction and calculate accuracy for each bin Modern neural networks are overconfident but simple logit temperature scaling is surprisingly effective P 9

Calibration Network is calibrated if for a softmax value of X (confidence) the prediction is correct X*100% of time Reliability diagram bins network’s confidences for max-prediction and calculate accuracy for each bin Modern neural networks are overconfident but simple logit temperature scaling is surprisingly effective And label smoothing has a similar effect to temperature scaling (green curve) P 10

Calibration with beam-search English-to-German translation using Transformer Expected calibration error (ECE) Beam-search benefits from calibrated predictions (higher BLEU score) Calibration partly explain why LS helps translation (despite hurting perplexity) With LS Without LS P 11

Knowledge distillation P 12

Knowledge distillation Toy experiment on MNIST Something goes seriously wrong with distillation when the teacher is trained with Narrow student (no distillation) label smoothing. Teacher/distilled student Label smoothing improves teacher’s gap WITH label generalization but hurus knowledge transfer Teacher/distilled student smoothing gap WITHOUT label to student. smoothing P 13

Revisiting representations training set hard targets Label smoothing Information lost with label smoothing: - Confidence difference between examples of the same class - Similarity structure between classes - Harder to distinguish between examples, thus less information for distillation! P 14

Measuring how much the logit remembers the input x => index of image from training set z => image d() => random data augmentation f() => image to difference between two logits (includes neural network) y => real-valued single dimension Approximate p(y|x) as Gaussian with mean and variance calculated via Monte Carlo P 15

Summary P 16

Summary Label smoothing attenuates differences between examples and classes Label smoothing helps: 1. Better accuracy across datasets and architectures 2. Implicitly calibrates model’s predictions 3. Calibration helps beam-search a. partly explaining success of label smoothing in translation Label smoothing does not help: 1. Better teachers may distill worse, i.e. label smoothing trained teacher distill poorly a. Explained visually and by mutual information reduction Poster #164 P 17

When does label smoothing help? Rafael Mller, Simon Kornblith, - PowerPoint PPT Presentation

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing Improves performance across different tasks and architectures. However, why it works is not well understood. P 2 Preliminaries 1 Predictions 0

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

Club Med Bintan Island, Indonesia A HOLISTIC WELLNESS ESCAPE JUST OFF SINGAPORE Image label

Presentation of the label Certicold WHY A CERTICOLD LABEL? A European conformity label For

IETF 78 TPA-Label for ADSP DKIM Third-Party Authorization Label draft-otis-dkim-tpa-label By

MPLS Source Label draft-chen-mpls-source-label-02 Mach Chen, Xiaohu Xu Zhenbin Li, Luyuan Fang

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Robotic Agents (CMPSC 311) Calibration Janyl Jumadinova September 10, 2019 Janyl Jumadinova

Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Camera Calibration Camera Calibration Steve Steve Seitz Seitz Carnegie Mellon University

of Calibration by Lars Peter Hansen and James J. Heckman Journal of Economic Perspectives, Vol.

Calibration of hitting probabilities via multilevel splitting Ioannis Phinikettos Axel Gandy

Calibration Requirements for Detecting the EoR Power Spectrum N. Barry Department of Physics

The distribution of calibrated likelihood-ratios in speaker recognition David van Leeuwen and

When does label smoothing help? Rafael Mller, Simon Kornblith, - PowerPoint PPT Presentation

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing Improves performance across different tasks and architectures. However, why it works is not well understood. P 2 Preliminaries 1 Predictions 0

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

Extreme Classification A New Paradigm for Ranking &amp; Recommendation Manik Varma Microsoft

Club Med Bintan Island, Indonesia A HOLISTIC WELLNESS ESCAPE JUST OFF SINGAPORE Image label

Presentation of the label Certicold WHY A CERTICOLD LABEL? A European conformity label For

IETF 78 TPA-Label for ADSP DKIM Third-Party Authorization Label draft-otis-dkim-tpa-label By

MPLS Source Label draft-chen-mpls-source-label-02 Mach Chen, Xiaohu Xu Zhenbin Li, Luyuan Fang

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! &quot; ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Robotic Agents (CMPSC 311) Calibration Janyl Jumadinova September 10, 2019 Janyl Jumadinova

Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Camera Calibration Camera Calibration Steve Steve Seitz Seitz Carnegie Mellon University

of Calibration by Lars Peter Hansen and James J. Heckman Journal of Economic Perspectives, Vol.

Calibration of hitting probabilities via multilevel splitting Ioannis Phinikettos Axel Gandy

Calibration Requirements for Detecting the EoR Power Spectrum N. Barry Department of Physics

The distribution of calibrated likelihood-ratios in speaker recognition David van Leeuwen and

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !