Born Again Neural Networks Tommaso Furlanello, Zachary C. Lipton, - PowerPoint PPT Presentation

Born Again Neural Networks Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anankdumar furlanel@usc.edu or for twitter trolling → @furlanel

Born Again Neural Networks Knowledge Distillation between identical neural network architectures systematically improves the student performance

Born Again Neural Networks Why Born Again ???

Dark Knowledge Under the Light Knowledge Distillation general interpretation is that conveys some “ Dark knowledge” hidden in the output scores of the teacher that reveals learned similarities between target categories

Dark Knowledge Under the Light Ground Truth Baseline Model Outputs Ground Truth Cross-Entropy Loss Function with one-hot Labels: F(X) P(Y) Only the dimension • corresponding to correct category 1 1 contributes to the loss function. 0 0 Output Values Contribution to Cross Entropy Loss

Dark Knowledge Under the Light Knowledge Distillation Student Outputs Teacher Outputs Cross-Entropy Loss Function with teacher outputs: F(X) Ft(x) The error in the output of all • categories contributes to the loss 1 1 function. If the teacher is highly accurate • and certain it is virtually identical to using original labels. 0 0 Output Values Contribution to Cross Entropy Loss

BAN - DenseNets Cifar-100 Object Classification (100 Categories) Students have systematically lower test error than • identical teacher. The most complex baseline model DenseNet-80-120 • with 50.4M params reaches a test error of 16.87 The smallest BAN-DenseNet-112-33 with 6.3M • params after 3 generations reaches a test error of 16.59, lower than the most complex baseline.

BAN - DenseNets Ban+L uses both labels and knowledge distillation Inter-generational ensembles improve over the individual models DenseNet-90-60 is used as teacher with students that share the same size of hidden states after each spatial transition but differs in depth and compression rate

BAN -Cifar10 Cifar-10 Object Classification (10 Categories)

Dark Knowledge Under the Light Dark Knowledge with Two experimental treatments to disentangle Permuted Predictions the contribution to the KD loss function of : Single dimension corresponding to • teachers predicted categories Dimensions corresponding to the teachers • Confidence Weighted non predicted category. by Teacher Max

Dark Knowledge Under the Light Dark Knowledge with Permuted Predictions Student Outputs Permuted Teacher Outputs Cross-Entropy Loss Function with permuted teacher outputs for the F(X) Ft(x) non max categories: The error in the output of all • 1 1 categories contributes to the loss function. Non max categories information • are permuted Max dimension contribution is • isolated 0 0 Output Values Contribution to Cross Entropy Loss

Dark Knowledge Under the Light Confidence Weighted by Teacher Max Student Outputs Ground Truth Cross-Entropy Loss Function with label, re-weighted by the value of F(X) P(Y) the teacher max: Only the dimension • 1 1 corresponding to correct category contributes to the loss function. Loss function of each sample is • re-weighted by the teacher’s max score. 0 0 Interpretation of knowledge • distillation as importance weighting of samples , where Output Values importance is defined by the Loss with High Confidence Teacher teacher’s confidence. Loss with Low Confidence Teacher

Dark Knowledge Under the Light We observe that the contribution of Knowledge Distillation depends on both the correct and incorrect output categories: Best results on CIFAR-100 using • simple KD with no labels. Permuting the incorrect output • categories results in systematic (but reduced) gains . CWTM of samples gives more • unstable results than DKPP suggesting that higher-order information of the complete output distribution are important.

BAN - ResNets BAN - LSTM Penn Tree Bank val/test perplexities of BAN-LSTM language models

BAN - ResNets BAN Wide-ResNet BAN Wide-ResNet Teacher Dense-90-60 with identical teacher Student ( 17.69 baseline) BAN - LSTM Penn Tree Bank val/test perplexities of BAN-LSTM language models

Related Literature Breiman, Leo, and Nong Shang. "Born again trees." University of California, • Berkeley, Berkeley, CA, Technical Report (1996). Bucilu ǎ , Cristian, Rich Caruana, and Alexandru Niculescu-Mizil. "Model • compression." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006. Vapnik, Vladimir, and Rauf Izmailov. "Learning using privileged information: similarity • control and knowledge transfer." Journal of machine learning research 16.2023-2049 (2015): 2. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural • network." arXiv preprint arXiv:1503.02531 (2015). Geras, Krzysztof J., et al. "Blending lstms into cnns." arXiv preprint arXiv:1511.06433 • (2015). Zagoruyko, Sergey, and Nikos Komodakis. "Paying more attention to attention: • Improving the performance of convolutional neural networks via attention transfer." arXiv preprint arXiv:1612.03928 (2016). Rusu, Andrei A., et al. "Policy distillation." arXiv preprint arXiv:1511.06295 (2015). • Yim, Junho, et al. "A gift from knowledge distillation: Fast optimization, network • minimization and transfer learning." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. Tarvainen, Antti, and Harri Valpola. "Mean teachers are better role models" • Advances in neural information processing systems. 2017. Schmitt, Simon, et al. "Kickstarting Deep Reinforcement Learning." arXiv preprint • arXiv:1803.03835 (2018).

Minsky thought it first :p

Extra credits to the conversations with: Pratik Chaudhari , Kamyar Azizzadenesheli, Seb Arnold, Rich Caruana, Sammy Bengio & all the participants of NIPS 2017 Metalearning workshop This work was supported by the National Science Foundation (CCF-1317433 and CNS-1545089), the Office of Naval Research (N00014-13-1-0563), C-BRIC (one of six centers in JUMP, a Semiconductor Re- search Corporation (SRC) program sponsored by DARPA), Intel Corporation and Amazon.com, inc. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof.

Born Again Neural Networks Tommaso Furlanello, Zachary C. Lipton, - PowerPoint PPT Presentation

Born Again Neural Networks Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anankdumar furlanel@usc.edu or for twitter trolling @furlanel Born Again Neural Networks Knowledge Distillation between identical

Again & Again Again & Again Again & Again Again & Again Gods people

Again & Again Again & Again Again & Again Again & Again The Detailed

Again & Again Again & Again Again & Again Again & Again The Divine Statement:

Again & Again Again & Again Again & Again Again & Again Afuer the death of

Again & Again Again & Again Again & Again Again & Again Life, like war, is a

Again & Again Again & Again Again & Again Again & Again Now when all the

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

OUR HOUSE A Production of Ishyo Arts Centre and Helios Theater (2016) I was born I was

Keziah Temilola Daramola Born 2/17/19 Isaac Henry Schmidt Born 1/6/19 Jack William Silver Born

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

NORDUnet Nordic Infrastructure for Research & Education NORDUnet NORDUnet NORDUnet Strategy

An 'amazing opportunity' - Women in IT Networking at SC (WINS) WOMEN IN IT NETWORKING AT SC

Brand Management: Rebrand Marketing: Program Launch, Small Business Digital Strategy: Web Redesign,

Creating and Using Great Content to Grow an International Business. DLR Summit Michael Goeden

First Quarter 2019 Conference Call May 8, 2019 Forward-Looking Statements and Additional

Second Quarter 2019 Conference Call August 1, 2019 Forward-Looking Statements and Additional

Delek US Holdings, Inc. Fourth Quarter 2018 Earnings Call February 20, 2019 Disclaimers Forward

Ellen R. McGrattan and Edward C. Prescott Two Asset-Pricing Puzzles Campbell-Shiller:

Born Again Neural Networks Tommaso Furlanello, Zachary C. Lipton, - PowerPoint PPT Presentation

Born Again Neural Networks Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anankdumar furlanel@usc.edu or for twitter trolling @furlanel Born Again Neural Networks Knowledge Distillation between identical

Again &amp; Again Again &amp; Again Again &amp; Again Again &amp; Again Gods people

Again &amp; Again Again &amp; Again Again &amp; Again Again &amp; Again The Detailed

Again &amp; Again Again &amp; Again Again &amp; Again Again &amp; Again The Divine Statement:

Again &amp; Again Again &amp; Again Again &amp; Again Again &amp; Again Afuer the death of

Again &amp; Again Again &amp; Again Again &amp; Again Again &amp; Again Life, like war, is a

Again &amp; Again Again &amp; Again Again &amp; Again Again &amp; Again Now when all the

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

OUR HOUSE A Production of Ishyo Arts Centre and Helios Theater (2016) I was born I was

Keziah Temilola Daramola Born 2/17/19 Isaac Henry Schmidt Born 1/6/19 Jack William Silver Born

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

NORDUnet Nordic Infrastructure for Research &amp; Education NORDUnet NORDUnet NORDUnet Strategy

An 'amazing opportunity' - Women in IT Networking at SC (WINS) WOMEN IN IT NETWORKING AT SC

Brand Management: Rebrand Marketing: Program Launch, Small Business Digital Strategy: Web Redesign,

Creating and Using Great Content to Grow an International Business. DLR Summit Michael Goeden

First Quarter 2019 Conference Call May 8, 2019 Forward-Looking Statements and Additional

Second Quarter 2019 Conference Call August 1, 2019 Forward-Looking Statements and Additional

Delek US Holdings, Inc. Fourth Quarter 2018 Earnings Call February 20, 2019 Disclaimers Forward

Ellen R. McGrattan and Edward C. Prescott Two Asset-Pricing Puzzles Campbell-Shiller:

Again & Again Again & Again Again & Again Again & Again Gods people

Again & Again Again & Again Again & Again Again & Again The Detailed

Again & Again Again & Again Again & Again Again & Again The Divine Statement:

Again & Again Again & Again Again & Again Again & Again Afuer the death of

Again & Again Again & Again Again & Again Again & Again Life, like war, is a

Again & Again Again & Again Again & Again Again & Again Now when all the

NORDUnet Nordic Infrastructure for Research & Education NORDUnet NORDUnet NORDUnet Strategy