Demystifying Dropout Hongchang Gao 1 , 2 , Jian Pei 3 , 4 and Heng - PowerPoint PPT Presentation

Demystifying Dropout Hongchang Gao 1 , 2 , Jian Pei 3 , 4 and Heng Huang 1 , 2 1 JD Finance America Corporation 2 Department of Electrical and Computer Engineering, University of Pittsburgh, USA 3 JD.com 4 School of Computing Science, Simon Fraser University, Canada June 13, 2019 Heng Huang Demystifying Dropout June 13, 2019 1 / 16

Outline Motivation 1 Forward and Backward Dropout 2 Definition Observations Augmented Dropout 3 Augmented Dropout Results Heng Huang Demystifying Dropout June 13, 2019 2 / 16

Motivation Dropout is a popular technique to allievate overfitting. 1 What is the underlying reason for its performance gain? 2 Feature augmentation, Regularization Dropout happens in both the forward and backward pass 3 Forward pass: feature augmentation Backward pass: noisy gradient Which pass accounts for performance gain of dropout? Heng Huang Demystifying Dropout June 13, 2019 3 / 16

Motivation 1 Forward and Backward Dropout 2 Definition Observations Augmented Dropout 3 Augmented Dropout Results Heng Huang Demystifying Dropout June 13, 2019 4 / 16

Definition Forward Dropout forward dropout randomly drops features in the forward pass but it does not drop features and backpropagated errors in the backward pass. Forward pass z ( l +1) = W l ( h ( l ) ⊙ ǫ ( l ) ) + b ( l ) , (1) h ( l +1) = f l ( z ( l +1) ) . Backward pass ∂ J ∂ W ( l ) = δ ( l +1) h ( l ) T , (2) ∂ J δ ( l ) � ∂ z ( l ) = ( W ( l ) T δ ( l +1) ) ⊙ f ′ l ( z ( l ) ) . Heng Huang Demystifying Dropout June 13, 2019 5 / 16

Definition Backward Dropout Backward dropout keeps all features in the forward pass while drops features andback propagated errors as the standard dropout in the back-ward pass. Forward pass z ( l +1) = W l h ( l ) + b ( l ) , (3) h ( l +1) = f l ( z ( l +1) ) . Backward pass ∂ J ∂ W ( l ) = δ ( l +1) ( h ( l ) ⊙ ǫ ( l ) ) T , (4) ∂ J δ ( l ) � l ( z ( l ) )) ⊙ ǫ ( l ) . ∂ z ( l ) = (( W ( l ) T δ ( l +1) ) ⊙ f ′ Heng Huang Demystifying Dropout June 13, 2019 6 / 16

Observations Table: The test accuracy of ConvNet-Quick for CIFAR10 Dropout Ratio 0.3 0.2 0.1 0.05 0.01 0.005 Plain 0.7579 Standard Dropout 0.7523 0.7617 0.7657 0.7647 0.7626 0.7608 Forward Dropout 0.6908 0.7211 0.7482 0.7627 0.7612 0.7578 Backward Dropout 0.7433 0.7557 0.7585 0.7593 0.7583 0.7599 Table: The test accuracy of ResNet-20 for CIFAR10 Dropout Ratio 0.3 0.2 0.1 0.05 0.01 0.005 Plain 0.9143 Standard Dropout 0.9163 0.9176 0.9193 0.9174 0.9141 0.9154 Forward Dropout 0.9007 0.9093 0.9169 0.9171 0.9141 0.9142 Backward Dropout 0.9109 0.9130 0.9140 0.9142 0.9147 0.9146 Heng Huang Demystifying Dropout June 13, 2019 8 / 16

Observations When the dropout ratio is large ( p = 0 . 2), the training loss of the forward dropout are 1 much larger than those of other methods. In other words, when p is large, due to interrupting features heavily, the forward dropout can increase the model bias and cause underfitting. As a result, it degrades the performance of the plain network. As for the standard dropout, although it employs the same dropout ratio with the forward dropout, yet its loss and accuracy are better than those of the forward dropout. The possible reason is the implicit regularization of the noisy gradient in the backward pass, which is helpful to escape local minima. When it comes to the backward dropout, although it can arrive at a smaller training loss compared with the standard dropout, it cannot outperform the standard one. The possible reason is that there is no data augmentation in the forward pass as the standard dropout. Hence, its generalization performance is worse than the standard dropout. Heng Huang Demystifying Dropout June 13, 2019 9 / 16

Observations When the dropout ratio becomes moderately small ( p = 0 . 05), the training loss of the 2 forward and backward dropout approach to that of the standard dropout, and their accuracy is a little better than that of the plain network. Thus, the data augmentation in the forward and the noisy gradient in the backward pass caused by the mild noise are helpful to improve the generalization performance. Additionally, the forward dropout outperforms the backward one. In other words, mild noise is more helpful in the forward pass. When the dropout ratio is very small ( p = 0 . 005), it is intuitive that the forward dropout 3 has little effect on the performance of the plain deep neural networks. Surprisingly, the backward dropout has better performance than the plain network, which means that the noisy gradient caused by the small noise in the backward pass contributes to improving the generalization performance. Heng Huang Demystifying Dropout June 13, 2019 10 / 16

Augmented Dropout Based on aforementioned observations, we propose the augmented dropout, which employs different dropout strategy for two passes. Forward pass h ( l ) forward = h ( l ) ⊙ ǫ ( l ) ˆ foward , z ( l +1) = W l ˆ forward + b ( l ) , h ( l ) (5) h ( l +1) = f l ( z ( l +1) ) , Backward pass h ( l ) backward = h ( l ) ⊙ ǫ ( l ) ˆ backward , ∂ J h ( l ) T ∂ W ( l ) = δ ( l +1) ˆ backward , (6) ∂ J δ ( l ) � l ( z ( l ) ) ⊙ ǫ ( l ) ∂ z ( l ) = ( W ( l ) T δ ( l +1) ) ⊙ ( f ′ backward ) , Heng Huang Demystifying Dropout June 13, 2019 12 / 16

Results Table: The test accuracy of ConvNet-Quick Standard Augmented Datasets Dropout Ratio Acc Acc Dropout Ratio 0.1 0.9240 0.9248 0.1/0.0002 SVHN 0.2 0.9231 0.9258 0.2/0.0002 0.3 0.9249 0.9250 0.3/0.0002 0.1 0.7657 0.7674 0.1/0.0002 CIFAR10 0.2 0.7617 0.7655 0.2/0.0002 0.3 0.7523 0.7606 0.3/0.0002 Heng Huang Demystifying Dropout June 13, 2019 14 / 16

Results Table: The test accuracy of ResNet-20 Standard Augmented Datasets Dropout Ratio Acc Acc Dropout Ratio 0.1 0.9618 0.9627 0.1/0.0002 SVHN 0.2 0.9626 0.9648 0.2/0.0002 0.3 0.9655 0.9667 0.3/0.0002 0.1 0.9193 0.9196 0.1/0.0001 CIFAR10 0.2 0.9176 0.9195 0.2/0.0001 0.3 0.9163 0.9177 0.3/0.0001 0.1 0.6762 0.6786 0.1/0.0001 CIFAR100 0.2 0.6748 0.6770 0.2/0.0001 0.3 0.6686 0.6688 0.3/0.0001 Heng Huang Demystifying Dropout June 13, 2019 15 / 16

Thank You Heng Huang Demystifying Dropout June 13, 2019 16 / 16

Demystifying Dropout Hongchang Gao 1 , 2 , Jian Pei 3 , 4 and Heng - PowerPoint PPT Presentation

Demystifying Dropout Hongchang Gao 1 , 2 , Jian Pei 3 , 4 and Heng Huang 1 , 2 1 JD Finance America Corporation 2 Department of Electrical and Computer Engineering, University of Pittsburgh, USA 3 JD.com 4 School of Computing Science, Simon Fraser

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff

Demystifying the Finance Audit Committee DEMYSTIFYING THE FINANCE AND AUDIT COMMITTEE

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Demystifying Python Metaclasses Demystifying Python Metaclasses Eric D. Wills, Ph.D. Instructor,

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Preve Prevention ntion of of Dro Dropout pout in Vo in Vocatio cational Training nal

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

Demystifying The Edge for the IIoT Deployment: Supporting the new Industrial Landscape

Demystifying Benchmarks How to Use Them to Better Evaluate Databases Peter Friedenbach,

Demystifying Accreditation Under 5th Edition Standards January 2020 Magali De Castro Clinical

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

for DUNE FD 28 th October 2019 Mousam Rai Supervisor: Prof John Marshall 1 Roadmap for this

Jim im F Fole oley Trustee D e Decisions isions www.iapf.ie Irish Pensions Update CSO

THE THE MARK MARKET ET OU OUTL TLOO OOK DIANA MOUSINA SENIOR ECONOMIST MAY 2020 SHARES

Investment Market Review March 12, 2015 J. Scott Adams, CCIM President, Mid-South Region

Lecture 15: Sorting CSE 373: Data Structures and Algorithms Algorithms CSE 373 WI 19 - KASEY

Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy github.com/alban Jonathan Boulle

Integrating container-based virtualization technologies into ARC-powered grid infrastructure

in Actuarial Science a brief overview Arthur Charpentier charpentier.arthur@uqam.ca http

Demystifying Dropout Hongchang Gao 1 , 2 , Jian Pei 3 , 4 and Heng - PowerPoint PPT Presentation

Demystifying Dropout Hongchang Gao 1 , 2 , Jian Pei 3 , 4 and Heng Huang 1 , 2 1 JD Finance America Corporation 2 Department of Electrical and Computer Engineering, University of Pittsburgh, USA 3 JD.com 4 School of Computing Science, Simon Fraser

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang*, Tianyi Zhou*, Jeff

Demystifying the Finance Audit Committee DEMYSTIFYING THE FINANCE AND AUDIT COMMITTEE

Demystifying SEO for Government Agencies Demystifying SEO for Government Agencies Why should you

Demystifying DNA Demystifying DNA What is it? How do I get it? What is it? How do I

Demystifying Python Metaclasses Demystifying Python Metaclasses Eric D. Wills, Ph.D. Instructor,

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Preve Prevention ntion of of Dro Dropout pout in Vo in Vocatio cational Training nal

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

Demystifying The Edge for the IIoT Deployment: Supporting the new Industrial Landscape

Demystifying Benchmarks How to Use Them to Better Evaluate Databases Peter Friedenbach,

Demystifying Accreditation Under 5th Edition Standards January 2020 Magali De Castro Clinical

Demystifying Relational Latent Representations Sebastijan Dumani, Hendrik Blockeel DTAI, KU

for DUNE FD 28 th October 2019 Mousam Rai Supervisor: Prof John Marshall 1 Roadmap for this

Jim im F Fole oley Trustee D e Decisions isions www.iapf.ie Irish Pensions Update CSO

THE THE MARK MARKET ET OU OUTL TLOO OOK DIANA MOUSINA SENIOR ECONOMIST MAY 2020 SHARES

Investment Market Review March 12, 2015 J. Scott Adams, CCIM President, Mid-South Region

Lecture 15: Sorting CSE 373: Data Structures and Algorithms Algorithms CSE 373 WI 19 - KASEY

Container mechanics in Linux and rkt FOSDEM 2016 Alban Crequy github.com/alban Jonathan Boulle

Integrating container-based virtualization technologies into ARC-powered grid infrastructure

in Actuarial Science a brief overview Arthur Charpentier charpentier.arthur@uqam.ca http

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff