On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - PowerPoint PPT Presentation

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019

Motivation ◮ Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. ◮ In this paper we study dropout, one of the most popular algorithmic heuristics for training deep neural nets.

Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

Problem Setup ◮ Deep linear networks with k hidden layers f w : x �→ W k +1 · · · W 1 x , W i ∈ R d i × d i − 1 where w = { W i } k +1 i =1 is the set of weight matrices. ◮ x ∈ R d 0 , y ∈ R d k +1 , ( x , y ) ∼ D . Assume E [ xx ⊤ ] = I . ◮ Learning problem: minimize the population risk L ( w ) := E ( x , y ) ∼D [ � y − f w ( x ) � 2 ] based on iid samples from the distribution. Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

Problem Setup ◮ Network perturbed by dropping hidden nodes at random, computing ¯ f w ( x ) = W k +1 B k W k · · · B 1 W 1 x , where B i ( j , j )=0 with probability 1 − θ , and 1 θ with probability θ . ◮ dropout boils down to SGD on the dropout objective L θ ( w ) := E { B i } , ( x , y ) � y − ¯ f w ( x ) � 2 dropout dropout dropout dropout dropout dropout dropout dropout dropout Input layer Output layer x [1] y [1] x [2] y [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ]

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Empirical Observation ◮ 3-layer network with width/input/output dimensionality = 20. True Model 35 SGD 1 − θ = 0.05 30 1 − θ = 0.10 1 − θ = 0.15 25 1 − θ = 0.25 Singular values 1 − θ = 0.35 1 − θ = 0.45 20 1 − θ = 0.55 1 − θ = 0.65 15 1 − θ = 0.75 1 − θ = 0.85 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values

Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) pivot = j 1 pivot = j 2 pivot = j 3 Input layer Output layer x [1] y [1] i 1 i 3 x [2] y [2] i 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x [ d 0 ] y [ d k +1 ] α j 1 , i 1 := � W j 1 → 1 ( i 1 , :) � β 1 := W j 2 → j 1 +1 ( i 2 , i 1 ) β 2 := W j 3 → j 2 +1 ( i 3 , i 2 ) γ j 3 , i 3 := � W k +1 → j 3 +1 (: , i 3 ) �

Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w )

Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗

Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2

Main Results Explicit Regularizer Give full characterization of R ( w ) := L θ ( w ) − L ( w ) Induced Regularizer Θ( M ) := min f w = M R ( w ) ◮ Multi-dimensional output Θ ∗∗ ( f w ) = ν { d i } � f w � 2 ∗ ◮ One-dimensional output Θ( f w ) = Θ ∗∗ ( f w ) = ν { d i } � f w � 2 Effective Regularization Parameter ν { d i } increases with depth and decreases with width deeper and narrower networks are more biased towards low-rank solutions

Thanks for your attention! Stop by Poster 79 for more information.

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - PowerPoint PPT Presentation

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019 Motivation Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. In

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff

1 -norm regularization Ji Zhu (Michigan), Saharon Rosset (IBM T. J. Watson), Trevor Hastie

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

6. Approximation and fitting norm approximation least-norm problems regularized

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

Certifying singular isolated points and their multiplicity structure J.D. Hauenstein 1 B. Mourrain

An Impossibility Theorem Seminar Algorithms Kevin Chang Eindhoven University of Technology June

Methods for Intelligent Systems Lecture Notes on Clustering (II) 2009-2010 Davide Eynard

Case Studies Sasikumar M Overview Set of internal case studies Marathi Tutor SQL

Rational approximation to analytic functions with polar singular set and finitely many

Lebesgue decomposition and order structure Zsigmond Tarcsay and Tams Titkos IWOTA Chemnitz,

A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks Siegfried

The Kitchen Sink Jumpstarting Mobile Web Development Introduction & Background i n i m A

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - PowerPoint PPT Presentation

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019 Motivation Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. In

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang*, Tianyi Zhou*, Jeff

1 -norm regularization Ji Zhu (Michigan), Saharon Rosset (IBM T. J. Watson), Trevor Hastie

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

6. Approximation and fitting norm approximation least-norm problems regularized

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

Certifying singular isolated points and their multiplicity structure J.D. Hauenstein 1 B. Mourrain

An Impossibility Theorem Seminar Algorithms Kevin Chang Eindhoven University of Technology June

Methods for Intelligent Systems Lecture Notes on Clustering (II) 2009-2010 Davide Eynard

Case Studies Sasikumar M Overview Set of internal case studies Marathi Tutor SQL

Rational approximation to analytic functions with polar singular set and finitely many

Lebesgue decomposition and order structure Zsigmond Tarcsay and Tams Titkos IWOTA Chemnitz,

A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks Siegfried

The Kitchen Sink Jumpstarting Mobile Web Development Introduction &amp; Background i n i m A

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff

Regularization Overview Regularization Overview Problems & Multicollinearity We will

The Kitchen Sink Jumpstarting Mobile Web Development Introduction & Background i n i m A