Traditional and Heavy-Tailed Self Regularization in Neural Network - PowerPoint PPT Presentation

Traditional and Heavy-Tailed Self Regularization in Neural Network Models Charles H. Martin & Michael W. Mahoney ICML, June 2019 ( charles@calculationconsulting.com & mmahoney@stat.berkeley.edu) Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 1 / 11

Motivations: towards a Theory of Deep Learning Theoretical : deeper insight into Why Deep Learning Works ? convex versus non-convex optimization? explicit/implicit regularization? is / why is / when is deep better? VC theory versus Statistical Mechanics theory? . . . Practical : use insights to improve engineering of DNNs? when is a network fully optimized? can we use labels and/or domain knowledge more efficiently? large batch versus small batch in optimization? designing better ensembles? . . . Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 2 / 11

How we will study regularization The Energy Landscape is determined by layer weight matrices W L : E DNN = h L ( W L × h L − 1 ( W L − 1 × h L − 2 ( · · · ) + b L − 1 ) + b L ) Traditional regularization is applied to W L : �� W l , b l L min E DNN ( d i ) − y i + α � W l � i l Different types of regularization, e.g., different norms � · � , leave different empirical signatures on W L . What we do: Turn off “all” regularization. Systematically turn it back on, explicitly with α or implicitly with knobs/switches. Study empirical properties of W L . Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 3 / 11

ESD: detailed insight into W L Empirical Spectral Density (ESD: eigenvalues of X = W T L W L ) Eopch 0: Random Matrix Eopch 36: Random + Spiles Entropy decrease corresponds to: modification (later, breakdown) of random structure and onset of a new kind of self-regularization. Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 4 / 11

Random Matrix Theory 101: Wigner and Tracy-Widom Wigner: global bulk statistics approach universal semi-circular form Tracy-Widom: local edge statistics fluctuate in universal way Problems with Wigner and Tracy-Widom: Weight matrices usually not square Typically do only a single training run Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 5 / 11

Random Matrix Theory 102’: Marchenko-Pastur (a) Vary aspect ratios (b) Vary variance parameters Figure: Marchenko-Pastur (MP) distributions. Important points: Global bulk stats : The overall shape is deterministic, fixed by Q and σ . Local edge stats : The edge λ + is very crisp, i.e., ∆ λ M = | λ max − λ + | ∼ O ( M − 2 / 3 ), plus Tracy-Widom fluctuations. We use both global bulk statistics as well as local edge statistics in our theory. Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 6 / 11

Random Matrix Theory 103: Heavy-tailed RMT Go beyond the (relatively easy) Gaussian Universality class: model strongly-correlated systems (“signal”) with heavy-tailed random matrices. Generative Model Finite- N Limiting Bulk edge (far) Tail w/ elements from Global shape Global shape Local stats Local stats λ ≈ λ + Universality class ρ N ( λ ) ρ ( λ ) , N → ∞ λ ≈ λ max MP Basic MP Gaussian TW MP No tail. distribution Gaussian, MP + Spiked- + low-rank Gaussian MP TW Gaussian Covariance perturbations spikes Heavy tail, (Weakly) MP + MP Heavy-Tailed ∗ Heavy-Tailed ∗ 4 < µ Heavy-Tailed PL tail (Moderately) PL ∗∗ PL Heavy tail, Heavy-Tailed No edge. Frechet ∼ λ − ( 1 ∼ λ − ( a µ + b ) 2 < µ < 4 2 µ +1) (or “fat tailed”) PL ∗∗ PL Heavy tail, (Very) ∼ λ − ( 1 ∼ λ − ( 1 No edge. Frechet 2 µ +1) 2 µ +1) 0 < µ < 2 Heavy-Tailed Basic MP theory, and the spiked and Heavy-Tailed extensions we use, including known, empirically-observed, and conjectured relations between them. Boxes marked “ ∗ ” are best described as following “TW with large finite size corrections” that are likely Heavy-Tailed, leading to bulk edge statistics and far tail statistics that are indistinguishable. Boxes marked “ ∗∗ ” are phenomenological fits, describing large (2 < µ < 4) or small (0 < µ < 2) finite-size corrections on N → ∞ behavior.

Phenomenological Theory: 5+1 Phases of Training (a) Random-like . (b) Bleeding-out . (c) Bulk+Spikes . (d) Bulk-decay . (e) Heavy-Tailed . (f) Rank-collapse . Figure: The 5+1 phases of learning we identified in DNN training. Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 8 / 11

Old/Small Models: Bulk+Spike ∼ Tikhonov regularization simple scale threshold � − 1 ˆ � ˆ W T y x = X + α I eigenvalues > α (Spikes) carry most of the signal/information λ + Smaller, older models like LeNet5 exhibit traditional regularization Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 9 / 11

New/Large Models: Heavy-tailed Self-regularization W is strongly-correlated and highly non-random Can model strongly-correlated systems by heavy-tailed random matrices Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc.) AlexNet ReseNet50 Inception V3 DenseNet201 ... Larger, modern DNNs exhibit novel Heavy-tailed self-regularization Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 10 / 11

Uses, implications, and extensions Exhibit all phases of training by varying just the batch size (“explaining” the generalization gap) A Very Simple Deep Learning (VSDL) model (with load-like parameters α , & temperature-like parameters τ ) that exhibits a non-trivial phase diagram Connections with minimizing frustration, energy landscape theory, and the spin glass of minimal frustration A “rugged convexity” since local minima do not concentrate near the ground state of heavy-tailed spin glasses A novel capacity control metric (the weighted sum of power law exponents) to predict trends in generalization performance for state-of-the-art models Use our tool: “pip install weightwatcher” Stop by the poster for more details ... Martin and Mahoney Traditional and Heavy-Tailed Self Reg. June 2019 11 / 11

Traditional and Heavy-Tailed Self Regularization in Neural Network - PowerPoint PPT Presentation

Traditional and Heavy-Tailed Self Regularization in Neural Network Models Charles H. Martin & Michael W. Mahoney ICML, June 2019 ( charles@calculationconsulting.com & mmahoney@stat.berkeley.edu) Martin and Mahoney Traditional and

Heavy tails: right skew ! Right skew ! normal distribution (not heavy tailed) ! e.g. heights of

Optimizing performance in heavy-tailed system: a case study Lyubov V. Potakhina Alexander S.

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed

Processing Quantities with Result for Addition . . . Heavy-Tailed Distribution of Case of a

Importance Sampling Methodology for Multidimensional Heavy-tailed Random Walks Jose Blanchet

Statistical Inference for Heavy and Super-Heavy-tailed distributions M. Isabel Fraga Alves DEIO,

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

ATLAS Heavy Flavour production Looking towards Run 2 Heavy Flavour at the LHC

Heavy-tailed random matrices and the Poisson Weighted In fi nite Tree Charles Bordenave CNRS

Approximating the covariance matrix with heavy tailed columns and RIP. Alexander Litvak

Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payofgs Department of

On the Statistical Rate of Nonlinear Recovery in Generative Models with Heavy-tailed Data Xiaohan

Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor,

Shared-Memory Programming Models Programmierung Paralleler und Verteilter Systeme (PPV) Sommer

04/09/2018 Linear algebra A brush-up course Jeff Hindsborg 04/09/2018 2 Agenda 1. Real

GSP Coordinating Committee Coordinating Committee Meeting March 26, 2018 Merced

Markets take the stairs up, but the elevator down Kris Boudt Professor of finance and

Estimation of moment-based models with latent variables work in progress Raaella Giacomini and

Measurably Entire Functions and Their Growth Adi Glcksam University of Toronto AMS Sectional

Business Statistics CONTENTS Hypotheses on the median The sign test The Wilcoxon signed ranks

Traditional and Heavy-Tailed Self Regularization in Neural Network - PowerPoint PPT Presentation

Traditional and Heavy-Tailed Self Regularization in Neural Network Models Charles H. Martin & Michael W. Mahoney ICML, June 2019 ( charles@calculationconsulting.com & mmahoney@stat.berkeley.edu) Martin and Mahoney Traditional and

Heavy tails: right skew ! Right skew ! normal distribution (not heavy tailed) ! e.g. heights of

Optimizing performance in heavy-tailed system: a case study Lyubov V. Potakhina Alexander S.

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed

Processing Quantities with Result for Addition . . . Heavy-Tailed Distribution of Case of a

Importance Sampling Methodology for Multidimensional Heavy-tailed Random Walks Jose Blanchet

Statistical Inference for Heavy and Super-Heavy-tailed distributions M. Isabel Fraga Alves DEIO,

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

ATLAS Heavy Flavour production Looking towards Run 2 Heavy Flavour at the LHC

Heavy-tailed random matrices and the Poisson Weighted In fi nite Tree Charles Bordenave CNRS

Approximating the covariance matrix with heavy tailed columns and RIP. Alexander Litvak

Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payofgs Department of

On the Statistical Rate of Nonlinear Recovery in Generative Models with Heavy-tailed Data Xiaohan

Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor,

Shared-Memory Programming Models Programmierung Paralleler und Verteilter Systeme (PPV) Sommer

04/09/2018 Linear algebra A brush-up course Jeff Hindsborg 04/09/2018 2 Agenda 1. Real

GSP Coordinating Committee Coordinating Committee Meeting March 26, 2018 Merced

Markets take the stairs up, but the elevator down Kris Boudt Professor of finance and

Estimation of moment-based models with latent variables work in progress Raaella Giacomini and

Measurably Entire Functions and Their Growth Adi Glcksam University of Toronto AMS Sectional

Business Statistics CONTENTS Hypotheses on the median The sign test The Wilcoxon signed ranks

Regularization Overview Regularization Overview Problems & Multicollinearity We will