Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehler’s Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020

Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? ◮ Is there hope to design activation functions such that we can ”compress” multiple layers? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

Multi-Layer Perceptron with Random Weights (Neal ’96; Rahimi, Recht ’08; Daniely et al ’07) Input : x (0) := x ∈ R d � � Hidden Layers : x ( ℓ +1) := σ ∈ R d ℓ +1 , for 0 ≤ ℓ < L W ( ℓ ) x ( ℓ ) / � x ( ℓ ) � Random Weights : W ( ℓ ) ∈ R d ℓ +1 × d ℓ , W ( ℓ ) ∼ MN ( 0 , I d ℓ +1 ⊗ I d ℓ ) . Regime : d 1 , . . . , d L → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 3

Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 Compositional Kernel: K ( L ) ( x i , x j ) = G ◦ G ◦ · · · ◦ G ( ρ ij ) =: G ( L ) ( ρ ij ) . � �� composite L times H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

Branching Process and Compositional Kernels Distribution: Y , with P ( Y = k ) = α 2 k and PGF G . Galton-Watson Process: Z ( L ) , with off-spring Y and PGFs G ( L ) H. Tran-Bach Compositional Kernels of Deep Neural Networks 5

Rescaled Limit: Phase Transition Theorem (Liang, Tran-Bach ’20) Define � � a 2 a 2 µ := k k , µ ⋆ := k k log k . k ≥ 0 k > 2 Then, for all t > 0 , we have 1. If µ ≤ 1 , � if a 1 � = 1 1 , L →∞ K ( L ) ( e − t ) = lim e − t , if a 1 = 1 2. If µ > 1 , � ξ + (1 − ξ ) E [ e − tW ] , if µ ⋆ < ∞ L →∞ K ( L ) ( e − t /µ L ) = lim 0 , if µ ⋆ = ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 6

Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) L = 1 L = 3 L = 5 L = 7 1.0 1.00 1.0 0.8 0.8 0.75 1 /µ 1 0.5 0.6 0.6 0.50 0.4 0.4 0.25 0.0 0.2 1 /µ 3 0.2 0.00 0.0 0.0 1 /µ 5 0.5 0.25 1 /µ 7 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Rescaled Limit: K ( L ) ( e − t /µ L ) ReLU 1.0 L=1 L=3 0.9 L=5 L=7 0.8 0.7 0.6 0.5 0.4 0.0 0.2 0.4 0.6 0.8 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

small correlation large correlation Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ small correlation large correlation H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

New Random Features Algorithm H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

New Random Features Algorithm Kernels Activation Sampling shift-invariant (Rahimi, Recht ’08) cos , sin ≈ inner-product (Liang, Tran-Bach ’20) ≈ Gaussian H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

Conclusions 1. Additional Results: H. Tran-Bach Compositional Kernels of Deep Neural Networks 12

Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .

Branching type processes with Stationary Ergodic Immigration Eitan Altman MAESTRO group,

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &

Diffusion on fractals: Branching Processes and Random Fractals Ben Hambly Mathematical Insitute

Popula'on Size Dependent, Age Structured Branching Processes Linger

Condensation in reinforced branching processes Anna Senkevich as2945@bath.ac.uk Supervised by

Modern Discrete Probability IV - Branching processes Review S ebastien Roch UWMadison

Self-similar growth-fragmentations as scaling limits of Markov branching processes Benjamin

Branching Processes in Fluid Mechanics: An application to the Navier-Stokes and LANS-alpha

Branching processes, tipping points and phase transitions David Aldous April 4, 2016 Some

Subcritical Galton-Watson branching processes with immigration in random environment Pter

Change of Measure formula and the Hellinger Distance of two Lvy Processes Erika Hausenblas

Model Checking Stochastic Branching Processes Taolue Chen Klaus Dr ager Stefan Kiefer

Branching Processes Will Perkins March 12, 2013 Galton and Watson In 1873 Francis Galton wrote

SAS + -Planung Vorlesung Handlungsplanung Tilman Mehler 1 Uberblick Wir haben u.A.

An algorithmic approach to branching processes with countably infinitely many types Peter

Compositional correctness of IP-based system design: Translating C/C++ Models into SIGNAL

Formula Student Overview for 2014-2015 Carleton Formula Student What is Formula Student?

Tecniche di Specifica e di Verifica Branching Time Temporal Logics I 1 Outline CTL ( C

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

71 Overview for 2010-2011 Carleton Formula SAE and Formula-Hybrid yb d u a o a d u a S o

Conflict-Based Selection Conflict-Based Selection of Branching Rules in of Branching Rules in

A Compositional Logic A Compositional Logic for Control Flow for Control Flow Gang Tan, Boston

http://cs224w.stanford.edu Epidemic Model based on Random Trees (a variant of branching

Superstar Model: ReTweets, Lady Gaga and Surgery on a Branching Process J. Michael Steele Simons

Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .

Branching type processes with Stationary Ergodic Immigration Eitan Altman MAESTRO group,

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &amp;

Diffusion on fractals: Branching Processes and Random Fractals Ben Hambly Mathematical Insitute

Popula'on Size Dependent, Age Structured Branching Processes Linger

Condensation in reinforced branching processes Anna Senkevich as2945@bath.ac.uk Supervised by

Modern Discrete Probability IV - Branching processes Review S ebastien Roch UWMadison

Self-similar growth-fragmentations as scaling limits of Markov branching processes Benjamin

Branching Processes in Fluid Mechanics: An application to the Navier-Stokes and LANS-alpha

Branching processes, tipping points and phase transitions David Aldous April 4, 2016 Some

Subcritical Galton-Watson branching processes with immigration in random environment Pter

Change of Measure formula and the Hellinger Distance of two Lvy Processes Erika Hausenblas

Model Checking Stochastic Branching Processes Taolue Chen Klaus Dr ager Stefan Kiefer

Branching Processes Will Perkins March 12, 2013 Galton and Watson In 1873 Francis Galton wrote

SAS + -Planung Vorlesung Handlungsplanung Tilman Mehler 1 Uberblick Wir haben u.A.

An algorithmic approach to branching processes with countably infinitely many types Peter

Compositional correctness of IP-based system design: Translating C/C++ Models into SIGNAL

Formula Student Overview for 2014-2015 Carleton Formula Student What is Formula Student?

Tecniche di Specifica e di Verifica Branching Time Temporal Logics I 1 Outline CTL ( C

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

71 Overview for 2010-2011 Carleton Formula SAE and Formula-Hybrid yb d u a o a d u a S o

Conflict-Based Selection Conflict-Based Selection of Branching Rules in of Branching Rules in

A Compositional Logic A Compositional Logic for Control Flow for Control Flow Gang Tan, Boston

http://cs224w.stanford.edu Epidemic Model based on Random Trees (a variant of branching

Superstar Model: ReTweets, Lady Gaga and Surgery on a Branching Process J. Michael Steele Simons

Noble Names, Branching Processes, and Fixation Probabilities Joachim Hermisson Mathematics &