mehler s formula branching processes and compositional
play

Mehlers Formula, Branching Processes, and Compositional Kernels of - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .


  1. Mehler’s Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020

  2. Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

  3. Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

  4. Motivations & Questions ◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19) . What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18) . How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? ◮ Is there hope to design activation functions such that we can ”compress” multiple layers? H. Tran-Bach Compositional Kernels of Deep Neural Networks 2

  5. Multi-Layer Perceptron with Random Weights (Neal ’96; Rahimi, Recht ’08; Daniely et al ’07) Input : x (0) := x ∈ R d � � Hidden Layers : x ( ℓ +1) := σ ∈ R d ℓ +1 , for 0 ≤ ℓ < L W ( ℓ ) x ( ℓ ) / � x ( ℓ ) � Random Weights : W ( ℓ ) ∈ R d ℓ +1 × d ℓ , W ( ℓ ) ∼ MN ( 0 , I d ℓ +1 ⊗ I d ℓ ) . Regime : d 1 , . . . , d L → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 3

  6. Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

  7. Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

  8. Duality: Activation and Kernel Activation: Hermite Polynomials 2 ∞ � σ ( x ) = α k h k ( x ) , 1 k =0 0 h 0 ∞ h 1 � −1 h 2 α 2 with k = 1 . h 3 h 4 h 5 −2 k =0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Dual Kernel: [ σ ( w T x i / � x i � ) σ ( w T x j / � x j � )] K ( x i , x j ) := E w ∼N (0 , I d ) ∞ � α 2 k ρ k = ij =: G ( ρ ij ); ρ ij := � x i / � x i � , x j / � x j �� . k =0 Compositional Kernel: K ( L ) ( x i , x j ) = G ◦ G ◦ · · · ◦ G ( ρ ij ) =: G ( L ) ( ρ ij ) . � �� � composite L times H. Tran-Bach Compositional Kernels of Deep Neural Networks 4

  9. Branching Process and Compositional Kernels Distribution: Y , with P ( Y = k ) = α 2 k and PGF G . Galton-Watson Process: Z ( L ) , with off-spring Y and PGFs G ( L ) H. Tran-Bach Compositional Kernels of Deep Neural Networks 5

  10. Rescaled Limit: Phase Transition Theorem (Liang, Tran-Bach ’20) Define � � a 2 a 2 µ := k k , µ ⋆ := k k log k . k ≥ 0 k > 2 Then, for all t > 0 , we have 1. If µ ≤ 1 , � if a 1 � = 1 1 , L →∞ K ( L ) ( e − t ) = lim e − t , if a 1 = 1 2. If µ > 1 , � ξ + (1 − ξ ) E [ e − tW ] , if µ ⋆ < ∞ L →∞ K ( L ) ( e − t /µ L ) = lim 0 , if µ ⋆ = ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 6

  11. Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

  12. Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) L = 1 L = 3 L = 5 L = 7 1.0 1.00 1.0 0.8 0.8 0.75 1 /µ 1 0.5 0.6 0.6 0.50 0.4 0.4 0.25 0.0 0.2 1 /µ 3 0.2 0.00 0.0 0.0 1 /µ 5 0.5 0.25 1 /µ 7 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

  13. Kernel Limits Example: centered ReLU Unscaled Limit: K ( L ) ( t ) ReLU 1.0 L=1 L=3 0.8 L=5 L=7 0.6 0.4 0.2 0.0 0.2 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Rescaled Limit: K ( L ) ( e − t /µ L ) ReLU 1.0 L=1 L=3 0.9 L=5 L=7 0.8 0.7 0.6 0.5 0.4 0.0 0.2 0.4 0.6 0.8 1.0 H. Tran-Bach Compositional Kernels of Deep Neural Networks 7

  14. small correlation large correlation Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

  15. Memorization Capacity ◮ ”small correlation” sup ij | ρ ij | ≈ 0 iid ∼ Unif ( S d − 1 ) and log( n ) / d → 0 x 1 , . . . , x n ◮ ”large correlation” sup ij | ρ ij | ≈ 1 x 1 , . . . , x n maximal packing of S d − 1 and log( n ) / d → ∞ small correlation large correlation H. Tran-Bach Compositional Kernels of Deep Neural Networks 8

  16. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  17. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  18. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  19. Memorization Capacity Theorem Theorem (Liang & Tran-Bach ’20) L � log( n κ − 1 ) + log log n d (small correlation) log a − 2 1 L � exp(2 log n d )log( n κ − 1 ) (large correlation) µ − 1 to memorize the data in the sense that 1 − κ ≤ λ i ≤ 1 + κ , where λ i are the eigenvalues of K := { K ( x i , x j ) } ij . H. Tran-Bach Compositional Kernels of Deep Neural Networks 9

  20. New Random Features Algorithm H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

  21. New Random Features Algorithm Kernels Activation Sampling shift-invariant (Rahimi, Recht ’08) cos , sin ≈ inner-product (Liang, Tran-Bach ’20) ≈ Gaussian H. Tran-Bach Compositional Kernels of Deep Neural Networks 10

  22. MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

  23. Experiment: MNIST & CIFAR10 Activation ReLU GeLU Sigmoid Swish µ 0 . 95 1 . 08 0 . 15 1 . 07 a 2 0 . 50 0 . 59 0 . 15 0 . 80 1 MNIST: L=1 MNIST: L=2 MNIST: L=3 1.0 0.98 0.95 0.9 0.96 0.94 0.90 0.8 0.92 0.7 0.85 0.90 ReLU ReLU 0.6 ReLU 0.88 0.80 GeLU GeLU GeLU 0.5 0.86 Sigmoid Sigmoid Sigmoid 0.75 Swish Swish Swish 0.84 0.4 0 10 20 30 0 10 20 30 0 10 20 30 CIFAR10: L=1 CIFAR10: L=2 CIFAR10: L=3 0.350 0.325 0.30 0.325 0.300 0.300 0.275 0.25 0.275 0.250 0.20 0.250 0.225 ReLU ReLU ReLU GeLU GeLU GeLU 0.225 0.200 0.15 Sigmoid Sigmoid Sigmoid 0.200 Swish Swish Swish 0.175 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 H. Tran-Bach Compositional Kernels of Deep Neural Networks 11

  24. Conclusions 1. Additional Results: H. Tran-Bach Compositional Kernels of Deep Neural Networks 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend