Mehlers Formula, Branching Processes, and Compositional Kernels of - - PowerPoint PPT Presentation
Mehlers Formula, Branching Processes, and Compositional Kernels of - - PowerPoint PPT Presentation
Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .
Motivations & Questions
◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19). What role do the activation functions play in the connections between DNN and Kernels?
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 2
Motivations & Questions
◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19). What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18). How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity?
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 2
Motivations & Questions
◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19). What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18). How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? ◮ Is there hope to design activation functions such that we can ”compress” multiple layers?
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 2
Multi-Layer Perceptron with Random Weights
(Neal ’96; Rahimi, Recht ’08; Daniely et al ’07) Input : x(0) := x ∈ Rd Hidden Layers : x(ℓ+1) := σ
- W (ℓ)x(ℓ)/x(ℓ)
- ∈ Rdℓ+1 , for 0 ≤ ℓ < L
Random Weights : W (ℓ) ∈ Rdℓ+1×dℓ, W (ℓ) ∼ MN(0, Idℓ+1 ⊗ Idℓ) . Regime : d1, . . . , dL → ∞
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 3
Duality: Activation and Kernel
Activation: σ(x) =
∞
- k=0
αkhk(x), with
∞
- k=0
α2
k = 1.
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2
Hermite Polynomials
h0 h1 h2 h3 h4 h5
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 4
Duality: Activation and Kernel
Activation: σ(x) =
∞
- k=0
αkhk(x), with
∞
- k=0
α2
k = 1.
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2
Hermite Polynomials
h0 h1 h2 h3 h4 h5
Dual Kernel: K(xi, xj) := E
w∼N(0,Id)
[σ(wTxi/xi)σ(wTxj/xj)] =
∞
- k=0
α2
kρk ij =: G(ρij);
ρij := xi/xi, xj/xj.
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 4
Duality: Activation and Kernel
Activation: σ(x) =
∞
- k=0
αkhk(x), with
∞
- k=0
α2
k = 1.
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2
Hermite Polynomials
h0 h1 h2 h3 h4 h5
Dual Kernel: K(xi, xj) := E
w∼N(0,Id)
[σ(wTxi/xi)σ(wTxj/xj)] =
∞
- k=0
α2
kρk ij =: G(ρij);
ρij := xi/xi, xj/xj. Compositional Kernel: K (L)(xi, xj) = G ◦ G ◦ · · · ◦ G
- composite L times
(ρij) =: G (L)(ρij).
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 4
Branching Process and Compositional Kernels
Distribution: Y , with P(Y = k) = α2
k and PGF G.
Galton-Watson Process: Z (L), with off-spring Y and PGFs G (L)
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 5
Rescaled Limit: Phase Transition
Theorem (Liang, Tran-Bach ’20)
Define µ :=
- k≥0
a2
kk,
µ⋆ :=
- k>2
a2
kk log k .
Then, for all t > 0, we have
- 1. If µ ≤ 1,
lim
L→∞ K (L)(e−t) =
- 1,
if a1 = 1 e−t, if a1 = 1
- 2. If µ > 1,
lim
L→∞ K (L)(e−t/µL) =
- ξ + (1 − ξ) E[e−tW ],
if µ⋆ < ∞ 0, if µ⋆ = ∞
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 6
Kernel Limits Example: centered ReLU
Unscaled Limit: K (L)(t)
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
ReLU
L=1 L=3 L=5 L=7
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 7
Kernel Limits Example: centered ReLU
Unscaled Limit: K (L)(t)
1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1/µ1
L = 1
1.0 0.5 0.0 0.5 1.0 0.25 0.00 0.25 0.50 0.75 1.00 1/µ3
L = 3
1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1/µ5
L = 5
1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1/µ7
L = 7
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 7
Kernel Limits Example: centered ReLU
Unscaled Limit: K (L)(t)
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
ReLU
L=1 L=3 L=5 L=7
Rescaled Limit: K (L)(e−t/µL)
0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
ReLU
L=1 L=3 L=5 L=7
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 7
Memorization Capacity
◮ ”small correlation” supij |ρij| ≈ 0
x1, . . . , xn
iid
∼ Unif(Sd−1) and log(n)/d → 0
◮ ”large correlation” supij |ρij| ≈ 1
x1, . . . , xn maximal packing of Sd−1 and log(n)/d → ∞
small correlation large correlation
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 8
Memorization Capacity
◮ ”small correlation” supij |ρij| ≈ 0
x1, . . . , xn
iid
∼ Unif(Sd−1) and log(n)/d → 0
◮ ”large correlation” supij |ρij| ≈ 1
x1, . . . , xn maximal packing of Sd−1 and log(n)/d → ∞
small correlation large correlation
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 8
Memorization Capacity Theorem
Theorem (Liang & Tran-Bach ’20)
L log(nκ−1) + log log n
d
log a−2
1
(small correlation) L exp(2log n
d )log(nκ−1)
µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 9
Memorization Capacity Theorem
Theorem (Liang & Tran-Bach ’20)
L log(nκ−1) + log log n
d
log a−2
1
(small correlation) L exp(2log n
d )log(nκ−1)
µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 9
Memorization Capacity Theorem
Theorem (Liang & Tran-Bach ’20)
L log(nκ−1) + log log n
d
log a−2
1
(small correlation) L exp(2log n
d )log(nκ−1)
µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 9
Memorization Capacity Theorem
Theorem (Liang & Tran-Bach ’20)
L log(nκ−1) + log log n
d
log a−2
1
(small correlation) L exp(2log n
d )log(nκ−1)
µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 9
New Random Features Algorithm
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 10
New Random Features Algorithm
Kernels Activation Sampling shift-invariant (Rahimi, Recht ’08) cos, sin ≈ inner-product (Liang, Tran-Bach ’20) ≈ Gaussian
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 10
Experiment: MNIST & CIFAR10
Activation ReLU GeLU Sigmoid Swish µ 0.95 1.08 0.15 1.07 a2
1
0.50 0.59 0.15 0.80
10 20 30 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98
MNIST: L=1
ReLU GeLU Sigmoid Swish 10 20 30 0.75 0.80 0.85 0.90 0.95
MNIST: L=2
ReLU GeLU Sigmoid Swish 10 20 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MNIST: L=3
ReLU GeLU Sigmoid Swish 20 40 60 80 0.200 0.225 0.250 0.275 0.300 0.325 0.350
CIFAR10: L=1
ReLU GeLU Sigmoid Swish 20 40 60 80 0.175 0.200 0.225 0.250 0.275 0.300 0.325
CIFAR10: L=2
ReLU GeLU Sigmoid Swish 20 40 60 80 0.15 0.20 0.25 0.30
CIFAR10: L=3
ReLU GeLU Sigmoid Swish
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 11
Experiment: MNIST & CIFAR10
Activation ReLU GeLU Sigmoid Swish µ 0.95 1.08 0.15 1.07 a2
1
0.50 0.59 0.15 0.80
10 20 30 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98
MNIST: L=1
ReLU GeLU Sigmoid Swish 10 20 30 0.75 0.80 0.85 0.90 0.95
MNIST: L=2
ReLU GeLU Sigmoid Swish 10 20 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MNIST: L=3
ReLU GeLU Sigmoid Swish 20 40 60 80 0.200 0.225 0.250 0.275 0.300 0.325 0.350
CIFAR10: L=1
ReLU GeLU Sigmoid Swish 20 40 60 80 0.175 0.200 0.225 0.250 0.275 0.300 0.325
CIFAR10: L=2
ReLU GeLU Sigmoid Swish 20 40 60 80 0.15 0.20 0.25 0.30
CIFAR10: L=3
ReLU GeLU Sigmoid Swish
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 11
Conclusions
- 1. Additional Results:
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- 2. Summary of role of activation functions in DNN:
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- 2. Summary of role of activation functions in DNN:
Composition of kernels as a Branching Process
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- 2. Summary of role of activation functions in DNN:
Composition of kernels as a Branching Process Functionals of activations govern kernel limits
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- 2. Summary of role of activation functions in DNN:
Composition of kernels as a Branching Process Functionals of activations govern kernel limits Depth bounds depending on activations for memorization
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- 2. Summary of role of activation functions in DNN:
Composition of kernels as a Branching Process Functionals of activations govern kernel limits Depth bounds depending on activations for memorization New Random Features Algorithm
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12
Conclusions
- 1. Additional Results:
Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation
- 2. Summary of role of activation functions in DNN:
Composition of kernels as a Branching Process Functionals of activations govern kernel limits Depth bounds depending on activations for memorization New Random Features Algorithm
- 3. Reference: T. Liang & H.Tran-Bach, Mehler’s Formula,
Branching Process, and Compositional Kernels of Deep Neural Networks, arXiv:2004.04767
- H. Tran-Bach
Compositional Kernels of Deep Neural Networks 12