Mehlers Formula, Branching Processes, and Compositional Kernels of - - PowerPoint PPT Presentation

mehler s formula branching processes and compositional
SMART_READER_LITE
LIVE PREVIEW

Mehlers Formula, Branching Processes, and Compositional Kernels of - - PowerPoint PPT Presentation

Mehlers Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks Tengyuan Liang Hai Tran-Bach May 26, 2020 Motivations & Questions DNN and Kernels (Rahimi, Recht 08; Belkin et al 18; Jacot et al 19) .


slide-1
SLIDE 1

Mehler’s Formula, Branching Processes, and Compositional Kernels of Deep Neural Networks

Tengyuan Liang Hai Tran-Bach May 26, 2020

slide-2
SLIDE 2

Motivations & Questions

◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19). What role do the activation functions play in the connections between DNN and Kernels?

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 2

slide-3
SLIDE 3

Motivations & Questions

◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19). What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18). How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity?

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 2

slide-4
SLIDE 4

Motivations & Questions

◮ DNN and Kernels (Rahimi, Recht ’08; Belkin et al ’18; Jacot et al ’19). What role do the activation functions play in the connections between DNN and Kernels? ◮ Interpolation and Generalization (Zhang et al ’17; Belkin et al ’18; Liang, Rakhlin ’18). How does the activation function interplay with depth, sample size, and input dimensionality in terms of memorization capacity? ◮ Is there hope to design activation functions such that we can ”compress” multiple layers?

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 2

slide-5
SLIDE 5

Multi-Layer Perceptron with Random Weights

(Neal ’96; Rahimi, Recht ’08; Daniely et al ’07) Input : x(0) := x ∈ Rd Hidden Layers : x(ℓ+1) := σ

  • W (ℓ)x(ℓ)/x(ℓ)
  • ∈ Rdℓ+1 , for 0 ≤ ℓ < L

Random Weights : W (ℓ) ∈ Rdℓ+1×dℓ, W (ℓ) ∼ MN(0, Idℓ+1 ⊗ Idℓ) . Regime : d1, . . . , dL → ∞

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 3

slide-6
SLIDE 6

Duality: Activation and Kernel

Activation: σ(x) =

  • k=0

αkhk(x), with

  • k=0

α2

k = 1.

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2

Hermite Polynomials

h0 h1 h2 h3 h4 h5

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 4

slide-7
SLIDE 7

Duality: Activation and Kernel

Activation: σ(x) =

  • k=0

αkhk(x), with

  • k=0

α2

k = 1.

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2

Hermite Polynomials

h0 h1 h2 h3 h4 h5

Dual Kernel: K(xi, xj) := E

w∼N(0,Id)

[σ(wTxi/xi)σ(wTxj/xj)] =

  • k=0

α2

kρk ij =: G(ρij);

ρij := xi/xi, xj/xj.

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 4

slide-8
SLIDE 8

Duality: Activation and Kernel

Activation: σ(x) =

  • k=0

αkhk(x), with

  • k=0

α2

k = 1.

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2

Hermite Polynomials

h0 h1 h2 h3 h4 h5

Dual Kernel: K(xi, xj) := E

w∼N(0,Id)

[σ(wTxi/xi)σ(wTxj/xj)] =

  • k=0

α2

kρk ij =: G(ρij);

ρij := xi/xi, xj/xj. Compositional Kernel: K (L)(xi, xj) = G ◦ G ◦ · · · ◦ G

  • composite L times

(ρij) =: G (L)(ρij).

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 4

slide-9
SLIDE 9

Branching Process and Compositional Kernels

Distribution: Y , with P(Y = k) = α2

k and PGF G.

Galton-Watson Process: Z (L), with off-spring Y and PGFs G (L)

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 5

slide-10
SLIDE 10

Rescaled Limit: Phase Transition

Theorem (Liang, Tran-Bach ’20)

Define µ :=

  • k≥0

a2

kk,

µ⋆ :=

  • k>2

a2

kk log k .

Then, for all t > 0, we have

  • 1. If µ ≤ 1,

lim

L→∞ K (L)(e−t) =

  • 1,

if a1 = 1 e−t, if a1 = 1

  • 2. If µ > 1,

lim

L→∞ K (L)(e−t/µL) =

  • ξ + (1 − ξ) E[e−tW ],

if µ⋆ < ∞ 0, if µ⋆ = ∞

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 6

slide-11
SLIDE 11

Kernel Limits Example: centered ReLU

Unscaled Limit: K (L)(t)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

ReLU

L=1 L=3 L=5 L=7

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 7

slide-12
SLIDE 12

Kernel Limits Example: centered ReLU

Unscaled Limit: K (L)(t)

1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1/µ1

L = 1

1.0 0.5 0.0 0.5 1.0 0.25 0.00 0.25 0.50 0.75 1.00 1/µ3

L = 3

1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1/µ5

L = 5

1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1/µ7

L = 7

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 7

slide-13
SLIDE 13

Kernel Limits Example: centered ReLU

Unscaled Limit: K (L)(t)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

ReLU

L=1 L=3 L=5 L=7

Rescaled Limit: K (L)(e−t/µL)

0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0

ReLU

L=1 L=3 L=5 L=7

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 7

slide-14
SLIDE 14

Memorization Capacity

◮ ”small correlation” supij |ρij| ≈ 0

x1, . . . , xn

iid

∼ Unif(Sd−1) and log(n)/d → 0

◮ ”large correlation” supij |ρij| ≈ 1

x1, . . . , xn maximal packing of Sd−1 and log(n)/d → ∞

small correlation large correlation

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 8

slide-15
SLIDE 15

Memorization Capacity

◮ ”small correlation” supij |ρij| ≈ 0

x1, . . . , xn

iid

∼ Unif(Sd−1) and log(n)/d → 0

◮ ”large correlation” supij |ρij| ≈ 1

x1, . . . , xn maximal packing of Sd−1 and log(n)/d → ∞

small correlation large correlation

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 8

slide-16
SLIDE 16

Memorization Capacity Theorem

Theorem (Liang & Tran-Bach ’20)

L log(nκ−1) + log log n

d

log a−2

1

(small correlation) L exp(2log n

d )log(nκ−1)

µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 9

slide-17
SLIDE 17

Memorization Capacity Theorem

Theorem (Liang & Tran-Bach ’20)

L log(nκ−1) + log log n

d

log a−2

1

(small correlation) L exp(2log n

d )log(nκ−1)

µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 9

slide-18
SLIDE 18

Memorization Capacity Theorem

Theorem (Liang & Tran-Bach ’20)

L log(nκ−1) + log log n

d

log a−2

1

(small correlation) L exp(2log n

d )log(nκ−1)

µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 9

slide-19
SLIDE 19

Memorization Capacity Theorem

Theorem (Liang & Tran-Bach ’20)

L log(nκ−1) + log log n

d

log a−2

1

(small correlation) L exp(2log n

d )log(nκ−1)

µ − 1 (large correlation) to memorize the data in the sense that 1 − κ ≤ λi ≤ 1 + κ, where λi are the eigenvalues of K := {K(xi, xj)}ij.

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 9

slide-20
SLIDE 20

New Random Features Algorithm

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 10

slide-21
SLIDE 21

New Random Features Algorithm

Kernels Activation Sampling shift-invariant (Rahimi, Recht ’08) cos, sin ≈ inner-product (Liang, Tran-Bach ’20) ≈ Gaussian

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 10

slide-22
SLIDE 22

Experiment: MNIST & CIFAR10

Activation ReLU GeLU Sigmoid Swish µ 0.95 1.08 0.15 1.07 a2

1

0.50 0.59 0.15 0.80

10 20 30 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98

MNIST: L=1

ReLU GeLU Sigmoid Swish 10 20 30 0.75 0.80 0.85 0.90 0.95

MNIST: L=2

ReLU GeLU Sigmoid Swish 10 20 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0

MNIST: L=3

ReLU GeLU Sigmoid Swish 20 40 60 80 0.200 0.225 0.250 0.275 0.300 0.325 0.350

CIFAR10: L=1

ReLU GeLU Sigmoid Swish 20 40 60 80 0.175 0.200 0.225 0.250 0.275 0.300 0.325

CIFAR10: L=2

ReLU GeLU Sigmoid Swish 20 40 60 80 0.15 0.20 0.25 0.30

CIFAR10: L=3

ReLU GeLU Sigmoid Swish

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 11

slide-23
SLIDE 23

Experiment: MNIST & CIFAR10

Activation ReLU GeLU Sigmoid Swish µ 0.95 1.08 0.15 1.07 a2

1

0.50 0.59 0.15 0.80

10 20 30 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98

MNIST: L=1

ReLU GeLU Sigmoid Swish 10 20 30 0.75 0.80 0.85 0.90 0.95

MNIST: L=2

ReLU GeLU Sigmoid Swish 10 20 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0

MNIST: L=3

ReLU GeLU Sigmoid Swish 20 40 60 80 0.200 0.225 0.250 0.275 0.300 0.325 0.350

CIFAR10: L=1

ReLU GeLU Sigmoid Swish 20 40 60 80 0.175 0.200 0.225 0.250 0.275 0.300 0.325

CIFAR10: L=2

ReLU GeLU Sigmoid Swish 20 40 60 80 0.15 0.20 0.25 0.30

CIFAR10: L=3

ReLU GeLU Sigmoid Swish

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 11

slide-24
SLIDE 24

Conclusions

  • 1. Additional Results:
  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-25
SLIDE 25

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-26
SLIDE 26

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-27
SLIDE 27

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • 2. Summary of role of activation functions in DNN:
  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-28
SLIDE 28

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • 2. Summary of role of activation functions in DNN:

Composition of kernels as a Branching Process

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-29
SLIDE 29

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • 2. Summary of role of activation functions in DNN:

Composition of kernels as a Branching Process Functionals of activations govern kernel limits

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-30
SLIDE 30

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • 2. Summary of role of activation functions in DNN:

Composition of kernels as a Branching Process Functionals of activations govern kernel limits Depth bounds depending on activations for memorization

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-31
SLIDE 31

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • 2. Summary of role of activation functions in DNN:

Composition of kernels as a Branching Process Functionals of activations govern kernel limits Depth bounds depending on activations for memorization New Random Features Algorithm

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12

slide-32
SLIDE 32

Conclusions

  • 1. Additional Results:

Eigenvalues of Compositional Kernels: Generalization Error Numerical Tricks with Guarantees: Stability under truncation

  • 2. Summary of role of activation functions in DNN:

Composition of kernels as a Branching Process Functionals of activations govern kernel limits Depth bounds depending on activations for memorization New Random Features Algorithm

  • 3. Reference: T. Liang & H.Tran-Bach, Mehler’s Formula,

Branching Process, and Compositional Kernels of Deep Neural Networks, arXiv:2004.04767

  • H. Tran-Bach

Compositional Kernels of Deep Neural Networks 12