Understanding Priors in Bayesian Neural Networks at the Unit Level - - PowerPoint PPT Presentation

understanding priors in bayesian neural networks at the
SMART_READER_LITE
LIVE PREVIEW

Understanding Priors in Bayesian Neural Networks at the Unit Level - - PowerPoint PPT Presentation

Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019


slide-1
SLIDE 1

Understanding Priors in Bayesian Neural Networks at the Unit Level

Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France

mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019

1/9

slide-2
SLIDE 2

2/9

Outline

Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation

slide-3
SLIDE 3

3/9

Distribution families with respect to tail behavior

For all k ∈ N, k-th row moment: Xk =

  • E|X|k1/k

Distribution Tail Moments Sub-Gaussian F(x) ≤ e−λx2 Xk ≤ C √ k Sub-Exponential F(x) ≤ e−λx Xk ≤ Ck Sub-Weibull F(x) ≤ e−λx1/θ Xk ≤ Ckθ

  • θ > 0 called tail parameter
  • Xk ≍ kθ =

⇒ X ∼ subW(θ), θ called optimal

  • subW(1/2) = subG, subW(1) = subE
  • Larger θ implies heavier tail

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-4
SLIDE 4

4/9

Outline

Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation

slide-5
SLIDE 5

5/9

Assumptions on neural network

We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N(µ, σ2) (A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t. |φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−, |φ(u)| ≤ c2 + d2|u| for all u ∈ R.

  • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh.
  • Nonlinearity does not harm the distributional tail:

φ(X)k ≍ Xk, k ∈ N

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-6
SLIDE 6

5/9

Assumptions on neural network

We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N(µ, σ2) (A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t. |φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−, |φ(u)| ≤ c2 + d2|u| for all u ∈ R.

  • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh.
  • Nonlinearity does not harm the distributional tail:

φ(X)k ≍ Xk, k ∈ N

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-7
SLIDE 7

5/9

Assumptions on neural network

We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N(µ, σ2) (A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t. |φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−, |φ(u)| ≤ c2 + d2|u| for all u ∈ R.

  • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh.
  • Nonlinearity does not harm the distributional tail:

φ(X)k ≍ Xk, k ∈ N

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-8
SLIDE 8

5/9

Assumptions on neural network

We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N(µ, σ2) (A2) Nonlinearity. ReLU-like with envelope property: exist c1, c2, d2 ≥ 0, d1 > 0 s.t. |φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−, |φ(u)| ≤ c2 + d2|u| for all u ∈ R.

  • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh.
  • Nonlinearity does not harm the distributional tail:

φ(X)k ≍ Xk, k ∈ N

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-9
SLIDE 9

6/9

Main theorem

Consider a Bayesian neural network with (A1) i.i.d. Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x, the marginal prior distribution of a unit u(ℓ) of ℓ-th hidden layer is sub-Weibull with optimal tail parameter θ = ℓ/2: π(ℓ)(u) ∼ subW(ℓ/2) . . . . . . . . . . . . . . . . . . . . . . . . x u(1) u(2) u(3) u(ℓ) input ℓth layer subW( 1

2)

subW(1) subW( 3

2)

subW( ℓ

2)

10 20 30 40 50 60 70

x

10−3 10−2 10−1 100

log P(X ≥ x)

subW(50) subW(5) subW(3/2) subW(1) subW(1/2)

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-10
SLIDE 10

7/9

Outline

Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation

slide-11
SLIDE 11

8/9

Interpretation: shrinkage effect

Maximum a Posteriori (MAP) is a Regularized problem max

W π(W|D) ∝ L(D|W)π(W)

min

W − log L(D|W)− log π(W)

min

W L(W) + λR(W)

L(W) is a loss function, R(W) is a norm on Rp, regularizer. Weight distribution π(w) ≈ e−w2

ℓ-th layer unit distribution π(ℓ)(u) ≈ e−u2/ℓ Layer Penalty on W Penalty on U 1 W(1)2

2, L2

U(1)2

2

L2 (weight decay) 2 W(2)2

2, L2

U(2) L1 (Lasso) ℓ W(ℓ)2

2, L2

U(ℓ)2/ℓ

2/ℓ

L2/ℓ

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

U1 U2

Layer 1

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

U1 U2

Layer 2

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

U1 U2

Layer 3

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

U1 U2

Layer 10

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level

slide-12
SLIDE 12

9/9

Conclusion

(i) We define the notion of sub-Weibull distributions, which are characterized by tails lighter than (or equally light as) Weibull distributions. (ii) We prove that the marginal prior distribution of the units are heavier-tailed as depth increases. (iii) We offer an interpretation from a regularization viewpoint. Future directions:

  • Prove the Gaussian process limit of sub-Weibull distributions in the wide regime;
  • Investigate if the described regularization mechanism induces sparsity at the unit level.

Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level