understanding priors in bayesian neural networks at the
play

Understanding Priors in Bayesian Neural Networks at the Unit Level - PowerPoint PPT Presentation

Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019


  1. Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France � mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019 1/9

  2. Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 2/9

  3. Distribution families with respect to tail behavior E | X | k � 1 / k For all k ∈ N , k -th row moment: � X � k = � Distribution Tail Moments √ F ( x ) ≤ e − λ x 2 � X � k ≤ C Sub-Gaussian k F ( x ) ≤ e − λ x Sub-Exponential � X � k ≤ Ck F ( x ) ≤ e − λ x 1 /θ � X � k ≤ Ck θ Sub-Weibull • θ > 0 called tail parameter • � X � k ≍ k θ = ⇒ X ∼ subW( θ ), θ called optimal • subW(1 / 2) = subG, subW(1) = subE • Larger θ implies heavier tail Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 3/9

  4. Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 4/9

  5. Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9

  6. Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9

  7. Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9

  8. Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9

  9. Main theorem Consider a Bayesian neural network with (A1) i.i.d. Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x , the marginal prior distribution of a unit u ( ℓ ) of ℓ -th hidden layer is sub-Weibull with optimal tail parameter θ = ℓ/ 2: π ( ℓ ) ( u ) ∼ subW( ℓ/ 2) ℓ th layer input 10 0 x u ( ℓ ) u (1) u (2) u (3) . . . 10 − 1 . log P ( X ≥ x ) . . . . . . . . . . subW(50) . . . . . 10 − 2 . subW(5) . subW(3/2) . . subW(1) . subW(1/2) 10 − 3 subW( 1 subW( 3 0 10 20 30 40 50 60 70 subW(1) subW( ℓ 2 ) 2 ) 2 ) x Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 6/9

  10. Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 7/9

  11. Interpretation: shrinkage effect Maximum a Posteriori (MAP) is a Regularized Weight distribution ℓ -th layer unit distribution π ( w ) ≈ e − w 2 ⇒ π ( ℓ ) ( u ) ≈ e − u 2 /ℓ problem max W π ( W |D ) ∝ L ( D| W ) π ( W ) Layer Penalty on W Penalty on U min W − log L ( D| W ) − log π ( W ) L 2 (weight decay) min W L ( W ) + λ R ( W ) � W (1) � 2 2 , L 2 � U (1) � 2 1 2 L 1 (Lasso) � W (2) � 2 2 , L 2 � U (2) � 2 L ( W ) is a loss function, R ( W ) is a norm on R p , regularizer. � W ( ℓ ) � 2 2 , L 2 � U ( ℓ ) � 2 /ℓ L 2 /ℓ ℓ 2 /ℓ Layer 1 Layer 2 Layer 3 Layer 10 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 U 2 U 2 U 2 U 2 0.0 0.0 0.0 0.0 −0.5 −0.5 −0.5 −0.5 −1.0 −1.0 −1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U 1 U 1 U 1 U 1 Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 8/9

  12. Conclusion (i) We define the notion of sub-Weibull distributions, which are characterized by tails lighter than (or equally light as) Weibull distributions. (ii) We prove that the marginal prior distribution of the units are heavier-tailed as depth increases. (iii) We offer an interpretation from a regularization viewpoint. Future directions: • Prove the Gaussian process limit of sub-Weibull distributions in the wide regime; • Investigate if the described regularization mechanism induces sparsity at the unit level. Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 9/9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend