Understanding Priors in Bayesian Neural Networks at the Unit Level - PowerPoint PPT Presentation

Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France � mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019 1/9

Outline Sub-Weibull distributions Main result: Prior on units gets heavier-tailed with depth Regularization interpretation 2/9

Distribution families with respect to tail behavior E | X | k � 1 / k For all k ∈ N , k -th row moment: � X � k = � Distribution Tail Moments √ F ( x ) ≤ e − λ x 2 � X � k ≤ C Sub-Gaussian k F ( x ) ≤ e − λ x Sub-Exponential � X � k ≤ Ck F ( x ) ≤ e − λ x 1 /θ � X � k ≤ Ck θ Sub-Weibull • θ > 0 called tail parameter • � X � k ≍ k θ = ⇒ X ∼ subW( θ ), θ called optimal • subW(1 / 2) = subG, subW(1) = subE • Larger θ implies heavier tail Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 3/9

Assumptions on neural network We analyze Bayesian neural networks which satisfy the following assumptions (A1) Parameters. The weights w have i.i.d. Gaussian prior w ∼ N ( µ, σ 2 ) (A2) Nonlinearity. ReLU-like with envelope property: exist c 1 , c 2 , d 2 ≥ 0, d 1 > 0 s.t. | φ ( u ) | ≥ c 1 + d 1 | u | for all u ∈ R + or u ∈ R − , | φ ( u ) | ≤ c 2 + d 2 | u | for all u ∈ R . • Examples: ReLU, ELU, PReLU etc, but no compactly supported like sigmoid and tanh. • Nonlinearity does not harm the distributional tail: � φ ( X ) � k ≍ � X � k , k ∈ N Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 5/9

Main theorem Consider a Bayesian neural network with (A1) i.i.d. Gaussian priors on the weights and (A2) nonlinearity satisfying envelope property. Then conditional on input x , the marginal prior distribution of a unit u ( ℓ ) of ℓ -th hidden layer is sub-Weibull with optimal tail parameter θ = ℓ/ 2: π ( ℓ ) ( u ) ∼ subW( ℓ/ 2) ℓ th layer input 10 0 x u ( ℓ ) u (1) u (2) u (3) . . . 10 − 1 . log P ( X ≥ x ) . . . . . . . . . . subW(50) . . . . . 10 − 2 . subW(5) . subW(3/2) . . subW(1) . subW(1/2) 10 − 3 subW( 1 subW( 3 0 10 20 30 40 50 60 70 subW(1) subW( ℓ 2 ) 2 ) 2 ) x Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 6/9

Interpretation: shrinkage effect Maximum a Posteriori (MAP) is a Regularized Weight distribution ℓ -th layer unit distribution π ( w ) ≈ e − w 2 ⇒ π ( ℓ ) ( u ) ≈ e − u 2 /ℓ problem max W π ( W |D ) ∝ L ( D| W ) π ( W ) Layer Penalty on W Penalty on U min W − log L ( D| W ) − log π ( W ) L 2 (weight decay) min W L ( W ) + λ R ( W ) � W (1) � 2 2 , L 2 � U (1) � 2 1 2 L 1 (Lasso) � W (2) � 2 2 , L 2 � U (2) � 2 L ( W ) is a loss function, R ( W ) is a norm on R p , regularizer. � W ( ℓ ) � 2 2 , L 2 � U ( ℓ ) � 2 /ℓ L 2 /ℓ ℓ 2 /ℓ Layer 1 Layer 2 Layer 3 Layer 10 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 U 2 U 2 U 2 U 2 0.0 0.0 0.0 0.0 −0.5 −0.5 −0.5 −0.5 −1.0 −1.0 −1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 U 1 U 1 U 1 U 1 Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 8/9

Conclusion (i) We define the notion of sub-Weibull distributions, which are characterized by tails lighter than (or equally light as) Weibull distributions. (ii) We prove that the marginal prior distribution of the units are heavier-tailed as depth increases. (iii) We offer an interpretation from a regularization viewpoint. Future directions: • Prove the Gaussian process limit of sub-Weibull distributions in the wide regime; • Investigate if the described regularization mechanism induces sparsity at the unit level. Vladimirova et al. Understanding priors in Bayesian neural networks at the unit level 9/9

Understanding Priors in Bayesian Neural Networks at the Unit Level - PowerPoint PPT Presentation

Understanding Priors in Bayesian Neural Networks at the Unit Level Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel Inria, Grenoble, France mariia.vladimirova@inria.fr International Conference on Machine Leaning June 13, 2019

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

8 Best Practices for IT Incident Management With Dan Barthelemy, Endurance International Group

SHORT-TERM SCHEDULI NG CARLOS A. MENDEZ CARLOS A. MENDEZ Instituto de Desarrollo Tecnolgico

Using Strategic Planning to Evaluate Academic Development Units Angela R. Linse, Ph.D.

The Other Half of the Fracture Equation: Fall Prevention and Edgar Pierluissi, MD Management

Ascott Residence Trust 2Q 2019 Financial Results 30 July 2019 Important Notice The value of

International Quality Infrastructure Dr Martin Milton Director, BIPM 29 th November 2017 Outline

Overview of nucleon form factor measurements Focus on theoretical calculations of form factors

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning