 
              Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge
Neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) A typical neural network: f W ( x ) = W L φ ( W L − 1 φ ( ...φ ( W 1 x + b 1 )) + b L − 1 ) + b L for the l th layer: h l = φ ( W l h l − 1 + b l ) , h 1 = φ ( W 1 x + b 1 ) Parameters: W = { W 1 , b 1 , ..., W L , b L } ; nonlinearity: φ ( · ) 1
Neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) Typical deep learning solution: Training the neural network weights: • Maximum likelihood estimation (MLE) given a dataset D = { ( x n , y n ) } N n =1 : N � W ∗ = arg min log p ( y n | x n , W ) n =1 1
Bayesian neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) A Bayesian solution: Put a prior distribution p ( W ) over W • compute posterior p ( W |D ) given a dataset D = { ( x n , y n ) } N n =1 : N � p ( W |D ) ∝ p ( W ) p ( y n | x n , W ) n =1 • Bayesian predictive inference: p ( y ∗ | x ∗ , D ) = E p ( W |D ) [ p ( y ∗ | x ∗ , W )] 2
Bayesian neural networks 101 Let’s say we want to classify different types of cats • x : input images; y : output label "cat" • build a neural network (with param. W ): p ( y | x , W ) = softmax( f W ( x )) In practice: p ( W |D ) is intractable • First find approximation q ( W ) ≈ p ( W |D ) • In prediction, do Monte Carlo sampling: K p ( y ∗ | x ∗ , D ) ≈ 1 � W k ∼ q ( W ) p ( y ∗ | x ∗ , W k ) , K k =1 2
Applications of Bayesian neural networks Detecting adversarial examples: Li and Gal 2017 3
Applications of Bayesian neural networks Image segmentation Kendall and Gal 2017 3
Applications of Bayesian neural networks Medical imaging (super resolution): Tanno et al. 2019 3
Bayesian neural networks vs Gaussian processes Why learning about BNNs in a summer school about GPs? • mean-field BNNs have GP limits • approximate inference on GPs has links to BNNs • approximate inference on BNNs can leverage GP techniques Bayesian Deep Learning 4
BNN → GP
Bayesian neural networks → Gaussian process Quick refresher: Central limit theorem Theorem Let x 1 , ..., x N be i.i.d. samples from p ( x ) and p ( x ) has mean µ and covariance Σ , then N � � 1 µ, 1 � d x n → N N Σ , N → + ∞ N n =1 5
Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 W 1 = [ w 1 , ..., w m ] T , W = { W 1 , b , W 2 } , b = [ b 1 , ..., b m ] , W 2 = [ v 1 , ..., v m ] , mean-field prior � � � p ( W ) = p ( W 1 ) p ( b ) p ( W 2 ) , p ( W 1 ) = p ( w m ) , p ( b ) = p ( b m ) , p ( W 2 ) = p ( v m ) , m m m the same prior for each connection weight/bias: p ( w i ) = p ( w j ) , p ( b i ) = p ( b j ) , p ( v i ) = p ( v j ) , ∀ i , j 1 Radford Neal’s derivation in his PhD thesis (1994) 6
Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 the same prior for each connection weight/bias: ∀ i , j p ( w i ) = p ( w j ) , p ( b i ) = p ( b j ) , ⇒ the same distribution of the hidden unit outputs: d h i ( x ) = φ ( w T h i ( x ) ⊥ h j ( x ) , h i ( x ) = h j ( x ) , i x + b i ) ⇒ i.e. h 1 ( x ) , ..., h M ( x ) are i.i.d. samples from some implicitly defined distribution 1 Radford Neal’s derivation in his PhD thesis (1994) 6
Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 mean-field prior with the same distribution for second layer connection weights: v i ⊥ W 1 , b , p ( v i ) = p ( v j ) , ∀ i , j d ⇒ v i h i ( x ) ⊥ v j h j ( x ) , v i h i ( x ) = v j h j ( x ) so f ( x ) is a sum of i.i.d. random variables 1 Radford Neal’s derivation in his PhD thesis (1994) 6
Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 if we make E [ v m ] = 0 and V [ v m ] = σ 2 v scale as O (1 / M ): M � E [ f ( x )] = E [ v m ] E [ h m ( x )] = 0 m =1 M M � � σ 2 v E [ h m ( x ) 2 ] → σ 2 v E [ h ( x ) 2 ] V [ f ( x )] = V [ v m h m ( x )] = m =1 m =1 1 Radford Neal’s derivation in his PhD thesis (1994) 6
Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 if we make E [ v m ] = 0 and V [ v m ] = σ 2 v scale as O (1 / M ): M � Cov[ f ( x ) , f ( x ′ )] = σ 2 v E [ h m ( x ) h m ( x ′ )] → σ 2 v E [ h ( x ) h ( x ′ )] m =1 1 Radford Neal’s derivation in his PhD thesis (1994) 6
Bayesian neural networks → Gaussian process 1 Consider one hidden layer BNN with mean-field prior and bounded non-linearity M � v m φ ( w T f ( x ) = m x + b m ) , m =1 if we make E [ v m ] = 0 and V [ v m ] = σ 2 v scale as O (1 / M ): d ( f ( x ) , f ( x ′ )) K ( x , x ′ ) = σ 2 v E [ h ( x ) h ( x ′ )] → N ( 0 , K ) , (CLT) it holds for any x , x ′ ⇒ f ∼ GP (0 , K ( x , x ′ )) 1 Radford Neal’s derivation in his PhD thesis (1994) 6
Bayesian neural networks → Gaussian process Recent extensions of Radford Neal’s result: • deep and wide BNNs have GP limits • mean-field prior over weights • the activation function satisfies | φ ( x ) | ≤ c + A | x | • hidden layer widths strictly increasing to infinity Matthews et al. 2018, Lee et al. 2018 7
Bayesian neural networks → Gaussian process Recent extensions of Radford Neal’s result: • Bayesian CNNs have GP limits • Convolution in CNN = fully connected layer applied to different locations in the image • # channels in CNN = # hidden units in fully connected NN Garriga-Alonso et al. 2019, Novak et al. 2019 7
GP → BNN
Gaussian process → Bayesian neural networks Exact GP inference can be very expensive: predictive inference for GP regression: p ( f ∗ | X ∗ , X , y ) = N ( f ∗ ; K ∗ n ( K nn + σ 2 I ) − 1 y , K ∗∗ − K ∗ n ( K nn + σ 2 I ) − 1 K n ∗ ) K nn ∈ R N × N ( K nn ) ij = K ( x i , x j ) , Inverting K nn + σ 2 I has O ( N 3 ) cost! 8
Gaussian process → Bayesian neural networks Quick refresher: Fourier (inverse) transform � s ( t ) e − itw dt S ( w ) = � S ( w ) e itw dw s ( t ) = 9
Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as � σ 2 e i w T ( x − x ′ ) � K ( x , x ′ ) = E p ( w ) for some distribution p ( w ) . � σ 2 e i w T ( x − x ′ ) � σ 2 cos ( w T ( x − x ′ )) � � • Real value kernel ⇒ E p ( w ) = E p ( w ) • cos ( x − x ′ ) = 2 E p ( b ) [ cos ( x + b ) cos ( x ′ + b )] , p ( b ) = Uniform[0 , 2 π ] Rahimi and Recht 2007 10
Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as σ 2 cos ( w T x + b ) cos ( w T x ′ + b ) � � K ( x , x ′ ) = E p ( w ) p ( b ) for some distribution p ( w ) and p ( b ) = Uniform [0 , 2 π ] . � σ 2 e i w T ( x − x ′ ) � σ 2 cos ( w T ( x − x ′ )) � � • Real value kernel ⇒ E p ( w ) = E p ( w ) • cos ( x − x ′ ) = 2 E p ( b ) [ cos ( x + b ) cos ( x ′ + b )] , p ( b ) = Uniform[0 , 2 π ] Rahimi and Recht 2007 10
Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as � σ 2 cos ( w T x + b ) cos ( w T x ′ + b ) � K ( x , x ′ ) = E p ( w ) p ( b ) for some distribution p ( w ) and p ( b ) = Uniform [0 , 2 π ] . • Monte Carlo approximation: M K ( x , x ′ ) = σ 2 � m x ′ + b m ) , K ( x , x ′ ) ≈ ˜ cos ( w T m x + b m ) cos ( w T w m , b m ∼ p ( w ) p ( b m ) M m =1 Rahimi and Recht 2007 10
Gaussian process → Bayesian neural networks Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K ( x , x ′ ) = K ( x − x ′ ) can be represented as � σ 2 cos ( w T x + b ) cos ( w T x ′ + b ) � K ( x , x ′ ) = E p ( w ) p ( b ) for some distribution p ( w ) and p ( b ) = Uniform [0 , 2 π ] . • Monte Carlo approximation: Define h m ( x ) = cos ( w T h ( x ) = [ h 1 ( x ) , ..., h M ( x )] , m x + b m ) , w m ∼ p ( w ) , b m ∼ p ( b ) K ( x , x ′ ) = σ 2 ⇒ ˜ M h ( x ) T h ( x ′ ) Rahimi and Recht 2007 10
Recommend
More recommend