CS480/680 Machine Learning Lecture 11: February 11 th , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra Sheikhbahaee VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al. 2016) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1

• Variational lower bound derivation • Variational mean field approximation University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

Full Bayesian Inference • Training stage 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄(𝜄) 𝑄 𝜄 𝑌 $% , 𝑍 $% = ∫ 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄 𝜄 𝑒𝜄 • Testing stage 𝑄 𝑧 𝑦, 𝑌 $% , 𝑍 $% = / 𝑄(𝑧|𝑦, 𝜄)𝑄 𝜄 𝑌 $% , 𝑍 $% 𝑒𝜄 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3

Full Bayesian Inference • Training stage 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄(𝜄) 𝑄 𝜄 𝑌 $% , 𝑍 $% = ∫ 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄 𝜄 𝑒𝜄 • Testing stage Maybe intractable 𝑄 𝑧 𝑦, 𝑌 $% , 𝑍 $% = / 𝑄(𝑧|𝑦, 𝜄)𝑄 𝜄 𝑌 $% , 𝑍 $% 𝑒𝜄 Posterior distributions can be calculated analytically only for simple conjugate models! University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

Choice Of Priors • In any Bayesian inference model what is essential is which type of prior knowledge (if any) is conveyed in prior. • Subjective priors: The prior encapsulates information as fully as possible by using previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 𝜄 2 𝜈 = 𝑞 𝜄 𝑧 ∝ 𝑔 𝜄 𝜈 𝑞 𝑧 𝜄 The definition of a likelihood function in an exponential family model is given as follow where we assume that n data points arrive independent and identically distributed 𝑞 𝑧 6 𝜄 = 𝑕 𝜄 𝑔(𝑧 6 )𝑓 9 : ; <(= > ) 𝑕(𝜄) : a normalization constant 𝜚(𝜄) : a vector of natural parameters 𝑣(𝑧 6 ) : The sufficient statistics 𝑄 𝜄 𝜃, 𝜉 = ℎ(𝜃, 𝜉)𝑕(𝜄) D 𝑓 9(:) ; E University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5

Choice Of Priors • In any Bayesian inference model what is essential is which type of prior knowledge (if any) is conveyed in prior. • Subjective priors: The prior encapsulates information as fully as possible by using previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 𝜄 2 𝜈 = 𝑞 𝜄 𝑧 ∝ 𝑔 𝜄 𝜈 𝑞 𝑧 𝜄 The posterior distribution I I 𝑞 𝑧 6 𝜄 ∝ 𝑕 𝜄 DJI 𝑓 9(:) ; E F 𝑔(𝑧 6 ) 𝑓 9 : ; <(= > ) ∝ 𝑄(𝜄|2 𝑄 𝜄 𝑧 = 𝑄 𝜄 𝜃, 𝜉 F 𝜃, 2 𝜉) 6GH 6GH 𝜃 = 𝜃 + 𝑜 2 I 𝜉 = 𝜉 + M 2 𝑣(𝑧 6 ) 6GH University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

Choice Of Priors • Objective Priors : Instead of attempting to encapsulate rich knowledge into the prior , the objective Bayesian tries to impart as little information as possible in an attempt to allow the data to carry as much weight as possible in the posterior distribution. One class of noninformative priors are reference priors . • Hierarchical priors : Utilize hierarchical modeling to transfer the reference prior problem to a ‘higher level’ of the model. Hierarchical models allow a more “objective” approach to inference by estimating the parameters of prior distributions from data rather than requiring them to be specified using subjective information. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

Approximate Inference Probabilistic model: 𝑄 𝑦, 𝜄 = 𝑄 𝑦 𝜄 𝑄(𝜄) Variational Inference Markov Chain Monte Carlo Approximate 𝑞(𝜄|𝑦) ≈ 𝑟(𝜄) ∈ 𝒭 Samples from unnormalized 𝑞 𝜄 𝑦 • Biased • Unbiased • Faster and more scalable • Need a lot of sample University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 9 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH I 𝑞 𝑦 6 , 𝑧 6 𝜄 = M ln 𝔽 ] ^> 𝑟 [ > (𝑦 6 ) 6GH The Jensen’s inequality for a concave function is given as 𝑔 𝔽 ] 𝑦 ≥ 𝔽 ] [𝑔(𝑦)] 10 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH I 𝑞 𝑦 6 , 𝑧 6 𝜄 = M ln 𝔽 ] ^> 𝑟 [ > (𝑦 6 ) 6GH I 𝔽 ] ^> ln 𝑞 𝑦 6 , 𝑧 6 𝜄 ≥ M 𝑟 [ > (𝑦 6 ) 6GH 11 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH I 𝔽 ] ^> ln 𝑞 𝑦 6 , 𝑧 6 𝜄 ≥ M 𝑟 [ > (𝑦 6 ) 6GH I / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑦 6 , 𝑧 6 𝜄 = M 𝑟 [ > (𝑦 6 ) 6GH 12 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH b 𝑦 6 , 𝑧 6 𝜄 I ≥ ∑ 6GH ∫ 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln ] ^> ([ > ) I = M / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑦 6 , 𝑧 6 𝜄 − / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑟 [ > (𝑦 6 ) 6GH ≡ ℱ(𝑟 [ e 𝑦 H , … , 𝑟 [ f 𝑦 I , 𝜄) 13 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

The Variational Lower Bound • The (negative) variational free energy ( ℱ(𝑟 [ 𝑦 , 𝜄) ) or the evidence lower bound (ELBO): the expected energy under 𝑟 [ (𝑦) minus the entropy of 𝑟 [ (𝑦) I / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑦 6 , 𝑧 6 𝜄 ℱ 𝑟 [ 𝑦 , 𝜄 = M 𝑟 [ > (𝑦 6 ) 6GH / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞(𝑦 6 |𝑧 6 , 𝜄) = M / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑧 6 𝜄 + M 𝑟 [ > (𝑦 6 ) 6 6 𝑟 [ > (𝑦 6 ) M ln 𝑞 𝑧 6 𝜄 − M / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞(𝑦 6 |𝑧 6 , 𝜄) 6 6 M ln 𝑞 𝑧 6 𝜄 − 𝐸 hi [𝑟 [ > (𝑦 6 ) ∥ 𝑞(𝑦 6 |𝑧 6 , 𝜄)] 6 KL divergence that we need for VI University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14

ELBO = Evidence Lower BOund ln 𝑞 𝑧|𝜄 = ℒ 𝜄 + 𝐸 hi (𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄)) Evidence 𝑞 𝑦 𝑧, 𝜄 = 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) ∫ 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) 𝑒𝑦 = Likelihood×Prior 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) = 𝑞(𝑧|𝜄) Evidence Evidence of the probabilistic model shows the total probability of observing the data. Lower Bound: 𝐸 hi ≥ 0 → ln 𝑞(𝑧|𝜄) ≥ ℒ(𝜄) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15

Kullback-Leibler Divergence • Properties • 𝐸 hi (𝑞||𝑟) = 0 if and only if ( iff ) 𝑞 = 𝑟 (they may be different on sets of probability zero) • 𝐸 hi (𝑞||𝑟) ≠ 𝐸 hi (𝑟||𝑞) • 𝐸 hi (𝑞||𝑟) ≥ 0 −𝐸 hi 𝑟 ∥ 𝑞 = 𝔽 ] − log 𝑟 𝑞 = 𝔽 ] log 𝑞 𝑟 Blue: mixture of Gaussians ≤ log 𝔽 ] (𝑞 𝑟) = log / 𝑟 𝑦 𝑞 𝑦 𝑞(𝑦) (fixed) 𝑟 𝑦 𝑒𝑦 = log / 𝑞 𝑦 𝑒𝑦 = 0 Green: (unimodal) Gaussian 𝑟 that minimises 𝐿𝑀(𝑟||𝑞) Red: (unimodal) Gaussian 𝑟 that minimises 𝐿𝑀(𝑞||𝑟) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16

Variational Inference • Optimization problem with intractable posterior distribution: 𝑟 ∗ = argmin 𝐸 hi (𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄)) ]([)∈𝒭 𝑄(𝑦|𝑧, 𝜄) 𝑟(𝑦) NICE University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra Sheikhbahaee VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al.

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Machine Learning Lecture 1: January 7 th , 2020 Course Introduction Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29,

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

ALICE TPC Simulation and Reconstruction Tom Junk Fourth DUNE Near Detector Workshop March 22,

Globally Polarized Quark Gluon Plasma in Non-Central A+A Collisions at High Energies

QCD transition in magnetic fields Gergely Endr odi University of Regensburg Advances in

g ( x ) := E [ Y | X = x ] := yPr [ Y = y | X = x ] . Recall that L [ Y | X ] = a + bX is a

Fitting Nonlinear Models to Data SI Model The SI model we discussed before is often written dS

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt.

Nonparametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Density Estimation

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra Sheikhbahaee VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al.

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Machine Learning Lecture 1: January 7 th , 2020 Course Introduction Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori &amp; Maximum

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 3: January 14 th , 2020 Linear Regression Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29,

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

ALICE TPC Simulation and Reconstruction Tom Junk Fourth DUNE Near Detector Workshop March 22,

Globally Polarized Quark Gluon Plasma in Non-Central A+A Collisions at High Energies

QCD transition in magnetic fields Gergely Endr odi University of Regensburg Advances in

g ( x ) := E [ Y | X = x ] := yPr [ Y = y | X = x ] . Recall that L [ Y | X ] = a + bX is a

Fitting Nonlinear Models to Data SI Model The SI model we discussed before is often written dS

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt.

Nonparametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Density Estimation

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum