Efficient sampling for Gaussian linear regression with arbitrary - PowerPoint PPT Presentation

Efficient sampling for Gaussian linear regression with arbitrary priors P. Richard Hahn (Arizona State), Jingyu He (Chicago Booth), Hedibert Lopes (INSPER) November 21, 2018 1

Motivation Goal: run a Gaussian linear regression and a new prior ◮ Do math, try to write a new Gibbs sampler. ◮ Maybe hard to find easy conditional distributions. ◮ Probably require data augmentation, add lots of latent variables. Why not take advantage of the Gaussian likelihood? Elliptical slice sampler can be extended to arbitrary priors, as long as we can evaluate the prior density. 2

Outline Brief review of (Bayesian) regularization Ridge and Lasso regressions Shrinkage priors Algorithm 1: Elliptical slice sampler Algorithm 2: Elliptical slice-within-Gibbs Sampler Other issues Computational considerations Rank deficiency Comparison metrics Simulation exercise n > p n < p Empirical illustrations Illustration 1: Beauty and course evaluations Illustration 2: Diabetes dataset Illustration 3: Motorcycle data References 3

Brief review of (Bayesian) regularization Consider the Gaussian linear model ( y | X , β, σ 2 ) ∼ N ( X β, σ 2 I n ) , where β is p -dimensional. Ridge Regression: ℓ 2 penalty on β : {|| y − X β || 2 + λ || β || 2 ˆ β R = arg min 2 } , λ ≥ 0 , β leading to ˆ β ridge = ( X ′ X + λ I ) − 1 X ′ y . LASSO Regression: ℓ 1 penalty on β : min {|| y − X β || 2 + λ || β || 1 } , ˆ λ ≥ 0 , β L = arg β which can be solved by using quadratic programming techniques such as a coordinate gradient descent algorithm. 4

Elastic net The Elastic net combines the Ridge and the LASSO approaches: min {|| y − X β || 2 + λ 1 || β || 1 + λ 2 || β || 2 ˆ β EN = arg 2 } , λ 1 ≥ 0 , λ 2 ≥ 0 , β The ℓ 1 part of the penalty generates a sparse model. The ℓ 2 part of the penalty ◮ Removes the limitation on the number of selected variables; ◮ Encourages grouping effect; ◮ Stabilizes the ℓ 1 regularization path. R package elasticnet 5

Two dimension contour plots of the three penalty functions Ridge (dot-dashed), LASSO (dashed) and Elastic net (solid) 6

Bayesian regularization ◮ Regularization and variable selection are done by assuming independent prior distributions from a scale mixture of normals (SMN) class: β | ψ ∼ N (0 , ψ ) and ψ ∼ p ( ψ ) , ◮ The posterior mode or the maximum a posteriori (MAP) is arg max { log p ( y | β ) + log p ( β | ψ ) } β which is equivalent to penalizing the log-likelihood log p ( y | β ) with penalty equal to the log prior log p ( β | ψ ) when ψ is held fixed. 7

Bayesian regularization in linear regression problems The marginal prior distribution of β � ∞ p ( β ) = p N ( β, 0 , ψ ) p ( ψ ) d ψ 0 can assume many forms depending on the mixing distribution p ( ψ ): p ( ψ ) p ( β ) Ridge IG ( α, δ ) Student’s t E ( λ 2 / 2) Lasso Laplace G ( λ, 1 / (2 γ 2 )) NG prior Normal-Gamma √ ψ ∼ C + (0 , 1) Horseshoe No closed form where, 1 √ π 2 λ − 1 / 2 γ λ +1 / 2 Γ( λ ) | β | λ − 1 / 2 K λ − 1 / 2 ( | β | /γ ) p NG ( β | λ, γ ) = � � 1 + 4 log p H ( β ) ≈ log β 2 8

Comparing shrinkage priors Shrinkage priors 1.2 Ridge Laplace Elastic−Net 1.0 Horseshoe Normal−Gamma 0.8 Density 0.6 0.4 0.2 0.0 −4 −2 0 2 4 β 9

Comparing shrinkage priors Shrinkage priors Ridge Laplace Elastic−Net 0 Horseshoe Normal−Gamma −2 Log density −4 −6 −4 −2 0 2 4 β 10

Algorithm 1: Elliptical Slice Sampler ◮ The original elliptical slice sampler (Murray et. al. [2010]) was designed to sampling from a posterior arising from a normal prior and a general likelihood. ◮ It can also be used with a normal likelihood and general prior such as shrinkage priors. ◮ Advantages: ◮ Flexible : It only requires evaluating the prior density or an approximation (no special samplers are required). ◮ Fast : Sample all coefficients simultaneously. Not necessary to loop over variables. 11

Advantages of our sampling scheme Flexibility: π ( β ) evaluated up to a normalizing constant. MC efficiency: In each MC iteration, single multivariate Gaussian draw and several univariate uniform draws. Acceptance rate: The size of the sampling region for θ shrinks rapidly with each rejected value and is guaranteed to eventually accept. Single/block move: The basic strategy of the elliptical slice sampler can be applied to smaller blocks. 12

Algorithm 1: Elliptical slice sampler Goal: Sample ∆ from p (∆) ∝ p N (∆; 0 , V ) L (∆) Key idea: For v 0 and v i iid N (0 , V ) and any θ ∈ [0 , 2 π ], it follows that ∆ = v 0 sin θ + v 1 cos θ ∼ N (0 , V ) . Parameter-expansion: Sampling from p ( v 0 , v 1 , ∆ , θ ) ∝ p N (0 , Σ θ ) L ( v 0 sin θ + v 1 cos θ ) can be done via two-block Gibbs sampler: ◮ Sample from p ( v 0 , v 1 | ∆ , θ ) ◮ Sample v from N (0 , V ) ◮ Set v 0 = ∆ sin θ + v cos θ and v 1 = ∆ cos θ − v sin θ ◮ Sample from p (∆ , θ | v 1 , v 2 ) ◮ Slice sampling from p ( θ | v 0 , v 1 ) ∝ L ( v 0 sin θ + v 1 cos θ ) ◮ Set ∆ = v 0 sin θ + v 1 cos θ 13

Slice sampling from p ( θ | v 0 , v 1 ) ∝ L ( v 0 sin θ + v 1 cos θ ) 14

Second draw 15

Third draw 16

Elliptical slice sampler for Gaussian linear regression ◮ Regression model: y | X , β, σ 2 ∼ N ( X β, σ 2 I n ) , ◮ Posterior p ( β | X , y , σ 2 ) ∝ f ( y | X , β, σ 2 ) π ( β ) � �� normal arbitrary prior ◮ f ( y | X , β, σ 2 ) can be rewritten as (based on OLS estimates) � � − 1 2 σ 2 ( β − � β ) ′ X ′ X ( β − � π 0 ( β | X , y , σ 2 ) ∝ exp β ) . ◮ The slice sampler of Murray et al (2010) can be applied directly, using π 0 ( β ) as the Gaussian “prior” and π ( β ) as the arbritary “likelihood” ◮ We actually sample ∆ = β − � β , which is centered around zero. 17

Elliptical slice sampler for Gaussian linear regression β , σ 2 fixed, and θ ∈ [0 , 2 π ] : For initial value β , ∆ = β − ˆ 1. Draw v ∼ N (0 , σ 2 ( X ′ X ) − 1 ) . Set v 0 = ∆ sin θ + v cos θ Set v 1 = ∆ cos θ − v sin θ. 2. Draw ℓ from U [0 , π (ˆ β + v 0 sin θ + v 1 cos θ )] . Initialize a = 0 and b = 2 π . 2.1 Sample θ ′ from U [ a , b ] . β + v 0 sin θ ′ + v 1 cos θ ′ ) > ℓ 2.2 If π (ˆ Then: Set θ ← θ ′ . Go to step 3. Else: If θ ′ < θ , set a ← θ ′ , else set b ← θ ′ . Go to step 2.1. 3. Return ∆ = v 0 sin θ + v 1 cos θ and β = ˆ β + ∆ . 18

Algorithm 2: Elliptical Slice-within-Gibbs Sampler One problem of naive slice sampler: If the number of regresson coefficients p is large, the green slice region is so tiny thus too many rejections before one acceptable sample. Solution : ◮ β has a jointly Gaussian likelihood and independent priors, it’s natural to write a Gibbs sampler, update a subset β k given other coefficients β − k in each MCMC iteration. ◮ Apply elliptical slice sampler to conditional likelihood β k | β − k instead of full likelihood. ◮ Thus it is a “slice-within-Gibbs sampler” 19

Update a subset β k | β − k Let us assume that β can be split into β k and β − k . �� β k β k Σ k , k Σ k , − k , σ 2 ∼ N (1) β − k � β − k Σ − k , k Σ − k , − k where � � � Σ k , k � � β k Σ k , − k = � = ( X ′ X ) − 1 . β and � Σ − k , k Σ − k , − k β − k Therefore, the conditional distribution of β k given β − k is N (˜ β k , ˜ Σ k ): − k , − k ( β − k − � β k = � β k + Σ k , − k Σ − 1 ˜ β − k ) (2) Σ k = σ 2 � � ˜ Σ k , k − Σ k , − k Σ − 1 − k , − k Σ − k , k . (3) Simulation shows that let k = 1 , sampling one element of β at a time has highest effective sample size per second! 20

Algorithm 2: Slice-within-Gibbs for linear regression For each k from 1 to K. β k and ˜ Σ k as in expressions (2) and (3). 1. Construct ˜ Set ∆ k = β k − ˜ β k . Draw v ∼ N (0 , ˜ Σ k ) . Set v 0 = ∆ k sin θ k + v cos θ k Set v 1 = ∆ k cos θ k − v sin θ k . 2. Draw ℓ uniformly on [0 , π (∆ k + ˜ β k )] . Initialize a = 0 and b = 2 π . 2.1 Sample θ ′ uniformly on [ a , b ] . β k + v 0 sin θ ′ + v 1 cos θ ′ ) > ℓ , set θ k ← θ ′ . Go to step 2.2 If π (˜ 3. Otherwise, shrink the support of θ ′ (if θ ′ < θ k , set a ← θ ′ ; if θ ′ > θ k , set b ← θ ′ ), and go to step 2.1. 3. Return ∆ k = v 0 sin θ k + v 1 cos θ k and β k = ˜ β k + ∆ k . 21

Computational considerations Question: Is slice-within-Gibbs sampler (updating one element of β at a time) better than regular standard Gibbs sampler? Answer: Yes! , All the conditional covariance are fixed in MCMC iterations. We can precompute all the conditional likelihoods. We can precompute Σ k , − k Σ − 1 − k , − k , Σ k , k − Σ k , − k Σ − 1 − k , − k Σ − k , k , and k = Σ k , k − Σ k , − k Σ − 1 Cholesky factors L k , with L k L T − k , − k Σ − k , k , for each k = 1 , . . . , K By contrast, regular Gibbs samplers have full conditional updates of the form ( β | X , y , σ 2 ) ∼ N(( X ′ X + D ) − 1 X ′ y , σ 2 ( X ′ X + D ) − 1 ) , (4) which require costly Cholesky or eigenvalue decompositions of the matrix ( X ′ X + D ) − 1 at each iteration as D is updated 22

Efficient sampling for Gaussian linear regression with arbitrary - PowerPoint PPT Presentation

Efficient sampling for Gaussian linear regression with arbitrary priors P. Richard Hahn (Arizona State), Jingyu He (Chicago Booth), Hedibert Lopes (INSPER) November 21, 2018 1 Motivation Goal: run a Gaussian linear regression and a new prior

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

14:332:231 DIGITAL LOGIC DESIGN Ivan Marsic, Rutgers University Electrical & Computer

Efficient Analysis of Pipeline Models for WCET Computation Stephan Wilhelm (sw@absint.com)

FSMD%Block%Diagram FSM$Datapath*Systems Datapath%Elements

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Lefflux de cholestrol Pr Johanne SILVAIN johanne.silvain@aphp.fr 1 Sorbonne Universit 2

Chapter 4 Digital Design and Computer Architecture , 2 nd Edition David Money Harris and Sarah L.

SCBench: A Benchmark Design Suite for SystemC Verification and Validation Bin Lin Department of

The ACCELERATE Trial Impact of the Cholesteryl Ester Transfer Protein Inhibitor Evacetrapib on

Efficient sampling for Gaussian linear regression with arbitrary - PowerPoint PPT Presentation

Efficient sampling for Gaussian linear regression with arbitrary priors P. Richard Hahn (Arizona State), Jingyu He (Chicago Booth), Hedibert Lopes (INSPER) November 21, 2018 1 Motivation Goal: run a Gaussian linear regression and a new prior

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

14:332:231 DIGITAL LOGIC DESIGN Ivan Marsic, Rutgers University Electrical &amp; Computer

Efficient Analysis of Pipeline Models for WCET Computation Stephan Wilhelm (sw@absint.com)

FSMD%Block%Diagram FSM$Datapath*Systems Datapath%Elements

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Lefflux de cholestrol Pr Johanne SILVAIN johanne.silvain@aphp.fr 1 Sorbonne Universit 2

Chapter 4 Digital Design and Computer Architecture , 2 nd Edition David Money Harris and Sarah L.

SCBench: A Benchmark Design Suite for SystemC Verification and Validation Bin Lin Department of

The ACCELERATE Trial Impact of the Cholesteryl Ester Transfer Protein Inhibitor Evacetrapib on

14:332:231 DIGITAL LOGIC DESIGN Ivan Marsic, Rutgers University Electrical & Computer