Efficient sampling for Gaussian linear regression with arbitrary priors
- P. Richard Hahn (Arizona State), Jingyu He (Chicago Booth),
Hedibert Lopes (INSPER) November 21, 2018
1
Efficient sampling for Gaussian linear regression with arbitrary - - PowerPoint PPT Presentation
Efficient sampling for Gaussian linear regression with arbitrary priors P. Richard Hahn (Arizona State), Jingyu He (Chicago Booth), Hedibert Lopes (INSPER) November 21, 2018 1 Motivation Goal: run a Gaussian linear regression and a new prior
1
◮ Do math, try to write a new Gibbs sampler. ◮ Maybe hard to find easy conditional distributions. ◮ Probably require data augmentation, add lots of latent variables.
2
3
β
2},
β
4
β
2}, λ1 ≥ 0, λ2 ≥ 0,
◮ Removes the limitation on the number of selected variables; ◮ Encourages grouping effect; ◮ Stabilizes the ℓ1 regularization path.
5
6
◮ Regularization and variable selection are done by assuming
◮ The posterior mode or the maximum a posteriori (MAP) is
β
7
−4 −2 2 4 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Shrinkage priors
β Density Ridge Laplace Elastic−Net Horseshoe Normal−Gamma
9
−4 −2 2 4 −6 −4 −2
Shrinkage priors
β Log density Ridge Laplace Elastic−Net Horseshoe Normal−Gamma
10
◮ The original elliptical slice sampler (Murray et. al. [2010]) was
◮ It can also be used with a normal likelihood and general prior such
◮ Advantages:
◮ Flexible : It only requires evaluating the prior density or an
◮ Fast : Sample all coefficients simultaneously. Not necessary to loop
11
12
◮ Sample from p(v0, v1|∆, θ)
◮ Sample v from N(0, V ) ◮ Set v0 = ∆ sin θ + v cos θ and v1 = ∆ cos θ − v sin θ
◮ Sample from p(∆, θ|v1, v2)
◮ Slice sampling from p(θ|v0, v1) ∝ L(v0 sin θ + v1 cos θ) ◮ Set ∆ = v0 sin θ + v1 cos θ
13
14
15
16
◮ Regression model:
◮ Posterior
◮ f (y | X, β, σ2) can be rewritten as (based on OLS estimates)
◮ The slice sampler of Murray et al (2010) can be applied directly,
◮ We actually sample ∆ = β −
17
18
◮ β has a jointly Gaussian likelihood and independent priors, it’s
◮ Apply elliptical slice sampler to conditional likelihood βk | β−k
◮ Thus it is a “slice-within-Gibbs sampler” 19
−k,−k(β−k −
−k,−kΣ−k,k
20
21
−k,−k, Σk,k − Σk,−kΣ−1 −k,−kΣ−k,k, and
k = Σk,k − Σk,−kΣ−1 −k,−kΣ−k,k, for each
22
π(β)
23
k=1 ρk
j
j
j=1(¯
j=1 β2 j
24
◮ Independent regressor: Xij are iid from standard Gaussian. ◮ Factor structure: Suppose there are k = p/5 factors. Factors are iid
j=1 β2 j , where κ controls noise level.
i β + ǫi, ǫi ∼ N(0, σ2) for i = 1, . . . , n.
25
Prior p RMSE ESS per second OLS slice mono Gibbs slice mono Gibbs Horseshoe 100 3.38% 1.52% 1.51% 1.51% 1399 613 567 1000 1.05% 0.27% 0.27% 0.27% 91 5 5 Laplace 100 3.38% 2.39% 2.38% – 2362 809 – 1000 1.04% 0.63% 0.63% – 168 8 – Ridge 100 3.38% 3.20% 3.20% – 3350 959 – 1000 1.06% 0.99% 0.99% – 178 5 –
26
Prior p Error ESS per second OLS 1block mono Gibbs 1block mono Gibbs Horseshoe 100 16.47% 6.06% 6.04% 6.03% 387 747 792 1000 6.85% 1.64% 1.64% 1.64% 36 4 4 Laplace 100 17.06% 7.21% 7.15% – 573 1257 – 1000 6.77% 1.95% 1.94% – 38 5 – Ridge 100 16.90% 8.50% 8.75% – 669 1668 – 1000 6.85% 2.93% 3.09% – 38 6 –
27
Running Time RMSE ESS per second p n κ J&O Slice J&O Slice J&O Slice 1000 300 0.25 119.11 91.50 0.0041 0.0038 46.71 43.19 1000 600 0.25 394.02 88.61 0.0028 0.0026 14.68 47.26 1000 900 0.25 905.36 88.91 0.0021 0.0020 6.60 48.85 1000 300 1 127.33 90.19 0.0189 0.0189 43.92 39.25 1000 600 1 399.50 91.17 0.0129 0.0129 14.39 44.12 1000 900 1 927.96 91.58 0.0098 0.0099 6.35 46.09 1500 450 0.25 346.37 187.91 0.0029 0.0027 16.37 21.26 1500 900 0.25 1073.28 185.57 0.0022 0.0021 5.50 23.08 1500 1350 0.25 2629.52 183.68 0.0018 0.0017 2.27 24.04 1500 450 1 326.63 183.66 0.0164 0.0164 17.39 20.28 1500 900 1 1021.47 174.52 0.0100 0.0101 5.73 23.72 1500 1350 1 2515.37 176.51 0.0071 0.0071 2.36 24.78 3000 100 0.25 85.95 985.68 0.0067 0.0075 69.72 3.89 3000 500 0.25 575.92 983.64 0.0024 0.0022 9.85 4.17
28
29
1 π(1+x2) is the density
−10 −5 5 10 15 20 0.00 0.05 0.10 0.15 β Density
30
−10 −5 5 10 0.00 0.05 0.10 0.15 β Density
31
32
coefficient −1.0 −0.5 0.0 0.5 Intercept age sex bmi map tc ldl hdl tch ltg glu
33
−8 −6 −4 −2 2 4 −0.2 −0.1 0.0 0.1 0.2 0.3
Ridge regression
log(1/lambda) Coefficients −8 −6 −4 −2 2 4 −0.2 −0.1 0.0 0.1 0.2 0.3
Lasso
log(1/lambda) Coefficients
34
−2 2 4 6 0.5 0.6 0.7 0.8 0.9 1.0
Ridge
log(Lambda) Mean−Squared Error
10 10 10 10 10 10 10 10 10 −8 −6 −4 −2 0.5 0.6 0.7 0.8 0.9 1.0
Lasso
log(Lambda) Mean−Squared Error
10 10 10 9 8 8 8 7 6 4 4 2
35
coefficient −0.4 −0.2 0.0 0.2 0.4 Intercept age sex bmi map tc ldl hdl tch ltg glu
Ridge (RMSE=0.678) Lasso (RMSE=0.672) Horseshoe (RMSE=0.69)
36
RIDGE LASSO HORSESHOE 0.45 0.50 0.55 0.60 MSE
Train=50%
37
coefficient −50 50 Intercept age sex bmi map tc ldl hdl tch ltg glu age^2 bmi^2 map^2 tc^2 ldl^2 hdl^2 tch^2 ltg^2 glu^2 age:sex age:bmi age:map age:tc age:ldl age:hdl age:tch age:ltg age:glu sex:bmi sex:map sex:tc sex:ldl sex:hdl sex:tch sex:ltg sex:glu bmi:map bmi:tc bmi:ldl bmi:hdl bmi:tch bmi:ltg bmi:glu map:tc map:ldl map:hdl map:tch map:ltg map:glu tc:ldl tc:hdl tc:tch tc:ltg tc:glu ldl:hdl ldl:tch ldl:ltg ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg tch:glu ltg:glu
38
−8 −6 −4 −2 2 4 −0.2 −0.1 0.0 0.1 0.2 0.3
Ridge regression
log(1/lambda) Coefficients −8 −6 −4 −2 2 4 −0.2 −0.1 0.0 0.1 0.2 0.3
Lasso
log(1/lambda) Coefficients
39
−2 2 4 6 0.5 0.6 0.7 0.8 0.9 1.0
Ridge
log(Lambda) Mean−Squared Error
64 64 64 64 64 64 64 64 64 −8 −6 −4 −2 0.5 0.6 0.7 0.8 0.9 1.0
Lasso
log(Lambda) Mean−Squared Error
61 58 52 44 39 27 14 9 4 3
40
coefficient −6 −4 −2 2 4 Intercept age sex bmi map tc ldl hdl tch ltg glu age^2 bmi^2 map^2 tc^2 ldl^2 hdl^2 tch^2 ltg^2 glu^2 age:sex age:bmi age:map age:tc age:ldl age:hdl age:tch age:ltg age:glu sex:bmi sex:map sex:tc sex:ldl sex:hdl sex:tch sex:ltg sex:glu bmi:map bmi:tc bmi:ldl bmi:hdl bmi:tch bmi:ltg bmi:glu map:tc map:ldl map:hdl map:tch map:ltg map:glu tc:ldl tc:hdl tc:tch tc:ltg tc:glu ldl:hdl ldl:tch ldl:ltg ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg tch:glu ltg:glu
Ridge (RMSE=0.664) Lasso (RMSE=0.674) Horseshoe (RMSE=0.688)
41
OLS RIDGE LASSO HORSESHOE 0.7 0.8 0.9 1.0 RMSE
Train=50%
42
43
20 30 40 50 −2 −1 1 2 Time after impact with motorcycles Head acceleration
44
−0.2 0.0 0.2 0.4 0.6 0.8 0.0 1.0 2.0 3.0 Density
beta 1
−2 −1 1 2 0.0 1.0 2.0 Density
beta 5
−2 −1 1 2 3 0.0 0.5 1.0 1.5 2.0 Density
beta 10
−3 −2 −1 1 0.0 0.4 0.8 1.2 Density
beta 15
−5 −4 −3 −2 −1 1 0.0 0.2 0.4 Density
beta 19
−6 −4 −2 2 0.0 0.1 0.2 0.3 Density
beta 24
−4 −3 −2 −1 1 0.0 0.2 0.4 Density
beta 29
−2 −1 1 2 3 0.0 0.4 0.8 1.2 Density
beta 34
−2 −1 1 2 3 4 0.0 0.2 0.4 0.6 Density
beta 38
−1 1 2 3 0.0 0.2 0.4 0.6 Density
beta 43
−2 −1 1 2 3 0.0 0.5 1.0 1.5 Density
beta 48
−1 1 2 0.0 0.5 1.0 1.5 Density
beta 53
45
20 30 40 50 −4 −2 2 4 6 coefficient 95% C.I.
HORSESHOE
46
−2 −1 1 2 −4 −2 2 OLS Horseshoe
47
20 30 40 50 −3 −2 −1 1 2 3 Time after impact with motorcycles Head acceleration
OLS
48
20 30 40 50 −3 −2 −1 1 2 3 Time after impact with motorcycles Head acceleration
Horseshoe
49
10 20 30 40 50 −3 −2 −1 1 2 3 Time after impact with motorcycles Head acceleration
Horseshoe
OLS Horseshoe
50
install.packages("adlift") install.packages("bayeslm") require(splines) library("adlift") library("bayeslm") data(motorcycledata) y = motorcycledata[,2] y = (y-mean(y))/sqrt(var(y)) x = motorcycledata[,1] n = length(x) cuts = quantile(x,seq(0.02,0.98,by=0.02)) X = bs(x,knots=cuts) p = ncol(X) fit.ols <- lm(y~X) fit.hs = bayeslm(y~X)
51
◮ Gramacy and Pantaleo (2009) Shrinkage regression for multivariate
◮ Hahn, He and Lopes (2018) Bayesian factor model shrinkage for
◮ Hahn, He and Lopes (2018) Efficient sampling for Gaussian linear
◮ Johndrow, Orenstein and Bhattacharya (2017) Scalable MCMC for
◮ Murray, Adams and MacKay (2010) Elliptical slice sampling. In
52