Bayesian variable selection Dr. Jarad Niemi Iowa State University - PowerPoint PPT Presentation

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 1 / 26

Bayesian regression Bayesian regression Consider the model y = X β + ǫ with ǫ ∼ N (0 , σ 2 I ) where y is a vector of length n β is an unknown vector of length p X is a known n × p design matrix σ 2 is an unknown scalar For a given design matrix X , we are interested in the posterior p ( β, σ 2 | y ) , but we may also be interested in which columns of X should be included, i.e. what explanatory variables should we keep in the model. Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 2 / 26

Bayesian regression Default Bayesian inference Default Bayesian regression Assume the standard noninformative prior p ( β, σ 2 ) ∝ 1 /σ 2 then the posterior is p ( β, σ 2 | y ) = p ( β | σ 2 , y ) p ( σ 2 | y ) ∼ N (ˆ β | σ 2 , y β MLE , σ 2 V β ) � 2 , [ n − p ] s 2 � n − p σ 2 | y ∼ IG 2 ∼ t n − p (ˆ β MLE , s 2 V β ) β | y = ( X ⊤ X ) − 1 V β ˆ = V β X ⊤ y β MLE n − p ( y − X ˆ β MLE ) ⊤ ( y − X ˆ s 2 1 = β MLE ) The posterior is proper if n > p and rank( X ) = p . Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 3 / 26

Bayesian regression Cricket chirps Information about chirps per 15 seconds Let Y i is the average number of chirps per 15 seconds and X i is the temperature in Fahrenheit. And we assume ind ∼ N ( β 0 + β 1 X i , σ 2 ) Y i then β 0 is the expected number of chirps at 0 degrees Fahrenheit β 1 is the expected increase in number of chirps (per 15 seconds) for each degree increase in Fahrenheit. Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 4 / 26

Bayesian regression Cricket chirps Cricket chirps As an example, consider the relationship between the number of cricket chirps (in 15 seconds) and temperature (in Fahrenheit). From example in LearnBayes::blinreg . 20 18 chirps 16 14 70 75 80 85 90 temp Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 5 / 26

Bayesian regression Cricket chirps Default Bayesian regression summary(m <- lm(chirps~temp)) ## ## Call: ## lm(formula = chirps ~ temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.74107 -0.58123 0.02956 0.58250 1.50608 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.61521 3.14434 -0.196 0.847903 ## temp 0.21568 0.03919 5.504 0.000102 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9849 on 13 degrees of freedom ## Multiple R-squared: 0.6997,Adjusted R-squared: 0.6766 ## F-statistic: 30.29 on 1 and 13 DF, p-value: 0.0001015 confint(m) # Credible intervals ## 2.5 % 97.5 % ## (Intercept) -7.4081577 6.1777286 ## temp 0.1310169 0.3003406 Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 6 / 26

Bayesian regression Subjective Bayesian inference Fully conjugate subjective Bayesian inference If we assume the following normal-inverse-gamma prior, β | σ 2 ∼ N ( b 0 , σ 2 B 0 ) σ 2 ∼ IG ( a , b ) then the posterior is β | σ 2 , y ∼ N ( b n , σ 2 B n ) σ 2 | y ∼ IG ( a ′ , b ′ ) with = B − 1 B − 1 + 1 σ 2 X ⊤ X n 0 B − 1 = B − 1 0 b 0 + 1 σ 2 X ⊤ y � � b n n a ′ = a + n 2 2 ( y − Xb 0 ) ⊤ ( XB 0 X ⊤ + I ) − 1 ( y − Xb 0 ) = b + 1 b ′ Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 7 / 26

Bayesian regression Subjective Bayesian inference Information about chirps per 15 seconds Let Y i is the average number of chirps per 15 seconds and X i is the temperature in Fahrenheit. And we assume ind ∼ N ( β 0 + β 1 X i , σ 2 ) Y i then β 0 is the expected number of chirps at 0 degrees Fahrenheit β 1 is the expected increase in number of chirps (per 15 seconds) for each degree increase in Fahrenheit. Perhaps a reasonable prior is p ( β 0 , β 1 , σ 2 ) ∝ N ( β 0 ; 0 , 10 2 ) N ( β 1 ; 0 , 1 2 ) 1 σ 2 . Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 8 / 26

Bayesian regression Subjective Bayesian inference Subjective Bayesian regression m = arm::bayesglm(chirps~temp, prior.mean.for.intercept = 0, # E[ \ beta_0] prior.scale.for.intercept = 10, # SD[ \ beta_0] prior.df.for.intercept = Inf, # normal prior for \ beta_0 prior.mean = 0, # E[ \ beta_1] prior.scale = 1, # SD[ \ beta_1] prior.df = Inf, # normal prior scaled = FALSE) # scale prior? Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 9 / 26

Bayesian regression Subjective Bayesian inference Subjective Bayesian regression summary(m) ## ## Call: ## arm::bayesglm(formula = chirps ~ temp, prior.mean = 0, prior.scale = 1, ## prior.df = Inf, prior.mean.for.intercept = 0, prior.scale.for.intercept = 10, ## prior.df.for.intercept = Inf, scaled = FALSE) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.7450 -0.5795 0.0312 0.5846 1.5142 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.53636 2.99849 -0.179 0.861 ## temp 0.21470 0.03738 5.743 6.79e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.9701008) ## ## Null deviance: 41.993 on 14 degrees of freedom ## Residual deviance: 12.611 on 13 degrees of freedom ## AIC: 45.966 ## ## Number of Fisher Scoring iterations: 10 Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 10 / 26

Bayesian regression Subjective Bayesian inference Subjective vs Default # default analysis tmp = lm(chirps~temp) tmp$coefficients ## (Intercept) temp ## -0.6152146 0.2156787 confint(tmp) ## 2.5 % 97.5 % ## (Intercept) -7.4081577 6.1777286 ## temp 0.1310169 0.3003406 # Subjective analysis m$coefficients ## (Intercept) temp ## -0.5363623 0.2146971 confint(m) ## 2.5 % 97.5 % ## (Intercept) -6.7792735 5.5475553 ## temp 0.1388709 0.2925027 Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 11 / 26

Bayesian regression Subjective Bayesian inference Subjective vs Default 20 18 chirps 16 14 70 75 80 85 90 temp Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 12 / 26

Bayesian regression Subjective Bayesian inference Shrinkage (as V [ β 1 ] gets smaller) beta0 beta1 12 0.20 8 0.15 estimate 4 0.10 0 0.1 10.0 0.1 10.0 V Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 13 / 26

Bayesian regression Subjective Bayesian inference Shrinkage (as V [ β 1 ] gets smaller) 20 18 V 1e+02 1e+01 chirps 1e+00 1e−01 1e−02 16 14 70 75 80 85 90 temp Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 14 / 26

Zellner’s g-prior Zellner’s g-prior Let ǫ ∼ N ( σ 2 I ) . y = X β + ǫ, If we choose the conjugate prior β ∼ N ( b 0 , σ 2 B 0 ), we still need to choose b 0 and B 0 . It seems natural to set b 0 = 0 which will shrink the estimates for β toward zero, i.e. toward no effect. But how should we choose B 0 ? One option is Zellner’s g -prior where B 0 = g [ X ⊤ X ] − 1 where g is either set or learned. Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 15 / 26

Zellner’s g-prior Zellner’s g-prior posterior Suppose y ∼ N ( X β, σ 2 I ) where X is n × p and you use Zellner’s g-prior β ∼ N ( b 0 , g σ 2 ( X ′ X ) − 1 ) and independently assume p ( σ 2 ) ∝ 1 /σ 2 . The posterior is then β MLE , σ 2 g � 1 g � β | σ 2 , y ∼ N ˆ g + 1( X ′ X ) − 1 1 + g b 0 + 1 + g Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 16 / 26

Zellner’s g-prior Setting g In Zellner’s g-prior, β ∼ N ( b 0 , g σ 2 ( X ′ X ) − 1 ) , p ( σ 2 ) ∝ 1 /σ 2 we need to determine how to set g. Here are some thoughts: g → 0 makes posterior equal to the prior, g = 1 puts equal weight to prior and likelihood, g = n means prior has the equivalent weight of 1 observation, g → ∞ recovers a uniform prior, empirical Bayes estimate of g , ˆ g EB = argmax g p ( y | g ), or put a prior on g and perform a fully Bayesian analysis. Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 17 / 26

Zellner’s g-prior Marginal likelihood Marginal likelihood The marginal likelihood under Zellner’s g -prior is n − p − 1 Γ ( n − 1 2 ) (1+ g ) 2 2 n 1 / 2 || y − y || − ( n − 1) p ( y | g ) = n − 1 n − 1 (1+ g [1 − R 2 ]) π 2 where R 2 is the coefficient of determination. We use the marginal likelihood as evidence in favor of the model, i.e. when comparing models those with higher marginal likelihoods should be prefered over the rest. Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 18 / 26

Bayesian variable selection Dr. Jarad Niemi Iowa State University - PowerPoint PPT Presentation

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 1 / 26 Bayesian regression Bayesian regression Consider the model y = X +

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 2:

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 1:

CS 115 Lecture 5 Math library; building a project Neil Moore Department of Computer Science

Climate Change and the Future Impacts across the Southwest Region Darren McCollum and Robert Bohlin

Interactors CSBridge Summer 2019 By Ayca Tuzmen Summer 2019 By Ayca Tuzmen How do programs

CS429: Computer Organization and Architecture Intro to C Dr. Bill Young Department of Computer

Preparing y o u r fig u res to share w ith others IN TR OD U C TION TO DATA VISU AL IZATION W

Programming Assignment 1. Temperature Converter Deadline: 2018/12/7 1:00PM 2018 Fall Computer

6 Foot Kitchen Training Food Production May 2020 Food Production (60 minutes) LEARNING

Java Swing GUI Programming 2 Learning Objectives Inner classes Event model of Swing

Bayesian variable selection Dr. Jarad Niemi Iowa State University - PowerPoint PPT Presentation

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi (Iowa State) Bayesian variable selection September 4, 2017 1 / 26 Bayesian regression Bayesian regression Consider the model y = X +

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3:

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 2:

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 1:

CS 115 Lecture 5 Math library; building a project Neil Moore Department of Computer Science

Climate Change and the Future Impacts across the Southwest Region Darren McCollum and Robert Bohlin

Interactors CSBridge Summer 2019 By Ayca Tuzmen Summer 2019 By Ayca Tuzmen How do programs

CS429: Computer Organization and Architecture Intro to C Dr. Bill Young Department of Computer

Preparing y o u r fig u res to share w ith others IN TR OD U C TION TO DATA VISU AL IZATION W

Programming Assignment 1. Temperature Converter Deadline: 2018/12/7 1:00PM 2018 Fall Computer

6 Foot Kitchen Training Food Production May 2020 Food Production (60 minutes) LEARNING

Java Swing GUI Programming 2 Learning Objectives Inner classes Event model of Swing

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?