Exploring Large Regression Model Spaces via Trans-dimensional - PowerPoint PPT Presentation

Exploring Large Regression Model Spaces via Trans-dimensional Genetic Algorithms Ricardo S. Ehlers ICMC - USP http://www.icmc.usp.br/ ∼ ehlers ehlers@icmc.usp.br Joint work with Marco A.R. Ferreira, University of Missouri.

UFSCar, April 2009 Searching for the “Best” Model(s) • Supose that the number M of alternative models is quite large. E.g. linear model with 19 possible covariates: 2 19 = 524288 alternative models (with no interations). • Enumerate, estimate and associate a measure of fit and parsimony to each possible model may not be the best strategy. • How to compare competing models? • How to make average inference using the competing models (or a subset of this)? Ricardo Ehlers Exploring Large Regression Model Spaces 2

UFSCar, April 2009 Bayesian Approach • Models M 1 , . . . , M k are assigned a priori probabilities p ( M i ) . • For each model θ i ∈ R n i with: – a likelihood function p ( y | θ i , M i ) – a prior distribution p ( θ i | M i ) . • By Bayes Theorem, π ( M i , θ i ) ∝ p ( y | θ i , M i ) p ( θ i | M i ) p ( M i ) p ( M i | y ) ∝ p ( y | M i ) p ( M i ) � p ( y | M i ) = p ( y | θ i , M i ) p ( θ i | M i ) d θ i Ricardo Ehlers Exploring Large Regression Model Spaces 3

UFSCar, April 2009 Approaches AIC (ˆ θ i , M i ) = − 2 log p ( y | ˆ • Akaike (1974) θ i , M i ) + 2 n i • Schwartz (1978) BIC (ˆ θ i , M i ) = − 2 log p ( y | ˆ θ i , M i ) + n i log T • Spiegelhalter et al. (2002) DIC ( θ i , M i ) = − 2 log p ( y | θ i , M i ) + 2 p D � n i =1 ( µ i − y i,obs ) 2 + � n γ i =1 σ 2 • Gelfand and Ghosh (1998) D γ = i , γ +1 • George and McCulloch (1993) SSVS • Chen (2005) , Chib (1995) , Chib and Jeliazkov (2001) , Friel and Pettit (2008) Estimating the marginal likelihood. Ricardo Ehlers Exploring Large Regression Model Spaces 4

UFSCar, April 2009 Genetic Algorithms Holland (1975) , Chatterjee, Laudato, and Lynch (1996)   x 11 . . . x 1 k . . . x 1 L     . . .  . . . Apply genetic opera-  . . .     tors to transform the  x i 1 . . . x ik . . . x iL     A population of M individuals population. . . . . . . . . . each of dimension L .    x j 1 . . . x jk . . . x jL   Selection, crossover,   . . .  . . .  . . .  mutation      x M 1 . . . x Mk . . . x ML  Ricardo Ehlers Exploring Large Regression Model Spaces 5

UFSCar, April 2009 Trans-dimensional Jumps Green (1995) • Propose a jump from model M i to model M j w.p. r ij , • generate a vector u of dimension n j − n i from q () , • set θ j = f ij ( θ i , u ) where f ij : Θ i × R n j − n i → Θ j denotes a bijective function. • Accept the jump w.p. min(1 , A ) where � � A = π ( θ j , M j ) r ji ∂f ij ( θ i , u ) � � � � π ( θ i , M i ) r ij q ( u ) ∂ ( θ i , u ) � � � �� proposal ratio target ratio Choice of proposal distribution q is crucial to cover model and parameter spaces. Ricardo Ehlers Exploring Large Regression Model Spaces 6

UFSCar, April 2009 We assume that: • θ i | M i is easy to estimate using standard methods and software. • Posterior distribution on model space is well approximated by P ( M k | y ) ∝ exp {− BIC (ˆ θ k , k ) / 2 } . BIC (ˆ θ k , k ) = − 2 log p ( y | ˆ θ k , k ) + n k log T . ˆ θ k : maximum likelihood estimate under model M k . Ricardo Ehlers Exploring Large Regression Model Spaces 7

UFSCar, April 2009 RJMCMC + Genetic Algorithms g ( E ( Y )) = β 0 + β j 1 x j 1 + · · · + β j k x j k , k = 0 , . . . , k max Given a population of models Z = ( z 1 , . . . , z M ) where z ij = 0 , 1 , 1. propose a new population z ′ via genetic operators (esp. mutation and crossover), 2. accept the new population with probability, � � 1 , exp {− BIC ( z ′ ) / 2 } P ( z ′ , z ) min P ( z , z ′ ) exp {− BIC ( z ) / 2 } where P ( z , z ′ ) = Pr ( proposing a jump from population z to z ′ ) Ricardo Ehlers Exploring Large Regression Model Spaces 8

UFSCar, April 2009 Crossover Move Combine pairs of models to generate offsprings more likely to be accepted if they have high performance. Randomly choose a pair of individuals z i , z j and propose a new population as follows, 1. select those elements with different values K = { k : z ik � = z jk } 2. randomly choose k ∈ K 3. set z ′ ik = z jk and z ′ jk = z ik 4. Accept this new population with probability � � 1 , exp( − BIC ( z ′ i ) / 2 − BIC ( z ′ j ) / 2) P ( z ′ , z ) min P ( z , z ′ ) exp( − BIC ( z i ) / 2 − BIC ( z j ) / 2) Repeat this updating scheme for all [ M/ 2] pairs selected without replacement from the population. Ricardo Ehlers Exploring Large Regression Model Spaces 9

UFSCar, April 2009 Mutation Move Include new regressor w.p. w , or delete an existing one w.p. 1 − w . Suppose we are updating z i and propose an inclusion. Define R 0 = { j : z ij = 0 } and R 1 = { j : z ij = 1 } . Then, 1. randomly choose j ∈ R 0 and set z ′ ij = 1 2. accepted this move w.p. min(1 , A ) where A = exp( − BIC ( z ′ i ) / 2) (1 − w ) | R 0 | exp( − BIC ( z i ) / 2) w ( | R 1 | + 1) and | J | denotes the cardinality of J . Likewise, if a deletion is proposed 1. choose j ∈ R 1 and set z ′ ij = 0 . 2. accept the move w.p. min(1 , A − 1 ) . Repeat this updating scheme for all z 1 , . . . , z M . Ricardo Ehlers Exploring Large Regression Model Spaces 10

UFSCar, April 2009 Example - linear regression Effect of punishment regimes on crime rates in 47 US states, 15 potential regressors. (Raftery, Painter, and Volinsky 2005) . M percentage of males aged 14-24 So indicator variable for a southern state Ed mean years of schooling Po1 police expenditure in 1960 Po2 police expenditure in 1959 LF labour force participation rate M.F number of males per 1000 females Pop state population NW number of nonwhites per 1000 people U1 unemployment rate of urban males 14-24 U2 unemployment rate of urban males 35-39 GDP gross domestic product per head Ineq income inequality Prob probability of imprisonment Time average time served in state prisons Ricardo Ehlers Exploring Large Regression Model Spaces 11

UFSCar, April 2009 Probs 0.209 0.122 0.060 0.055 0.053 0.036 0.026 0.025 0.023 0.022 Prob.inc M 1 1 1 1 1 1 1 1 1 1 0.9890 0 0 0 0 0 0 0 0 0 0 So 0.0549 Ed 1 1 1 1 1 1 1 1 1 1 1.0000 Po1 1 1 1 1 0 0 1 1 1 0 0.7714 0 0 0 0 1 1 0 0 0 1 Po2 0.2459 LF 0 0 0 0 0 0 0 0 0 0 0.0290 0 0 0 0 0 0 0 0 0 0 M.F 0.0347 Pop 0 0 0 1 0 0 0 0 0 0 0.2049 1 1 1 1 1 1 1 1 0 1 NW 0.9227 U1 0 0 0 0 0 0 0 1 0 0 0.0889 1 1 1 1 1 1 0 1 1 1 U2 0.8891 GDP 0 0 1 0 0 0 0 0 0 1 0.2414 1 1 1 1 1 1 1 1 1 1 Ineq 1.0000 Prob 1 1 1 1 1 1 1 1 1 1 0.9956 1 0 1 0 0 1 1 1 0 1 Time 0.4963 Ricardo Ehlers Exploring Large Regression Model Spaces 12

UFSCar, April 2009 Models visited by GA−MCMC M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time 1 2 3 4 5 7 12 21 54 Model Ricardo Ehlers Exploring Large Regression Model Spaces 13

UFSCar, April 2009 Example - Logistic Regression Risk factors associated with low infant birth weight ( Hosmer and Lemeshow 1989 ). y i ∼ Bernoulli ( π i ) where π i is the i th baby probability of low weight at birth. Under model k this is associate with the covariates as � � π i = X ′ log i θ . 1 − π i Ricardo Ehlers Exploring Large Regression Model Spaces 14

UFSCar, April 2009 Model Covariates indicator Model indicator age lwt race smoke ptl ht ui ftv probability 35 0 1 0 0 0 1 0 0 0.0962 99 0 1 0 0 0 1 1 0 0.0673 51 0 1 0 0 1 1 0 0 0.0600 43 0 1 0 1 0 1 0 0 0.0599 107 0 1 0 1 0 1 1 0 0.0333 3 0 1 0 0 0 0 0 0 0.0294 115 0 1 0 0 1 1 1 0 0.0287 17 0 0 0 0 1 0 0 0 0.0239 19 0 1 0 0 1 0 0 0 0.0202 47 0 1 1 1 0 1 0 0 0.0202 Inclusion 0.190 0.696 0.140 0.381 0.349 0.659 0.376 0.081 – probability Ricardo Ehlers Exploring Large Regression Model Spaces 15

UFSCar, April 2009 Models visited by GA−MCMC age lwt race smoke ptl ht ui ftv 2 4 6 9 14 22 33 50 84 Model Ricardo Ehlers Exploring Large Regression Model Spaces 16

UFSCar, April 2009 Example - Censored Survival Models Survival times of patients with primary biliary cirrhosis, h ( t ) = h 0 ( t ) exp( X ′ i θ ) . age: in years alb: serum albumin alkphos: alkaline phosphotase ascites: presence of ascites bili: serum bilirunbin edtrt: edema treatment hepmeg: enlarged liver platelet: platelet count protime: standardised blood clotting time sex: 1=male sgot: liver enzyme (now called AST) spiders: blood vessel malformations in the skin stage: histologic stage of disease (needs biopsy) trt: 1/2/-9 for control, treatment, not randomised copper: urine copper Ricardo Ehlers Exploring Large Regression Model Spaces 17

Exploring Large Regression Model Spaces via Trans-dimensional - PowerPoint PPT Presentation

Exploring Large Regression Model Spaces via Trans-dimensional Genetic Algorithms Ricardo S. Ehlers ICMC - USP http://www.icmc.usp.br/ ehlers ehlers@icmc.usp.br Joint work with Marco A.R. Ferreira, University of Missouri. UFSCar, April 2009

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

. Trans-Tech Energy and Environmental, Inc . Trans-Tech Energy and Environmental, Inc

IEEE Abbreviations for Transactions, Journals, Letters Biomed En g/ IFEE Trans. Auton. Mental

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

The Trans-Pacific Partnership & Pork and Pork Products 1 Trans-Pacific Partnership (TPP)

Building The Trans-ASEAN Gas Pipeline (TAGP) project envisages the creation of a trans-national

Inclusive Growth: Creating Space for Trans Players Craig Kulyk (Marketing Manager) Tro Weston

Cross-Coupling Reactions of Organoboranes: p g g An Easy Way for Carbon-Carbon Bonding y y g

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

S T. Abdul-Aziz A Histomoniasis Slide Study Set L. R. McDougald American Association of Avian

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB Objectives Learn

MGT-SM: A Method for Constructing Cellular Signal Transduction Networks Min Li, Ruiqing Zheng,

Latest version of the slides can be obtained from

Signs of life On the third day there was a wedding in Cana of Galilee On the third day there was

Unbelief Swallowing up a Bit of the Word of God Unbelief reinterpreting the Word of God within

JOHN 3 YOU MUST BE BORN AGAIN Stony Point Church Gospel of John Sunday School Class Bob Hodges

ROME www.culturesandart.com Memories of the eternal city... Roman Forum located between the

Exploring Large Regression Model Spaces via Trans-dimensional - PowerPoint PPT Presentation

Exploring Large Regression Model Spaces via Trans-dimensional Genetic Algorithms Ricardo S. Ehlers ICMC - USP http://www.icmc.usp.br/ ehlers ehlers@icmc.usp.br Joint work with Marco A.R. Ferreira, University of Missouri. UFSCar, April 2009

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

. Trans-Tech Energy and Environmental, Inc . Trans-Tech Energy and Environmental, Inc

IEEE Abbreviations for Transactions, Journals, Letters Biomed En g/ IFEE Trans. Auton. Mental

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

The Trans-Pacific Partnership &amp; Pork and Pork Products 1 Trans-Pacific Partnership (TPP)

Building The Trans-ASEAN Gas Pipeline (TAGP) project envisages the creation of a trans-national

Inclusive Growth: Creating Space for Trans Players Craig Kulyk (Marketing Manager) Tro Weston

Cross-Coupling Reactions of Organoboranes: p g g An Easy Way for Carbon-Carbon Bonding y y g

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

S T. Abdul-Aziz A Histomoniasis Slide Study Set L. R. McDougald American Association of Avian

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB Objectives Learn

MGT-SM: A Method for Constructing Cellular Signal Transduction Networks Min Li, Ruiqing Zheng,

Latest version of the slides can be obtained from

Signs of life On the third day there was a wedding in Cana of Galilee On the third day there was

Unbelief Swallowing up a Bit of the Word of God Unbelief reinterpreting the Word of God within

JOHN 3 YOU MUST BE BORN AGAIN Stony Point Church Gospel of John Sunday School Class Bob Hodges

ROME www.culturesandart.com Memories of the eternal city... Roman Forum located between the

The Trans-Pacific Partnership & Pork and Pork Products 1 Trans-Pacific Partnership (TPP)