Sliced Inverse Regression with Interaction (Siri) Detection for - - PowerPoint PPT Presentation

sliced inverse regression with interaction siri detection
SMART_READER_LITE
LIVE PREVIEW

Sliced Inverse Regression with Interaction (Siri) Detection for - - PowerPoint PPT Presentation

Sliced Inverse Regression with Interaction (Siri) Detection for non-Gaussian BN learning Jun S. Liu Department of Statistics Harvard University Joint work with Bo Jiang 1 General: Regression and Classification Responses Covariates Ind 1 x


slide-1
SLIDE 1

Jun S. Liu Department of Statistics Harvard University

Joint work with Bo Jiang

Sliced Inverse Regression with Interaction (Siri) Detection

for non-Gaussian BN learning

1

slide-2
SLIDE 2

2

General: Regression and Classification

Ind 1 Ind 2 Ind N

Responses

Y1 Y2 YN

x11, x12, …, x1p x21, x22, …, x2p xN1, xN2, …, xNP

M

Covariates

slide-3
SLIDE 3

3

slide-4
SLIDE 4

Variable Selection with Interaction

Suppose p= 1000. How to find X 1 and X 2 ?

One step forward selection :∼ 500,000 interaction terms

Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2), X ∼ MVN(0, I p)

Let Y ∈ R be a univerate response variable and X ∈ R

p

be a vector of p continuous predictor variables

4

slide-5
SLIDE 5

Variable Selection with Interaction

Let Y ∈ R be a univerate response variable and X ∈ R

p

be a vector of p continuous predictor variables Suppose p= 1000. How to find X 1 and X 2 ?

One step forward selection :∼ 500,000 interaction terms

Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2), X ∼ MVN(0, I p)

Is there any marginal relationship betweenY and X 1?

5

slide-6
SLIDE 6

[Y|X] ? [X|Y] ? Who is behind the bar?

6

slide-7
SLIDE 7

7

General: Regression and Classification

Ind 1 Ind 2 Ind N

Responses

Y1 Y2 YN

x11, x12, …, x1p x21, x22, …, x2p xN1, xN2, …, xNP

M

( | ) ( | ) ( )/ ( ) P Y P Y P Y P = X X X

How to model this? Covariates

slide-8
SLIDE 8

Naïve Bayes model

X1 X2 X3 Xm Y

8

slide-9
SLIDE 9

l BEAM: Bayesian Epistasis Association Mapping (Zhang and Liu 2007): discrete univariate response and discrete predictors l (Augmented) Naïve Bayes Classifier with Variable Selection and Interaction Detection (Yuan Yuan et al.): discrete univariate response and continuous (but discretized) predictors l Bayesian Partition Model for eQTL study (Zhang et al. 2010): continuous multivariate responses and discrete predictors l Sliced Inverse Regression with Interaction Detection (SIRI): continuous univariate response and continuous predictors

(Augmented) Naïve Bayes Model

9

slide-10
SLIDE 10

Tree-Augmented Naïve Bayes

X1 X2 X3 Y X4 X6 X5 (Pearl 1988; Friedman 1997) TAN (tree-augmented naïve Bayes)

10

slide-11
SLIDE 11

Augmented Naïve Bayes

X11 X12 X13 Y X2.11 X2.12 X2.13 X01 X02 Group 0 Group 1 Group 21 X2.21 X2.22 Group 22

11

slide-12
SLIDE 12

How about continuous covariates?

  • We may discretize Y, and discretize each X
  • Or discretize Y, assuming joint Gaussian

distributions on X?

  • Sound familiar?
slide-13
SLIDE 13

x1

y

13

Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2), X ∼ MVN(0, I p)

An observation:

slide-14
SLIDE 14

Let Y ∈ R be a univerate response variable and X ∈ R

p

be a vector of p continuous predictor variables

Y = f (β 1

T X ,... ,β K T X ,ϵ )

f is an unknown function and ϵ is the error with finite variance

SIR is a tool for dimension reduction in multivariate statistics

Sliced Inverse Regression (SIR, Li 1991)

How to identify unknown projection vectors β 1 ,...,β K ?

14

slide-15
SLIDE 15

15

slide-16
SLIDE 16

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

24

slide-25
SLIDE 25

25

slide-26
SLIDE 26

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

###ëëʹà”‹åf

30

slide-31
SLIDE 31

Let Σ xx be the covariance matrix of X .Standarize X to :

SIR Algorithm

Z= Σ xx

− 1/2{ X − E X }

Divde the range of yi into S nonoverlapping slices H s∈ {1,... ,S } ns is the number of observations within each slice

Compute the mean of zi over all slices ̄zs= ns

− 1∑ i∈ H s zi, and

calculate the estimate for Cov{E( X∣Y)}:

̂ M= n

− 1∑ s= 1 S

ns̄zs̄z s

T

Identify largest K eigenvalues of ̂ M , ̂λ k and corresponding eigenvectors ̂ η k .Then,

̂ β k= ̂ Σ xx

− 1/2 ̂

η k (k= 1,...,K)

31

slide-32
SLIDE 32

Shrinkage estimates of β 1,...,β K using L1- or L2-penalty : Only a subset of predictors are relevant: β 1,...,β K are sparse

Regularized SIR (RSIR, Zhong et al. 2005) Sparse SIR (SSIR, Li 2007) Correlation Pursuit (Zhong et al. 2012) : A forward selection and backward elimination procedure motivated by F-test in stepwise regression Backward subset selection (Cook 2004, Li et al. 2005)

SIR with Variable Selection

F 1, n− d − 1= (n− d− 1)( ̂Rd + 1

2

− ̂Rd

2)

1− ̂Rd+ 1

2 32

slide-33
SLIDE 33

Correlation Pursuit (COP)

Let A be the current set of selected predictors and ̂λ k

A the k th

largest eigenvalue estimated by SIR based on predictors in A For j th predictor( j∉ A) , X j ,define statistic

COPk

A+ j= n( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j 33

slide-34
SLIDE 34

Correlation Pursuit (COP)

Let A be the current set of selected predictors and ̂λ k

A the k th

largest eigenvalue estimated by SIR based on predictors in A For j th predictor( j∉ A) , X j ,define statistic

If j∉ A ,COP k

A+ j(k= 1,... , K) are asymptotically

i.i.d. χ

2(1) ,

and COP1:K

A+ j= ∑ k= 1 K

COPk

A+ jis asymptotically χ 2( K)

COPk

A+ j= n( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j 34

slide-35
SLIDE 35

Correlation Pursuit (COP)

Let A be the current set of selected predictors and ̂λ k

A the k th

largest eigenvalue estimated by SIR based on predictors in A For j th predictor( j∉ A) , X j ,define statistic

COPk

A+ j= n( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j

The stepwise procedure is consistent if p= O(n

r) ,r< 1/2

If j∉ A ,COP k

A+ j(k= 1,... , K) are asymptotically

i.i.d. χ

2(1) ,

and COP1:K

A+ j= ∑ k= 1 K

COPk

A+ jis asymptotically χ 2( K) 35

slide-36
SLIDE 36

Correlation Pursuit (COP)

Let A be the current set of selected predictors and ̂λ k

A the k th

largest eigenvalue estimated by SIR based on predictors in A For j th predictor( j∉ A) , X j ,define statistic

COPk

A+ j= n( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j

The stepwise procedure is consistent if p= O(n

r) ,r< 1/2

Dimension K and threshold in forward selection (backward elimnation) are chosen by cross-validation

If j∉ A ,COP k

A+ j(k= 1,... , K) are asymptotically

i.i.d. χ

2(1) ,

and COP1:K

A+ j= ∑ k= 1 K

COPk

A+ jis asymptotically χ 2( K) 36

slide-37
SLIDE 37

SIR via MLE

X A∣Y ∈ H s∼ N (μ s ,Σ )

Let A be the set of relevant predictors and C= A

C, d=∣A∣

37

slide-38
SLIDE 38

SIR via MLE

X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) X A∣Y ∈ H s∼ N (μ s ,Σ )

Let A be the set of relevant predictors and C= ¬ A, d=∣A∣

38

slide-39
SLIDE 39

SIR via MLE

X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) X A∣Y ∈ H s∼ N (μ s ,Σ )

Let A be the set of relevant predictors and C= A

C, d=∣A∣

μ s= α + Γ γ s ,where γ s∈ RK and Γ is a d× K orthogonal matrix

39

slide-40
SLIDE 40

SIR via MLE

Let A be the set of relevant predictors and C= A

C, d=∣A∣

X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) X A∣Y ∈ H s∼ N (μ s ,Σ )

μ s= α + V

K ,belongs to a K -dimensional affine space(K< d)

40

slide-41
SLIDE 41

SIR via MLE

MLE of the span of subspaceV

K coincides with SIR directions

(Cook 2007, Szretter and Yohai 2009) X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) X A∣Y ∈ H s∼ N (μ s ,Σ )

μ s= α + V

K ,belongs to a K -dimensional affine space(K< d)

Let A be the set of relevant predictors and C= A

C, d=∣A∣

41

slide-42
SLIDE 42

SIR via MLE

X A∣Y ∈ H s∼ N (μ s ,Σ ) X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) MLE of the span of subspaceV

K coincides with SIR directions

(Cook 2007, Szretter and Yohai 2009) Given current A and predctor X j∉ A, we want to test H 0: X jis irrelevant, vs. H 1: X jis relevant

μ s= α + V

K ,belongs to a K -dimensional affine space(K< d)

Let A be the set of relevant predictors and C= ¬ A, d=∣A∣

42

slide-43
SLIDE 43

SIR via MLE

X A∣Y ∈ H s∼ N (μ s ,Σ )

X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) MLE of the span of subspaceV

K coincides with SIR directions

(Cook 2007, Szretter and Yohai 2009)

P M 1( X∣Y ) P M 0( X∣Y ) = P M 1( X j∣X A ,Y ) P M 0( X j∣ X A,Y )

Given current A and predctor X j∉ A, we want to test H 0: X jis irrelevant, vs. H 1: X jis relevant

μ s= α + V

K ,belongs to a K -dimensional affine space(K< d)

Let A be the set of relevant predictors and C= A

C, d=∣A∣

43

slide-44
SLIDE 44

SIR via MLE

X A∣Y ∈ H s∼ N (μ s ,Σ ) MLE of the span of subspaceV

K coincides with SIR directions

(Cook 2007, Szretter and Yohai 2009)

LR j= P ̂

M 1( X j∣ X A,Y )

P ̂

M 0( X j∣X A ,Y )

X C∣ X A ,Y ∈ H s∼ N ( X A β ,Σ 0) Given current A and predctor X j∉ A, we want to test H 0: X jis irrelevant, vs. H 1: X jis relevant

μ s= α + V

K ,belongs to a K -dimensional affine space(K< d)

Let A be the set of relevant predictors and C= A

C, d=∣A∣

44

slide-45
SLIDE 45

2LR j= − n(∑ k= 1

K

log(1− ̂ λ k

A+ j)− ∑ k= 1 K

log(1− ̂λ k

A))

LR Test vs. COP

= n∑ k= 1

K

log( 1+ ̂ λ k

A+ j− ̂λ k A

1− ̂λ k

A+ j)

COPk

A+ j= n( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j

→p χ

2(1), ( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j

→p0

Given current A, the likelihood ratio (LR) test statistic of H 0: X jis irrelevant, vs. H 1: X jis relevant

Under H 0: X jis irrelevant

45

slide-46
SLIDE 46

2LR j= − n(∑ k= 1

K

log(1− ̂ λ k

A+ j)− ∑ k= 1 K

log(1− ̂λ k

A))

LR Test vs. COP

= n∑ k= 1

K

log( 1+ ̂ λ k

A+ j− ̂λ k A

1− ̂λ k

A+ j)

COPk

A+ j= n( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j

→p χ

2(1), ( ̂λ k A+ j− ̂λ k A)

1− ̂λ k

A+ j

→p0 2LR j →pCOP1: K

A+ j= ∑ k= 1 K

COP k

A+ j→p χ 2(K)

Under H 0: X jis irrelevant

Given current A, the likelihood ratio (LR) test statistic of H 0: X jis irrelevant, vs. H 1: X jis relevant

46

slide-47
SLIDE 47

x1

y

47

Beyond the First-order

  • E(X1|Y)=0
slide-48
SLIDE 48

An Augmented Model

48

slide-49
SLIDE 49

Likelihood Ratio Test

slide-50
SLIDE 50
slide-51
SLIDE 51

Example revisit

slide-52
SLIDE 52
slide-53
SLIDE 53

x1 x2

53

slide-54
SLIDE 54

x1 x2

54

slide-55
SLIDE 55

x1 x2

55

slide-56
SLIDE 56

x1 x2

56

slide-57
SLIDE 57

x1 x2

57

slide-58
SLIDE 58

x1 x2

58

slide-59
SLIDE 59

x1 x2

59

slide-60
SLIDE 60

Sure independence screening (SIS) when p>>n

slide-61
SLIDE 61

Siri: An interweaving strategy

slide-62
SLIDE 62

Theoretical Properties

slide-63
SLIDE 63
slide-64
SLIDE 64

Consistency of Stepwise Procedure

slide-65
SLIDE 65

Implementation Issues

slide-66
SLIDE 66

Simulation I (linear)

Y = X p β+ ϵ , ϵ ~ N (0,1), Cov( X i , X j)= 0.5∣

i− j∣

n= 200, p= 1000, β= (3,1.5,1,1,2,1,0.9,1,1,1,0,... ,0)T

Method FP(0, 990) FN(0, 10) SIRI-C

[CV minimizing classification error]

1.86 (0.222) 1.66 (0.117) SIRI-M

[CV minimizing mean square error]

0.76 (0.120) 1.75 (0.114) COP 1.62 (0.165) 1.67 (0.118) SIS-SCAD 0.10 (0.030) 0.64 (0.069) LASSO 5.40 (0.188) 0.00 (0.000)

slide-67
SLIDE 67

Simulation II: hierarchical interactions

slide-68
SLIDE 68

Simulation III: non-hierarchical

slide-69
SLIDE 69

Simulation IV: Non-multiplicative

slide-70
SLIDE 70

Simulation V (heteroscedastic, single index)

Y = 0.2ϵ 1.5+ ∑

j= 1 8

X j , X p~ indepdent normal n= 1000, p= 1000

Method FP(0, 992) FN(0, 8) SIRI-C 2.00 (0.163) 0.42 (0.138) SIRI-M 0.43 (0.079) 4.60 (0.274) COP 1.26 (0.128) 3.32 (0.192) SIS-SCAD 3.23 (0.356) 8.00 (0.000) LASSO 0.64 (0.255) 8.00 (0.000)

slide-71
SLIDE 71

Simulation VI (hub with linear effect)

n= 200, p= 1000

Method FP(0, 997) FN(0, 3) SIRI-C 0.39 (0.115) 0.12 (0.046) SIRI-M 0.03 (0.017) 0.04 (0.020) SIS-SCAD-2 0.00 (0.000) 0.45 (0.068)

Y = X 1+ X 1× ( X 2+ X 3)+ 0.2ϵ , X p~indepdent normal

slide-72
SLIDE 72

Simulation VII (three-way interaction)

n= 500, p= 1000 Y = X 1× X 2× X 3+ 0.2ϵ , X p~indepdent normal

slide-73
SLIDE 73

Bayesian Networks

slide-74
SLIDE 74
slide-75
SLIDE 75

Learning BN structures

  • Global approach (Score/likelihood based):
  • Posterior inference:
  • Or score-based criterion
  • AIC =
  • BIC =

P(G | Data)∝ P(Data |θG,G)p(θG |G)p(G)d

θG

−2logP(Data | ˆ θG,G)+ 2pG −2logP(Data | ˆ θG,G)+ log(n)pG

slide-76
SLIDE 76

Learning structures

  • Local approaches: using conditional

independence statements as constraints.

  • Represented by “Inductive causality” (IC) algorithm

due to Pearl (2000).

  • 1. First, the skeleton of the network (undirected graph

underlying the network structure) is learned by recursively testing the conditional independence between nodes.

  • 2. Set all direction of the arcs that are part of a v-structure,

which is a triplet of nodes incident on a converging connection Xj à Xiß Xk

  • 3. Set the directions of the order arcs as needed to satisfy the

acyclicity constraint.

slide-77
SLIDE 77

Finding Markov blanket for each note using Growth-Shrink (GS) algorithm

  • It is like a stepwise regression. For each note

Xi, we treat it as the response variable and

(a) gradually add variables that are predictive of Xi; (b) Backward removing those “redundant” Xj’s

  • btained from the growth phase.
slide-78
SLIDE 78

l Cross-validation to select the dimension and thresholds l Back to full Bayesian model with dynamic slicing l We want to have flexibility in choosing slicing boundaries l Connection with Mutual-Information Criterion (MIC) l Many interesting possibilities l Robustness to the distribution of predictors

Discussion

78

slide-79
SLIDE 79

Acknowledgment

Bo Jiang – who did all the work

  • Dr. Tingting Zhang, Dr. Wenxuan Zhong

Joseph K. Blitzstein

79