Discovering Conditionally Salient Features with Statistical - - PowerPoint PPT Presentation

▶

Dec 20, 2022 109 likes •352 views

Discovering Conditionally Salient Features with Statistical Guarantees Jaime Roquero Gimenez, James Zou Stanford University 1 / 11 Feature Selection Setting the problem: Dataset with d features X 1 , . . . , X d Response variable Y Goal : Find

SLIDE 1

Discovering Conditionally Salient Features with Statistical Guarantees

Jaime Roquero Gimenez, James Zou

Stanford University

1 / 11

SLIDE 2

Feature Selection

Setting the problem: Dataset with d features X1, . . . , Xd Response variable Y Goal: Find set of important variables H1 ⊂ {1, . . . , d} A variable j ∈ H0 is null (i.e. irrelevant for predicting Y ) if Xj ⊥ ⊥ Y |X−j Otherwise, we say that that j ∈ H1 is non-null. Construct a procedure that outputs an estimate ˆ S of H1 False Discovery Rate control as statistical guarantee: FDR = E

| ˆ

S ∩ H0| | ˆ S| ∨ 1

2 / 11

SLIDE 3

Feature Selection

Setting the problem: Dataset with d features X1, . . . , Xd Response variable Y Goal: Find set of important variables H1 ⊂ {1, . . . , d} A variable j ∈ H0 is null (i.e. irrelevant for predicting Y ) if Xj ⊥ ⊥ Y |X−j Otherwise, we say that that j ∈ H1 is non-null. Construct a procedure that outputs an estimate ˆ S of H1 False Discovery Rate control as statistical guarantee: FDR = E

| ˆ

S ∩ H0| | ˆ S| ∨ 1

2 / 11

SLIDE 4

Feature Selection

Setting the problem: Dataset with d features X1, . . . , Xd Response variable Y Goal: Find set of important variables H1 ⊂ {1, . . . , d} A variable j ∈ H0 is null (i.e. irrelevant for predicting Y ) if Xj ⊥ ⊥ Y |X−j Otherwise, we say that that j ∈ H1 is non-null. Construct a procedure that outputs an estimate ˆ S of H1 False Discovery Rate control as statistical guarantee: FDR = E

| ˆ

S ∩ H0| | ˆ S| ∨ 1

2 / 11

SLIDE 5

Feature Selection in Linear Model

Fit a linear model to the data: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + · · · + βdXd + ǫ Which variables are important? Those whose corresponding coefficients are non-zero. β1, β3 = 0 ⇒ 1, 3 ∈ H1 β2 = β4 = · · · = βd = 0 ⇒ 2, 4, . . . , d ∈ H0 In this model, non-null features are global non-nulls. We have H1 = {1, 3}, regardless of the value of X

3 / 11

SLIDE 6

Feature Selection in Linear Model

Fit a linear model to the data: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + · · · + βdXd + ǫ Which variables are important? Those whose corresponding coefficients are non-zero. β1, β3 = 0 ⇒ 1, 3 ∈ H1 β2 = β4 = · · · = βd = 0 ⇒ 2, 4, . . . , d ∈ H0 In this model, non-null features are global non-nulls. We have H1 = {1, 3}, regardless of the value of X

3 / 11

SLIDE 7

Feature Selection in Linear Model

Fit a linear model to the data: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + · · · + βdXd + ǫ Which variables are important? Those whose corresponding coefficients are non-zero. β1, β3 = 0 ⇒ 1, 3 ∈ H1 β2 = β4 = · · · = βd = 0 ⇒ 2, 4, . . . , d ∈ H0 In this model, non-null features are global non-nulls. We have H1 = {1, 3}, regardless of the value of X

3 / 11

SLIDE 8

Global vs. Local non-nulls

What if a feature is non-null depending on the value of other features?

Y = X2 + ǫ

if X1 > c Y = X3 + ǫ if X1 ≤ c ” ⇒ ”

H1 = {1, 2}

if X1 > c H1 = {1, 3} if X1 ≤ c From a global perspective, H1 = {1, 2, 3}. Can we generate a procedure that selects non-null features locally, while retaining statistical guarantees? Potentially yes if model interactions in parametric models of Y |X. What if such models are not available?

4 / 11

SLIDE 9

Global vs. Local non-nulls

What if a feature is non-null depending on the value of other features?

Y = X2 + ǫ

if X1 > c Y = X3 + ǫ if X1 ≤ c ” ⇒ ”

H1 = {1, 2}

if X1 > c H1 = {1, 3} if X1 ≤ c From a global perspective, H1 = {1, 2, 3}. Can we generate a procedure that selects non-null features locally, while retaining statistical guarantees? Potentially yes if model interactions in parametric models of Y |X. What if such models are not available?

4 / 11

SLIDE 10

Global vs. Local non-nulls

What if a feature is non-null depending on the value of other features?

Y = X2 + ǫ

if X1 > c Y = X3 + ǫ if X1 ≤ c ” ⇒ ”

H1 = {1, 2}

if X1 > c H1 = {1, 3} if X1 ≤ c From a global perspective, H1 = {1, 2, 3}. Can we generate a procedure that selects non-null features locally, while retaining statistical guarantees? Potentially yes if model interactions in parametric models of Y |X. What if such models are not available?

4 / 11

SLIDE 11

Local Definition of Null Variable

A variable j ∈ H0 is null if Xj ⊥ ⊥ Y |X−j We define / construct: the sets of local nulls H0(x) , local non-nulls H1(x) at points in feature space a procedure to return a local estimate ˆ S(x) of the local non-nulls a generalization of FDR to a local FDR How to retain FDR control in a local setting, without using a parametric model for Y |X?

5 / 11

SLIDE 12

Local Definition of Null Variable

A variable j ∈ H0(x) is a local null at X = x if Xj ⊥ ⊥ Y |X−j = x−j We define / construct: the sets of local nulls H0(x) , local non-nulls H1(x) at points in feature space a procedure to return a local estimate ˆ S(x) of the local non-nulls a generalization of FDR to a local FDR How to retain FDR control in a local setting, without using a parametric model for Y |X?

5 / 11

SLIDE 13

Local Definition of Null Variable

A variable j ∈ H0(x) is a local null at X = x if Xj ⊥ ⊥ Y |X−j = x−j We define / construct: the sets of local nulls H0(x) , local non-nulls H1(x) at points in feature space a procedure to return a local estimate ˆ S(x) of the local non-nulls a generalization of FDR to a local FDR How to retain FDR control in a local setting, without using a parametric model for Y |X?

5 / 11

SLIDE 14

Knockoff Procedure

Most feature selection procedures construct scores Tj for each feature: X1, X2, . . ., Xd, Y ↓ T1, T2, . . ., Td Then scores are ranked and some cutoff leads to ˆ S. Need a statistical model to have statistical guarantees on FDR If high-dimensional setting, statistical assumptions may fail. If wanted to do local feature selection, subsetting data could limit the power and break assumptions based on asymptotic behavior. These limitations make local feature selection a hard problem for usual methods.

6 / 11

SLIDE 15

Knockoff Procedure

Most feature selection procedures construct scores Tj for each feature: X1, X2, . . ., Xd, Y ↓ T1, T2, . . ., Td Then scores are ranked and some cutoff leads to ˆ S. Need a statistical model to have statistical guarantees on FDR If high-dimensional setting, statistical assumptions may fail. If wanted to do local feature selection, subsetting data could limit the power and break assumptions based on asymptotic behavior. These limitations make local feature selection a hard problem for usual methods.

6 / 11

SLIDE 16

Knockoff Procedure

The knockoff procedure generates a new, synthetic dataset ˜ X, and constructs scores as previously: X1, X2, . . ., Xd, ˜ X1, ˜ X2, . . . , ˜ Xd, Y ↓ ↓ T1, T2, . . ., Td, ˜ T1, ˜ T2, . . . , ˜ Td Ranking the differences Wj = Tj − ˜ Tj allows to select features with FDR control. Does not require modeling Y |X for FDR control. Statistical guarantees

nly depend on the validity of the process to generate ˜

X.

7 / 11

SLIDE 17

Knockoff Procedure

The knockoff procedure generates a new, synthetic dataset ˜ X, and constructs scores as previously: X1, X2, . . ., Xd, ˜ X1, ˜ X2, . . . , ˜ Xd, Y ↓ ↓ T1, T2, . . ., Td, ˜ T1, ˜ T2, . . . , ˜ Td Ranking the differences Wj = Tj − ˜ Tj allows to select features with FDR control. Does not require modeling Y |X for FDR control. Statistical guarantees

nly depend on the validity of the process to generate ˜

X.

7 / 11

SLIDE 18

Localize the Knockoff Procedure

Our work generalizes the Knockoff procedure to tackle local feature selection: Generalize the distributional properties of the knockoff variables ˜ X to the local setting, without additional constraints. Generalize the construction of the scores to capture local dependence. By generating ˜ X as in the usual knockoff procedure, using the whole dataset, the statistical guarantees hold for the localized procedure.

8 / 11

SLIDE 19

Localize the Knockoff Procedure

Our work generalizes the Knockoff procedure to tackle local feature selection: Generalize the distributional properties of the knockoff variables ˜ X to the local setting, without additional constraints. Generalize the construction of the scores to capture local dependence. By generating ˜ X as in the usual knockoff procedure, using the whole dataset, the statistical guarantees hold for the localized procedure.

8 / 11

SLIDE 20

Example: Switch variable model

Three switch features Xs0, Xs1, Xs2 and four different sets of local non-nulls S00, S01, S10, S11. Y has a linear response in XSij.

9 / 11

SLIDE 21

Local FDR control

5000 10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0 Power Average Global Power - Full Space Average Local Power - Medium radius (2 Partitions) Average Local Power - Small radius (4 Partitions) 5000 10000 20000 30000 40000 50000 Number of samples 0.0 0.1 0.2 0.3 0.4 FDR Average Global FDR - Full Space Average Local FDR - Medium radius (2 Partitions) Average Local FDR - Small radius (4 Partitions)

10 / 11

SLIDE 22

Thank you

11 / 11