STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - - PowerPoint PPT Presentation

stk in4300 statistical learning methods in data science
SMART_READER_LITE
LIVE PREVIEW

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 13 1/ 30 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Feature Assessment when p " N


slide-1
SLIDE 1

STK-IN4300 Statistical Learning Methods in Data Science

Riccardo De Bin

debin@math.uio.no

STK-IN4300: lecture 13 1/ 30

slide-2
SLIDE 2

STK-IN4300 - Statistical Learning Methods in Data Science

Outline of the lecture

Feature Assessment when p " N Feature Assessment and Multiple Testing Problem The false discovery rate Stability Selection Introduction Selection probability Stability path Choice of regularization

STK-IN4300: lecture 13 2/ 30

slide-3
SLIDE 3

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

In the previous lecture: ‚ talked about the p " N framework; ‚ focused on the construction of prediction models. More basic goal: ‚ assess the significance of the M variables;

§ in this lecture M is the number of variables (as in the book);

‚ e.g., identify the genes most related to cancer.

STK-IN4300: lecture 13 3/ 30

slide-4
SLIDE 4

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

Assessing the significance of a variable can be done: ‚ as a by-product of a multivariate model,

§ selection by a procedure with variable selection property; § absolute value of a regression coefficient in lasso; § variable importance plots (boosting, random forests, . . . );

‚ evaluating the variables one-by-one:

§ univariate tests;

§ § đ multiple hypothesis testing

STK-IN4300: lecture 13 4/ 30

slide-5
SLIDE 5

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

Consider the data from Rieger et al. (2004): ‚ study on the sensitivity of cancer patients to ionizing radiation treatment; ‚ oligo-nucleotide microarray data (M “ 12625); ‚ N “ 58:

§ 44 patients with normal reaction; § 14 patients who had a severe reaction. STK-IN4300: lecture 13 5/ 30

slide-6
SLIDE 6

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

STK-IN4300: lecture 13 6/ 30

slide-7
SLIDE 7

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

The simplest way to identify significative genes: ‚ two-sample t-statistic for each gene, tj “ ¯ x2j ´ ¯ x1j sej where

§ ¯

xkj “ ř

iPCk xkj{Nk;

§ Ck are the indexes of the Nk observations of group k; § sej “ ˆ

σj b

1 N1 ` 1 N2 ;

§ ˆ

σ2

j “ 1 N1`N2´2

iPC1pxij ´ ¯

x1jq2 ` ř

iPC2pxij ´ ¯

x2jq2˘ .

STK-IN4300: lecture 13 7/ 30

slide-8
SLIDE 8

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

STK-IN4300: lecture 13 8/ 30

slide-9
SLIDE 9

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

From the histogram (12625 t-statistics): ‚ the values range from ´4.7 to 5.0; ‚ assuming tj „ Np0, 1q, significance at 5% when |tj| ě 2; ‚ in the example, 1189 genes with |tj| ě 2. However: ‚ out of 12625 genes, many are significant by chance; ‚ supposing (it is not true) independence:

§ expected falsely significant genes, 12625 ¨ 0.05 “ 631.25; § standard deviation,

a 12625 ¨ 0.05 ¨ p1 ´ 0.05q « 24.5;

‚ the actual 1189 is way out of range.

STK-IN4300: lecture 13 9/ 30

slide-10
SLIDE 10

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

Without assuming normality, permutation test: ‚ perform K “ `58

14

˘ permutations of the sample labels; ‚ compute the statistic trks

j

for each permutation k; ‚ the p-value for the gene j is pj “ 1 K

K

ÿ

k“1

1p|trks

j | ą |tj|q

(not all `58

14

˘ are needed, random sample of K “ 1000)

STK-IN4300: lecture 13 10/ 30

slide-11
SLIDE 11

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: multiple testing problem

For j P 1, . . . , M test the hypotheses: H0j: treatment has no effect on gene j H1j: treatment has an effect on gene j H0j is rejected at level α if pj ă α: ‚ α is the type-I error; ‚ we expect a probability of falsely rejecting H0j of α.

STK-IN4300: lecture 13 11/ 30

slide-12
SLIDE 12

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: family-wise error rate

Define Aj “ {H0j is falsely rejected} Ý Ñ PrpAjq “ α. The family-wise error rate (FWER) is the probability of at least

  • ne false rejection,

PrpAq “ Prp

M

ď

j“1

Ajq ‚ for p large, PrpAq " α; ‚ it depends on the correlation between the test; ‚ if tests independent, PrpAq “ 1 ´ p1 ´ αqM; ‚ test with positive dependence, PrpAq ă 1 ´ p1 ´ αqM;

§ positive dependence is typical in genomic studies. STK-IN4300: lecture 13 12/ 30

slide-13
SLIDE 13

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: family-wise error rate

The simplest approach to correct the p-value for the multiplicity of the tests is the Bonferroni method: ‚ reject H0j if pj ă α{M; ‚ it makes the individual test more stringent; ‚ controls the FWER

§ it is easy to show that FWER ď α;

‚ it is very (too) conservative. In the example: ‚ with α “ 0.05, α{M “ 0.05{12635 « 3.9 ˆ 10´6; ‚ no gene has a p-value so small.

STK-IN4300: lecture 13 13/ 30

slide-14
SLIDE 14

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: the false discovery rate

Instead of FWER, we can control the false discovery rate (FDR): ‚ expected proportion of genes incorrectly defined significant among those selected as significant, ‚ in formula, FDR “ ErV {Rs; ‚ procedure to have the FDR smaller than an user-defined α.

STK-IN4300: lecture 13 14/ 30

slide-15
SLIDE 15

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: the false discovery rate

STK-IN4300: lecture 13 15/ 30

slide-16
SLIDE 16

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: the false discovery rate

STK-IN4300: lecture 13 16/ 30

slide-17
SLIDE 17

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: the false discovery rate

In the example: ‚ α “ 0.15; ‚ the last pj under the line α ¨ pj{Mq occurs at j “ 11; ‚ the smallest 11 p-values are considered significative; ‚ in the example, pp11q “ 0.00012; ‚ the corresponding t-statistic is |tp11q| “ 4.101; ‚ a gene is relevant if the corresponding t-statistics is in absolute value larger than 4.101.

STK-IN4300: lecture 13 17/ 30

slide-18
SLIDE 18

STK-IN4300 - Statistical Learning Methods in Data Science

Feature Assessment when p " N: the false discovery rate

It can be proved (Benjamini & Hochberg, 1995) that FDR ď M0 M α ď α ‚ regardless the number of true null hypotheses; ‚ regardless the distribution of the p-values under H1; ‚ suppose independent test statistics; ‚ in case of dependence, see Benjamini & Yekutieli (2001).

STK-IN4300: lecture 13 18/ 30

slide-19
SLIDE 19

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: introduction

In general: ‚ the L1-penalty is often use to perform model selection; ‚ no oracle property (strict conditions to have it); ‚ issues with selecting the proper amount of regularization; Meinshausen & B¨ uhlmann (2010) suggested a procedure: ‚ based on subsampling (could work with bootstrapping as well); ‚ determines the amount of regularization to control the FWER; ‚ new structure estimation or variable selection scheme; ‚ here presented with L1-penalty, works in general.

STK-IN4300: lecture 13 19/ 30

slide-20
SLIDE 20

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: introduction

Setting: ‚ β is a p-dimensional vector of coefficients; ‚ S “ tj : βj ‰ 0u, |S| ă p; ‚ SC “ tj : βj “ 0u; ‚ Zris “ pXris, Y risq, i “ 1, . . . , N, are the i.i.d. data,

§ univariate response Y ; § N ˆ p covariate matrix X.

‚ consider a linear model Y “ Xβ ` ǫ with ǫ “ pǫ1, . . . , ǫNq with i.i.d. components.

STK-IN4300: lecture 13 20/ 30

slide-21
SLIDE 21

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: introduction

The goal is to infer S from the data. We saw that lasso, ˆ βλ “ argminβPRp ˜ ||Y ´ Xβ||2

2 ` λ p

ÿ

j“1

|βj| ¸ provides an estimate of S, Sλ “ tj : ˆ βj ‰ 0u Ď t1, . . . , pu. Remember: ‚ λ P R` is the regularization factor; ‚ ||Xj||2

2 “ řN i“1pxris j q2 “ 1;

STK-IN4300: lecture 13 21/ 30

slide-22
SLIDE 22

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: selection probability

Stability selection is built on the concept of selection probability, Definition 1: Let I be a random subsample of t1, . . . , Nu of size tN{2u drawn without replacement. We define selection probability the probability for a variable Xj of being in SλpIq, ˆ Πλ

j “ Pr˚rj Ď SλpIqs

Note: ‚ Pr˚ is with respect of both the random subsampling and

  • ther sources of randomness if Sλ is not deterministic;

‚ tN{2u is chosen for computational efficiency.

STK-IN4300: lecture 13 22/ 30

slide-23
SLIDE 23

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: stability path

Once we have the selection probability, we can define the stability path, as the evolution of ˆ Πλ

j when λ P Λ varies,

‚ similar to the learning path plot of lasso; ‚ it shows the selection probabilities for all variables; ‚ it is very useful for improved variable selection, especially in high-dimensional cases.

STK-IN4300: lecture 13 23/ 30

slide-24
SLIDE 24

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: stability path

‚ left: lasso learning path; ‚ center: stability path of the lasso; ‚ right: stability path of the randomized lasso.

STK-IN4300: lecture 13 24/ 30

slide-25
SLIDE 25

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: stability path

Normally we would choose a specific λ: ‚ it is a single element of the set ˆ Sλ, λ P Λ; ‚ S might not be a member of the set; ‚ even if it is, it is hard to find the right λ high-dimensions. With stability selection: ‚ we do not simply select one model in ˆ Sλ, λ P Λ; ‚ the data are perturbed (e.g. by subsampling) many times; ‚ we choose all variables that occur in a large fraction of the resulting selection sets.

STK-IN4300: lecture 13 25/ 30

slide-26
SLIDE 26

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: stability selection

Definition 2: For a cut-off πthr, with 0 ă πthr ă 1, and a set of regularization parameters Λ, the set of stable variables is defined as ˆ Sstable “ " j : max

λPΛ pΠλ j q ě πthr

* . In this way: ‚ we keep variables with a high selection probability; ‚ we disregard those with low selection probabilities; ‚ the exact cut-off πthr is a tuning parameter; ‚ the results vary surprisingly little for sensible choices of πthr; ‚ results do not strongly depend on the choice of λ or Λ.

STK-IN4300: lecture 13 26/ 30

slide-27
SLIDE 27

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: choice of regularization

Let: ‚ SΛ “ Ť

λPΛ ˆ

Sλ be the set of selected variables @λ P Λ; ‚ qΛ “ Er| ˆ SΛpIq|s be the average number of selected variables; ‚ V “ |SC Ş ˆ Sstable| the number of falsely selected variables with stability selection. Theorem (Meinshausen & B¨ uhlmann, 2010): Assuming that the distribution of t1jP ˆ

Sλu is exchangeable @λ P Λ and the

procedure is not worse than a random guess, then ErV s ď 1 2πthr ´ 1 q2

Λ

p

STK-IN4300: lecture 13 27/ 30

slide-28
SLIDE 28

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: choice of regularization

Therefore: ‚ πthr is a tuning parameter whose influence is very small;

§ sensible values are in p0.6, 0.9q;

‚ once decided πthr, Λ is determined by the error control desired; ‚ specifically for πthr “ 0.9,

§ Λ : qΛ “ ?0.8p

Ý Ñ ErV s ď 1;

§ Λ : qΛ “ ?0.8αp

Ý Ñ PrrV ą 0s ď α;

‚ i.e., we need to find Λ that gives a specific qΛ,

§ q is given by the number of variables which enter in the model; § for lasso, find λmin : | Ť

λmaxěλěλmin ˆ

Sλ| ď q

STK-IN4300: lecture 13 28/ 30

slide-29
SLIDE 29

STK-IN4300 - Statistical Learning Methods in Data Science

Stability Selection: choice of regularization

Final remarks: ‚ without stability selection, λ depends on the unknown noise level of the observations; ‚ the advantages of stability selection are:

§ exact error control is possible; § the method works fine even though the noise level is unknown;

‚ real advantage when p ě N (hard to estimate the noise level); ‚ consistency can be proved (see Meinshausen & B¨ uhlmann, 2010, for the proof for randomized lasso); ‚ exchangeability in Theorem 1 is only need for the proof.

STK-IN4300: lecture 13 29/ 30

slide-30
SLIDE 30

STK-IN4300 - Statistical Learning Methods in Data Science

References I

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) , 289–300. Benjamini, Y. & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29, 1165–1188. Meinshausen, N. & B¨ uhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 417–473. Rieger, K. E., Hong, W.-J., Tusher, V. G., Tang, J., Tibshirani, R. & Chu, G. (2004). Toxicity from radiation therapy associated with abnormal transcriptional responses to dna damage. Proceedings of the National Academy of Sciences 101, 6635–6640.

STK-IN4300: lecture 13 30/ 30