stk in4300 statistical learning methods in data science
play

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 13 1/ 30 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Feature Assessment when p " N


  1. STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 13 1/ 30

  2. STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Feature Assessment when p " N Feature Assessment and Multiple Testing Problem The false discovery rate Stability Selection Introduction Selection probability Stability path Choice of regularization STK-IN4300: lecture 13 2/ 30

  3. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem In the previous lecture: ‚ talked about the p " N framework; ‚ focused on the construction of prediction models. More basic goal: ‚ assess the significance of the M variables; § in this lecture M is the number of variables (as in the book); ‚ e.g., identify the genes most related to cancer. STK-IN4300: lecture 13 3/ 30

  4. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem Assessing the significance of a variable can be done: ‚ as a by-product of a multivariate model, § selection by a procedure with variable selection property; § absolute value of a regression coefficient in lasso; § variable importance plots (boosting, random forests, . . . ); ‚ evaluating the variables one-by-one: § univariate tests; § § đ multiple hypothesis testing STK-IN4300: lecture 13 4/ 30

  5. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem Consider the data from Rieger et al. (2004): ‚ study on the sensitivity of cancer patients to ionizing radiation treatment; ‚ oligo-nucleotide microarray data ( M “ 12625 ); ‚ N “ 58 : § 44 patients with normal reaction; § 14 patients who had a severe reaction. STK-IN4300: lecture 13 5/ 30

  6. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem STK-IN4300: lecture 13 6/ 30

  7. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem The simplest way to identify significative genes: ‚ two-sample t-statistic for each gene, t j “ ¯ x 2 j ´ ¯ x 1 j se j where § ¯ x kj “ ř i P C k x kj { N k ; § C k are the indexes of the N k observations of group k ; b 1 1 § se j “ ˆ σ j N 1 ` N 2 ; x 1 j q 2 ` ř § ˆ 1 σ 2 `ř x 2 j q 2 ˘ j “ i P C 1 p x ij ´ ¯ i P C 2 p x ij ´ ¯ . N 1 ` N 2 ´ 2 STK-IN4300: lecture 13 7/ 30

  8. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem STK-IN4300: lecture 13 8/ 30

  9. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem From the histogram ( 12625 t-statistics): ‚ the values range from ´ 4 . 7 to 5 . 0 ; ‚ assuming t j „ N p 0 , 1 q , significance at 5% when | t j | ě 2 ; ‚ in the example, 1189 genes with | t j | ě 2 . However: ‚ out of 12625 genes, many are significant by chance; ‚ supposing (it is not true) independence: § expected falsely significant genes, 12625 ¨ 0 . 05 “ 631 . 25 ; § standard deviation, a 12625 ¨ 0 . 05 ¨ p 1 ´ 0 . 05 q « 24 . 5 ; ‚ the actual 1189 is way out of range. STK-IN4300: lecture 13 9/ 30

  10. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem Without assuming normality, permutation test: ` 58 ˘ ‚ perform K “ permutations of the sample labels; 14 ‚ compute the statistic t r k s for each permutation k ; j ‚ the p-value for the gene j is K p j “ 1 1 p| t r k s ÿ j | ą | t j |q K k “ 1 ` 58 ˘ (not all are needed, random sample of K “ 1000 ) 14 STK-IN4300: lecture 13 10/ 30

  11. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : multiple testing problem For j P 1 , . . . , M test the hypotheses: H 0 j : treatment has no effect on gene j H 1 j : treatment has an effect on gene j H 0 j is rejected at level α if p j ă α : ‚ α is the type-I error; ‚ we expect a probability of falsely rejecting H 0 j of α . STK-IN4300: lecture 13 11/ 30

  12. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : family-wise error rate Define A j “ { H 0 j is falsely rejected } Ý Ñ Pr p A j q “ α . The family-wise error rate (FWER) is the probability of at least one false rejection, M ď Pr p A q “ Pr p A j q j “ 1 ‚ for p large, Pr p A q " α ; ‚ it depends on the correlation between the test; ‚ if tests independent, Pr p A q “ 1 ´ p 1 ´ α q M ; ‚ test with positive dependence, Pr p A q ă 1 ´ p 1 ´ α q M ; § positive dependence is typical in genomic studies. STK-IN4300: lecture 13 12/ 30

  13. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : family-wise error rate The simplest approach to correct the p-value for the multiplicity of the tests is the Bonferroni method : ‚ reject H 0 j if p j ă α { M ; ‚ it makes the individual test more stringent; ‚ controls the FWER § it is easy to show that FWER ď α ; ‚ it is very (too) conservative. In the example: ‚ with α “ 0 . 05 , α { M “ 0 . 05 { 12635 « 3 . 9 ˆ 10 ´ 6 ; ‚ no gene has a p-value so small. STK-IN4300: lecture 13 13/ 30

  14. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate Instead of FWER, we can control the false discovery rate (FDR): ‚ expected proportion of genes incorrectly defined significant among those selected as significant, ‚ in formula, FDR “ E r V { R s ; ‚ procedure to have the FDR smaller than an user-defined α . STK-IN4300: lecture 13 14/ 30

  15. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate STK-IN4300: lecture 13 15/ 30

  16. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate STK-IN4300: lecture 13 16/ 30

  17. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate In the example: ‚ α “ 0 . 15 ; ‚ the last p j under the line α ¨ p j { M q occurs at j “ 11 ; ‚ the smallest 11 p-values are considered significative; ‚ in the example, p p 11 q “ 0 . 00012 ; ‚ the corresponding t-statistic is | t p 11 q | “ 4 . 101 ; ‚ a gene is relevant if the corresponding t-statistics is in absolute value larger than 4 . 101 . STK-IN4300: lecture 13 17/ 30

  18. STK-IN4300 - Statistical Learning Methods in Data Science Feature Assessment when p " N : the false discovery rate It can be proved (Benjamini & Hochberg, 1995) that FDR ď M 0 M α ď α ‚ regardless the number of true null hypotheses; ‚ regardless the distribution of the p-values under H 1 ; ‚ suppose independent test statistics; ‚ in case of dependence, see Benjamini & Yekutieli (2001). STK-IN4300: lecture 13 18/ 30

  19. STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: introduction In general: ‚ the L 1 -penalty is often use to perform model selection; ‚ no oracle property (strict conditions to have it); ‚ issues with selecting the proper amount of regularization; Meinshausen & B¨ uhlmann (2010) suggested a procedure: ‚ based on subsampling (could work with bootstrapping as well); ‚ determines the amount of regularization to control the FWER; ‚ new structure estimation or variable selection scheme; ‚ here presented with L 1 -penalty, works in general. STK-IN4300: lecture 13 19/ 30

  20. STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: introduction Setting: ‚ β is a p -dimensional vector of coefficients; ‚ S “ t j : β j ‰ 0 u , | S | ă p ; ‚ S C “ t j : β j “ 0 u ; ‚ Z r i s “ p X r i s , Y r i s q , i “ 1 , . . . , N , are the i.i.d. data, § univariate response Y ; § N ˆ p covariate matrix X . ‚ consider a linear model Y “ Xβ ` ǫ with ǫ “ p ǫ 1 , . . . , ǫ N q with i.i.d. components. STK-IN4300: lecture 13 20/ 30

  21. STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: introduction The goal is to infer S from the data. We saw that lasso, ˜ p ¸ β λ “ argmin β P R p ˆ || Y ´ Xβ || 2 ÿ 2 ` λ | β j | j “ 1 provides an estimate of S, S λ “ t j : ˆ β j ‰ 0 u Ď t 1 , . . . , p u . Remember: ‚ λ P R ` is the regularization factor; i “ 1 p x r i s j q 2 “ 1 ; 2 “ ř N ‚ || X j || 2 STK-IN4300: lecture 13 21/ 30

  22. STK-IN4300 - Statistical Learning Methods in Data Science Stability Selection: selection probability Stability selection is built on the concept of selection probability , Definition 1: Let I be a random subsample of t 1 , . . . , N u of size t N { 2 u drawn without replacement. We define selection probability the probability for a variable X j of being in S λ p I q , ˆ Π λ j “ Pr ˚ r j Ď S λ p I qs Note: ‚ Pr ˚ is with respect of both the random subsampling and other sources of randomness if S λ is not deterministic; ‚ t N { 2 u is chosen for computational efficiency. STK-IN4300: lecture 13 22/ 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend