statistical methods in bioinformatics
play

Statistical methods in bioinformatics Brief introduction, - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrm Biostatistics,


  1. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrøm Biostatistics, University of Copenhagen E-mail: ekstrom@sund.ku.dk Slide 1/56

  2. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Today’s programme Introduction to statistical methods for high-dimensional data, linear models, dimension reduction and regularization methods. 1 Brief overview of molecular data. 2 Big-p small-n problems 3 Multiple testing techniques (inference correction, false discovery rates, q-values) 4 The correlation vs. causation and prediction vs. hypothesis differences 5 Generalized linear models refresher 6 Dimension reduction I: Penalized regression 7 Dimension reduction II: Partial least squares, principal component regression Slide 2/56 — Statistical methods in bioinformatics

  3. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 “Classical”statistics analysis gene obesity gender age Could be analyzed with a multiple regression model: obesity i = α + β 1 · gene i + β 2 · gender i + β 3 · age i + ε i Slide 3/56 — Statistical methods in bioinformatics

  4. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The omics revolution Slide 4/56 — Statistical methods in bioinformatics

  5. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The“joy”of *omics for an analyst CACAC GCGTG AAGAT CAACC Slide 5/56 — Statistical methods in bioinformatics

  6. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA CCGAAATCACTCATGCGGGCTTGA CCATGTCGCCTACATGTCCTTCAC ACGCGTGAAGATCAACCGAAATCA CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Slide 6/56 — Statistical methods in bioinformatics

  7. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT Evaluate CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG P ( Y i =“gene” | Y 1 ,..., Y i − 1 ) CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA Do that for each i and identify CCGAAATCACTCATGCGGGCTTGA the nucleotides that have a CCATGTCGCCTACATGTCCTTCAC high probability of being inside ACGCGTGAAGATCAACCGAAATCA a gene. CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Slide 6/56 — Statistical methods in bioinformatics

  8. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Proteomics Slide 7/56 — Statistical methods in bioinformatics

  9. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Gene expression data Slide 8/56 — Statistical methods in bioinformatics

  10. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Metabolite data Slide 9/56 — Statistical methods in bioinformatics

  11. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data — metabolite data Slide 10/56 — Statistical methods in bioinformatics

  12. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 A bit of history • 2000 one SNP • 2003 10 SNPs • 2006 500 SNPs • 2009 22k SNPs • 2012 2.5 mio SNPs • 2013 25 mio SNPs ∼ 45 mio imputated Slide 11/56 — Statistical methods in bioinformatics

  13. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Pattern recognition Slide 12/56 — Statistical methods in bioinformatics

  14. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Prediction Slide 13/56 — Statistical methods in bioinformatics

  15. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The $1000 genome Slide 14/56 — Statistical methods in bioinformatics

  16. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Data sizes y x 1 x 2 x 3 y x . . . Slide 15/56 — Statistical methods in bioinformatics

  17. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Data sizes y x 1 x 2 x 3 ··· x 1 x 2 x 99999 X y x . . . . . . We need dimension reduction constantly: • Feature selection • Inference? Slide 15/56 — Statistical methods in bioinformatics

  18. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The problem with multiple comparisons P predictors - let’s do P standard analyses! P (at least 1 false positive) = 1 − (1 − α ) P Slide 16/56 — Statistical methods in bioinformatics

  19. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Slide 17/56 — Statistical methods in bioinformatics

  20. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Possible errors committed when testing a single null hypotheses, H 0 . H 0 is true ... is false Total Rejected α 1 − β Not rejected 1 − α β Total 1 1 α is the significance level, 1 − β is the power. Slide 18/56 — Statistical methods in bioinformatics

  21. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Number of errors committed when testing m null hypotheses H 0 is true ... is false Total Rejected V S R Not rejected U T m − R Total m 0 m − m 0 m Here R , the number of rejected hypotheses/discoveries, can be seen as a random variable. V , S , U and T are unobserved. Slide 19/56 — Statistical methods in bioinformatics

  22. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Number of errors committed when testing m null hypotheses H 0 is true ... is false Total Rejected V S R Not rejected U T m − R Total m 0 m − m 0 m Here R , the number of rejected hypotheses/discoveries, can be seen as a random variable. V , S , U and T are unobserved. The family-wise error rate (FWER) is the probability of making at least one type I error (false positive): FWER = P ( V > 0) = 1 − P ( V = 0) Slide 19/56 — Statistical methods in bioinformatics

  23. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems The family-wise error rate (FWER) is the probability of making at least one type I error (false positive). For m tests we have FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α where the second equality only holds under independence. (However, the inequality holds due to Boole’s inequality.) Slide 20/56 — Statistical methods in bioinformatics

  24. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Bonferroni correction The most conservative method but is free of dependence and distributional assumptions. FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α So set instead the significance level at each individual test at α / m . In other words we reject the i th hypothesis if m · p i ≤ α ⇔ p i ≤ α m Slide 21/56 — Statistical methods in bioinformatics

  25. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Bonferroni correction The most conservative method but is free of dependence and distributional assumptions. FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α So set instead the significance level at each individual test at α / m . In other words we reject the i th hypothesis if m · p i ≤ α ⇔ p i ≤ α m ˘ ak (assume independence). Want significance level α ∗ . S´ ıd´ √ 1 − (1 − α ) m = α ∗ ⇔ α = 1 − m 1 − α ∗ Slightly less conservative (but not much). Slide 21/56 — Statistical methods in bioinformatics

  26. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Holm’s correction The Holm-Bonferroni-correction. 1 Compute and order the individual p -values: p (1) ≤ p (2) ≤ ··· ≤ p ( m ) . 2 Find ˆ α k = min { k : p ( k ) > m +1 − k } 3 If ˆ k exists then reject hypotheses corresponding to p (1) ,..., p (ˆ k − 1) . Slide 22/56 — Statistical methods in bioinformatics

  27. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Holm’s correction The Holm-Bonferroni-correction. 1 Compute and order the individual p -values: p (1) ≤ p (2) ≤ ··· ≤ p ( m ) . 2 Find ˆ α k = min { k : p ( k ) > m +1 − k } 3 If ˆ k exists then reject hypotheses corresponding to p (1) ,..., p (ˆ k − 1) . Controls the FWER: Assume the (ordered) k is the first wrongly rejected true hypothesis. Then k ≤ m − ( m 0 − 1). Hypothesis k was rejected so α m +1 − ( m − ( m 0 − 1)) ≤ α α p ( k ) ≤ m +1 − k ≤ m 0 Since there are m 0 true hypotheses then (Bonferroni argument) the probability that one of them is significant is at most α so FWER is controlled. Slide 22/56 — Statistical methods in bioinformatics

  28. u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Resampling methods Computerintensive methods Permutation methods. Simulate data under H 0 , compute test statistic and compre to test statistic from original data. Bootstrap. “Simulate data under H a ” . Slide 23/56 — Statistical methods in bioinformatics

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend