Statistical methods in bioinformatics Brief introduction, - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrøm Biostatistics, University of Copenhagen E-mail: ekstrom@sund.ku.dk Slide 1/56

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Today’s programme Introduction to statistical methods for high-dimensional data, linear models, dimension reduction and regularization methods. 1 Brief overview of molecular data. 2 Big-p small-n problems 3 Multiple testing techniques (inference correction, false discovery rates, q-values) 4 The correlation vs. causation and prediction vs. hypothesis differences 5 Generalized linear models refresher 6 Dimension reduction I: Penalized regression 7 Dimension reduction II: Partial least squares, principal component regression Slide 2/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 “Classical”statistics analysis gene obesity gender age Could be analyzed with a multiple regression model: obesity i = α + β 1 · gene i + β 2 · gender i + β 3 · age i + ε i Slide 3/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The omics revolution Slide 4/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The“joy”of *omics for an analyst CACAC GCGTG AAGAT CAACC Slide 5/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA CCGAAATCACTCATGCGGGCTTGA CCATGTCGCCTACATGTCCTTCAC ACGCGTGAAGATCAACCGAAATCA CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Slide 6/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data CACACGCGTGAAGATCAACCGAAA TCACTCATGCGGGCTTGACCATGT Evaluate CGCCTACATGTCCTTCACACGCGT GAAGATCAACCGAAATCACTCATG P ( Y i =“gene” | Y 1 ,..., Y i − 1 ) CGGGCTTGACCATGTCGCCTACAT GTCCTTCACACGCGTGAAGATCAA Do that for each i and identify CCGAAATCACTCATGCGGGCTTGA the nucleotides that have a CCATGTCGCCTACATGTCCTTCAC high probability of being inside ACGCGTGAAGATCAACCGAAATCA a gene. CTCATGCGGGCTTGACCATGTCGC CTACATGTCC Slide 6/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Proteomics Slide 7/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Gene expression data Slide 8/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Metabolite data Slide 9/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Examples Sequence data — metabolite data Slide 10/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 A bit of history • 2000 one SNP • 2003 10 SNPs • 2006 500 SNPs • 2009 22k SNPs • 2012 2.5 mio SNPs • 2013 25 mio SNPs ∼ 45 mio imputated Slide 11/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Pattern recognition Slide 12/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Prediction Slide 13/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The $1000 genome Slide 14/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Data sizes y x 1 x 2 x 3 y x . . . Slide 15/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Data sizes y x 1 x 2 x 3 ··· x 1 x 2 x 99999 X y x . . . . . . We need dimension reduction constantly: • Feature selection • Inference? Slide 15/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 The problem with multiple comparisons P predictors - let’s do P standard analyses! P (at least 1 false positive) = 1 − (1 − α ) P Slide 16/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Slide 17/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Possible errors committed when testing a single null hypotheses, H 0 . H 0 is true ... is false Total Rejected α 1 − β Not rejected 1 − α β Total 1 1 α is the significance level, 1 − β is the power. Slide 18/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Number of errors committed when testing m null hypotheses H 0 is true ... is false Total Rejected V S R Not rejected U T m − R Total m 0 m − m 0 m Here R , the number of rejected hypotheses/discoveries, can be seen as a random variable. V , S , U and T are unobserved. Slide 19/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems Number of errors committed when testing m null hypotheses H 0 is true ... is false Total Rejected V S R Not rejected U T m − R Total m 0 m − m 0 m Here R , the number of rejected hypotheses/discoveries, can be seen as a random variable. V , S , U and T are unobserved. The family-wise error rate (FWER) is the probability of making at least one type I error (false positive): FWER = P ( V > 0) = 1 − P ( V = 0) Slide 19/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Multiple comparison problems The family-wise error rate (FWER) is the probability of making at least one type I error (false positive). For m tests we have FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α where the second equality only holds under independence. (However, the inequality holds due to Boole’s inequality.) Slide 20/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Bonferroni correction The most conservative method but is free of dependence and distributional assumptions. FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α So set instead the significance level at each individual test at α / m . In other words we reject the i th hypothesis if m · p i ≤ α ⇔ p i ≤ α m Slide 21/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Bonferroni correction The most conservative method but is free of dependence and distributional assumptions. FWER = 1 − P ( V = 0) = 1 − (1 − α ) m ≤ m α So set instead the significance level at each individual test at α / m . In other words we reject the i th hypothesis if m · p i ≤ α ⇔ p i ≤ α m ˘ ak (assume independence). Want significance level α ∗ . S´ ıd´ √ 1 − (1 − α ) m = α ∗ ⇔ α = 1 − m 1 − α ∗ Slightly less conservative (but not much). Slide 21/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Holm’s correction The Holm-Bonferroni-correction. 1 Compute and order the individual p -values: p (1) ≤ p (2) ≤ ··· ≤ p ( m ) . 2 Find ˆ α k = min { k : p ( k ) > m +1 − k } 3 If ˆ k exists then reject hypotheses corresponding to p (1) ,..., p (ˆ k − 1) . Slide 22/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Holm’s correction The Holm-Bonferroni-correction. 1 Compute and order the individual p -values: p (1) ≤ p (2) ≤ ··· ≤ p ( m ) . 2 Find ˆ α k = min { k : p ( k ) > m +1 − k } 3 If ˆ k exists then reject hypotheses corresponding to p (1) ,..., p (ˆ k − 1) . Controls the FWER: Assume the (ordered) k is the first wrongly rejected true hypothesis. Then k ≤ m − ( m 0 − 1). Hypothesis k was rejected so α m +1 − ( m − ( m 0 − 1)) ≤ α α p ( k ) ≤ m +1 − k ≤ m 0 Since there are m 0 true hypotheses then (Bonferroni argument) the probability that one of them is significant is at most α so FWER is controlled. Slide 22/56 — Statistical methods in bioinformatics

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Resampling methods Computerintensive methods Permutation methods. Simulate data under H 0 , compute test statistic and compre to test statistic from original data. Bootstrap. “Simulate data under H a ” . Slide 23/56 — Statistical methods in bioinformatics

Statistical methods in bioinformatics Brief introduction, - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrm Biostatistics,

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Basis of Neural Networks School of Data Science, Fudan

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof.

Announcements Homework 1: Due today Office hours Come to office hours before your presentation!

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Going be y ond linear regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v

Machine Learning Regression Where we are Inputs Prob- Density ability Estimator Inputs

Statistical methods in bioinformatics Brief introduction, - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n m a r c h 3 1 s t , 2 0 2 0 Faculty of Health Sciences Statistical methods in bioinformatics Brief introduction, statistical models, dimension reductions. Claus Thorn Ekstrm Biostatistics,

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Basis of Neural Networks School of Data Science, Fudan

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof.

Announcements Homework 1: Due today Office hours Come to office hours before your presentation!

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Going be y ond linear regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v

Machine Learning Regression Where we are Inputs Prob- Density ability Estimator Inputs

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt