Statistical analysis of RNASeq Data Introduction to RNA-seq data - PowerPoint PPT Presentation

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis dominique-laurent.couturier@cruk.cam.ac.uk [Bioinformatics core] (Source: O. Rueda, CRUK-CI; G. Marot, INRIA)

Introduction 2

Grand Picture of Statistics Statistical Hypotheses Sample H0: µ B = µ L H1: µ B � = µ L Idea: Data: RNASeq counts EGF is differentially expressed (DE) ( x B, 1 ; x B, 2 ; ... ; x B,nB ) in luminal (L) and basal (B) cells ( x L, 1 ; x L, 2 ; ... ; x L,nL ) Inference Point estimation µ B − � � µ L � µ B − � µ L � T obs = ∼ St nT + nC − 2 1 1 s p nB + nL 3

Outline ◮ 1/ Analysis of gene expression measured with Microarrays ⊲ 1a/ Normal distribution ⊲ 1b/ Test of equality of means for two samples: T-test ⊲ 1c/ Test of equality of means for > 2 samples: ANOVA ⊲ 1d/ Test of equality of means for 2 categorical predictors: ANOVA ⊲ 1e/ Test of equality of means for > 2 predictors: Linear model ⊲ 1f/ Confounding ◮ 2/ Analysis of gene expression measured by RNAseq ⊲ Generalisation of the linear model: Negative Binomial regression ◮ 2a/ Negative Binomial distribution ◮ 2b/ Nuisance parameter estimation: Shrinkage estimator ◮ 2c/ Controlling for Library size: Offset ◮ 3/ Controlling for multiple testing ⊲ 3a/ Family-wise error rate ⊲ 3b/ False discovery rate 4

Analysis of gene expression measured with Microarrays Part I dominique-laurent.couturier@cruk.cam.ac.uk [Bioinformatics core] (Source: O. Rueda, CRUK-CI; G. Marot, INRIA)

1a/ Normal distribution 2 πσ 2 e − ( y − µ )2 1 X ∼ N ( µ, σ 2 ) , √ f Y ( y ) = 2 σ 2 Var [ Y ] = σ 2 , E [ Y ] = µ, Probability density function, f Y ( y | µ, σ ) 0.4 0.3 0.2 0.1 0.0 µ − 3 σ µ − 2 σ µ − σ µ µ + σ µ + 2 σ µ + 3 σ 68.27% 95.45% 99.73% 6

1a/ Normal distribution 2 πσ 2 e − ( y − µ )2 1 X ∼ N ( µ, σ 2 ) , √ f Y ( y ) = 2 σ 2 Var [ Y ] = σ 2 , E [ Y ] = µ, ◮ Suitable modelling for a lot of variables 0.5 0.4 0.3 0.2 0.1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 (Gene expression values of gene ‘X’ of basal cells of 33 mice) 6

1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 Luminal ● n=43 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 We test H0 : µ B − µ L = 0 against H1 : µ B − µ L � = 0 . We know: 0.4 T ∼ St 100 T ∼ St 50 � µ B − � µ L ◮ Student’s t-test [assume σ 2 B = σ 2 T ∼ St 10 L ]: � ∼ t n B + n L − 2 , T ∼ St 2 0.3 1 1 s p nB + nL � 0.2 s 2 B ( n B − 1)+ s 2 L ( n L − 1) Density ◮ s p = . n B + N L − 2 0.1 0.0 -4.303 95% 4.303 -2.228 95% 2.228 -2.009 95% 2.009 -1.984 95% 1.984 -5 -4 -3 -2 -1 0 1 2 3 4 5 7

1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 Luminal ● n=43 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 We test H0 : µ B − µ L = 0 against H1 : µ B − µ L � = 0 . We know: 0.4 T ∼ St 100 T ∼ St 50 � µ B − � µ L ◮ Student’s t-test [assume σ 2 B = σ 2 T ∼ St 10 L ]: � ∼ t n B + n L − 2 , T ∼ St 2 0.3 1 1 s p nB + nL � 0.2 s 2 B ( n B − 1)+ s 2 L ( n L − 1) Density ◮ s p = . n B + N L − 2 0.1 Two Sample t-test 0.0 -4.303 95% 4.303 -2.228 95% 2.228 data: Basal and Luminal -2.009 95% 2.009 -1.984 95% 1.984 t = 6.6751, df = 74, p-value = 3.941e-09 alternative hypothesis: true difference in means is not equal to 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 95 percent confidence interval: 1.048457 1.940748 sample estimates: mean of x mean of y 2.923908 1.429305 7

1b/ Test of equality of means for two samples ◮ Modelling 1: Y i ( B ) = µ B + ǫ i Y i ( L ) = µ L + ǫ i Intensity expression of gene 'X' Basal ● n=33 Luminal ● n=43 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 8

1b/ Test of equality of means for two samples ◮ Modelling 1: Y i ( B ) = µ B + ǫ i Y i ( L ) = µ L + ǫ i Intensity expression of gene 'X' Basal ◮ Modelling 2: ● n=33 Luminal ● Y i = µ B + δ L I ( i ∈ L ) + ǫ i n=43 = β 0 + β 1 X 1 + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . 8

1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 1: Luminal ● n=43 Y i = µ B I ( i ∈ B ) + µ L I ( i ∈ L ) + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . 9

1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 1: Luminal ● n=43 Y i = µ B I ( i ∈ B ) + µ L I ( i ∈ L ) + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . Call: lm(formula = expression ~ celltype - 1, data = microarrays) Residuals: Min 1Q Median 3Q Max -2.64401 -0.58586 0.01473 0.65051 2.47771 Coefficients: Estimate Std. Error t value Pr(>|t|) celltypeBasal 2.9239 0.1684 17.361 < 2e-16 *** celltypeLuminal 1.4293 0.1475 9.687 8.47e-15 *** --- 0 ,¨ o***,¨ o 0.001 ,¨ o**,¨ o 0.01 ,¨ o*,¨ o 0.05 ,¨ o.,¨ o 0.1 ,¨ o ,¨ Signif. codes: A` A^ A` A^ A` A^ A` A^ A` A^ o 1 Residual standard error: 0.9675 on 74 degrees of freedom Multiple R-squared: 0.8423,Adjusted R-squared: 0.838 F-statistic: 197.6 on 2 and 74 DF, p-value: < 2.2e-16 9

1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 2: Luminal ● n=43 Y i = µ B + δ L I ( i ∈ L ) ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 = β 0 + β 1 X 1 + ǫ i Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . 10

1b/ Test of equality of means for two samples Intensity expression of gene 'X' Basal ● n=33 ◮ Modelling 2: Luminal ● n=43 Y i = µ B + δ L I ( i ∈ L ) ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 = β 0 + β 1 X 1 + ǫ i Y = X β + ǫ where i = 1 , ..., n ; ǫ i ∼ N (0 , σ 2 ) . Call: lm(formula = expression ~ celltype, data = microarrays) Residuals: Min 1Q Median 3Q Max -2.64401 -0.58586 0.01473 0.65051 2.47771 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.9239 0.1684 17.361 < 2e-16 *** celltypeLuminal -1.4946 0.2239 -6.675 3.94e-09 *** --- 0 ,¨ o***,¨ o 0.001 ,¨ o**,¨ o 0.01 ,¨ o*,¨ o 0.05 ,¨ o.,¨ o 0.1 ,¨ o ,¨ Signif. codes: A` A^ A` A^ A` A^ A` A^ A` A^ o 1 Residual standard error: 0.9675 on 74 degrees of freedom Multiple R-squared: 0.3758,Adjusted R-squared: 0.3674 F-statistic: 44.56 on 1 and 74 DF, p-value: 3.941e-09 10

1c/ Test of equality of means for > 2 samples ◮ One-way ANOVA hypotheses ⊲ H0: µ L = µ P = µ V , ⊲ H1: µ k � = µ l for at least one pair ( k, l ) . Intensity expression of gene 'X' Virgin Pregnant ● Lactating −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 11

1c/ Test of equality of means for > 2 samples ◮ One-way ANOVA hypotheses ⊲ H0: µ L = µ P = µ V , ⊲ H1: µ k � = µ l for at least one pair ( k, l ) . Intensity expression of gene 'X' ◮ Modelling 1: Virgin Y i ( L ) = µ L + ǫ i Pregnant ● Y i ( P ) = µ P + ǫ i Lactating Y i ( V ) = µ V + ǫ i −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Y i = µ L I ( i ∈ L ) + µ P I ( i ∈ P ) + µ V I ( i ∈ V ) + ǫ i Y = X β + ǫ 11

Statistical analysis of RNASeq Data Introduction to RNA-seq data - PowerPoint PPT Presentation

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis dominique-laurent.couturier@cruk.cam.ac.uk [Bioinformatics core] (Source: O. Rueda, CRUK-CI; G. Marot, INRIA) Introduction 2 Grand Picture of Statistics Statistical

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

RNAseq analysis -its complicated Oktober 2016 RNA

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

Identification and quantification of isoforms in RNAseq data : deep short reads Vs shallow long

NGI-RNAseq Processing RNA-seq data at the National Genomics Infrastructure Phil Ewels

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Importing Data from Statistical So ware haven Importing Data into R Statistical So

. Surajit Ray Minjung Kyung Jiezhun (Sherry) Gu Ray SAMSI, June 2 2005 - slide #1 Statistical

Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University

TSWF Diving Event AIM Form Training May May-Aug 2020 Form Version Medically Ready

Open Ocean Trustee Implementation Group Annual Meeting November 14, 2018 If youre using

AVOIDING THE CRASH: AVOIDING THE CRASH 3: OPTIMIZE YOUR PRE, PERI, AND RELAX, OPTIMAL POST-AIRWAY

Lauriane Dury Influence of the Basal Cellular Level of Glutathione in Triggering MRP1-cells

9/28/2016 The compound Tissue Selective Estrogen Complex (conjugated estrogen/bazedoxifene)

Thermodynamics of ice Ian Hewitt, University of Oxford hewitt@maths.ox.ac.uk Example temperature

ts Prr ss t rr

Statistical analysis of RNASeq Data Introduction to RNA-seq data - PowerPoint PPT Presentation

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis dominique-laurent.couturier@cruk.cam.ac.uk [Bioinformatics core] (Source: O. Rueda, CRUK-CI; G. Marot, INRIA) Introduction 2 Grand Picture of Statistics Statistical

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel &amp; David

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

RNAseq analysis -its complicated Oktober 2016 RNA

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

Identification and quantification of isoforms in RNAseq data : deep short reads Vs shallow long

NGI-RNAseq Processing RNA-seq data at the National Genomics Infrastructure Phil Ewels

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Importing Data from Statistical So ware haven Importing Data into R Statistical So

. Surajit Ray Minjung Kyung Jiezhun (Sherry) Gu Ray SAMSI, June 2 2005 - slide #1 Statistical

Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University

TSWF Diving Event AIM Form Training May May-Aug 2020 Form Version Medically Ready

Open Ocean Trustee Implementation Group Annual Meeting November 14, 2018 If youre using

AVOIDING THE CRASH: AVOIDING THE CRASH 3: OPTIMIZE YOUR PRE, PERI, AND RELAX, OPTIMAL POST-AIRWAY

Lauriane Dury Influence of the Basal Cellular Level of Glutathione in Triggering MRP1-cells

9/28/2016 The compound Tissue Selective Estrogen Complex (conjugated estrogen/bazedoxifene)

Thermodynamics of ice Ian Hewitt, University of Oxford hewitt@maths.ox.ac.uk Example temperature

ts Prr ss t rr

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models