analyses biostatistiques de donn es rna seq
play

Analyses biostatistiques de donnes RNA-seq Nathalie Vialaneix (MIAT, - PowerPoint PPT Presentation

Analyses biostatistiques de donnes RNA-seq Nathalie Vialaneix (MIAT, INRA) en collaboration avec Ignacio Gonzles et Annick Moisan nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu Toulouse, 4/5 novembre 2020 NV 2 (INRA)


  1. Dataset used in the examples These text files are loaded with: raw_counts <- read.table("D1-counts.txt", header = TRUE, row.names = 1) raw_counts <- as.matrix(raw_counts) design <- read.table("D1-targets.txt", header = TRUE, stringsAsFactors = FALSE) gene_lengths <- scan("D1-genesLength.txt") NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 16 / 67

  2. Count distribution The count distribution ( i.e. , the number of times a given count is obtained in the data) can be visualized with histograms (boxplots or violin plots can also be used): This distribution is highly skewed and it is better visualized using a log 2 transformation before it is displayed. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 17 / 67

  3. Count distribution The count distribution ( i.e. , the number of times a given count is obtained in the data) can be visualized with histograms (boxplots or violin plots can also be used): This distribution is highly skewed and it is better visualized using a log 2 transformation before it is displayed. library size The library size is the sum of all counts in a given sample. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 17 / 67

  4. Count distribution between samples The count distribution between different samples can be compared with parallel boxplots or violin plots: It is expected that, within a given condition (group), the count distributions are similar. The same is often also expected between conditions. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 18 / 67

  5. Check reproducibility between samples MA plots can be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values): M-values: log of ratio between A-values: average log counts counts between two samples: between two samples: A g = log 2 ( K gj ) + log 2 ( K gj ′ ) M g = log 2 ( K gj ) − log 2 ( K gj ′ ) 2 where K gj stands for the counts for gene g in sample j . NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 19 / 67

  6. Check reproducibility between samples MA plots can be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values): NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 19 / 67

  7. Check similarity between samples Similarities between samples can be visualized with a HAC and a heatmap: perform hierarchical ascending classification (HAC) using Euclidean � � 2 distance between samples: δ ( j , j ′ ) = � log 2 ( K gj ) − log 2 ( K gj ′ ) g visualize the strength of the similarity with heatmap. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 20 / 67

  8. Search for the main structures in the data: PCA PCA (on log 2 counts) can be used to project data into a small dimensional space and search for unexpected experimental effects in the data. (MDS is equivalent to PCA when used with the standard Euclidean distance) Remark: In DESeq , the function plotPCA performs PCA on the top genes with the highest variance. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 21 / 67

  9. Outline Exploratory analysis 1 Introduction Experimental design Data exploration and quality assessment Normalization 2 Raw data filtering Interpreting read counts Differential Expression analysis 3 Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 22 / 67

  10. Raw data filtering Filtering consists in removing genes with low expression. Different strategies can be used: [Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold; [Bottomly et al., 2011]: filter out genes with zero count in an experimental condition; [Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition. With CPM: Count Per Million ( i.e. , raw count divived by library size, this strategy takes into account differences in library sizes). NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 23 / 67

  11. Raw data filtering Filtering consists in removing genes with low expression. Different strategies can be used: [Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold; [Bottomly et al., 2011]: filter out genes with zero count in an experimental condition; [Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition. With CPM: Count Per Million ( i.e. , raw count divived by library size, this strategy takes into account differences in library sizes). More sophisticated filtering To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 23 / 67

  12. Part III: Normalization NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 24 / 67

  13. Purpose of normalization identify and correct technical biases (due to sequencing process) to make counts comparable types of normalization: within sample normalization and between sample normalization NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 25 / 67

  14. Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts for gene B are twice larger than counts for gene A because: NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 26 / 67

  15. Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts for gene B are twice larger than counts for gene A because: gene B is expressed with a number of transcripts twice larger than gene A gene A gene B NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 26 / 67

  16. Within sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts for gene B are twice larger than counts for gene A because: both genes are expressed with the same number of transcripts but gene B is twice longer than gene A gene A gene B NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 26 / 67

  17. Within sample normalization Purpose of within sample comparison: enabling comparisons of genes from a same sample Sources of variability: gene length, sequence composition (GC content) These differences need not to be corrected for a differential analysis and are not really relevant for data interpretation. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 26 / 67

  18. Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts in sample 3 are much larger than counts in sample 2 because: NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 27 / 67

  19. Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts in sample 3 are much larger than counts in sample 2 because: gene A is more expressed in sample 3 than in sample 2 gene A in sample 2 gene A in sample 3 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 27 / 67

  20. Between sample normalization Example: (read counts) sample 1 sample 2 sample 3 gene A 752 615 1203 gene B 1507 1225 2455 counts in sample 3 are much larger than counts in sample 2 because: gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 ( i.e. , differences in library sizes) gene A in sample 2 gene A in sample 3 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 27 / 67

  21. Between sample normalization Purpose of between sample comparison: enabling comparisons of a gene in different samples Sources of variability: library size, ... These differences must be corrected for a differential analysis and for data interpretation. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 27 / 67

  22. Principles for sequencing depth normalization Basics choose an appropriate baseline for each sample 1 for a given gene, compare counts relative to the baseline rather than 2 raw counts NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  23. Principles for sequencing depth normalization Basics choose an appropriate baseline for each sample 1 for a given gene, compare counts relative to the baseline rather than 2 raw counts In practice: Raw counts correspond to different sequencing depths control treated 5 1 0 0 4 0 0 Gene 1 0 2 1 2 1 0 0 Gene 2 Gene 3 92 161 76 70 140 88 70 : : : : : : : : : 15 25 9 5 20 14 17 Gene G NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  24. Principles for sequencing depth normalization Basics choose an appropriate baseline for each sample 1 for a given gene, compare counts relative to the baseline rather than 2 raw counts In practice: A correction multiplicative factor is computed for every sample control treated 5 1 0 0 4 0 0 Gene 1 0 2 1 2 1 0 0 Gene 2 Gene 3 92 161 76 70 140 88 70 : : : : : : : : : 15 25 9 5 20 14 17 Gene G C j 1.1 1.6 0.6 0.7 1.4 0.7 0.8 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  25. Principles for sequencing depth normalization Basics choose an appropriate baseline for each sample 1 for a given gene, compare counts relative to the baseline rather than 2 raw counts In practice: Every counts is multiplied by the correction factor corresponding to its sample Gene 3 92 161 76 70 140 88 70 Cj 1.1 1.6 0.6 0.7 1.4 0.7 0.8 x 101.2 257.6 45.6 49 196 61.6 56 Gene 3 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  26. Principles for sequencing depth normalization Basics choose an appropriate baseline for each sample 1 for a given gene, compare counts relative to the baseline rather than 2 raw counts Consequences: Library sizes for normalized counts are roughly equal. control treated 5.5 1.6 0 0 5.6 0 0 Gene 1 Gene 2 0 3.2 0.6 1.4 1.4 0 0 Gene 3 101.2 257.6 45.6 49 196 61.6 56 : : : : : : : : : 16.5 40 5.4 5.5 28 9.8 13.6 Gene G + 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10 5 Lib. size NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  27. Principles for sequencing depth normalization Definition If K gj is the raw count for gene g in sample j then, the normalized counts is defined as: K gj � × 10 6 K gj = s j × D j in which: D j = � g K gj is the library size of sample j , s j is the correction factor of the library size for sample j and thus C j = 10 6 s j D j . NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  28. Principles for sequencing depth normalization Definition If K gj is the raw count for gene g in sample j then, the normalized counts is defined as: K gj � × 10 6 K gj = s j × D j in which: D j = � g K gj is the library size of sample j , s j is the correction factor of the library size for sample j and thus C j = 10 6 s j D j . Three types of methods: distribution adjustment method taking length into account the “effective library size” concept NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 28 / 67

  29. Distribution adjustment Total read count adjustment [Mortazavi et al., 2008] K gj = K gj and thus: � × 10 6 s j = 1 D j (Counts Per Million). raw counts normalized counts 15 edgeR : log 2 (count + 1) Samples 10 Sample 1 cpm(..., Sample 2 normalized.lib.sizes=TRUE) 5 0 0 250 500 750 1000 0 250 500 750 1000 rank(mean) gene expression NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 29 / 67

  30. Distribution adjustment Total read count adjustment [Mortazavi et al., 2008] (Upper) Quartile normalization [Bullard et al., 2010] Q ( p ) j s j = � N l = 1 Q ( p ) 1 N l in which N is the number of samples and Q ( p ) is a given quantile j (generally 3rd quartile) of the count distribution in sample j . raw counts normalized counts raw counts normalized counts 20 20 15 15 log 2 (count + 1) log 2 (count + 1) Samples Samples Sample 1 Sample 1 10 10 Sample 2 Sample 2 5 5 0 0 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 rank(mean) gene expression rank(mean) gene expression edgeR : calcNormFactors(..., method = "upperquartile", p = 0.75) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 30 / 67

  31. Method using gene lengths (intra & inter sample normalization) RPKM: Reads Per Kilobase per Million mapped Reads Assumptions: read counts are proportional to expression level, transcript length and sequencing depth D j L g s j = 10 3 × 10 6 in which L g is gene length (bp). edgeR : rpkm(..., gene.length = ...) Unbiaised estimation of number of reads but affect variability [Oshlack and Wakefield, 2009]. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 31 / 67

  32. Relative Log Expression (RLE) [Anders and Huber, 2010], edgeR - DESeq - DESeq2 Method: compute a pseudo-reference sample: geometric mean across 1 samples   1 / N  �  N         R g =  K gj      j = 1 (geometric mean is less sensitive to extreme values than standard mean) 15 log 2 (count + 1) Samples 10 Sample 1 Sample 2 5 0 0 250 500 750 1000 rank(mean) gene expression NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 32 / 67

  33. Relative Log Expression (RLE) [Anders and Huber, 2010], edgeR - DESeq - DESeq2 Method: compute a pseudo-reference sample 1 center samples compared to the reference 2   1 / N  �  N   K gj = K gj     ˜   R g = with  K gj      R g j = 1 4 count / geo. mean 3 Samples Sample 1 2 Sample 2 1 0 250 500 750 1000 rank(mean) gene expression NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 32 / 67

  34. Relative Log Expression (RLE) [Anders and Huber, 2010], edgeR - DESeq - DESeq2 Method: compute a pseudo-reference sample 1 center samples compared to the reference 2 compute normalization factor: median of centered counts over the 3 genes � ˜ � ˜ s j s j = median ˜ s j = K gj factors multiply to 1: � 1 � � N g exp l = 1 log(˜ s l ) N with K gj = K gj 4 ˜ R g count / geo. mean 3 Samples Sample 1 and 2 Sample 2   1 / N   1 � N         R g =  K gj      0 250 500 750 1000 j = 1 rank(mean) gene expression NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 32 / 67

  35. Relative Log Expression (RLE) [Anders and Huber, 2010], edgeR - DESeq - DESeq2 Method: compute a pseudo-reference sample 1 center samples compared to the reference 2 compute normalization factor: median of centered counts over the 3 genes ## with edgeR 15 calcNormFactors(..., log 2 (count + 1) method="RLE") Samples 10 Sample 1 Sample 2 5 ## with DESeq estimateSizeFactors(...) 0 0 250 500 750 1000 rank(mean) gene expression NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 33 / 67

  36. Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edgeR Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 34 / 67

  37. Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edgeR Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed ⇒ remove extreme data for fold-changed (M) and average intensity (A) � K gj � � K gr � � � K gj � � K gr �� A g ( j , r ) = 1 M g ( j , r ) = log 2 − log 2 log 2 + log 2 D j D r 2 D j D r 2 select as a reference sample, the 1 sample r with the upper quartile Trimmed M(j,r) 0 values closest to the average upper None −1 quartile −2 M- vs A-values −3 −20 −15 −10 −5 A(j,r) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 34 / 67

  38. Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edgeR Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed ⇒ remove extreme data for fold-changed (M) and average intensity (A) � K gj � � K gr � � � K gj � � K gr �� A g ( j , r ) = 1 M g ( j , r ) = log 2 − log 2 log 2 + log 2 D j D r 2 D j D r 2 1 Trimmed values M(j,r) 0 None Trim 30% on M-values 30% of Mg −1 −2 −3 −20 −15 −10 −5 A(j,r) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 34 / 67

  39. Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edgeR Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed ⇒ remove extreme data for fold-changed (M) and average intensity (A) � K gj � � K gr � � � K gj � � K gr �� A g ( j , r ) = 1 M g ( j , r ) = log 2 − log 2 log 2 + log 2 D j D r 2 D j D r 2 1 Trimmed values M(j,r) 0 None Trim 5% on A-values 30% of Mg 5% of Ag −1 −2 −3 −20 −15 −10 −5 A(j,r) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 34 / 67

  40. Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edgeR Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed On remaining data, compute the 2 weighted mean of M-values: 1 � Trimmed w g ( j , r ) M g ( j , r ) values 0 M(j,r) None 30% of Mg g : not trimmed 5% of Ag TMM ( j , r ) = −1 � w g ( j , r ) −2 g : not trimmed � � −3 −20 −15 −10 −5 D j − K gj D r − K gr with w g ( j , r ) = D j K gj + . A(j,r) D r K gr NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 34 / 67

  41. Trimmed Mean of M-values (TMM) [Robinson and Oshlack, 2010], edgeR Assumptions behind the method the total read count strongly depends on a few highly expressed genes most genes are not differentially expressed Correction factors: ˜ s j s j = 2 TMM ( j , r ) ˜ factors multiply to 1: s j = � 1 � � N l = 1 log(˜ exp s l ) N calcNormFactors(..., method="TMM") NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 35 / 67

  42. Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Purpose of the comparison: finding the “best” method for all cases is not a realistic purpose find an approach which is robust enough to provide relevant results in all cases Method: comparison based on several criteria to select a method which is valid for multiple objectives NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 36 / 67

  43. Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Effect on count distribution: RPKM and TC are very similar to raw data. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 36 / 67

  44. Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Effect on differential analysis (DESeq v. 1.6): Inflated FPR for all methods except for TMM and DESeq (RLE). NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 36 / 67

  45. Comparison of the different approaches [Dillies et al., 2013], (6 simulated datasets) Conclusion: Differences appear based on data characteristics TMM and DESeq (RLE) are performant in a differential analysis context. NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 36 / 67

  46. Outline Exploratory analysis 1 Introduction Experimental design Data exploration and quality assessment Normalization 2 Raw data filtering Interpreting read counts Differential Expression analysis 3 Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 37 / 67

  47. Part IV: Differential expression analysis NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 38 / 67

  48. Different steps in hypothesis testing formulate an hypothesis H 0 : 1 H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples which is tested against an alternative H 1 : the average count for gene g in the control samples is different from the average count in the treated samples NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 39 / 67

  49. Different steps in hypothesis testing formulate an hypothesis H 0 : 1 H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples from observations, compute a test statistics ( e.g. , the mean in the two 2 samples) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 39 / 67

  50. Different steps in hypothesis testing formulate an hypothesis H 0 : 1 H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples from observations, compute a test statistics ( e.g. , the mean in the two 2 samples) find the theoretical distribution of the test statistics under H 0 3 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 39 / 67

  51. Different steps in hypothesis testing formulate an hypothesis H 0 : 1 H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples from observations, compute a test statistics ( e.g. , the mean in the two 2 samples) find the theoretical distribution of the test statistics under H 0 3 deduce the probability that the observations occur under H 0 : this is 4 called the p-value NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 39 / 67

  52. Different steps in hypothesis testing formulate an hypothesis H 0 : 1 H 0 : the average count for gene g in the control samples is the same that the average count in the treated samples from observations, compute a test statistics ( e.g. , the mean in the two 2 samples) find the theoretical distribution of the test statistics under H 0 3 deduce the probability that the observations occur under H 0 : this is 4 called the p-value conclude: if the p-value is low (usually below α = 5% as a 5 convention), H 0 is unlikely: we say that “H 0 is rejected”. We have that: α = P H 0 ( H 0 is rejected ) . NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 39 / 67

  53. Summary of the possible decisions Do not reject H 0 Reject H 0 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 40 / 67

  54. Types of errors in tests Reality H 0 is true H 0 is false Correct decision Type II error Do not reject H 0 Decision � (True Negative) � (False Negative) Type I error Correct decision Reject H 0 � (False Positive) � (True Positive) P ( Type I error ) = α (risk) P ( Type II error ) = 1 − β ( β : power) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 41 / 67

  55. Why performing a large number of tests might be a problem? Framework: Suppose you are performing G tests at level α . P ( at least one FP if H 0 is always true ) = 1 − ( 1 − α ) G Ex: for α = 5% and G = 20, P ( at least one FP if H 0 is always true ) ≃ 64%!!! NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 42 / 67

  56. Why performing a large number of tests might be a problem? Framework: Suppose you are performing G tests at level α . P ( at least one FP if H 0 is always true ) = 1 − ( 1 − α ) G Ex: for α = 5% and G = 20, P ( at least one FP if H 0 is always true ) ≃ 64%!!! Probability to have at least one false positive versus the number of tests performed when H 0 is true for all G tests 1.00 0.75 For more than 75 tests and if H 0 is Probability always true, the probability to have at 0.50 least one false positive is very close to 100%! 0.25 0 25 50 75 100 NV 2 (INRA) Number of tests Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 42 / 67

  57. Notation for multiple tests Number of decisions for G independent tests: True null False null Total hypotheses hypotheses Rejected U V R Not rejected G 0 − U G 1 − V G − R Total G 0 G 1 G NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 43 / 67

  58. Notation for multiple tests Number of decisions for G independent tests: True null False null Total hypotheses hypotheses Rejected U V R Not rejected G 0 − U G 1 − V G − R Total G 0 G 1 G Instead of the risk α , control: familywise error rate (FWER): FWER = P ( U > 0 ) ( i.e. , probability to have at least one false positive decision) false discovery rate (FDR): FDR = E ( Q ) with     U / R if R > 0  Q =     0 otherwise NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 43 / 67

  59. Adjusted p-values Settings: p-values p 1 , . . . , p G ( e.g. , corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are ˜ p 1 , . . . , ˜ p G such that Rejecting tests such that ˜ P ( U > 0 ) ≤ α or E ( Q ) ≤ α p g < α ⇐⇒ NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 44 / 67

  60. Adjusted p-values Settings: p-values p 1 , . . . , p G ( e.g. , corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are ˜ p 1 , . . . , ˜ p G such that Rejecting tests such that ˜ P ( U > 0 ) ≤ α or E ( Q ) ≤ α p g < α ⇐⇒ Calculating p-values order the p-values p ( 1 ) ≤ p ( 2 ) ≤ . . . ≤ p ( G ) 1 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 44 / 67

  61. Adjusted p-values Settings: p-values p 1 , . . . , p G ( e.g. , corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are ˜ p 1 , . . . , ˜ p G such that Rejecting tests such that ˜ P ( U > 0 ) ≤ α or E ( Q ) ≤ α p g < α ⇐⇒ Calculating p-values order the p-values p ( 1 ) ≤ p ( 2 ) ≤ . . . ≤ p ( G ) 1 compute ˜ p ( g ) = a g p ( g ) 2 ◮ with Bonferroni method: a g = G (FWER) ◮ with Benjamini & Hochberg method: a g = G / g (FDR) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 44 / 67

  62. Adjusted p-values Settings: p-values p 1 , . . . , p G ( e.g. , corresponding to G tests on G different genes) Adjusted p-values adjusted p-values are ˜ p 1 , . . . , ˜ p G such that Rejecting tests such that ˜ P ( U > 0 ) ≤ α or E ( Q ) ≤ α p g < α ⇐⇒ Calculating p-values order the p-values p ( 1 ) ≤ p ( 2 ) ≤ . . . ≤ p ( G ) 1 compute ˜ p ( g ) = a g p ( g ) 2 ◮ with Bonferroni method: a g = G (FWER) ◮ with Benjamini & Hochberg method: a g = G / g (FDR) if adjusted p-values ˜ p ( g ) are larger than 1, correct ˜ p ( g ) ← min { ˜ p ( g ) , 1 } 3 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 44 / 67

  63. Fisher’s exact test for contingency tables After normalization, one may build a contingency table like this one: treated control Total gene g n gA n gB n g other genes N A − n gA N B − n gB N − n g Total N A N B N Question: is the number of reads of gene g in the treated sample significatively different than in the control sample? NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 45 / 67

  64. Fisher’s exact test for contingency tables After normalization, one may build a contingency table like this one: treated control Total gene g n gA n gB n g other genes N A − n gA N B − n gB N − n g Total N A N B N Question: is the number of reads of gene g in the treated sample significatively different than in the control sample? Method Direct computation of the probability to obtain such a contingency table (or a “more extreme” contingency table) with: independency between the two columns of the contingency tables; the same marginals (“Total”). NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 45 / 67

  65. Example of results obtained with the Fisher test Genes declared significantly differentially expressed are in pink: 6 3 log 2 fold change Main remark: more 0 conservative for genes with a low expression −3 −6 0 5 10 15 mean log 2 expression NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 46 / 67

  66. Example of results obtained with the Fisher test Genes declared significantly differentially expressed are in pink: 6 3 log 2 fold change Main remark: more 0 conservative for genes with a low expression −3 −6 0 5 10 15 mean log 2 expression Limitation of Fisher test Highly expressed genes have a very large variance! As Fisher test does not estimate variance, it tends to detect false positives among highly expressed genes ⇒ do not use it! NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 46 / 67

  67. Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g , K 1 g 1 , ..., K 1 gn 1 (condition 1) and K 2 g 1 , ..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) estimate its parameters for both conditions conclude by computing p-value NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 47 / 67

  68. Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g , K 1 g 1 , ..., K 1 gn 1 (condition 1) and K 2 g 1 , ..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj ∼ NB ( s k j λ gk , φ g ) in which: ◮ s k j is library correction factor of sample j in condition k ◮ λ gk is the proportion of counts for gene g in condition k ◮ φ g is the (over)dispersion (parameter) of gene g (supposed to be identical for all samples) estimate its parameters for both conditions conclude by computing p-value NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 47 / 67

  69. Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g , K 1 g 1 , ..., K 1 gn 1 (condition 1) and K 2 g 1 , ..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj ∼ NB ( s k j λ gk , φ g ) in which: ◮ s k j is library correction factor of sample j in condition k ◮ λ gk is the proportion of counts for gene g in condition k ◮ φ g is the (over)dispersion (parameter) of gene g (supposed to be identical for all samples) estimate its parameters for both conditions λ g 1 λ g 2 φ g conclude by computing p-value NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 47 / 67

  70. Basic principles of tests for count data: 2 conditions and replicates Notations: for gene g , K 1 g 1 , ..., K 1 gn 1 (condition 1) and K 2 g 1 , ..., K 2 gn 2 (condition 2) choose an appropriate distribution to model count data (discrete data, overdispersion) K k gj ∼ NB ( s k j λ gk , φ g ) in which: ◮ s k j is library correction factor of sample j in condition k ◮ λ gk is the proportion of counts for gene g in condition k ◮ φ g is the (over)dispersion (parameter) of gene g (supposed to be identical for all samples) estimate its parameters for both conditions λ g 1 λ g 2 φ g conclude by computing p-value ⇒ Test H0 : { λ g 1 = λ g 2 } NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 47 / 67

  71. First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies ⇒ s K 1 g 1 + . . . + K 1 gn 1 ∼ NB ( s λ g 1 , φ g / n 1 ) (and similarly for the second condition) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 48 / 67

  72. First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies ⇒ s K 1 g 1 + . . . + K 1 gn 1 ∼ NB ( s λ g 1 , φ g / n 1 ) (and similarly for the second condition) λ g 1 and λ g 2 are estimated (mean of the distributions) 1 NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 48 / 67

  73. First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies ⇒ s K 1 g 1 + . . . + K 1 gn 1 ∼ NB ( s λ g 1 , φ g / n 1 ) (and similarly for the second condition) λ g 1 and λ g 2 are estimated (mean of the distributions) 1 φ g is estimated independently of λ g 1 and λ g 2 , using different 2 approaches to account for small sample size NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 48 / 67

  74. First method: Exact Negative Binomial test [Robinson and Smyth, 2008] Normalization is performed to get equal size librairies ⇒ s K 1 g 1 + . . . + K 1 gn 1 ∼ NB ( s λ g 1 , φ g / n 1 ) (and similarly for the second condition) λ g 1 and λ g 2 are estimated (mean of the distributions) 1 φ g is estimated independently of λ g 1 and λ g 2 , using different 2 approaches to account for small sample size The test is performed similarly as for Fisher test (exact probability is 3 computed according to estimated paramters) NV 2 (INRA) Biostatistique RNA-seq Toulouse, 4/5 novembre 2020 48 / 67

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend