Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics - PowerPoint PPT Presentation

Statistical LeaRning Lydia Müller &Katja Nowick Bioinformatics group, Markus Kreuz IMISE

Differential Gene Expression Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains • Using microarrays (Affymetrix arrays) • Using RNA-Seq (Illumina sequencing)

Gene Expression Analysis • Measure RNA levels for all or at least thousands of genes at a time for each sample • “Snapshot” approach A piece of tissue typically consists of different cell types Each cell has a unique set of thousands of different RNAs

The biology of gene expression Gene expression: mRNA = messenger RNA  protein ncRNAs = non-coding RNAs active as RNAs All these different RNA “species” are measured by RNA-Seq

The biology of gene expression Gene X consisting of several exons Form A mRNAs of Form B Gene X Form C The different mRNAs produced from gene X are called different transcripts or different isoforms

Dataset for the exercises 5 humans 5 chimpanzees Expression profiles using Affymetrix arrays: HG-U133Plus2 Designed for human samples ~ 54,000 probesets ~21,000 genes

Affymetrix Multiple (11-20) 25-base oligonucleotide probes represent one transcript PM MM

Affymetrix 5´ 3´ Published Gene Sequence Multiple (11-20) 25-base oligonucleotide probes Perfect Match Mismatch PM is exactly complementary to published sequence MM is changed on 13 th base

QC: 1. Affy chip image For Affy arrays, artifacts should be < 10% of image, otherwise summarized probeset values could be affected…

QC: 2. Batch effects • In good experimental design, you compare two groups that only differ in one factor. • Batch effect can occur when subsets of the replicates are handled separately at any stage of the process; handling group becomes in effect another factor. Avoid processing all or most of one factor level together if you can’t do all the samples at once.

QC: 3. RNA degradation • Probes are ordered by location from 5’ to 3’ of transcript • RNA degradation starts from 5’ end  signal intensity should decrease from 3’ to 5’ Slope should be <5

Data Preprocessing 1. Background correction 2. Normalization a. Within array b. Between arrays

Background and probe-specific correction Methods • MAS 5: PM-MM, only if PM>MM – Chip is divided into 16 sections – 2% lowest probes = background – PM and MM probes are corrected against background in their section – If MM>PM, adjusted so that MM<PM • RMA: PM = S + Noise; – Predicts signal based on assumption that it is the sum of the real signal and noise – Uses observed PM probes only – Integrates signal of all probes into model • GCRMA: based on amount + location of GC content of the probes

Normalization • Most methods of normalization assume: – Most genes are not changing – Changes are not all in same direction • This applies to both within-array and between- array normalization

Between-Array Normalization • Necessary when comparing intensity values between arrays • Mas5 does no between array normalization • RMA uses quantile normalization • Process all (and only) arrays of an experiment together! • Each chip will get the same empirical distribution of signal intensities  tends to underestimate fold change

Summarization PM MM • RMA: uses only PM “median polish”: linear model based on all arrays in the study  Expression values on log 2 scale • MAS5: uses PM-MM “1 - step Tukey biweight”, only for individual arrays

MAS 5 Presence Calls • Affymetrix has algorithm to determine presence/absence of transcripts • Despite problems with MM probes, they are useful in determining presence • A data set where all the transcripts present were known indicated that Affy’s P/M/A calls were 85% accurate with only 10% FDR ( Choe et al. 2005, Genome Biology 6: R16) • It makes only sense to analyze expressed transcripts

A word on technical replication… Generally have limited amount of replicates due to cost of the experiments or availability of samples Technical replication is seen by many statisticians as a waste of time and resources because they do not substantially increase your power to detect differences... Biological replicates: include the technical variation and give information on natural/biological variation to determine what amplitude of difference might be biologically relevant

What is different? 7 7 7 6.5 6.5 6.5 6 6 6 5.5 5.5 5.5 5 5 5 4.5 4.5 4.5 FC=2.28 FC=1.14 FC=1.16 4 4 4 P=0.001 P=0.07 P=0.04 3.5 3.5 3.5 3 3 3 7 7 7 6.5 6.5 6.5 6 6 6 5.5 5.5 5.5 5 5 5 4.5 4.5 4.5 FC=1.43 FC=1.81 FC=2.28 4 4 4 P=0.25 P=0.32 P=0.17 3.5 3.5 3.5 3 3 3 Be aware of effect of outliers or biological variation on FC! T-test is a common way to identify differentially expressed genes

Multiple Testing Problem • Most common statistical tests were good in the times before large scale datasets • P=0.01 means if you were repeating the same test 100x you would expect the same outcome only ones (1 in 100) • P<0.01 means you would not expect the same outcome even if you repeated the test 100 x • Problem: With microarrays, you are testing thousands of genes at the same time •  If you had 10,000 genes and none were different, you would expect 100 to have p<0.01 by chance alone

Multiple Testing Solutions (1) Adjust p-values to reflect multiple testing: • Bonferroni correction: – Multiply p-values by number of tests done – Very conservative; greatly increases false negative rate • Benjamini & Hochberg correction: – Less conservative, willing to accept some number of false positives in order to decrease false negative rate • Storey’s q-values – Controls false discovery rate; – Estiamtes height of uniform part of p-value distribution and corrects p-value based on this height Calculate False Discovery Rate: - Try multiple p-value cut-offs and use what you are willing to accept

Multiple Testing Solutions (2) Non-specific filtering to remove genes before analysis – Remove non-expressed genes (e.g. genes not expressed in all samples) – This typically reduces the number of tests by 50-75% genes

Multiple Testing Solutions (3) a priori specification of a set of genes that are likely to be differentially expressed – E.g. from results from other experiments or other prior knowledge – Instead of correcting for all genes, this reduces the multiple testing correction to just that number of a priori defined genes – But don’t define your expectation after analyzing your current dataset or change your expectation after your previous one didn’t work

Limitations of microarrays • Limited to genes for which probes are on the array  no de-novo gene discovery, only limited isoform information • Cross-hybridization • Only available for some species (unless you make custom arrays – expensive!) You can use arrays of highly related species (e.g. human arrays for chimpanzee samples) But regions with sequence difference would not hybridize equally well: X X Probeset Probeset Solutions: U se/create custom CDF files to “mask” probes you don’t want to consider Determine statistically the probes with unequal behavior between samples (Dannemann et al. 2009) RNA-Seq

RNA-Seq Quantification of gene expression 1. Biochemical amplification of all RNAs 2. Fragmentation of all RNAs 3. Biochemical conversion of all RNAs into cDNA 4. Sequencing of all cDNAs  Short reads (e.g. 100 nt long) 5. Mapping of reads to a reference genome / genes EXON EXON EXON EXON 6. Count how many reads map to the gene to determine its expression level Quantification can be done at the level of genes, transcripts, exons

Quantification of gene expression Long gene EXON EXON EXON EXON Short gene Has fewer counts EXON EXON EXON EXON EXON EXON Need to normalize for gene / transcript length!

RNA-Seq vs Microarray differential expression With the costs for NGS decreasing, RNA-Seq will replace microarrays Microarrays Continues numbers Transcript abundance: Is linearly related to the fluorescence level recovered after hybridization RNA-Seq Counts of reads (discrete numbers) Transcript abundance: Counts are considered to be linearly related to transcript abundance Calculate expression values for genes, transcripts (i.e. different isoforms), or exons?  Differential gene, transcript or exon expression?

Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics - PowerPoint PPT Presentation

Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics group, Markus Kreuz IMISE Differential Gene Expression Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains Using

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

THE IMPACT OF SOCIAL SUPPORT ON KIDNEY TRANSPLANT LISTING DECISIONS: RESULTS FROM A DISCRETE

A Team Approach: Navigating the Journey to Kidney Transplant May 17, 2018 MEET OUR PANELISTS

Kidney Transplantation Committee Fall 2015 1 Policy & Guidance Implementation Dates Title

Approaching Evaluation Approaching Evaluation Using the Milestones: Using the Milestones: Step Away

Presentation at PROBUS North Club 9 April 2018 ~ 9:30 a.m. Photo Source:

PROBUS, Vancouver, B.C. 10 a.m., Tuesday, 10 January 2017 Photo Source:

Conference Remarks on Food security: International Experience and Cooperation by Sohail J.

What Public School Leaders Ought to Know About Recent Legislation and Related Legal Developments

Sambuz

Useful Links

Newsletter

Mail Us

Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics - PowerPoint PPT Presentation

Statistical LeaRning Lydia Mller &Katja Nowick Bioinformatics group, Markus Kreuz IMISE Differential Gene Expression Goals of today: Find out which genes are differentially expressed between human and chimpanzee brains Using

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

THE IMPACT OF SOCIAL SUPPORT ON KIDNEY TRANSPLANT LISTING DECISIONS: RESULTS FROM A DISCRETE

A Team Approach: Navigating the Journey to Kidney Transplant May 17, 2018 MEET OUR PANELISTS

Kidney Transplantation Committee Fall 2015 1 Policy &amp; Guidance Implementation Dates Title

Approaching Evaluation Approaching Evaluation Using the Milestones: Using the Milestones: Step Away

Presentation at PROBUS North Club 9 April 2018 ~ 9:30 a.m. Photo Source:

PROBUS, Vancouver, B.C. 10 a.m., Tuesday, 10 January 2017 Photo Source:

Conference Remarks on Food security: International Experience and Cooperation by Sohail J.

What Public School Leaders Ought to Know About Recent Legislation and Related Legal Developments

Sambuz

Useful Links

Newsletter

Mail Us

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

Kidney Transplantation Committee Fall 2015 1 Policy & Guidance Implementation Dates Title