Statistical Models for sequencing data: from Experimental Design to - PowerPoint PPT Presentation

Best practices in the analysis of RNA-Seq data 28 th -29 th March 2018 University of Cambridge, Cambridge, UK Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models Oscar M. Rueda Breast Cancer Functional Genomics Group. CRUK Cambridge Research Institute (a.k.a. Li Ka Shing Centre) � Oscar.Rueda@cruk.cam.ac.uk 1

Outline • Experimental Design • Design and Contrast matrices • Generalized linear models • Models for coun:ng data 2

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. Sir Ronald Fisher (1890-1962) [evolu:onary biologist, gene:cist and sta:s:cian] 3

An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John Tukey (1915-2000) [Sta:s:cian] 4

An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination. Andrew Lang (1844-1912) [Poet, novelist and literary cri:c] 5

Experimental Design

Design of an experiment • Select biological ques:ons of interest • Iden:fy an appropriate measure to answer that ques:on • Select addi:onal variables or factors that can have an influence in the result of the experiment • Select a sample size and the sample units • Assign samples to lanes/flow cells. 7

Principles of Sta:s:cal Design of Experiments • R. A. Fisher: – Replica:on – Blocking – Randomiza:on. • They have been used in microarray studies from the beginning. • Bar coding makes easy to adapt them to NGS studies. 8

Unreplicated Data Inferences for RNA and fragment-level can be obtained through Fisher’s test. But they don’t reflect biological variability. 9 Auer and Doerge. Genetics 185:405-416(2010)

Replicated Data Inferences for treatment effect using generalized linear models Is this a good design? (more on this later). We should randomize within block! 10 Auer and Doerge. Genetics 185:405-416(2010)

Balanced Block Designs • Avoids confounding effects: – Lane effects (any errors from the point where the sample is input to the flow cell un:l the data output). Examples: systema:cally bad sequencing cycles, errors in base calling… – Batch effects (any errors afer random fragmenta:on of the RNA un:l it is input to the flow cell). Examples: PCR amplifica:on, reverse transcrip:on ar:facts… – Other effects non related to treatment. 11 Auer and Doerge. Genetics 185:405-416(2010)

Balanced blocks by mul:plexing Auer and Doerge. Genetics 185:405-416(2010)

Benefits of a proper design • NGS is benefited with design principles • Technical replicates can not replace biological replicates • It is possible to avoid mul:plexing with enough biological replicates and sequencing lanes • The advantages of mul:plexing are bigger than the disadvantages (cost, loss of sequencing depth, bar-code bias…) 13

Design and contrast matrices

Sta:s:cal models – We want to model the expected result of an outcome (dependent variable) under given values of other variables (independent variables) Arbitrary function (any shape) A set of k Expected value of variable Y independent variables E ( Y ) = f ( X ) (also called factors) This is the Y = f ( X ) + ε variability around the expected mean of y 15

Design matrix – Represents the independent variables that have an influence in the response variable, but also the way we have coded the information and the design of the experiment. – For now, let’s restrict to models Y = β X + ε Stochastic error Response variable Parameter vector Design matrix 16

Types of designs considered • Models with 1 factor – Models with two treatments – Models with several treatments • Models with 2 factors – Interac:ons • Paired designs • Models with categorical and con:nuous factors • TimeCourse Experiments • Mul:factorial models. 17

Strategy • Define our set of samples • Define the factors, type of factors (con:nuous, categorical), number of levels… • Define the set of parameters: the effects we want to es:mate • Build the design matrix, that relates the informa:on that each sample contains about the parameters. • Es:mate the parameters of the model: tes:ng • Further es:ma:on (and tes:ng): contrast matrices.

Models with 1 factor, 2 levels Treatme Sample Treatment Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Sample 5 Treatment A Sample 6 Control Number of samples: 6 Number of factors: 1 Treatment: Number of levels: 2 Possible parameters (What differences are important)? - Effect of Treatment A - Effect of Control 19

Design matrix for models with 1 factor, 2 levels Sample Treatment Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Treat. A Control Sample 5 Treatment A Parameters (coefficients, Sample 6 Control levels of the variable) ! $ ! $ S 1 Sample 1 1 0 # & # & Sample 2 S 2 0 1 # & # & ! $ T Sample 3 # & # & S 3 1 0 = # & # & # & C Sample 4 " % 0 1 S 4 # & # & Sample 5 1 0 S 5 # & # & # & Sample 6 0 1 # & S 6 " % " % C is the mean expression of the control T is the mean expression of the treatment Design Matrix Equivalent to a t-test 20

Design matrix for models with 1 factor, 2 levels Sample Treatment Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Treat. A Control Sample 5 Treatment A Parameters (coefficients, Sample 6 Control levels of the variable) ! $ ! $ S 1 Sample 1 1 0 # & # & Sample 2 S 2 0 1 # & # & ! $ T Sample 3 # & # & S 3 1 0 = # & # & # & C Sample 4 " % 0 1 S 4 # & # & Sample 5 1 0 S 5 # & # & # & Sample 6 0 1 # & S 6 " % " % Design Matrix Equivalent to a t-test 21

Intercepts Different parameteriza:on: using intercept Let’s now consider this parameteriza:on: Sample Treatment Sample1 Treatment A C= Baseline expression T A = Baseline expression + effect of treatment Sample 2 Control Sample 3 Treatment A So the set of parameters are: Sample 4 Control Sample 5 Treatment A C = Control (mean expression of the control) Sample 6 Control a = T A – Control (mean change in expression under treatment 22

Intercept Different parameteriza:on: using intercept Treatment A Intercept Parameters (coefficients, levels of the variable) ! $ ! $ S 1 Sample 1 1 1 # & # & Sample 2 S 2 1 0 # & # & ! $ β 0 Sample 3 # & # & S 3 1 1 = # & # & # & a Sample 4 # & 1 0 S 4 " % # & # & Sample 5 1 1 S 5 # & # & # & Intercept measures the Sample 6 1 0 # & S 6 " % " % baseline expression. a measures now the differen:al expression between Treatment A and Design Matrix Control 23

Contrast matrices Are the two parameteriza:ons equivalent? " $ ˆ T " $ & ' 1 − 1 = T − C # % ˆ & ' C # % Contrast matrices allow us to es:mate (and test) linear Contrast matrix combina:ons of our coefficients. 24

Models with 1 factor, more than 2 levels Treatme Sample Treatment Sample1 Treatment A Sample 2 Treatment B Sample 3 Control Sample 4 Treatment A Sample 5 Treatment B Sample 6 Control ANOVA models Number of samples: 6 Number of factors: 1 Treatment: Number of levels: 3 Possible parameters (What differences are important)? - Effect of Treatment A - Effect of Treatment B - Effect of Control 25 - Differences between treatments?

Design matrix for ANOVA models ! $ ! $ S 1 1 0 0 # & # & ! $ T A S 2 0 1 0 Sample Treatment # & # & # & # & # & 0 0 1 S 3 T B # & Sample1 Treatment A = # & # & 1 0 0 S 4 # & Sample 2 Treatment B # & C # & # & " % 0 1 0 S 5 # & # & Sample 3 Control # & 0 0 1 # & S 6 " % " % Sample 4 Treatment A Sample 5 Treatment B ! $ ! $ S 1 1 1 0 Sample 6 Control # & # & ! $ β 0 S 2 1 0 1 # & # & # & # & # & S 3 1 0 0 a # & = # & # & 1 1 0 S 4 # & b # & # & # & " % 1 1 1 S 5 # & # & # & 1 0 0 # & S 6 " % " % 26

Statistical Models for sequencing data: from Experimental Design to - PowerPoint PPT Presentation

Best practices in the analysis of RNA-Seq data 28 th -29 th March 2018 University of Cambridge, Cambridge, UK Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models Oscar M. Rueda Breast Cancer Functional

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

sequencing data Simon Andrews @simon_andrews How to spot problems in your sequencing data

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

In vitro tests and experimental animal In vitro tests and experimental animal In vitro tests and

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

Applications of Next Generation DNA Sequencing in Newborn Screening Anne Goodeve Sheffield

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

Introduction to Statistics 18.05 Spring 2014 T T T H H T H H H T H T H T H T H T H T H T T T H T T

Josh Bloch Charlie Garrod School of Computer Science 17-214 1 Administrivia Homework 4a

Josh Bloch Charlie Garrod School of Computer Science 15-214 1 Administrivia Homework 4a

INSTITUTE for SAFE ENVIRONMENTS INTERACTIVE PANEL DISCUSSION Diane E. Allen, MN, PMH RN-BC,

I Insights on the IRS Approach to i ht th IRS A h t Transfer Pricing Audits g February 9,

Unit 4: Inference for numerical data 3. Power Project proposal has been uploaded to course

How to design a phase-II study Regina Berger Medical University of Innsbruck AGO-Austria GCIG

SWEN 256 Software Process & Project Management Agile Methods o cowboys and

Statistical Models for sequencing data: from Experimental Design to - PowerPoint PPT Presentation

Best practices in the analysis of RNA-Seq data 28 th -29 th March 2018 University of Cambridge, Cambridge, UK Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models Oscar M. Rueda Breast Cancer Functional

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

The Massive Parallel Sequencing era: &quot;Global sequencing&quot; Richard Christen CNRS UMR

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

sequencing data Simon Andrews @simon_andrews How to spot problems in your sequencing data

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

In vitro tests and experimental animal In vitro tests and experimental animal In vitro tests and

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

Applications of Next Generation DNA Sequencing in Newborn Screening Anne Goodeve Sheffield

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

Introduction to Statistics 18.05 Spring 2014 T T T H H T H H H T H T H T H T H T H T H T T T H T T

Josh Bloch Charlie Garrod School of Computer Science 17-214 1 Administrivia Homework 4a

Josh Bloch Charlie Garrod School of Computer Science 15-214 1 Administrivia Homework 4a

INSTITUTE for SAFE ENVIRONMENTS INTERACTIVE PANEL DISCUSSION Diane E. Allen, MN, PMH RN-BC,

I Insights on the IRS Approach to i ht th IRS A h t Transfer Pricing Audits g February 9,

Unit 4: Inference for numerical data 3. Power Project proposal has been uploaded to course

How to design a phase-II study Regina Berger Medical University of Innsbruck AGO-Austria GCIG

SWEN 256 Software Process &amp; Project Management Agile Methods o cowboys and

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

SWEN 256 Software Process & Project Management Agile Methods o cowboys and