Statistical Models for sequencing data: from Experimental Design to - - PowerPoint PPT Presentation

statistical models for sequencing data from experimental
SMART_READER_LITE
LIVE PREVIEW

Statistical Models for sequencing data: from Experimental Design to - - PowerPoint PPT Presentation

Best practices in the analysis of RNA-Seq and CHiP-Seq data 4 th -5 th May 2017 University of Cambridge, Cambridge, UK Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models Oscar M. Rueda Breast Cancer


slide-1
SLIDE 1

1

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Oscar M. Rueda

Breast Cancer Functional Genomics Group. CRUK Cambridge Research Institute (a.k.a. Li Ka Shing Centre) Oscar.Rueda@cruk.cam.ac.uk

Best practices in the analysis of RNA-Seq and CHiP-Seq data

4th-5th May 2017 University of Cambridge, Cambridge, UK

slide-2
SLIDE 2

Outline

  • Experimental Design
  • Design and Contrast matrices
  • Generalized linear models
  • Models for coun:ng data

2

slide-3
SLIDE 3

3

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem

  • examination. He can perhaps say

what the experiment died of.

Sir Ronald Fisher (1890-1962) [evolu:onary biologist, gene:cist and sta:s:cian]

slide-4
SLIDE 4

4

An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.

John Tukey (1915-2000) [Sta:s:cian]

slide-5
SLIDE 5

5

An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination.

Andrew Lang (1844-1912) [Poet, novelist and literary cri:c]

slide-6
SLIDE 6

Experimental Design

slide-7
SLIDE 7

Design of an experiment

  • Select biological ques:ons of interest
  • Iden:fy an appropriate measure to answer

that ques:on

  • Select addi:onal variables or factors that can

have an influence in the result of the experiment

  • Select a sample size and the sample units
  • Assign samples to lanes/flow cells.

7

slide-8
SLIDE 8

Principles of Sta:s:cal Design of Experiments

  • R. A. Fisher:

– Replica:on – Blocking – Randomiza:on.

  • They have been used in microarray studies

from the beginning.

  • Bar coding makes easy to adapt them to NGS

studies.

8

slide-9
SLIDE 9

Sampling hierarchy

There are three levels of sampling:

  • Subject Sampling
  • RNA sampling
  • Fragment sampling.

Auer and Doerge. Genetics 185:405-416(2010) 9

slide-10
SLIDE 10

Unreplicated Data

Inferences for RNA and fragment-level can be

  • btained through Fisher’s test. But they don’t

reflect biological variability.

Auer and Doerge. Genetics 185:405-416(2010) 10

slide-11
SLIDE 11

Replicated Data

Inferences for treatment effect using generalized linear models (more on this later).

Auer and Doerge. Genetics 185:405-416(2010) 11

Is this a good design? We should randomize within block!

slide-12
SLIDE 12

Balanced Block Designs

  • Avoids confounding effects:

– Lane effects (any errors from the point where the sample is input to the flow cell un:l the data

  • utput). Examples: systema:cally bad sequencing

cycles, errors in base calling… – Batch effects (any errors ager random fragmenta:on of the RNA un:l it is input to the flow cell). Examples: PCR amplifica:on, reverse transcrip:on ar:facts… – Other effects non related to treatment.

Auer and Doerge. Genetics 185:405-416(2010) 12

slide-13
SLIDE 13

Balanced blocks by mul:plexing

Auer and Doerge. Genetics 185:405-416(2010)

slide-14
SLIDE 14

Balanced incomplete block design and blocking without mul:plexing

  • Usually there are restric:ons with the number
  • f treatments, replicates, unique bar codes

and available lanes for sequencing.

  • 3 treatments (T1, T2, T3)
  • 1 subject per treatment (T11, T21, T31)
  • 2 technical replicates (T111, T112, T211,

T212, T311, T312)

Auer and Doerge. Genetics 185:405-416(2010)

slide-15
SLIDE 15

Simula:ons

Auer and Doerge. Genetics 185:405-416(2010)

slide-16
SLIDE 16

Results

Auer and Doerge. Genetics 185:405-416(2010) 16

slide-17
SLIDE 17

Benefits of a proper design

  • NGS is benefited with design principles
  • Technical replicates can not replace biological

replicates

  • It is possible to avoid mul:plexing with

enough biological replicates and sequencing lanes

  • The advantages of mul:plexing are bigger

than the disadvantages (cost, loss of sequencing depth, bar-code bias…)

17

slide-18
SLIDE 18

Design and contrast matrices

slide-19
SLIDE 19

Sta:s:cal models

– We want to model the expected result of an

  • utcome (dependent variable) under given

values of other variables (independent variables)

E(Y) = f (X) Y = f (X)+ε

Expected value of variable Y Arbitrary function (any shape) A set of k independent variables (also called factors) This is the variability around the expected mean of y

19

slide-20
SLIDE 20

20

Design matrix

– Represents the independent variables that have an influence in the response variable, but also the way we have coded the information and the design of the experiment. – For now, let’s restrict to models

Y = βX +ε

Response variable Parameter vector Design matrix Stochastic error

slide-21
SLIDE 21

Types of designs considered

  • Models with 1 factor

– Models with two treatments – Models with several treatments

  • Models with 2 factors

– Interac:ons

  • Paired designs
  • Models with categorical and con:nuous factors
  • TimeCourse Experiments
  • Mul:factorial models.

21

slide-22
SLIDE 22

Strategy

  • Define our set of samples
  • Define the factors, type of factors (con:nuous,

categorical), number of levels…

  • Define the set of parameters: the effects we want to

es:mate

  • Build the design matrix, that relates the informa:on

that each sample contains about the parameters.

  • Es:mate the parameters of the model: tes:ng
  • Further es:ma:on (and tes:ng): contrast matrices.
slide-23
SLIDE 23

Models with 1 factor, 2 levels

Treatme Sample Treatment Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Sample 5 Treatment A Sample 6 Control Number of samples: 6 Number of factors: 1 Treatment: Number of levels: 2 Possible parameters (What differences are important)?

  • Effect of Treatment A
  • Effect of Control

23

slide-24
SLIDE 24

Design matrix for models with 1 factor, 2 levels

Design Matrix

1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

  • Treat. A

Control Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Parameters (coefficients, levels of the variable)

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

T C ! " # $ % &

24

Equivalent to a t-test

Sample Treatment

Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Sample 5 Treatment A Sample 6 Control

C is the mean expression of the control T is the mean expression of the treatment

slide-25
SLIDE 25

Design matrix for models with 1 factor, 2 levels

Design Matrix

1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

  • Treat. A

Control Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Parameters (coefficients, levels of the variable)

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

T C ! " # $ % &

25

Equivalent to a t-test

Sample Treatment

Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Sample 5 Treatment A Sample 6 Control

slide-26
SLIDE 26

Intercepts

Different parameteriza:on: using intercept

26

Sample Treatment Sample1 Treatment A Sample 2 Control Sample 3 Treatment A Sample 4 Control Sample 5 Treatment A Sample 6 Control

Let’s now consider this parameteriza:on: C= Baseline expression TA= Baseline expression + effect of treatment So the set of parameters are: C = Control (mean expression of the control) a = TA – Control (mean change in expression under treatment

slide-27
SLIDE 27

Intercept

Design Matrix

1 1 1 1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

Intercept Treatment A Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Parameters (coefficients, levels of the variable)

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

β0 a ! " # # $ % & &

Different parameteriza:on: using intercept Intercept measures the baseline expression. a measures now the differen:al expression between Treatment A and Control

27

slide-28
SLIDE 28

Contrast matrices

Are the two parameteriza:ons equivalent? Contrast matrix Contrast matrices allow us to es:mate (and test) linear combina:ons of our coefficients.

28

1 −1 " # $ % ˆ T ˆ C " # & & $ % ' ' = T −C

slide-29
SLIDE 29

Models with 1 factor, more than 2 levels

Treatme Sample Treatment Sample1 Treatment A Sample 2 Treatment B Sample 3 Control Sample 4 Treatment A Sample 5 Treatment B Sample 6 Control Number of samples: 6 Number of factors: 1 Treatment: Number of levels: 3 Possible parameters (What differences are important)?

  • Effect of Treatment A
  • Effect of Treatment B
  • Effect of Control
  • Differences between treatments?

29

ANOVA models

slide-30
SLIDE 30

Design matrix for ANOVA models

1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

TA TB C ! " # # # # $ % & & & &

1 1 1 1 1 1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

β0 a b ! " # # # # $ % & & & &

30

Sample Treatment Sample1 Treatment A Sample 2 Treatment B Sample 3 Control Sample 4 Treatment A Sample 5 Treatment B Sample 6 Control

slide-31
SLIDE 31

Design matrix for ANOVA models

1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

TA TB C ! " # # # # $ % & & & &

1 1 1 1 1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

β0 a b ! " # # # # $ % & & & &

31

Sample Treatment Sample1 Treatment A Sample 2 Treatment B Sample 3 Control Sample 4 Treatment A Sample 5 Treatment B Sample 6 Control Control = Baseline TA = Baseline + a TB = Baseline + b

slide-32
SLIDE 32

Baseline levels

1 1 1 1 1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

β0 b c ! " # # # # $ % & & & &

The model with intercept always take one level as a baseline: The baseline is treatment A, the coefficients are comparisons against it!

32 By default, R uses the first level as baseline

slide-33
SLIDE 33

R code

R code:

> Treatment <- rep(c(“TreatmentA”, “TreatmentB”, “Control”), 2) > design.matrix <- model.matrix(~ Treatment) (model with intercept) > design.matrix <- model.matrix(~ -1 + Treatment) (model without intercept) > design.matrix <- model.matrix(~ 0 + Treatment) (model without intercept) 33

slide-34
SLIDE 34

Exercise

Build contrast matrices for all pairwise comparisons for this design:

1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

TA TB C ! " # # # # $ % & & & & ˆ TA ˆ TB ˆ C ! " # # # # $ % & & & &

! " # # # $ % & & &

34

slide-35
SLIDE 35

Exercise

Build contrast matrices for all pairwise comparisons for these designs:

1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

TA TB C ! " # # # # $ % & & & & ˆ TA ˆ TB ˆ C ! " # # # # $ % & & & &

1 −1 1 −1 1 −1 " # $ $ $ % & ' ' '

35

slide-36
SLIDE 36

Exercise

Build contrast matrices for all pairwise comparisons for these designs:

! " # # # $ % & & & 1 1 1 1 1 1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

β0 a b ! " # # # # $ % & & & & ˆ β0 ˆ a ˆ b ! " # # # # $ % & & & &

36

slide-37
SLIDE 37

Exercise

Build contrast matrices for all pairwise comparisons for these designs:

1 1 1 −1 " # $ $ $ % & ' ' ' 1 1 1 1 1 1 1 1 1 1 1 ! " # # # # # # # $ % & & & & & & &

S1 S2 S3 S4 S5 S6 ! " # # # # # # # $ % & & & & & & &

=

β0 a b ! " # # # # $ % & & & & ˆ β0 ˆ a ˆ b ! " # # # # $ % & & & &

37

slide-38
SLIDE 38

Models with 2 factors

Treatme Sample Treatment ER status Sample1 Treatment A + Sample 2 No Treatment + Sample 3 Treatment A + Sample 4 No Treatment + Sample 5 Treatment A

  • Sample 6

No Treatment

  • Sample 7

Treatment A

  • Sample 8

No Treatment

  • Number of samples: 8

Number of factors: 2 Treatment: Number of levels: 2 ER: Number of levels: 2

38

slide-39
SLIDE 39

Adapted from Natalie Thorne, Nuno L. Barbosa Morais

Treat effect Both effects ER effect

Treat x ER positive

interaction Treat x ER negative interaction

Understanding Interac:ons

Treat + ER effects Treat effect Both effects ER effect Treat x ER effects Treat x ER effects

No Treat Treat A ER - S6, S8 S5, S7 ER + S2, S4 S1, S3

39

slide-40
SLIDE 40

Models with 2 factors and no interac:on

Number of coefficients (parameters): Intercept + (♯levels Treat -1) + (♯levels ER -1) = 3 If we remove the intercept, the addi:onal parameter comes from the missing level in one of the variables, but in models with more than 1 factor it is a good idea to keep the intercept.

Model with no interac:on: only main effects

40

slide-41
SLIDE 41

Models with 2 factors (no interac:on)

S1 S2 S3 S4 S5 S6 S7 S8 ! " # # # # # # # # # # $ % & & & & & & & & & & = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , β0 a er + ! " # # # # $ % & & & &

R code: > design.matrix <- model.matrix(~Treatment+ER) (model with intercept) In R, the baseline for each variable is the first level.

41

No Treat Treat A ER - S6, S8 S5, S7 ER + S2, S4 S1, S3

slide-42
SLIDE 42

Models with 2 factors (no interac:on)

S1 S2 S3 S4 S5 S6 S7 S8 ! " # # # # # # # # # # $ % & & & & & & & & & & = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , β0 a er + ! " # # # # $ % & & & &

42

No Treat Treat A ER - S6, S8 S5, S7 ER + S2, S4 S1, S3 R code: > design.matrix <- model.matrix(~Treatment+ER) (model with intercept)

slide-43
SLIDE 43

Models with 2 factors and interac:on

Number of coefficients (parameters): Intercept + (♯levels Treat -1) + (♯levels ER -1) + ((♯levels Treat -1) * (♯levels ER -1)) = 4

Model with interac:on: main effects + interac/on

43

slide-44
SLIDE 44

Models with 2 factors (interac:on)

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , β0 a er + a.er + ! " # # # # # # $ % & & & & & &

R code: > design.matrix <- model.matrix(~Treatment*ER) (model with intercept) “Extra effect” of Treatment A on ER+ samples

44

No Treat Treat A ER - S6, S8 S5, S7 ER + S2, S4 S1, S3

slide-45
SLIDE 45

Models with 2 factors (interac:on)

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , β0 a er + a.er + ! " # # # # # # $ % & & & & & &

“Extra effect” of Treatment A on ER+ samples

45

R code: > design.matrix <- model.matrix(~Treatment*ER) (model with intercept) No Treat Treat A ER - S6, S8 S5, S7 ER + S2, S4 S1, S3

slide-46
SLIDE 46

2 by 3 factorial experiment

  • Identify DE genes that have different time profiles

between different mutants. α = time effect, β = strains, αβ = interaction effect

0 12 24 Strain B Strain A

time α > 0 β = 0 αβ=0 Exp

0 12 24 Strain B Strain A

time α > 0 β > 0 αβ=0 Exp

0 12 24 Strain A Strain B

time αβ>0 Exp

0 12 24

time

Strain A Strain B

α=0 β> 0 αβ=0 Exp

Slide by Natalie Thorne, Nuno L. Barbosa Morais 46

slide-47
SLIDE 47

Paired Designs

Sample Type Sample 1 Tumour Sample 2 Matched Normal Sample 3 Tumour Sample 4 Matched Normal Sample 5 Tumour Sample 6 Matched Normal Sample 7 Tumour Sample 8 Matched Normal Number of samples: 4 Number of factors: 2 Sample: Number of levels: 4 Type: Number of levels: 2 Number of samples: 8 Number of factors: 1 Type: Number of levels: 2 Sample Type Sample 1 Tumour Sample 1 Matched Normal Sample 2 Tumour Sample 2 Matched Normal Sample 3 Tumour Sample 3 Matched Normal Sample 4 Tumour Sample 4 Matched Normal

47

slide-48
SLIDE 48

Design matrix for Paired experiments

We can gain precision in our es:mates with a paired design, because individual variability is removed when we compare the effect of the treatment within the same sample. R code: > design.matrix <- model.matrix(~-1 +Type) (unpaired; model without intercept) > design.matrix <- model.matrix(~-1 +Sample+Type) (paired; model without intercept) These effects only reflect biological differences not related to tumour/normal effect.

48

Sample Type Sample 1 Tumour Sample 1 Matched Normal Sample 2 Tumour Sample 2 Matched Normal Sample 3 Tumour Sample 3 Matched Normal Sample 4 Tumour Sample 4 Matched Normal

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 1 1 1 1 1 1 1 1 1 1 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , S1 S2 S3 S4 t ! " # # # # # # $ % & & & & & &

slide-49
SLIDE 49

Analysis of covariance (Models with categorical and con:nuous variables)

Sample ER Dose Sample 1 + 37 Sample 2

  • 52

Sample 3 + 65 Sample 4

  • 89

Sample 5 + 24 Sample 6

  • 19

Sample 7 + 54 Sample 8

  • 67

Number of samples: 8 Number of factors: 2 ER: Number of levels: 2 Dose: Con:nuous

49

slide-50
SLIDE 50

Analysis of covariance (Models with categorical and con:nuous variables)

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 37 1 52 1 1 65 1 89 1 1 24 1 19 1 1 54 1 67 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , β0 er + d ! " # # # # $ % & & & &

R code: > design.matrix <- model.matrix(~ ER + dose) If we consider the effect of dose linear we use 1 coefficient (degree

  • f freedom). We can also model it

as non-linear (using splines, for example).

50

Sample ER Dose Sample 1 + 37 Sample 2

  • 52

Sample 3 + 65 Sample 4

  • 89

Sample 5 + 24 Sample 6

  • 19

Sample 7 + 54 Sample 8

  • 67
slide-51
SLIDE 51

Analysis of covariance (Models with categorical and con:nuous variables)

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 37 37 1 52 1 1 65 65 1 89 1 1 24 24 1 19 1 1 54 54 1 67 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , β0 er + d er +.d ! " # # # # $ % & & & &

If the interac:on is significant, the effect on the dose is different depending on the levels of ER. Interac:on: Is it the effect of dose equal in ER + and ER -?

51

R code: > design.matrix <- model.matrix(~ ER * dose)

Sample ER Dose Sample 1 + 37 Sample 2

  • 52

Sample 3 + 65 Sample 4

  • 89

Sample 5 + 24 Sample 6

  • 19

Sample 7 + 54 Sample 8

  • 67
slide-52
SLIDE 52

Time Course experiments

Sample Time Sample 1 0h Sample 1 1h Sample 1 4h Sample 1 16h Sample 2 0h Sample 2 1h Sample 2 4h Sample 2 16h Number of samples: 2 Number of factors: 2 Sample: Number of levels: 2 Time: Con:nuous or categorical? Main ques:on: how does expression change over :me? If we model :me as categorical, we don’t make assump:ons about its effect, but we use too many degrees of freedom. If we model :me as con:nuous, we use less degrees of freedom but we have to make assump:ons about the type of effect.

52

Intermediate solu:on: splines

slide-53
SLIDE 53

Time Course experiments: no assump:ons

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , S1 S2 T

1

T4 T

16

! " # # # # # # # $ % & & & & & & &

R code: > design.matrix <- model.matrix(~Sample + factor(Time)) We can use contrasts to test differences at :me points.

53

Sample Time Sample 1 0h Sample 1 1h Sample 1 4h Sample 1 16h Sample 2 0h Sample 2 1h Sample 2 4h Sample 2 16h

slide-54
SLIDE 54

Time Course experiments

Y

1

Y2 Y3 Y4 Y5 Y6 Y7 Y8 ! " # # # # # # # # # # # $ % & & & & & & & & & & & = 1 1 1 1 4 1 16 1 1 1 1 4 1 16 ' ( ) ) ) ) ) ) ) ) ) ) * + , , , , , , , , , , S1 S2 X ! " # # # # $ % & & & &

R code: > design.matrix <- model.matrix(~Sample + Time) We are assuming a linear effect on :me

54

time

small coef x Big coef x Large neg coef x

Intermediate models are possible: splines

Sample Time Sample 1 0h Sample 1 1h Sample 1 4h Sample 1 16h Sample 2 0h Sample 2 1h Sample 2 4h Sample 2 16h

slide-55
SLIDE 55

Mul: factorial models

  • We can fit models with many variables
  • Sample size must be adequate to the number of factors
  • Same rules for building the design matrix must be used:
  • There will be one column in design matrix for the intercept
  • Con:nuous variables with a linear effect will need one column in the design

matrix

  • Categorical variable will need ♯levels -1 columns
  • Interac:ons will need (♯levels -1) x (♯levels -1)
  • It is possible to include interac:ons of more than 2 variables, but the number of

samples needed to accurately es:mate those interac:ons is large.

55

slide-56
SLIDE 56

Generalized linear models

slide-57
SLIDE 57

Sta:s:cal models

– We want to model the expected result of an

  • utcome (dependent variable) under given values
  • f other variables (independent variables)

E(Y) = f (X) Y = f (X)+ε

Expected value of variable y Arbitrary func:on (any shape) A set of k independent variables (also called factors) This is the variability around the expected mean of y

57

slide-58
SLIDE 58

Fixed vs Random effects

– If we consider an independent variable Xi as fixed, that is the set of observed values has been fixed by the design, then it is called a fixed factor. – If we consider an independent variable Xj as random, that is the set of observed values comes from a realiza:on of a random process, it is called a random factor. – Models that include random effects are called MIXED MODELS. – In this course we will only deal with fixed factors.

58

slide-59
SLIDE 59

Linear models

– The observed value of Y is a linear combina:on of the effects of the independent variables – If we include categorical variables the model is called General Linear Model

E(Y) = β0 + β1X1 + β2X2 +...+ βkXk E(Y) = β0 + β1X1 + β2X 2

1 +...+ βpX p 1

E(Y) = β0 + β1 log(X1)+ β2 f (X2)+...+ βkXk

Arbitrary number of independent variables Polynomials are valid We can use func:ons

  • f the variables if the

effects are linear Smooth func:ons: not exactly the same as the so-called addi/ve models 59

slide-60
SLIDE 60

Model Es:ma:on

We use least squares esImaIon

  • 20

40 60 80 100 −2 2 4 6 8 10 12 X Y

  • X

Y −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 Male Female

  • Given n observa:ons (y1,..yn,x1,..xn) minimize the differences between the observed

and the predicted values

60

slide-61
SLIDE 61

Model Es:ma:on

Y = βX +ε

β ˆ β se( ˆ β)

Parameter of interest (effect of X on Y) EsImator of the parameter of interest Standard Error of the es:mator of the parameter of interest

ˆ β = (X TX)−1X TY se( ˆ βi) =σ ci

where ci is the ith diagonal element of X TX

( )

−1

ˆ y = ˆ βx e = y − ˆ y

Fived values (predicted by the model) Residuals (observed errors)

61

slide-62
SLIDE 62

Model Assump:ons

In order to conduct sta:s:cal inferences on the parameters on the model, some assump:ons must be made:

  • The observa:ons 1,..,n are independent
  • Normality of the errors:
  • Homoscedas:city: the variance is constant.
  • Linearity.

εi ~ N(0,σ 2)

62

slide-63
SLIDE 63

Generalized linear models

– Extension of the linear model to other distribu:ons and non-linearity in the structure (to some degree) – Y must follow a probability distribu:on from the exponen:al family (Bernoulli, Binomial, Poisson, Gamma, Normal,…) – Parameter es:ma:on must be performed using an itera:ve method (IWLS).

g(E(Y)) = Xβ

Link func:on

63

slide-64
SLIDE 64

Example: Logis:c Regression

– We want to study the rela:onship between the presence of an amplifica:on in the ERBB2 gene and the size of the tumour in a specific type of breast cancer. – Our dependent variable Y, takes two possible values: “AMP”, “NORMAL” (“YES”, “NO”) – X (size) takes con:nuous values.

64

slide-65
SLIDE 65

Example: Logis:c Regression

It is very difficult to see the relationship. Let’s model the “probability

  • f success”: in

this case, the probability of amplification

65

  • Size

ERBB2 Amplification NO YES 5 10 15 20

slide-66
SLIDE 66

Example: Logis:c Regression

Some predictions are out of the possible range for a probability

66

  • 5

10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 Size Prob.Amplification

slide-67
SLIDE 67

Example: Logis:c Regression

We can transform the probabilities to a scale that goes from –Inf to Inf using log odds

logodds = log p 1− p " # $ % & '

67

  • 5

10 15 20 −10 −8 −6 −4 −2 Size log odds Amplification

slide-68
SLIDE 68

Example: Logis:c Regression

How does this relate to the generalized linear model?

  • Y follows a Bernoulli distribu:on; it can take two values

(YES or NO)

  • The expecta:on of Y, p is the probability of YES (EY=p)
  • We assume that there is a linear rela:onship between size

and a func:on of the expected value of Y: the log odds (the link func:on)

logodds(prob.amplif ) = β0 + β1Size g(EY) = βX

68

slide-69
SLIDE 69

Binomial Distribu:on

  • It is the distribu:on of the number of events

in a series of n independent Bernoulli experiments, each with a probability of success p.

  • Y can take integer

values from 0 to n

  • EY=np
  • VarY= np(1-p)

69

1 2 3 4 5 6 7 8 9 10

Binomial distribution. n=10, p=0.3

0.00 0.05 0.10 0.15 0.20 0.25
slide-70
SLIDE 70

Poisson Distribu:on

  • Let Y ~ B(n,p). If n is large and p is small then Y

can be approximated by a Poisson Distribu:on (Law of rare events)

  • Y ~ P(λ)
  • EY=λ
  • VarY=λ

70

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Poisson Distributionλ = 2

0.00 0.05 0.10 0.15 0.20 0.25

slide-71
SLIDE 71

Nega:ve Binomial Distribu:on

  • Let Y ~ NB(r,p)
  • Represents the number of successes in a Bernoulli experiment

un:l r failures occur.

  • It is also the distribu:on of a con:nuous mixture of Poisson

distribu:ons where λ follows a Gamma distribu:on.

  • It can be seen as a overdispersed Poisson distribu:on.

p = µ σ 2 r = µ 2 σ 2 −µ

Overdispersion parameter Loca:on parameter

71

Negative Binomial distribution. r=10, p=0.3 0.00 0.01 0.02 0.03 0.04
slide-72
SLIDE 72

Hypothesis tes:ng

  • Everything starts with a biological ques:on to test:

– What genes are differenIally expressed under one treatment? – What genes are more commonly amplified in a class of tumours? – What promoters are methylated more frequently in cancer?

  • We must express this biological ques:on in terms of a

parameter in a model.

  • We then conduct an experiment, obtain data and es:mate

the parameter.

  • How do we take into account uncertainty in order to

answer our ques:on based on our es:mate?

72

slide-73
SLIDE 73

Sampling and tes:ng

10% red balls and 90% blue balls Random sample of 10 balls from the box Discrete

  • bservations

When do I think that I am not sampling from this box anymore? How many reds could I expect to get just by chance alone! #red = 3

Slide by Natalie Thorne, Nuno L. Barbosa Morais 73

slide-74
SLIDE 74

10% red balls and 90% blue balls Random sample of 10 balls from the box Discrete

  • bservations

Sample Null hypothesis

(about the population that is being sampled)

Rejection criteria (based on

your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population)

#red = 3

Test statistic

Slide by Natalie Thorne, Nuno L. Barbosa Morais 74

slide-75
SLIDE 75
  • Null Hypothesis: Our population follows a

(known) distribution defined by a set of parameters: H0 : X ~ f(θ1,…θk)

  • Take a random sample (X1,…Xn) = (x1,…xn) and
  • bserve test statistic

T(X1,…Xn)= t(x1,…xn)

  • The distribution of T under H0 is known (g(.))
  • p-value : probability under H0 of observing a

result as extreme as t(x1,…xn)

75

Hypothesis tes:ng

slide-76
SLIDE 76
  • bserved z-

score

High probability of gezng a more extreme score just by chance P-value is high!

76 Slide by Natalie Thorne, Nuno L. Barbosa Morais

slide-77
SLIDE 77
  • bserved z-

score

Low probability of gezng a more extreme score just by chance P-value is low

Reject null hypothesis

77 Slide by Natalie Thorne, Nuno L. Barbosa Morais

slide-78
SLIDE 78

Type I and Type II errors

  • Type I error: probability of rejec:ng the null hypothesis

when it is true. Usually, it is the significance level of the test. It is denoted as α

  • Type II error: probability of not rejec:ng the null hypothesis

when it is false It is denoted as β

  • Decreasing one type of error increases the other, so in

prac:ce we fix the type I error and choose the test that minimizes type II error.

78

slide-79
SLIDE 79

The power of a test

  • The power of a test is the

probability of rejec:ng the null hypothesis at a given significance level when a specific alterna:ve is true

  • For a given significance level

and a given alterna:ve hypothesis in a given test, the power is a func:on of the sample size

  • What is the difference

between sta:s:cal significance and biological significance?

1000 2000 3000 4000 5000 0.2 0.4 0.6 0.8 1.0

t−test: true diff:0.1 std=1 sig.lev=0.05

Sample Size Power

With enough sample size, we can detect any alterna:ve hypothesis (if the es:mator is consistent, its standard error converges to zero as the sample size increases)

79

slide-80
SLIDE 80

The Likelihood Ra:o Test (LRT)

  • We are working with models, therefore we would

like to do hypothesis tests on coefficients or contrasts of those models

  • We fit two models M1 without the coefficient to

test and M2 with the coefficient.

  • We compute the likelihoods of the two models

(L1 and L2) and obtain LRT=-2log(L1 /L2) that has a known distribu:on under the null hypothesis that the two models are equivalent. This is also known as model selec/on.

80

slide-81
SLIDE 81

Mul:ple tes:ng problem

  • In High throughput experiments we are fizng
  • ne model for each gene/exon/sequence of

interest, and therefore performing thousands

  • f tests.
  • Type I error is not equal to the significance

level of each test.

  • Mul:ple test correc:ons try to fix this

problem (Bonferroni, FDR,…)

81

slide-82
SLIDE 82

Distribu:on of p-values

82

If the null hypothesis is true, the p-values from the repeated experiments come from a Uniform(0,1) distribution.

Histogram of p

p Frequency 0.0 0.2 0.4 0.6 0.8 1.0 1000 2000 3000 4000 5000

Slide by Alex Lewin, Ernest Turro, Paul O’Reilly

slide-83
SLIDE 83

Controlling the number of errors

Null Hypothesis True AlternaIve Hypothesis True Total

Not Significant (don’t reject) ♯True Nega:ve ♯False Nega:ve (Type II error) N-♯ Rejec:ons Significant (Reject) ♯False posi:ve (Type I error) ♯True posi:ve ♯Total Rejec:ons Total n0 N-n0 N N = number of hypothesis tested R = number of rejected hypothesis n0 = number of true hypothesis

slide-84
SLIDE 84

Bonferroni Correc:on

If the tests are independent: P(at least one false posi:ve | all null hypothesis are true) = P(at least one p-value < α| all null hypothesis are true) = 1 – (1-α)m Usually, we set a threshold at α/ n. Bonferroni correc:on: reject each hypothesis at α/ N level It is a very conserva:ve method

slide-85
SLIDE 85

False Discovery Rate (FDR)

N = number of hypothesis tested R = number of rejected hypothesis n0 = number of true hypothesis

FDR aims to control the set of false posi:ves among the rejected null hypothesis. Family Wise Error Rate: FWER = P(V≥1) False Discovery Rate: FDR = E(V/R | R>0) P(R>0)

Null Hypothesis True AlternaIve Hypothesis True Total

Not Significant (don’t reject) ♯True Nega:ve ♯False Nega:ve (Type II error) N-♯ Rejec:ons Significant (Reject) V=♯False posi:ve (Type I error) ♯True posi:ve R=♯Total Rejec:ons Total n0 N-n0 N

slide-86
SLIDE 86

86

Benjamini & Hochberg (BH) step-up method to control FDR

Benjamini & Hochberg proposed the idea of controlling FDR, and used a step-wise method for controlling it. Step 1: compare largest p-value to the specified significance level α: If pord

m

> α then don’t reject corresponding null hypothesis Step 2: compare second largest p-value to a modified threshold: If pord

m−1 > α ∗ (m − 1)/m then don’t reject corresponding null hypothesis

Step 3: If pord

m−2 > α ∗ (m − 2)/m then don’t reject corresponding null hypothesis

. . . Stop when a p-value is lower than the modified threshold: All other null hypotheses (with smaller p-values) are rejected.

Slide by Alex Lewin, Ernest Turro, Paul O’Reilly

slide-87
SLIDE 87

87

Adjusted p-values for BH FDR

The final threshold on p-values below which all null hypotheses are rejected is αj∗/m where j∗ is the index of the largest such p-value. BH: compare pi to αj∗/m ← → compare mpi/j∗ to α Can define ’adjusted p-values’ as mpi/j∗ But these ’adjusted p-values’ tell you the level of FDR which is being controlled (as opposed to the FWER in the Bonferroni and Holm cases).

Slide by Alex Lewin, Ernest Turro, Paul O’Reilly

slide-88
SLIDE 88

Mul:ple power problem

  • We have another problem related to the power of

each test. Each unit tested has a different test sta:s:c that depends on the variance of the distribu:on. This variance is usually different for each gene/transcript,…

  • This means that the probability of detec:ng a given

difference is different for each gene; if there is low variability in a gene we will reject the null hypothesis under a smaller difference

  • Methods that shrinkage variance (like the empirical

Bayes in limma for microarrays) deal with this problem.

88

slide-89
SLIDE 89

Models for coun:ng data

slide-90
SLIDE 90

Microarray expression data

Adapted from slides by Benilton Carvalho

10000 30000 50000 0.00000 0.00010 0.00020

RG densities

Intensity Density 4 6 8 10 12 14 16 0.00 0.05 0.10 0.15 0.20

RG densities

Intensity Density

log yij ~ N(µ j,σ j

2)

Data are color intensi:es

90

slide-91
SLIDE 91

Gene Sample 1 Sample 2 ERBB2 45 MYC 14 23 ESR1 56 2

Sequencing data

Adapted from slides by Benilton Carvalho, Tom Hardcastle

  • Transcript (or sequence, or methyla:on) i in

sample j is generated at a rate λij

  • A fragment avaches to the flow cell with a

probability of pij (small)

  • The number of observed tags yij follows a

Poisson distribu:on with a rate that is propor:onal to λijpij.

Number of reads Density 20 40 60 0.0 0.1 0.2 0.3 0.4 0.5 0.6

The variance in a Poisson distribu:on is equal to the mean

91

slide-92
SLIDE 92

Extra variability

Adapted from slides by Benilton Carvalho Squared coefficient of varia:on

92

slide-93
SLIDE 93

Nega:ve binomial model for sequencing data

Adapted from slides by Benilton Carvalho

93

slide-94
SLIDE 94

Es:ma:ng Overdispersion with edgeR

  • edgeR (Robinson, McCarthy, Chen and Smyth)
  • Total CV2=Technical CV2 + Biological CV2
  • Borrows informa:on from all genes to

es:mate BCV.

– Common dispersion for all tags – Empirical Bayes to shrink each dispersion to the common dispersion.

Variability in gene abundance between replicates Decreases with sequencing depth

94

slide-95
SLIDE 95

Es:ma:ng Overdispersion with DESeq

  • DESeq (Anders, Huber)
  • Var = sμ + αs2μ2
  • es:mateDispersions()
  • 1. Dispersion value for each gene
  • 2. Fits a curve through the es:mates
  • 3. Each gene gets an es:mate between (1) and (2).

Expected normalized count value Size factor for the sample Dispersion of the gene

95

slide-96
SLIDE 96

Reproducibility

Slide by Wolfgang Huber

96

slide-97
SLIDE 97

A few number of genes get most of the reads

Slide by Wolfgang Huber 97

slide-98
SLIDE 98

Effec:ve library sizes

  • Also called normaliza:on (although the counts are not

changed!!!)

  • We must es:mate the effec:ve library size of each sample, so
  • ur counts are comparable between genes and samples
  • Gene lengths?
  • This library sizes are included in the model as an offset (a

parameter with a fixed value)

98

slide-99
SLIDE 99

Es:ma:ng library size with edgeR

  • edgeR (Robinson, McCarthy, Chen and Smyth)
  • Adjust for sequencing depth and RNA

composi:on (total RNA output)

  • Choose a set of genes with the same RNA

composi:on between samples (with the log fold change of normalised counts) ager trimming

  • Use the total reads of that set as the es:mate.

99

slide-100
SLIDE 100

Es:ma:ng library size with DESeq

  • DESeq (Anders, Huber)
  • Adjust for sequencing depth and RNA

composi:on (total RNA output)

  • Compute the ra:o between the log counts in

each gene and each sample and the log mean for that gene on all samples.

  • The median on all genes is the es:mated

library size.

100

slide-101
SLIDE 101

References

  • Anders and Huber. Genome Biology, 2010; 11:R106
  • Auer and Doerge. Gene:cs 2010, 185:405-416
  • Harrell. Regression Modeling Strategies
  • Robles et al. BMC Genomics 2012, 13:484
  • Venables and Ripley. Modern Applied Sta:s:cs with S

101