Comparison of Normalization Methods for cDNA Microarrays Liling - - PowerPoint PPT Presentation

comparison of normalization methods for cdna microarrays
SMART_READER_LITE
LIVE PREVIEW

Comparison of Normalization Methods for cDNA Microarrays Liling - - PowerPoint PPT Presentation

Comparison of Normalization Methods for cDNA Microarrays Liling Warren, Ben Hui Liu Bioinformatics Program, NCSU, Raleigh, NC Bio-informatics Group, Inc. Cary, NC 1 Topics of Discussion Data flow in a microarray experiment Describe


slide-1
SLIDE 1

1

Comparison of Normalization Methods for cDNA Microarrays

Liling Warren, Ben Hui Liu

Bioinformatics Program, NCSU, Raleigh, NC Bio-informatics Group, Inc. Cary, NC

slide-2
SLIDE 2

2

Topics of Discussion

Data flow in a microarray experiment Describe different normalization methods Evaluate different normalization methods To normalize or not to normalize Data quality Experimental design Conclusions

slide-3
SLIDE 3

3

Data flow in a microarray experiment

Arrays Samples Hybridization

Normalized results

Scanned data Analysis results

slide-4
SLIDE 4

4

Purpose of Data Normalization

To remove systematic errors introduced at various stages of a microarray experiment. Systematic effects include:

Array effect Pin/block effect Dye effect (Cy3/Cy5) mRNA extraction effect Dye labeling effect

slide-5
SLIDE 5

5

Systematic Errors – Array Effect

log2(s) 3 4 5 6 7 8 9 10 11 1 2 5 6 9 10 13 14 17 18 21 22 array log2(s) 2 3 4 5 6 7 8 9 10 11 3 4 7 8 11 12 15 16 19 20 23 24 array

array Error

  • C. Total

Source 11 32757 32768 DF 2977.951 30718.314 33696.265 Sum of Squares 270.723 0.938 Mean Square 288.6899 F Ratio 0.0000 Prob > F

Analysis of Variance

array Error

  • C. Total

Source 11 32754 32765 DF 2067.326 33917.943 35985.269 Sum of Squares 187.939 1.036 Mean Square 181.4893 F Ratio 0.0000 Prob >

Analysis of Variance

Box plots and ANOVA tests show that between array variation is highly significant with a P value < 0.0001 using either log ratios or log signal intensity

slide-6
SLIDE 6

6

Systematic Errors – Block Effect

Box plots and ANOVA tests show that between block variation is highly significant with a P value < 0.0001 using either log ratios or log signal intensity

log2(s) 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 block M

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 block

block Error

  • C. Total

Source 15 2715 2730 DF 98.3035 2877.4692 2975.7727 Sum of Squares 6.55357 1.05984 Mean Square 6.1835 F Ratio <.0001 Prob >

Analysis of Variance

block Error

  • C. Total

Source 15 2715 2730 DF 77.2856 4839.5028 4916.7885 Sum of Squares 5.15237 1.78251 Mean Square 2.8905 F Ratio 0.0002 Prob > F

Analysis of Variance

slide-7
SLIDE 7

7

Systematic Errors – Dye Effect

Average log signal intensity

Log ratio

M vs. A plots for block

1, 2, 5,6 in array 1 of Kidney data

M vs. A plots reveal the

dependency of log ratios on average signal intensity

) ( log ) ( log

2 2

G R M − = )) ( log ) ( (log 2 / 1

2 2

G R A + =

slide-8
SLIDE 8

8

Comparing Normalization Methods

Method #1: log ratio based, local smoothing method using loess function

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 Y 7 8 9 10 11 a1 Y m1 p1 r1

  • 3
  • 2
  • 1

1 2 3 4 Y 6 7 8 9 10 11 a7 Y m7 p7 r7

Red: M values; Green:Predicted values Blue:Residual values

slide-9
SLIDE 9

9

Comparing Normalization Methods

Method #2: log ratio based, block-specific global

normalization

ij

y

where i=1,…,24; j=1, …, 16; k=1, .., nij, and : block-specific mean : block-specific standard deviation

ij ij ijk ijk

s y y y / ) ( ~ − =

ij

s

slide-10
SLIDE 10

10

Comparing Normalization Methods

Method #3: log ratio based, ANOVA normalization

ijklm ij l k j i ijklm

AB D M B A y ε µ + + + + + + = ) (

  • - Random effects: A, B, AB
  • - Fixed effects: M, D
  • - Residuals are subsequently used as input for

gene-based ANOVA model

slide-11
SLIDE 11

11

Methods Omitting Normalization

Method #4: gene-based ANOVA, omitting

normalization, using log ratios

Method #5: gene-based Analysis of Covariance,

  • mitting normalization, using log signal intensity

ijk ij j i ijk

md d m y ε µ + + + + = ) (

ijk ijk ij j i ijk

x x md d m y ε β µ + − + + + + = ) ( ) (

...

ijk

y

: log signal intensity from test sample; : log signal intensity from reference sample

ijk

x

slide-12
SLIDE 12

12

“project normal” Data Analysis

Gene based ANOVA model:

ijk ij j i ijk

md d m y ε µ + + + + = ) (

, i=1 to 6, j=1, 2 and k=1 to 4.

MSE 12 Error MS(MD) 5 Mouse*Dye MSD 1 Dye MSM 5 Mouse EMS MS df Source

The null hypothesis of no mouse effect is tested with

2

σ

2 2

4 md τ σ +

2 2

20 d τ σ +

2 2

4 m τ σ +

MSE MSM F

  • /

= , with df1=5 and df2=12.

slide-13
SLIDE 13

13

Comparison Results

Method 1 Method 2 Method 3 Method 4 Method 5 Method 1 129 315 451 243 182 Method 2 89 275 522 362 318 Method 3 80 155 402 409 410 Method 4 51 78 158 165 174 Method 5 32 42 77 76 85 On the diagonal: numbers of genes detected by the specific method; Upper triangle: detected by either of the two corresponding methods; Lower triangle: detected by both methods (Kidney data).

slide-14
SLIDE 14

14

Power Comparison

Power rank: Pair-wise power comparison - McNemar’s Test

Method

Power 3 2 4 1

5

slide-15
SLIDE 15

15

McNemar’s Test

n2. n22 n21 Accept N n.2 n.1 Total n1. n12 n11 Reject Total Accept Reject Second Method First Method Test statistic:

) /( ) (

21 12 2 21 12 2 1

n n n n + − = χ

Under H0: Reject H0 if

1 1 + +

= π π

84 . 3

2 1

> χ

at

05 . = α

slide-16
SLIDE 16

16

McNemar’s Test Results

Method 1 Method 2 Method 3 Method 4 Method 2 94.3 Method 3 12 22.4 Method 4 6.8 42.6 223.8 Method 5 12.91 130.8 301.77 65.31

Pair-wise power comparisons show all pairs of methods have significantly different power in detecting mouse effect

slide-17
SLIDE 17

17

Data Normalization Data Quality Issues

Why Do They Differ

Genes not significant before Genes not significant after Genes significant before Genes significant after

slide-18
SLIDE 18

18

Assessing Data Quality

M1 M2 M3 M4 M5 M6 Kidney Testis Liver

Reference Sample

slide-19
SLIDE 19

19

Assessing Data Quality

Kidney Test Sample + Reference Sample M1 M2 M3 M4 M5 M6 Liver Test Sample + Reference Sample Testis Test Sample + Reference Sample

r1 r2 r3 r4 r1 r2 r3 r4 r1 r2 r3 r4

slide-20
SLIDE 20

20

Assessing Data Quality

∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

=

3 1 6 1 4 1 3 1 6 1 4 1 i j k ijk i j k ijk

y x

On a gene-by-gene basis

ijk

x

ijk

y

Reference Samples (72) Test Samples (72)

slide-21
SLIDE 21

21

Assessing Data Quality

Let r = Examine normalization effect within the set of genes where 1) r<0.5 2) r>2 388 genes in Kidney are significant by at least one method, among which 156 genes have r<0.5 or r >2. Histogram of r for all genes

∑∑∑ ∑∑∑

= = = = = =

/

3 1 6 1 4 1 3 1 6 1 4 1 i j k ijk i j k ijk

y x

slide-22
SLIDE 22

22

Normalization Effect

  • 0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P-value after (method 1) .1 .2 .3 .4 .5 .6 .7 P-value before

P-values before and after normalization method 1

Genes significant before, not significant after normalization Genes not significant before, significant after normalization

slide-23
SLIDE 23

23

Assessing Data Quality

When (foreground – background) < 0 no hybridization How about (foreground – background) >0, but < 100? 388 genes in Kidney are significant by at least

  • ne method, among which 124 genes have

(foreground – background) < 100. Effect of normalization is examined in these genes.

slide-24
SLIDE 24

24

Assessing Data Quality

1471 94 3 2194 407 4 2243 865 2 3453 965 1 425 409 4 60 58 3 1476 1077 2 152 115 1 Signal(ref) Signal(test) Rep #

  • Low signal

intensity

  • Low mRNA copy

number?

  • Failed

hybridization?

  • Due to Spotting (if

both numbers are small)

  • Due to Labeling (one

number is small)

slide-25
SLIDE 25

25

Assessing Data Quality

4 3 2 1 4 3 2 1 Rep # 753 572 753 596 77 84 426 525 364 365 6 10 145 575 151 610 Signal (ref) Signal (test)

Affecting other genes in the same normalization group These genes are affected by normalization

slide-26
SLIDE 26

26

Normalization Effect

  • 0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P-value after (method 1) .1 .2 .3 .4 .5 .6 P-value before

Genes significant before, not significant after normalization Genes not significant before, significant after normalization

P-values before and after normalization method 1

slide-27
SLIDE 27

27

STD within block / STD among mice Log (P-value before / P-value after)

genes more significant after normalization genes more significant before normalization genes significant before and after normalization

463951 388019

580674

400713 514599

Examine Normalization Effect

388019 463951 514599 400713

slide-28
SLIDE 28

28

1.26 1.26 1.28 1.11 1.28

STD within block

15.75 0.08 0.000001 0.5597 388019 15.75 0.08 0.000008 0.6563 580674 18.29 0.07 0.00005 0.7182 463951 0.95 1.17 0.06612 <0.000001 400713 0.91 1.40 0.03125 <0.000001 514599

STD within block/ STD among mice

STD among mice P-value (after) P-value (before)

cDNA

Examine Normalization Effect

slide-29
SLIDE 29

29

Treatment Variation Systematic Variation Small Large Small Large

Examine Normalization Effect

Remove systematic errors

Create false positives, false negatives

slide-30
SLIDE 30

30

Method Comparison in Three Tissues

Tissue Criteria method1 method 2 method 3 method 4 method 5 Raw_P 936 1440 1808 1253 1057 Kidney Bonf_P 63 114 196 73 27 FDR_P 488 1109 1551 757 441 Raw_P 464 867 809 705 654 Liver Bonf_P 12 31 25 1 FDR_P 56 328 265 4 1 Raw_P 853 966 825 3090 3042 Testis Bonf_P 35 25 24 1956 1163 FDR_P 272 407 232 3089 3038

slide-31
SLIDE 31

31

Array Means in Testis Tissue

mean

  • 0.5

0.0 0.5 1.0 1.5 2.0 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 array within m

m Error

  • C. Total

Source 5 18 23 DF 11.150598 5.587398 16.737997 Sum of Squares 2.23012 0.31041 Mean Square 7.1844 F Ratio 0.0007 Prob > F

Analysis of Variance

2.5

slide-32
SLIDE 32

32

Array Means in Liver and Kidney

mean

  • 0.8
  • 0.7
  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1
  • 0.0

0.1 0.2 0.3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 array within m mean

  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 array within m

Liver Kidney

1.0

slide-33
SLIDE 33

33

Normalization: to do or not to do

There exists significant systematic errors Normalization aims at removing such systematic errors To normalize: added noise can create false positives and false negatives Not to normalize: systematic errors can create false positives and false negatives

slide-34
SLIDE 34

34

Some Possible Solutions

Quality Control

Ensure data quality by using both positive and negative controls Perform multiple independent labeling reactions

Experimental Design

Replicate genes within and among blocks such that block effect can be fit into gene-based ANOVA models Two-stage experiment: Pilot study – estimate variances; conduct power analysis to determine how many replicates, how many samples, etc. for the experiment Large scale experiment

slide-35
SLIDE 35

35

Conclusions

Normalization is an important step to remove systematic effects before data analysis. Effective normalization needs to be done after data quality is ensured. QC standards need to be established for large scale microarray experiments to ensure data quality. Experimental design plays a crucial role in both data analysis and making normalization effective. Labeling effect needs to be incorporated into the design. Genes can have normal baseline variations - different positive controls need to be incorporated into experimental designs.

slide-36
SLIDE 36

36

Conclusions

Large number of genes, small number of samples fully balanced experimental designs Small number of genes, large number of samples Stage I

(hypothesis generating)

Stage II

(hypothesis testing)

slide-37
SLIDE 37

37

Acknowledgements

  • Dr. Bruce Weir
  • Dr. Ross Whetten
  • Dr. Yinghsuan Sun