Using non-parametric methods in the context of multiple testing to - - PowerPoint PPT Presentation

using non parametric methods in the context of multiple
SMART_READER_LITE
LIVE PREVIEW

Using non-parametric methods in the context of multiple testing to - - PowerPoint PPT Presentation

Using non-parametric methods in the context of multiple testing to determine differentially expressed genes Greg Grant, Elisabetta Manduchi, Chris Stoeckert Penn Center for Bioinformatics CAMDA 2000 Outline Differential Expression


slide-1
SLIDE 1

Using non-parametric methods in the context

  • f multiple testing to determine differentially

expressed genes

Greg Grant, Elisabetta Manduchi, Chris Stoeckert Penn Center for Bioinformatics CAMDA 2000

slide-2
SLIDE 2

Outline

  • Differential Expression
  • Biological Variability and Replicates
  • Gene Intensity Distributions

– necessitate nonparametric methods

  • Applications of

– PaGE – t-statistic combined with a permutation algorithm

slide-3
SLIDE 3

The Dataset

  • Golub et al. (1999), Science, 286:531-537
  • ALL-AML: heterogeneous groups: source (B-

cells, T-cells, 4 AML types), sex, success, etc.

  • Focus on B-cells (37 replicates) vs T-cells (9

replicates): combined the training and the test sets

  • Affymetrix

– single sample hybridization – each signal is a composite of hybridizations to probes in a set – absent calls

slide-4
SLIDE 4

Distribution Heterogeneity

slide-5
SLIDE 5

“Deterministic” differential expression

Identifier: U23852, T-lymphocyte specific protein tyrosine kinase p56lck (lck) aberrant mRNA B T B and T

log scale

slide-6
SLIDE 6

“Non-deterministic” differential expression

B and T

B T Identifier: M23323, T-cell surface glycoprotein CD3 epsilon chain precursor

log scale

slide-7
SLIDE 7

Absent calls

log scale

B and T B-cell T-cell

  • Consequence of including the absent calls: the introduction of bimodal

distributions and non-deterministic differential expression, thus complicating the problem of assigning confidence to predictions of differential expression. Only by including the absent calls do we see the difference in genes we expect to be differentially expressed, such as the following T-cell antigen CD7 precursor (Id: D007499)

slide-8
SLIDE 8

t-statistics and adjusted p-values

Use method described in Dudoit et al. (2000)

  • Assigns t-statistic to each gene.
  • p-values are obtained by permuting the

columns in place of assuming t-distributions.

  • Corrects for multiple testing by Westfall and

Young stepdown approach.

slide-9
SLIDE 9

B-cell vs. T-cell using t-statistic

Column n,9 = fraction of times gene is up-regulated in T-cells out of 100 comparisons between n randomly chosen B-cell and all 9 T-cell expmts.

slide-10
SLIDE 10

PaGE: Patterns from Gene Expression

  • PaGE assigns confidence measures to predictions
  • f differential expression. Handles multiple testing

in a nonparametric (and non-standard) way.

– Does not use t-statistic.

  • Patterns are generated by comparison of groups of

replicates to a reference group.

  • See Manduchi et al. (2000), Bioinformatics,

16:685-698.

slide-11
SLIDE 11

PaGE: outline

  • Find C (the upper cutratio) such that, if a gene is

chosen at random from the set of genes which are true negatives, then the probability that is small.

  • This C gives a cutoff for making predictions about

up-regulation.

  • Similarly for down-regulation (find an appropriate

c [lower cutratio], reverse the above inequality).

C X X

i i > , 1 , 2

slide-12
SLIDE 12

PaGE: approximations

The false positive rate is approximated by After having shifted all intensities by an appropriate numerical constant, we approximate the unknown distribution of by that of where i varies over the gene tags and j varies of the replicates for group 1. Similarly for group 2.

i i

X

, 1 , 1

µ

1 1 1

1 , 1 , , 1

+ − − n X X

i j i

            > C X X

i i i i 1 1 2 2

Prob µ µ

slide-13
SLIDE 13

The effect of shifting

hypothetical data: assuming variance proportional to

  • magnitude. No shift

necessary. real data:

  • variance greater for low

intensities.

  • absent calls increase this

effect.

  • a moderate shift compen-

sates and reduces false positives at low and high intensity.

slide-14
SLIDE 14
slide-15
SLIDE 15

The effect of shifting (cont.)

37 B-cell replicates vs 9 T-cell replicates ↓

slide-16
SLIDE 16

B-cell vs. T-cell using PaGE

Column n,9 = fraction of times gene is up-regulated in T-cells out of 100 comparisons between n randomly chosen B-cell and all 9 T-cell expmts.

slide-17
SLIDE 17

Effect of Number of Replicates on False Positives Due to Biological Subclassing

  • Comparisons of B-cell to B-cell.

– Any predictions are false positives.

  • Table entries are empirical likelihoods
  • f observing any false positives.
  • False positives are due to noise and/or

biological subclassing, with the latter effect diminishing as the number of replicates increases.

  • Confidence was 90%. If PaGE was

exact instead of conservative, the numbers in each column would converge to 0.1.

  • Tripling the number of independent

genes does not dramatically worsen the multiple testing problem of subclassing.

0.04 0.03 50 0.02 0.06 40 0.07 0.06 30 0.11 0.10 20 0.19 0.15 10 0.44 0.39 5 3000 1000 Reps

  • No. of Indep. Genes
slide-18
SLIDE 18

Summary

  • How many replicates are needed?

– Gene intensity distributions can be very irregular – Noise and multiple testing (False negatives)

  • t-statistic: Continue to reduce false negatives even with 25 replicates
  • PaGE: Much less conservative

– Biological variability and multiple testing (False positives)

  • PaGE: Confidence measures assume that the variability of each class

is fully represented in the replicates. If a class is very heterogeneous (e.g. B-cells) then many replicates might be needed to avoid over- representing a subclass by chance and therefore introducing false positives.

  • The more homogeneous the group, the fewer replicates are needed.
  • How do findings generalize to other platforms?
slide-19
SLIDE 19

URLs

  • http://www.cbil.upenn.edu/
  • http://www.cbil.upenn.edu/PaGE
  • http://www.stat.berkeley.edu/users/terry/zarray/html/matt.html

(Dudoit et al.)

  • http://www.cbil.upenn.edu/tpWY (implementation of Dudoit et al.)

Acknowledgements

Brian Brunk Eugen Buehler Jonathan Crabtree Sue Davidson Sharon Diskin Georgi Kostov Phillip Le Joan Mazzarelli Shannon McWeeney Colleen Petrelli Debbie Pinney Angel Pizarro Jonathan Schug Jim Wolff

PCBI