FARMS: a probabilistic latent variable model for summarizing - - PowerPoint PPT Presentation

farms a probabilistic latent variable model for
SMART_READER_LITE
LIVE PREVIEW

FARMS: a probabilistic latent variable model for summarizing - - PowerPoint PPT Presentation

FARMS: a probabilistic latent variable model for summarizing Affymetrix array data at probe level Djork-Arn Clevert, Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University Linz Willem Talloen, An De Bond, Hinrich Ghlmann


slide-1
SLIDE 1

FARMS: a probabilistic latent variable model for summarizing Affymetrix array data at probe level

Djork-Arné Clevert, Sepp Hochreiter

Institute of Bioinformatics, Johannes Kepler University Linz

Willem Talloen, An De Bond, Hinrich Göhlmann

Johnson & Johnson Pharmaceutical Research & Development, a division of Janssen Pharmaceutica n.v., Beerse, Belgium

slide-2
SLIDE 2

Introduction Microarray technology Model & assumption Data sets & experiments Results FARMS I/NI-Calls Results Conclusion

2

Overview

slide-3
SLIDE 3

Microarrays measure simultaneously

cellular concentrations of thousands of mRNAs

mRNA concentration ~ activity of a gene Activity of a gene = expression level Basis for the functional genome analysis

3

Microarrays

slide-4
SLIDE 4

4

Affymetrix technology

!"#$%&'()

B B B B B B

B B B B B B B B B B B B B B B B B B B B B B B B 10 20 30 40 50 60

*+$,-.&' /&0+112&-3.-'+ 4&056)+.-)&7+5 89(&-:-1-)&7+5 !);)&1) /&0+112&-3.-'+ <="# >-'.-+$60&2-)&.)$$<!"# <!"# 4&05)6)+.) ?-2&'0&&09 !)-+-57+5$@ 4-A-)&7+5 B<0+$@ #+0C91)

slide-5
SLIDE 5

5

Microarray design

mRNA reference sequence

3‘ 5‘

probeset probe 5 probe 4

slide-6
SLIDE 6

5

Microarray design

mRNA reference sequence

3‘ 5‘ TGTGATGGTGGGAATGGGTCAGAAGGACTCCTATGTGGGTGACGAGGCC

mRNA reference sequence

3‘ 5‘ TTACCCAGTCTTCCTGAGGATACAC perfect match probe TTACCCAGTCTTGCTGAGGATACAC mismatch probe

probeset probe 5 probe 4

slide-7
SLIDE 7

5

Microarray design

mRNA reference sequence

3‘ 5‘ TGTGATGGTGGGAATGGGTCAGAAGGACTCCTATGTGGGTGACGAGGCC

mRNA reference sequence

3‘ 5‘ TTACCCAGTCTTCCTGAGGATACAC perfect match probe TTACCCAGTCTTGCTGAGGATACAC mismatch probe

Perfect match reporters Fluorescence intensity image Mismatch reporters probe 4 probeset probe 5

probeset probe 5 probe 4

slide-8
SLIDE 8

Example:

  • ne PM-probe set and six arrays

7

z1 1 11

  • 1

2 3 4 5 6 7 8 9 10 11

z2 z6 11 11 1

  • 1
  • x = λz + ǫ
slide-9
SLIDE 9

6

Factor analysis

Generative model:

where From this it follows that:

parameter estimation with EM-algorithm

models the correlation between the data

elements

accounts for the independent noise in the data

x = λz + ǫ

x, λ ∈ Rn, z ∼ N (0, 1) , ǫ ∼ N (0, Ψ) x ∼ N

  • 0 , λλT + Ψ
  • z

ǫ

slide-10
SLIDE 10

Increasing mRNA concentration leads to a

larger signals

negative values of are not plausible

Observed variance in the data is often low

high values of are unlikely

Most genes from a chip are non-relevant

most genes with a zero

8

Prior knowledge

slide-11
SLIDE 11

Bayesian posterior: Prior distribution:

rectified Gaussian

9

Bayesian posterior & prior

p(λ, Ψ | {x}) ∝ p({x} | λ, Ψ) p(λ, Ψ)

p(λ, Ψ) = p(λ)

slide-12
SLIDE 12

10

Data sets

Affymetrix spiked-in data set „A“

59 arrays HGU95A_v2

14 artificially entered cDNA fragments 0, 0.25, 0.5, 1, 2, 4, 8, ... , 1024 pM

Affymetrix spiked-in data set „B“

42 arrays HGU133A

42 artificially entered cDNA fragments 0, 0.0125, 0.25, 0.5, 1, ... , 512 pM

slide-13
SLIDE 13

11

Preprocessing chain

Background correction Normalisation PM correction Summarisation Probe level data Expression level RMA MAS 5.0 None Quantilen Cyclic Loess Constant VSN PM only PM-MM IM FARMS Medianpolish Tukey Bi-Weight LiWong AverageDiff

slide-14
SLIDE 14

12

Results

Intensity FARMS RMA GCRMA MAS 5.0 MBEI HGU133 Low 0.94 0.51 0.62 0.07 0.21 Med 0.99 0.91 0.94 0.00 0.43 High 1.00 0.64 0.59 0.00 0.16 Mean 0.95 0.60 0.69 0.05 0.26 HGU95 Low 0.91 0.57 0.45 0.09

  • Med

1.00 0.91 0.91 0.00

  • High

0.98 0.96 0.92 0.00

  • Mean

0.93 0.65 0.57 0.06

  • FARMS

RMA MAS 5.0 MBEI Computational time [s] 92 384 851 591

Affycomp II Benchmark (AUC - area under the curve): Computational costs for processing 60 arrays:

slide-15
SLIDE 15

Problem of multiple testing and over-fitting

Because of the high dimensionality of data Because of the technology (noise) Because of the biology, most genes are non-

informative

Informative pre-filtering is desired Using array information to filter genes

A/P calls: excluding probe sets that are always

absent

13

Analysis of microarray data

slide-16
SLIDE 16

The correlation of intensities between

probes of the same probe set across chips

When intensities are high or low for all probes

in an individual chip there needs to be a strong correlation

Strong correlation consistency

This means that all fragments of a gene tell the

same story

14

Internal consistency

slide-17
SLIDE 17

15

Internal consistency

!"#$%&'

!' ( ' ) !' ' ) * + !) ( ' ) * !) !' ( ' ) !) ( ' ) !) ( ' ) !' ( ' )

!"#$%&) !"#$%&*

!) ( ' ) !' ' ) * +

!"#$%&+ !"#$%&,

!) ( ' ) !) ( ' ) *

!"#$%&- !"#$%&.

!) ( ' ) !) ( ' )

!"#$%&/ !"#$%&0

!) ( ' ) * !) ( ' )

!"#$%&'(

!) ( ' ) !) ( ' ) !) ( ' ) !) !' ( ' ) !) ( ' ) * !+ !) ( ) !+ !) ( )

!"#$%&''

Informative gene Non-informative gene

Dots represent individual chips

slide-18
SLIDE 18

Variance of the extracted factor z given the

data:

provides a measure of how much variation in

the probe set data x is explained by the factor z

value between [0-1]

var(z|x) = 0 data can be completely explained by z var(z|x) = 1 data cannot be explained by z var(z|x) = 0.5 signal-to-noise-ratio = 1

criterion for unsupervised feature selection

16

Background: I/NI-call

var (z | x) =

  • 1 + λT Ψ−1λ

−1

slide-19
SLIDE 19

17

I/NI-calls in action

clear bimodal

distribution of var(z|x)

distinct modes for

Non-Inf. and Inf. genes

non-informative informative

GSE6119

var(z|x) denisty 0.0 0.2 0.4 0.6 0.8 1.0 10 20 30

slide-20
SLIDE 20

18

I/NI-calls vs. A/P-calls

variance across the arrays (log10) expression level (log2)

slide-21
SLIDE 21

19

Results I/NI-calls

On average: 84 (±1.5)% exclusion rate

applied on 30 real life studies A/P calls excluded only 33 (±1)%

Validation on spiked-in data

* McGee et al. 2006 ** Wolfinger and Chu 2002; Cope et al. 2004 Informative Non-informative Exclusion rate Detected Spiked-ins Detected Pseudo Spiked-ins HGU133A 81 22219 99.63% 42/42 28/28* HGU95_V2 56 12570 99.56% 14/14 5/5**

slide-22
SLIDE 22

20

81probe sets with I/NI-call

! "! #! $! %! & "" '()*+ ,--.!/01.!$234 ! "! #! $! %! 5 "# '()*+ ,--.!1#!/367234 ! "! #! $! %! 5 "! '()*+ #!8$9&2+234 ! "! #! $! %! & "" '()*+ ,--.!:0*.!$234 ! "! #! $! %! 5 "" '()*+ ,--.!1#!/36;234 ! "! #! $! %! 5 "! '()*+ ,--.!:0*.!8234 ! "! #! $! %! 5 "! '()*+ ,--.!/01.!<234 ! "! #! $! %! 5 "! '()*+ ,--.!:0*.!<234 ! "! #! $! %! 5 "! '()*+ ,--.!=>?.!<234 ! "! #! $! %! 5 "! '()*+ ,--.!;3@.!$234 ! "! #! $! %! 5 9 '()*+ ,--.!=>?.!$234 ! "! #! $! %! & "" '()*+ ,--.!1#!/36A234 ! "! #! $! %! 5 "! '()*+ ,--.!=>?.!8234 ! "! #! $! %! 5 "# '()*+ ,--.!1#!/36B234 ! "! #! $! %! & "" '()*+ #!&8%!2?234 ! "! #! $! %! 5 "! '()*+ ,--.!;3@.!<234 ! "! #! $! %! 5 "! '()*+ #!8#5&234 ! "! #! $! %! & "" '()*+ #!859#2?234 ! "! #! $! %! & "" '()*+ ,--.!1#!/36-234 ! "! #! $! %! & "" '()*+ ,--.!1#!C?!401!82?234 ! "! #! $! %! & "" '()*+ #!&"5!234 ! "! #! $! %! 5 "! '()*+ #!&&&&2?234 ! "! #! $! %! 5 "" '()*+ ,--.!1#!C?!@0*!8234 ! "! #! $! %! 5 "" '()*+ ,--.!1#!/36,234 ! "! #! $! %! & "" '()*+ #!%98"234 ! "! #! $! %! 5 9 '()*+ ,--.!/01.!8234 ! "! #! $! %! & "" '()*+ #!8&9!234 ! "! #! $! %! 5 "# '()*+ #!!5582?234 ! "! #! $! %! 8 9 '()*+ ,--.!1#!C?!)3@!$234 ! "! #! $! %! 5 "! '()*+ #!%989234 ! "! #! $! %! & "" '()*+ #!9&$%234 ! "! #! $! %! 5 "! '()*+ #"#D#&234 ! "! #! $! %! 5 "! '()*+ ,--.!1#!C?!)3@!<234 ! "! #! $! %! 5 "" '()*+ ,--.!1#!C?!E>?!8234 ! "! #! $! %! & "" '()*+ #!%D$5234 ! "! #! $! %! & "" '()*+ ,--.!1#!/36F234 ! "! #! $! %! & "" '()*+ #!%9"#234 ! "! #! $! %! 5 "! '()*+ ,--.!1#!C?!@0*!<234 ! "! #! $! %! 5 "! '()*+ ,--.!1#!/36C234 ! "! #! $! %! D "# '()*+ #!8$9D2?234 ! "! #! $! %! & "" '()*+ ,--.!1#!C?!401!$2?234 ! "! #! $! %! 5 9 '()*+ ,--.!;3@.!8234 ! "! #! $! %! 5 "! '()*+ ,--.!1#!C?!)3@!8234 ! "! #! $! %! 5 9 '()*+ #!$8!D234 ! "! #! $! %! & "" '()*+ #!%8"$2?234 ! "! #! $! %! 5 9 '()*+ #!%#!8234 ! "! #! $! %! 5 "! '()*+ #!%%"&234 ! "! #! $! %! & "" '()*+ #!9$8%234 ! "! #! $! %! & "" '()*+ #!%85$234 ! "! #! $! %! 5 "! '()*+ ,--.!1#!C?!E>?!<234 ! "! #! $! %! & "" '()*+ #!$%&"2?234 ! "! #! $! %! & "" '()*+ #!%%$!2?234 ! "! #! $! %! & "" '()*+ ,--.!1#!C?!401!<2?234 ! "! #! $! %! 5 9 '()*+ ,--.!1#!C?!E>?!$234 ! "! #! $! %! & "" '()*+ ,--.!1#!C?!@0*!$234 ! "! #! $! %! & "" '()*+ #!&95D2?234 ! "! #! $! %! 5 "! '()*+ #!&5882?234 ! "! #! $! %! 5 "! '()*+ #!8859234 ! "! #! $! %! 5 "! '()*+ #!5!5!2?234 ! "! #! $! %! & "" '()*+ #!9$&%2?234 ! "! #! $! %! D "# '()*+ #!8#9"234 ! "! #! $! %! 5 9 '()*+ #!95!5234 ! "! #! $! %! 8 9 '()*+ #!D!"!2?234 ! "! #! $! %! 8 D '()*+ #!9&98234 ! "! #! $! %! D "" '()*+ #!&5%"234 ! "! #! $! %! & 9 '()*+ #!%D9"2?234 ! "! #! $! %! &G! 9G8 '()*+ #!%D9!2?234 ! "! #! $! %! &G! 9G8 '()*+ #!$"&$2?234 ! "! #! $! %! 5G8 9G! '()*+ #"%&9"234 ! "! #! $! %! &G! '()*+ #"$!5!2?234 ! "! #! $! %! "#G$ '()*+ ##"5!&2+234 ! "! #! $! %! "#G% '()*+ #"!%#&2+234 ! "! #! $! %! 9G! '()*+ #!!5!D2?234 ! "! #! $! %! "!G" '()*+ #"$%&52+234 ! "! #! $! %! "$G"8 '()*+ #"#55"2+234 ! "! #! $! %! "!G# "!G9 '()*+ #!""5$2?234 ! "! #! $! %! DG! DG5 '()*+ #"&D8"2?234 ! "! #! $! %! "#G$! '()*+ #!"#8&2+234 ! "! #! $! %! &G$ &G9 '()*+ ##"85D2?234 ! "! #! $! %! ""G# '()*+ #!D9&&2+234 ! "! #! $! %! "#G#! '()*+ #"#&9!2+234
slide-23
SLIDE 23

21

56 probe sets with I/NI-call

! "! #! $! %! &! '! ( ) "" "$ *+,-. $'$""/01 ! "! #! $! %! &! '! ( ) "" "$ *+,-. '2%/01 ! "! #! $! %! &! '! 2 "! "# *+,-. %!$##/01 ! "! #! $! %! &! '! ' 2 "! *+,-. %!(/01 ! "! #! $! %! &! '! 2 "! "# *+,-. $)!&2/01 ! "! #! $! %! &! '! 2 "! "# *+,-. "!#%/01 ! "! #! $! %! &! '! ( ) "" *+,-. $2($%/01 ! "! #! $! %! &! '! 2 "! "# *+,-. "&)(/01 ! "! #! $! %! &! '! ( ) "" "$ *+,-. $((((/01 ! "! #! $! %! &! '! ( ) "" *+,-. "(!2/01 ! "! #! $! %! &! '! "! "# *+,-. "!)"/01 ! "! #! $! %! &! '! 2 "! "# *+,-. $'!2&/01 ! "! #! $! %! &! '! 2 "! "# *+,-. $$2"2/01 ! "! #! $! %! &! '! ( 2 ) "" *+,-. $'#!#/01 ! "! #! $! %! &! '! ( ) "" *+,-. $'22)/01 ! "! #! $! %! &! '! ( 2 ) "" *+,-. &%'/01 ! "! #! $! %! &! '! '3& 23! )3& *+,-. $$#'%/01 ! "! #! $! %! &! '! 23! )3! "!3! *+,-. $#''!/01 ! "! #! $! %! &! '! 232 )3# *+,-. 4556!7897!:/01 ! "! #! $! %! &! '! 23' )3# )32 *+,-. $2%!'/;/01 ! "! #! $! %! &! '! (3' 23# 232 *+,-. "&&#/8/01 ! "! #! $! %! &! '! 23# 23' )3! *+,-. 4556!789<!$/01 ! "! #! $! %! &! '! 23' )3! *+,-. 4556!789<!&/01 ! "! #! $! %! &! '! 23# 23' )3! *+,-. 4556!7897!&/01 ! "! #! $! %! &! '! (3# (32 23% *+,-. 4556!=>?!#"@ABC4$/01 ! "! #! $! %! &! '! 23' )3! *+,-. 4556!7897!$/01 ! "! #! $! %! &! '! 23! 23% 232 *+,-. ")%%/;/01 ! "! #! $! %! &! '! )3# )3' "!3! *+,-. 4556!789D+!$/01 ! "! #! $! %! &! '! )3# )3' "!3! *+,-. "''#/E/01 ! "! #! $! %! &! '! (3' 23! 23% *+,-. 4556!<E-6!&/01 ! "! #! $! %! &! '! 23! 23% *+,-. 4556!?FG6!&/01 ! "! #! $! %! &! '! 23! 23% 232 *+,-. 4556!D0H6!&/01 ! "! #! $! %! &! '! "!3! "!3% "!32 *+,-. $$""(/E/01 ! "! #! $! %! &! '! 23! 23% 232 *+,-. %""%$/01 ! "! #! $! %! &! '! 23& 232 )3" *+,-. 4556!D0H6!$/01 ! "! #! $! %! &! '! 23# 23& 232 *+,-. 4556!<E-6!$/01 ! "! #! $! %! &! '! )3$ )3' )3) *+,-. "#'"/8/01 ! "! #! $! %! &! '! (3' 23! 23% *+,-. 4556!789D+!&/01 ! "! #! $! %! &! '! 23' )3! )3% *+,-. 4556!?FG6!$/01 ! "! #! $! %! &! '! 23' 232 )3! )3# *+,-. 4556!?FG6!:/01 ! "! #! $! %! &! '! (3( 23! 23$ *+,-. ##!/E/01 ! "! #! $! %! &! '! "!3' ""3! ""3% *+,-. $(%%)/8/01 ! "! #! $! %! &! '! '32 (3% *+,-. $($#2/01 ! "! #! $! %! &! '! )3$ )3' )3) *+,-. $2(2$/01 ! "! #! $! %! &! '! 23' 23) )3# *+,-. 4556!D0H6!:/01 ! "! #! $! %! &! '! 23% 23' 232 *+,-. 4556!IJ-6!&/01 ! "! #! $! %! &! '! "!32 ""3" ""3% *+,-. $#(%%/01 ! "! #! $! %! &! '! 23' 232 )3! *+,-. 4556!KJE6!:/01 ! "! #! $! %! &! '! (3! (3$ (3' *+,-. 4556!=>?!#%@AC*I"/01 ! "! #! $! %! &! '! 23) )3" )3$ *+,-. $&'%$/01 ! "! #! $! %! &! '! 23% 23' 232 *+,-. $"%'$/G/01 ! "! #! $! %! &! '! 23$ 23& *+,-. 4556!KEH+6!$/01 ! "! #! $! %! &! '! 232& )3!! )3"& *+,-. $%2%"/01 ! "! #! $! %! &! '! )3(! )32& "!3!! *+,-. "$''/8/01 ! "! #! $! %! &! '! 23&& 23'& 23(& *+,-. 4556!KEH+6!&/01 ! "! #! $! %! &! '! )3%! )3&& *+,-. $#'/8/01
slide-24
SLIDE 24

22

Conclusion

FARMS summarization outperforms all

Affycomp II competitors (57) in terms of sensitivity and specificity (AUC)

I/NI calls offers a critical contribution to the

curse of high-dimensionality in the analysis

  • f microarray data

I/NI calls filters informative genes in a statistically

sound and objective manner

The smaller gene set contains less false positives

slide-25
SLIDE 25

23

Further information

Talloen W, Clevert DA, Hochreiter S, Amaratunga D, Bijnens L,

Kass S and Göhlmann H: I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data. Bioinformatics 2007 Advance Access published on October 5, 2007.

Hochreiter S, Clevert DA, Obermayer K: A new

summarization method for Affymetrix probe level data. Bioinformatics 2006, 22: 943-949.

FARMS homepage: http://www.bioinf.jku.at/software/farms/farms.html Affycomp II benchmark: http://affycomp.biostat.jhsph.edu/