Harnessing Crowd-Sourcing to Assess Genes based on Effect Size Using Visual Inference Methods
Di Cook, Monash University Joint work with Niladri Roy Chowdhury, Eric Hare, Mahbub Majumder, Michelle Graham, Tengfei Yin, Heike Hofmann
Harnessing Crowd-Sourcing to Assess Genes based on Effect Size - - PowerPoint PPT Presentation
Harnessing Crowd-Sourcing to Assess Genes based on Effect Size Using Visual Inference Methods Di Cook, Monash University Joint work with Niladri Roy Chowdhury, Eric Hare, Mahbub Majumder, Michelle Graham, Tengfei Yin, Heike Hofmann Outline
Di Cook, Monash University Joint work with Niladri Roy Chowdhury, Eric Hare, Mahbub Majumder, Michelle Graham, Tengfei Yin, Heike Hofmann
VicBioStat 2016, Melbourne, Australia
…36
2
VicBioStat 2016, Melbourne, Australia
…36
3
Glyma13g12080 Glyma13g11960 Glyma13g12010 Glyma06g03100 Glyma10g36890 Glyma16g29220 Glyma18g10330 Glyma03g06420 Glyma09g28100 Glyma16g05640 Glyma09g03270 Glyma09g29370 Glyma09g24780 Glyma14g34080 Glyma02g39150 Glyma02g03290 Glyma08g36390 Glyma20g26600 Glyma01g38130 Glyma18g01720 Glyma05g16350 Glyma18g07090 Glyma12g36140 Glyma12g03280 Glyma02g13850 5 10 5 10 5 10 5 10 5 10 insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient
Fe log2(normalized counts + 1)
geno Emptyvector RPA
1 2 3 4 5 6
11 16 21
7 8 9
10 12 13 14 15 17 18 19 20 22 23 24 25
!
" "
? ? ? ? ? ? ?
"
? ? ? ? ! ? ! ! ! ! ? ? ? ?
Glyma13g12080 Glyma13g11960 Glyma13g12010 Glyma06g03100 Glyma10g36890 Glyma16g29220 Glyma18g10330 Glyma03g06420 Glyma09g28100 Glyma16g05640 Glyma09g03270 Glyma09g29370 Glyma09g24780 Glyma14g34080 Glyma02g39150 Glyma02g03290 Glyma08g36390 Glyma20g26600 Glyma01g38130 Glyma18g01720 Glyma05g16350 Glyma18g07090 Glyma12g36140 Glyma12g03280 Glyma02g13850 5 10 5 10 5 10 5 10 5 10 insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient
Fe log2(normalized counts + 1)
geno Emptyvector RPA
1 2 3 4 5 6
11 16 21
7 8 9
10 12 13 14 15 17 18 19 20 22 23 24 25
!
" "
? ? ? ? ? ? ?
"
? ? ? ? ! ? ! ! ! ! ? ? ? ?
Glyma13g12080 Glyma13g11960 Glyma13g12010 Glyma06g03100 Glyma10g36890 Glyma16g29220 Glyma18g10330 Glyma03g06420 Glyma09g28100 Glyma16g05640 Glyma09g03270 Glyma09g29370 Glyma09g24780 Glyma14g34080 Glyma02g39150 Glyma02g03290 Glyma08g36390 Glyma20g26600 Glyma01g38130 Glyma18g01720 Glyma05g16350 Glyma18g07090 Glyma12g36140 Glyma12g03280 Glyma02g13850 5 10 5 10 5 10 5 10 5 10 insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient
Fe log2(normalized counts + 1)
geno Emptyvector RPA
1 2 3 4 5 6
11 16 21
7 8 9
10 12 13 14 15 17 18 19 20 22 23 24 25
!
" "
? ? ? ? ? ? ?
"
? ? ? ? ! ? ! ! ! ! ? ? ? ?
VicBioStat 2016, Melbourne, Australia
…36
7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Emptyvector RPA Emptyvector RPA Emptyvector RPA Emptyvector RPA Emptyvector RPA
geno log2(normalized counts + 1)
In which of these plots do the two groups have the most vertical difference?
geno_1_5, 5/7
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 i s i s i s i s i s
Fe log2(normalized counts + 1)
In which of these plots is the green line the steepest, and the spread of the green points relatively small?
interaction_2_1, 4/5
VicBioStat 2016, Melbourne, Australia
…36
10
VicBioStat 2016, Melbourne, Australia
…36
10
VicBioStat 2016, Melbourne, Australia
…36
11
Image from son’s T−shirt!
VicBioStat 2016, Melbourne, Australia
…36
12
VicBioStat 2016, Melbourne, Australia
…36
13
VicBioStat 2016, Melbourne, Australia
…36
14
VicBioStat 2016, Melbourne, Australia
…36
15
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
JSM 2014, Boston, MA
VicBioStat 2016, Melbourne, Australia
…36
21
http://www.unomaha.edu/mahbubulmajumder/html/experiments.html
VicBioStat 2016, Melbourne, Australia
…36
22
P(X ≥ x) = 1 − BinomK,1/m(x − 1) =
K
⇤
i=x
K i ⇥ 1 m ⇥i m − 1 m ⇥K−i
VicBioStat 2016, Melbourne, Australia
…36
22
P(X ≥ x) = 1 − BinomK,1/m(x − 1) =
K
⇤
i=x
K i ⇥ 1 m ⇥i m − 1 m ⇥K−i
Number of independent observers
VicBioStat 2016, Melbourne, Australia
…36
22
P(X ≥ x) = 1 − BinomK,1/m(x − 1) =
K
⇤
i=x
K i ⇥ 1 m ⇥i m − 1 m ⇥K−i
VicBioStat 2016, Melbourne, Australia
…36
22
P(X ≥ x) = 1 − BinomK,1/m(x − 1) =
K
⇤
i=x
K i ⇥ 1 m ⇥i m − 1 m ⇥K−i
Number of observers choosing data plot
VicBioStat 2016, Melbourne, Australia
…36
22
P(X ≥ x) = 1 − BinomK,1/m(x − 1) =
K
⇤
i=x
K i ⇥ 1 m ⇥i m − 1 m ⇥K−i
VicBioStat 2016, Melbourne, Australia
…36
23
0.00 0.25 0.50 0.75 1.00 2−11 95−104 995−1004 1995−2004
Gene Order Visual p−value
VicBioStat 2016, Melbourne, Australia
…36
2−11 95−104 995−1004 1995−2004 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −3.831596e−78 1.724218e−77 3.831596e−77 5.938973e−77 8.046351e−77 −6.558297e−25 4.597441e−24 9.850712e−24 1.510398e−23 2.035725e−23 7.00e−08 7.25e−08 7.50e−08 7.75e−08 4.05e−05 4.10e−05 4.15e−05 4.20e−05
edgeR Visual
24
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=10.9 FDR=10-183
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=10.9 FDR=10-183 FC=2.9 FDR=10-14
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=10.9 FDR=10-183 FC=2.9 FDR=10-14 FC=3.0 FDR=10-11
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=10.9 FDR=10-183 FC=2.9 FDR=10-14 FC=3.0 FDR=10-11
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=2.9 FDR=10-14 FC=3.0 FDR=10-11
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=3.0 FDR=10-11
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
8 0.00 0.25 0.50 0.75
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=0.78 FDR=10-7
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=0.78 FDR=10-7 FC=5.9 FDR=10-15
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=0.78 FDR=10-7 FC=-2.5 FDR=10-14 FC=5.9 FDR=10-15
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=0.78 FDR=10-7 FC=-2.5 FDR=10-14 FC=5.9 FDR=10-15
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=0.78 FDR=10-7 FC=5.9 FDR=10-15
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
FC=0.78 FDR=10-7
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
2 4 6 0.0 0.2 0.4 0.6 0.8
FDR logFC
1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT
log10(cpm) log10(cpm)
VicBioStat 2016, Melbourne, Australia
…36
27
VicBioStat 2016, Melbourne, Australia
…36
28
VicBioStat 2016, Melbourne, Australia
…36
29
VicBioStat 2016, Melbourne, Australia
…36
30
VicBioStat 2016, Melbourne, Australia
…36
31
pathTN <- getPath("Tokyo", "Narow", sbIG, sbGeneal) pathTN #> $pathVertices #> [1] "Tokyo" "Volstate" "Jackson" "R66-873" "Narow" #> #> $yearVertices #> [1] "1907" "1942" "1954.5" "1971.5" "1985" plotPath(pathTN)
pathTN <- getPath("Tokyo", "Narow", sbIG, sbGeneal) pathTN #> $pathVertices #> [1] "Tokyo" "Volstate" "Jackson" "R66-873" "Narow" #> #> $yearVertices #> [1] "1907" "1942" "1954.5" "1971.5" "1985" plotPath(pathTN)
pathTN <- getPath("Tokyo", "Narow", sbIG, sbGeneal) pathTN #> $pathVertices #> [1] "Tokyo" "Volstate" "Jackson" "R66-873" "Narow" #> #> $yearVertices #> [1] "1907" "1942" "1954.5" "1971.5" "1985" plotPath(pathTN)
hw <- read_csv("../data/hw-gen.csv") names(hw)[2:3] <- c("parent", "child") plotAncDes("Hadley Alexander Wickham", hw, mAnc=6, mDes=1)
VicBioStat 2016, Melbourne, Australia
…36
Explore genetic signatures, genealogy and phenotypic changes of soybean breeding Understand how the genome changed with the breeding
Data sources: Next-generation sequencing DNA-seq on 79 lines: DNA sequencing libraries were prepared using TruSeq DNA sample prep and NuGENs unamplified prep kits (Illumina Inc., San Diego, CA and NuGEN Technologies Inc., San Carlos, CA). Field yield trials: 30/79 + 138 ancestral lines Breeding literature, what lines were bred to produce what line
34
VicBioStat 2016, Melbourne, Australia
…36
Copy number variation (CNV): 2Gb of analysis files, annotations Seven tabs containing different functionality Four of the tabs, CNV Location, Copy Number, "Search CNVs by Location", and CNV List, primarily concerned with exploring the identified copy number variants The other three tabs, Phenotype Data, Genealogy, and Methodology provide additional information about the soybean cultivars and the experimental methodology SNPs: 12Gb of data, 20mill SNPs, 1mil locations, 79 lines Genealogy: Shows the parent to child lineage
35
VicBioStat 2016, Melbourne, Australia
…36
Copy number variation (CNV): 2Gb of analysis files, annotations Seven tabs containing different functionality Four of the tabs, CNV Location, Copy Number, "Search CNVs by Location", and CNV List, primarily concerned with exploring the identified copy number variants The other three tabs, Phenotype Data, Genealogy, and Methodology provide additional information about the soybean cultivars and the experimental methodology SNPs: 12Gb of data, 20mill SNPs, 1mil locations, 79 lines Genealogy: Shows the parent to child lineage
35
http://shiny.soybase.org/CNV/
VicBioStat 2016, Melbourne, Australia
…36
36