Harnessing Crowd-Sourcing to Assess Genes based on Effect Size - - PowerPoint PPT Presentation

harnessing crowd sourcing to assess genes based on effect
SMART_READER_LITE
LIVE PREVIEW

Harnessing Crowd-Sourcing to Assess Genes based on Effect Size - - PowerPoint PPT Presentation

Harnessing Crowd-Sourcing to Assess Genes based on Effect Size Using Visual Inference Methods Di Cook, Monash University Joint work with Niladri Roy Chowdhury, Eric Hare, Mahbub Majumder, Michelle Graham, Tengfei Yin, Heike Hofmann Outline


slide-1
SLIDE 1

Harnessing Crowd-Sourcing to Assess Genes based on Effect Size Using Visual Inference Methods

Di Cook, Monash University Joint work with Niladri Roy Chowdhury, Eric Hare, Mahbub Majumder, Michelle Graham, Tengfei Yin, Heike Hofmann

slide-2
SLIDE 2

VicBioStat 2016, Melbourne, Australia

…36

Outline

Analysis outline, edgeR, … background Our top genes: good, maybe, ugly Why - video of dispersion First experiment, is there any structure Re-analysis of published study

2

slide-3
SLIDE 3

VicBioStat 2016, Melbourne, Australia

…36

Our Data

RNA libraries sequenced by Illumina HiSeq2000 Alignment by bowtie Rsamtools to import bam files, rtracklayer to import gff files GenomicRanges to count reads Negative binomial model using edgeR to compute differential expression FDR yields ~2000 significantly expressed genes

3

slide-4
SLIDE 4

Glyma13g12080 Glyma13g11960 Glyma13g12010 Glyma06g03100 Glyma10g36890 Glyma16g29220 Glyma18g10330 Glyma03g06420 Glyma09g28100 Glyma16g05640 Glyma09g03270 Glyma09g29370 Glyma09g24780 Glyma14g34080 Glyma02g39150 Glyma02g03290 Glyma08g36390 Glyma20g26600 Glyma01g38130 Glyma18g01720 Glyma05g16350 Glyma18g07090 Glyma12g36140 Glyma12g03280 Glyma02g13850 5 10 5 10 5 10 5 10 5 10 insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient

Fe log2(normalized counts + 1)

geno Emptyvector RPA

1 2 3 4 5 6

11 16 21

7 8 9

10 12 13 14 15 17 18 19 20 22 23 24 25

!

" "

? ? ? ? ? ? ?

"

? ? ? ? ! ? ! ! ! ! ? ? ? ?

The Good (✔), Maybe (?) & Ugly (✘)

  • rdered list of

genes

TOP 25 GENES 1 25

slide-5
SLIDE 5

Glyma13g12080 Glyma13g11960 Glyma13g12010 Glyma06g03100 Glyma10g36890 Glyma16g29220 Glyma18g10330 Glyma03g06420 Glyma09g28100 Glyma16g05640 Glyma09g03270 Glyma09g29370 Glyma09g24780 Glyma14g34080 Glyma02g39150 Glyma02g03290 Glyma08g36390 Glyma20g26600 Glyma01g38130 Glyma18g01720 Glyma05g16350 Glyma18g07090 Glyma12g36140 Glyma12g03280 Glyma02g13850 5 10 5 10 5 10 5 10 5 10 insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient

Fe log2(normalized counts + 1)

geno Emptyvector RPA

1 2 3 4 5 6

11 16 21

7 8 9

10 12 13 14 15 17 18 19 20 22 23 24 25

!

" "

? ? ? ? ? ? ?

"

? ? ? ? ! ? ! ! ! ! ? ? ? ?

The Good (✔), Maybe (?) & Ugly (✘)

  • rdered list of

genes

TOP 25 GENES

slide-6
SLIDE 6

Glyma13g12080 Glyma13g11960 Glyma13g12010 Glyma06g03100 Glyma10g36890 Glyma16g29220 Glyma18g10330 Glyma03g06420 Glyma09g28100 Glyma16g05640 Glyma09g03270 Glyma09g29370 Glyma09g24780 Glyma14g34080 Glyma02g39150 Glyma02g03290 Glyma08g36390 Glyma20g26600 Glyma01g38130 Glyma18g01720 Glyma05g16350 Glyma18g07090 Glyma12g36140 Glyma12g03280 Glyma02g13850 5 10 5 10 5 10 5 10 5 10 insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient insufficient sufficient

Fe log2(normalized counts + 1)

geno Emptyvector RPA

1 2 3 4 5 6

11 16 21

7 8 9

10 12 13 14 15 17 18 19 20 22 23 24 25

!

" "

? ? ? ? ? ? ?

"

? ? ? ? ! ? ! ! ! ! ? ? ? ?

The Good (✔), Maybe (?) & Ugly (✘)

  • rdered list of

genes Do you agree?

TOP 25 GENES

slide-7
SLIDE 7

Dispersion

Why?

slide-8
SLIDE 8

Why?

slide-9
SLIDE 9

Level N inflates dispersion

Why?

slide-10
SLIDE 10

Why?

slide-11
SLIDE 11

Gene B inflates dispersion

Why?

slide-12
SLIDE 12

Why?

slide-13
SLIDE 13

In reality, gene B here inflates dispersion, making gene A not signif.

Why?

slide-14
SLIDE 14

Why?

slide-15
SLIDE 15

log (counts pm) tagwise dispersion cranvas ggplot2

slide-16
SLIDE 16

log (counts pm) tagwise dispersion Each point =

  • ne gene

cranvas ggplot2

slide-17
SLIDE 17

log (counts pm) tagwise dispersion Each point =

  • ne gene

cranvas ggplot2 Trended dispersion

slide-18
SLIDE 18

log (counts pm) tagwise dispersion Each point =

  • ne gene

Classical interaction plot

  • f one gene

cranvas ggplot2 Trended dispersion

slide-19
SLIDE 19

log (counts pm) tagwise dispersion Each point =

  • ne gene

Classical interaction plot

  • f one gene

Plots linked, clicking on a point in left plot shows the interaction plot for that gene cranvas ggplot2 Trended dispersion

slide-20
SLIDE 20

log (counts pm) tagwise dispersion Each point =

  • ne gene

Classical interaction plot

  • f one gene

Plots linked, clicking on a point in left plot shows the interaction plot for that gene cranvas ggplot2

slide-21
SLIDE 21

log (counts pm) tagwise dispersion Classical interaction plot

  • f one gene

Plots linked, clicking on a point in left plot shows the interaction plot for that gene cranvas ggplot2

slide-22
SLIDE 22

log (counts pm) tagwise dispersion Plots linked, clicking on a point in left plot shows the interaction plot for that gene cranvas ggplot2

slide-23
SLIDE 23

log (counts pm) tagwise dispersion cranvas ggplot2

slide-24
SLIDE 24

log (counts pm) tagwise dispersion cranvas ggplot2

slide-25
SLIDE 25

VicBioStat 2016, Melbourne, Australia

…36

So we ran a little experiment

Compare the results with random results Take the experimental design, 2x2x3, and permute the labels Re-run the analysis, record most significant gene Plot the results

7

slide-26
SLIDE 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Emptyvector RPA Emptyvector RPA Emptyvector RPA Emptyvector RPA Emptyvector RPA

geno log2(normalized counts + 1)

In which of these plots do the two groups have the most vertical difference?

geno_1_5, 5/7

slide-27
SLIDE 27
  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 i s i s i s i s i s

Fe log2(normalized counts + 1)

In which of these plots is the green line the steepest, and the spread of the green points relatively small?

interaction_2_1, 4/5

slide-28
SLIDE 28

VicBioStat 2016, Melbourne, Australia

…36

Experiment

Five different sets of null plots Five different locations of true data plot inside the lineup Shown to a sample of Amazon Turk workers Overwhelmingly in both cases, the true data is picked, slightly less so for interaction

10

slide-29
SLIDE 29

VicBioStat 2016, Melbourne, Australia

…36

Experiment

Five different sets of null plots Five different locations of true data plot inside the lineup Shown to a sample of Amazon Turk workers Overwhelmingly in both cases, the true data is picked, slightly less so for interaction

10

Data has SOME SIGNAL!

slide-30
SLIDE 30

VicBioStat 2016, Melbourne, Australia

…36

Human vs chimp

Data from “Sex-specific and lineage-specific alternative splicing in primates” Blekhman, Marioni, Zumbo, Stephens, Gilad, Genome Research, 2010 20: 180-189, http:// genome.cshlp.org/content/suppl/2009/12/16/ gr.099226.109.DC1.html Human, chimp (and rhesus) liver RNA 3x2(M/F) individuals, 2 reps for each species

11

Image from son’s T−shirt!

slide-31
SLIDE 31

VicBioStat 2016, Melbourne, Australia

…36

Human vs chimp

Pairwise comparisons of species

12

Likelihoods compared, FDR<0.05

slide-32
SLIDE 32

VicBioStat 2016, Melbourne, Australia

…36

Human vs chimp

Re-analyzed using edgeR, exactTest (Yes, not taking dependencies into account

  • but a quick re-do of analysis wanted)

Just Human-Chimp Yields 3630 differentially expressed genes, at FDR<0.01, mostly overlapping with published results

13

slide-33
SLIDE 33

VicBioStat 2016, Melbourne, Australia

…36

Visual testing

Create multiple sets of permutations of the labels of human, chimp Conduct edgeR/exactTest on each of the permutations Record the top 2500 genes based on p- value Make lineups of j’th ordered gene of actual data against those of permuted data

14

slide-34
SLIDE 34

VicBioStat 2016, Melbourne, Australia

…36

You try

Pick one plot among the 20 “Which plot has the largest vertical difference between the two groups?”

15

goo.gl/gG60uR Point your mobile device to this web page

slide-35
SLIDE 35

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1

slide-36
SLIDE 36

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 2

slide-37
SLIDE 37

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 3

slide-38
SLIDE 38

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 4

slide-39
SLIDE 39

JSM 2014, Boston, MA

Actual data is in positions 8, 5, 17, 18

slide-40
SLIDE 40

VicBioStat 2016, Melbourne, Australia

…36

Turk study

Lineups of the 2-11’th, 95-104’th, 995-1004’th, 1995-2004’th ordered genes Two replicates of lineups made with different nulls, and different positions of actual data Turkers evaluated blocks of 10 randomly selected lineups Combine results from turkers

21

http://www.unomaha.edu/mahbubulmajumder/html/experiments.html

slide-41
SLIDE 41

VicBioStat 2016, Melbourne, Australia

…36

Significance

If there is no difference in gene expression the chance of one person detecting the actual plot out of 20 is 1/20=0.05 Multiple people follows:

22

P(X ≥ x) = 1 − BinomK,1/m(x − 1) =

K

i=x

K i ⇥ 1 m ⇥i m − 1 m ⇥K−i

slide-42
SLIDE 42

VicBioStat 2016, Melbourne, Australia

…36

Significance

If there is no difference in gene expression the chance of one person detecting the actual plot out of 20 is 1/20=0.05 Multiple people follows:

22

P(X ≥ x) = 1 − BinomK,1/m(x − 1) =

K

i=x

K i ⇥ 1 m ⇥i m − 1 m ⇥K−i

Number of independent observers

slide-43
SLIDE 43

VicBioStat 2016, Melbourne, Australia

…36

Significance

If there is no difference in gene expression the chance of one person detecting the actual plot out of 20 is 1/20=0.05 Multiple people follows:

22

P(X ≥ x) = 1 − BinomK,1/m(x − 1) =

K

i=x

K i ⇥ 1 m ⇥i m − 1 m ⇥K−i

slide-44
SLIDE 44

VicBioStat 2016, Melbourne, Australia

…36

Significance

If there is no difference in gene expression the chance of one person detecting the actual plot out of 20 is 1/20=0.05 Multiple people follows:

22

P(X ≥ x) = 1 − BinomK,1/m(x − 1) =

K

i=x

K i ⇥ 1 m ⇥i m − 1 m ⇥K−i

Number of observers choosing data plot

slide-45
SLIDE 45

VicBioStat 2016, Melbourne, Australia

…36

Significance

If there is no difference in gene expression the chance of one person detecting the actual plot out of 20 is 1/20=0.05 Multiple people follows:

22

P(X ≥ x) = 1 − BinomK,1/m(x − 1) =

K

i=x

K i ⇥ 1 m ⇥i m − 1 m ⇥K−i

slide-46
SLIDE 46

VicBioStat 2016, Melbourne, Australia

…36

Results

Genes in the top 10 have a clear difference From 900’s down difference is consistent with randomness

23

0.00 0.25 0.50 0.75 1.00 2−11 95−104 995−1004 1995−2004

Gene Order Visual p−value

slide-47
SLIDE 47

VicBioStat 2016, Melbourne, Australia

…36

2−11 95−104 995−1004 1995−2004 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 −3.831596e−78 1.724218e−77 3.831596e−77 5.938973e−77 8.046351e−77 −6.558297e−25 4.597441e−24 9.850712e−24 1.510398e−23 2.035725e−23 7.00e−08 7.25e−08 7.50e−08 7.75e−08 4.05e−05 4.10e−05 4.15e−05 4.20e−05

edgeR Visual

Results

Visual p-value 0 when edgeR p- value is really tiny, top 20 Positive association with p-values with

  • rder ~1000th,

~2000th

24

slide-48
SLIDE 48

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

slide-49
SLIDE 49

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

slide-50
SLIDE 50

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

FC=10.9 FDR=10-183

slide-51
SLIDE 51

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

FC=10.9 FDR=10-183 FC=2.9 FDR=10-14

slide-52
SLIDE 52

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

FC=10.9 FDR=10-183 FC=2.9 FDR=10-14 FC=3.0 FDR=10-11

slide-53
SLIDE 53

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

FC=10.9 FDR=10-183 FC=2.9 FDR=10-14 FC=3.0 FDR=10-11

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

slide-54
SLIDE 54

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

FC=2.9 FDR=10-14 FC=3.0 FDR=10-11

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

slide-55
SLIDE 55

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

FC=3.0 FDR=10-11

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

slide-56
SLIDE 56

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

slide-57
SLIDE 57

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

7

slide-58
SLIDE 58

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

7 1

slide-59
SLIDE 59

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

7 1 2

slide-60
SLIDE 60

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

7 1 2 3

slide-61
SLIDE 61

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 2 2 3 3 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

Human-chimp 1 2nd

  • 4

8 0.00 0.25 0.50 0.75

FDR logFC

7 1 2 3 1

slide-62
SLIDE 62

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

slide-63
SLIDE 63

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

slide-64
SLIDE 64

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

FC=0.78 FDR=10-7

slide-65
SLIDE 65

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

FC=0.78 FDR=10-7 FC=5.9 FDR=10-15

slide-66
SLIDE 66

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

FC=0.78 FDR=10-7 FC=-2.5 FDR=10-14 FC=5.9 FDR=10-15

slide-67
SLIDE 67
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

FC=0.78 FDR=10-7 FC=-2.5 FDR=10-14 FC=5.9 FDR=10-15

slide-68
SLIDE 68
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

FC=0.78 FDR=10-7 FC=5.9 FDR=10-15

slide-69
SLIDE 69
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

FC=0.78 FDR=10-7

slide-70
SLIDE 70
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

slide-71
SLIDE 71
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th

slide-72
SLIDE 72
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th 4

slide-73
SLIDE 73
  • −2

2 4 6 0.0 0.2 0.4 0.6 0.8

FDR logFC

1 2 3 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT HS HS PT PT

log10(cpm) log10(cpm)

1000th 4 1

slide-74
SLIDE 74

VicBioStat 2016, Melbourne, Australia

…36

What we learn

We see FDR-adjusted p-values as low as 10-14 in random assignment of treatment labels Would argue that the difference in expression maxes out around 100 genes

27

slide-75
SLIDE 75

VicBioStat 2016, Melbourne, Australia

…36

Costs

82 lineups 107 turkers 1078 lineups evaluated $115 total cost, $1/lineup, $10 Amazon fee, $5 for bonus’

28

slide-76
SLIDE 76

VicBioStat 2016, Melbourne, Australia

…36

Summary

The lineup provides a rigorous way to evaluate data plots With the use of crowd-sourcing it could be used for high-throughput data Consideration of scale used, similar to tag- wise, trended or common Need to get some funding to pursue this!

29

slide-77
SLIDE 77

VicBioStat 2016, Melbourne, Australia

…36

Other topics: R packages

genealogy: lineage relationships shiny apps for big data

30

slide-78
SLIDE 78

VicBioStat 2016, Melbourne, Australia

…36

ggenealogy

shortest path plotting ancestors and descendants plotting distance matrix using interaction

31

slide-79
SLIDE 79

pathTN <- getPath("Tokyo", "Narow", sbIG, sbGeneal) pathTN #> $pathVertices #> [1] "Tokyo" "Volstate" "Jackson" "R66-873" "Narow" #> #> $yearVertices #> [1] "1907" "1942" "1954.5" "1971.5" "1985" plotPath(pathTN)

slide-80
SLIDE 80

pathTN <- getPath("Tokyo", "Narow", sbIG, sbGeneal) pathTN #> $pathVertices #> [1] "Tokyo" "Volstate" "Jackson" "R66-873" "Narow" #> #> $yearVertices #> [1] "1907" "1942" "1954.5" "1971.5" "1985" plotPath(pathTN)

slide-81
SLIDE 81

pathTN <- getPath("Tokyo", "Narow", sbIG, sbGeneal) pathTN #> $pathVertices #> [1] "Tokyo" "Volstate" "Jackson" "R66-873" "Narow" #> #> $yearVertices #> [1] "1907" "1942" "1954.5" "1971.5" "1985" plotPath(pathTN)

slide-82
SLIDE 82

hw <- read_csv("../data/hw-gen.csv") names(hw)[2:3] <- c("parent", "child") plotAncDes("Hadley Alexander Wickham", hw, mAnc=6, mDes=1)

For more information go to: https://github.com/dicook/SISBID-2016

slide-83
SLIDE 83

VicBioStat 2016, Melbourne, Australia

…36

Shiny apps for big data

Explore genetic signatures, genealogy and phenotypic changes of soybean breeding Understand how the genome changed with the breeding

  • f lines, and how this affected other traits

Data sources: Next-generation sequencing DNA-seq on 79 lines: DNA sequencing libraries were prepared using TruSeq DNA sample prep and NuGENs unamplified prep kits (Illumina Inc., San Diego, CA and NuGEN Technologies Inc., San Carlos, CA). Field yield trials: 30/79 + 138 ancestral lines Breeding literature, what lines were bred to produce what line

34

slide-84
SLIDE 84

VicBioStat 2016, Melbourne, Australia

…36

Copy number variation (CNV): 2Gb of analysis files, annotations Seven tabs containing different functionality Four of the tabs, CNV Location, Copy Number, "Search CNVs by Location", and CNV List, primarily concerned with exploring the identified copy number variants The other three tabs, Phenotype Data, Genealogy, and Methodology provide additional information about the soybean cultivars and the experimental methodology SNPs: 12Gb of data, 20mill SNPs, 1mil locations, 79 lines Genealogy: Shows the parent to child lineage

35

Apps written by Dr Susan Vanderplas

slide-85
SLIDE 85

VicBioStat 2016, Melbourne, Australia

…36

Copy number variation (CNV): 2Gb of analysis files, annotations Seven tabs containing different functionality Four of the tabs, CNV Location, Copy Number, "Search CNVs by Location", and CNV List, primarily concerned with exploring the identified copy number variants The other three tabs, Phenotype Data, Genealogy, and Methodology provide additional information about the soybean cultivars and the experimental methodology SNPs: 12Gb of data, 20mill SNPs, 1mil locations, 79 lines Genealogy: Shows the parent to child lineage

35

Apps written by Dr Susan Vanderplas

http://shiny.soybase.org/CNV/

slide-86
SLIDE 86

VicBioStat 2016, Melbourne, Australia

…36

Inference References

  • Buja et al (2009) R. Soc. Phil. Trans. A
  • Hofmann et al (2012) IEEE TVCG
  • Majumder et al (2013) JASA
  • Yin et al (2013) J. Data Mining in Gen. & Prot.
  • Roy Chowdhury et al (2014) Computational

Statistics

36