Re-inserting human interaction ! into cancer genome interpretation ! - - PowerPoint PPT Presentation

re inserting human interaction into cancer genome
SMART_READER_LITE
LIVE PREVIEW

Re-inserting human interaction ! into cancer genome interpretation ! - - PowerPoint PPT Presentation

Re-inserting human interaction ! into cancer genome interpretation ! CYDNEY NIELSEN UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY Outline 1 Visualization and its role in scientific discovery ! 2 Interactive cancer genomics


slide-1
SLIDE 1

Re-inserting human interaction ! into cancer genome interpretation!

CYDNEY NIELSEN

UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY

slide-2
SLIDE 2

Outline

1 Visualization and its role in scientific discovery! 2 Interactive cancer genomics visualization: why now?! 3 Building a cancer genomics visualization platform !

  • Flexible integration of views!
  • Dynamic linking between views!
  • Scalable to large data sets!

4 Summary! !

! !

slide-3
SLIDE 3

1

Visualization and its role in scientific discovery

slide-4
SLIDE 4

Discovery loop

INSIGHTS! QUESTIONS! DATA! hypothesis generation! interpretation! experiments!

...01100110...

!

?

slide-5
SLIDE 5

...01100110...

!

?

Discovery loop

INSIGHTS! QUESTIONS! DATA! experiments! communication! PUBLICATIONS! interpretation!

slide-6
SLIDE 6

Discovery loop

INSIGHTS! QUESTIONS! DATA! experiments! interpretation! computer automation + human expert! communication! PUBLICATIONS!

...01100110...

!

?

slide-7
SLIDE 7

Intelligence Amplifying System > Artificial Intelligence System!

! That is, a machine and a mind can beat a mind-imitating machine working by itself.!

  • Frederick Brooks
slide-8
SLIDE 8

Why visualization?

Visualization!

  • Leverages our ability to visually recognize patterns and enhances our ability to reason about data!
  • Can reveal a level of detail that may be missed in summary statistics alone!

y I II III IV x 10 8 13 9 11 14 6 4 12 7 5 y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 x 10 8 13 9 11 14 6 4 12 7 5 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 x 8 8 8 8 8 8 8 19 8 8 8 y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89 x 10 8 13 9 11 14 6 4 12 7 5

a b

Anscombe’s quartet!

slide-9
SLIDE 9

...01100110...

!

?

Why visualization?

INSIGHTS! DATA!

interpretation!

Visualization!

  • Is well suited to questions where the solution is too ill-defined to be automatically computed!

! !

slide-10
SLIDE 10

Why visualization?

www.apple.com

Example: ! ! Visual Information-Seeking Mantra! ! Overview first, zoom and filter, then details-on-demand. !

!

  • Shneiderman 1996!

Visualization!

  • Can be further enhanced with interactivity, which is key to dynamic data exploration!

! !

slide-11
SLIDE 11

Visualization!

  • Reduces the computational barrier

posed by many data analysis workflows! ! !

Why visualization?

slide-12
SLIDE 12

2

Interactive cancer genomics visualization: why now?

slide-13
SLIDE 13

Analogy: Human genome assembly

Computer automation

To reconstruct the human genome sequence from raw sequencing data!

Human expert

To finish the genome: close gaps, correct mis-assemblies, improve error probabilities of the consensus bases!

Consed | David Gordon and Phil Green!

slide-14
SLIDE 14

Analogy: Human genome assembly

Computer automation Human expert

Consed | David Gordon and Phil Green!

slide-15
SLIDE 15

Analogy: Human genome assembly

Consed | David Gordon and Phil Green!

  • Some manual tasks become

automated once they are better characterized (e.g. AutoFinish)!

  • Computational analyses can be

interactively focused by the user (e.g. local re-assembly)!

Computer automation Human expert

slide-16
SLIDE 16

Cancer genomics data interpretation

Computer automation

To predict diverse features that differ between tumor and matched-normal sample pairs!

A > G A > G!

Mutations! Copy number!

deletion deletion!

Gene expression!

AAAAA! AAAAA! AAAAA! AAAAA! AAAAA! AAAAA!

Rearrangements!

translocation translocation!

Human expert

To integrate and interpret these features together with relevant patient metadata!

slide-17
SLIDE 17

Cancer genomics data interpretation

A > G A > G!

Mutations! Copy number!

deletion deletion!

Gene expression!

AAAAA! AAAAA! AAAAA! AAAAA! AAAAA! AAAAA!

Rearrangements!

translocation translocation!

Need$interac+ve$visualiza+on$tools$to$ facilitate$the$human$component$and$ complement$the$computa+onal$one$

Computer automation Human expert

slide-18
SLIDE 18

Visualizing multidimensional cancer genomics data

Michael P Schroeder1, Abel Gonzalez-Perez1 and Nuria Lopez-Bigas*1,2

REVIEW

Schroeder et al. Genome Medicine 2013, 5:9 http://genomemedicine.com/content/5/1/9

Genomics visualization

Matrix heatmaps Genomic coordinates Networks Chromosomal coordinates Clinical data Interactions Clinical data Omics data Genes Clinical data Omics data Omics data Genes Samples

slide-19
SLIDE 19

Visualizing multidimensional cancer genomics data

Michael P Schroeder1, Abel Gonzalez-Perez1 and Nuria Lopez-Bigas*1,2

REVIEW

Schroeder et al. Genome Medicine 2013, 5:9 http://genomemedicine.com/content/5/1/9

Genomics visualization

slide-20
SLIDE 20

3

Building a cancer genomics visualization platform

slide-21
SLIDE 21

Flexible integration of views

slide-22
SLIDE 22

Integrate multiple data types into one view

Mutations | MutationSeq!

Ding et al., Bioinformatics, 2012!

Copy Number | Titan!

Ha et al., Genome Research, 2014!

Example analysis: Examine a mutation in its copy number context!

!

dele$on' muta$on'

slide-23
SLIDE 23

Integrate multiple data types into one view

Mutations | MutationSeq!

Ding et al., Bioinformatics, 2012!

Copy Number | Titan!

Ha et al., Genome Research, 2014!

Example analysis: Examine a mutation in its copy number context!

!

mutations! copy number!

slide-24
SLIDE 24

Compare data filters on a single data set

MutationSeq predictions!

Example analysis: ! Examine impact of MutationSeq probability threshold on coverage versus allele ratio distribution! !

slide-25
SLIDE 25

Explore views of different data types

MutationSeq predictions! Titan copy number predictions!

Example analysis: ! Examine both the mutations and copy number alterations for a given sample! !

slide-26
SLIDE 26

v! d! Components

Data!

sample(s) + data type!

View!

visual representation!

Region Filter!

  • n genomic range!

Data Filter!

  • n data parameters!
slide-27
SLIDE 27

Integrate multiple data types into one view

v! d! d!

mutations! copy number!

slide-28
SLIDE 28

Compare data filters on a single data set

v! d! v! MutationSeq predictions!

slide-29
SLIDE 29

v! v! d! d! MutationSeq predictions! Titan copy number predictions!

Explore views of different data types

slide-30
SLIDE 30

Interface

slide-31
SLIDE 31

Select a predefined structure!

Create

slide-32
SLIDE 32

Add to an existing structure!

Create

slide-33
SLIDE 33

Sample(s)! Query by project name / tumour type / sample id! ! Single data type! e.g. mutations, copy number, etc.!

Define Data

slide-34
SLIDE 34

Data filters depend

  • n previously

selected data type!

Filter Data

slide-35
SLIDE 35

Filter Regions

Limit the view to genes or regions of interest!

slide-36
SLIDE 36

View types depend

  • n previously

selected data type!

Select a View

slide-37
SLIDE 37

Adjust View

slide-38
SLIDE 38

Inspect/Modify

slide-39
SLIDE 39

Dynamic linking between views

slide-40
SLIDE 40

Dynamically link views of different data types

v! v! d! d! MutationSeq predictions! Titan copy number predictions!

slide-41
SLIDE 41

Dynamically link views of different data types

v! v! d! d!

slide-42
SLIDE 42

Dynamically link views of different data types

v! v! d! d!

slide-43
SLIDE 43

Scalability

slide-44
SLIDE 44

Research on big data visualization must address two major challenges: ! perceptual and interactive scalability!

Zhicheng Liu, Biye Jiang, Jeffrey Heer inMens, EuroVis 2013

slide-45
SLIDE 45

Interactive scalability

How to enable dynamic querying and rendering of millions of data points in real time? !

slide-46
SLIDE 46
  • Optimized for text search across documents!
  • All fields are indexed for fast retrieval (bag-of-terms approach)!
  • Query performance is a function of the number of query matches not the

total data set size!

  • Scales well as the data set size grows!
  • Appropriate for load-once-read-many workflows!

Search

slide-47
SLIDE 47

Elasticsearch

  • Chose for ease of use (built on top of Apache Lucene)!
  • Benefits include:!
  • Built-in support for distributed data (manages shards across nodes)!
  • Extensive caching!
  • Sophisticated query language (DSL)!
slide-48
SLIDE 48

Storing data

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

Documents (records)' Fields'

slide-49
SLIDE 49

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

Storing data

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91!

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.95! !

CNV!

sample id: SA091! chrom: 5! start: 2,062! end: 9,199! state: GAIN! !

CNV!

sample id: SA091! chrom: 2! start: 69,064! end: 89,119! state: DEL!

mutation!

sample id: SA091! chrom: 2! position: 19,586! ref_allele: G! alt_allele: G!

Index'

slide-50
SLIDE 50

Work-in-progress

  • Current index contains 800 million genomic events!
  • Optimizations in how we store our data facilitate real-time linking!
  • Queries sufficiently fast to support interaction!
  • Working to standardize our data loaders and develop a query API for

community use!

slide-51
SLIDE 51

Perceptual scalability

What to do when we have more data points than pixels?!

slide-52
SLIDE 52
  • Design views to present meaningful summaries!

Scalable views

  • Sample or smooth data while preserving potentially relevant outliers!
  • Exploit optimized elasticsearch aggregation functions (e.g. counts in

heatmaps computed during search)!

slide-53
SLIDE 53

4

Summary

slide-54
SLIDE 54

Summary

  • We need a flexible platform in order to tackle the diverse visualization

demands of cancer genomics!

  • Goal of this platform is to facilitate scientific discovery!
  • Uses visualization as a means of supporting data exploration and

insight generation!

  • The insights are not necessarily the final product, but rather the

beginning of a rigorous scientific process to further test the idea!

  • Key challenges include:!
  • Interactive and perceptual scalability!
  • Interoperability!
slide-55
SLIDE 55

Acknowledgements

Sohrab Shah! ! Samuel Aparicio! David Huntsman! Marco Marra! Janessa Laskin! ! ! Michael Smith Genome Sciences Centre!

British Columbia Cancer Agency! Vancouver, Canada!

Tom Jin! Kevin Wagner! Daniel Machev! Kelsey Hamer! Ali Bashashati! ! Shah Lab Development Team!