Interactive Data Visualization in the Wild ! Challenges of Big Data - - PowerPoint PPT Presentation

interactive data visualization in the wild
SMART_READER_LITE
LIVE PREVIEW

Interactive Data Visualization in the Wild ! Challenges of Big Data - - PowerPoint PPT Presentation

Interactive Data Visualization in the Wild ! Challenges of Big Data in Cancer Genomics ! CYDNEY NIELSEN UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY Outline 1 Visualization and its role in scientific discovery ! 2 Introduction


slide-1
SLIDE 1

Interactive Data Visualization in the Wild!

Challenges of Big Data in Cancer Genomics!

CYDNEY NIELSEN

UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY

slide-2
SLIDE 2

Outline

1 Visualization and its role in scientific discovery! 2 Introduction to cancer genomics! 3 Cancer genomics visualization – building a scalable platform! 4 Summary! !

! !

slide-3
SLIDE 3

1

Visualization and its role in scientific discovery

slide-4
SLIDE 4

Discovery loop

INSIGHTS! QUESTIONS! DATA! hypothesis generation! interpretation! experiments!

slide-5
SLIDE 5

Discovery loop

INSIGHTS! QUESTIONS! DATA! experiments! communication! PUBLICATIONS! interpretation!

slide-6
SLIDE 6

Discovery loop

INSIGHTS! QUESTIONS! DATA! experiments! communication! PUBLICATIONS! interpretation! computer automation + human expert!

slide-7
SLIDE 7

Intelligence Amplifying System > Artificial Intelligence System!

! That is, a machine and a mind can beat a mind-imitating machine working by itself.!

  • Frederick Brooks
slide-8
SLIDE 8

Why visualization?

Visualization!

  • Leverages our ability to visually recognize patterns and enhances our ability to reason about data!
  • Can reveal a level of detail that may be missed in summary statistics alone!

y I II III IV x 10 8 13 9 11 14 6 4 12 7 5 y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 x 10 8 13 9 11 14 6 4 12 7 5 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 x 8 8 8 8 8 8 8 19 8 8 8 y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89 x 10 8 13 9 11 14 6 4 12 7 5

a b

Figure 1 a

Anscombe’s quartet!

slide-9
SLIDE 9

Why visualization?

INSIGHTS! DATA!

interpretation!

Visualization!

  • Is well suited to questions where the solution is too ill-defined to be automatically computed!

! !

slide-10
SLIDE 10

Why visualization?

www.apple.com

Example: ! ! Visual Information-Seeking Mantra! ! Overview first, zoom and filter, then details-on-demand. !

!

  • Shneiderman 1996!

Visualization!

  • Can be further enhanced with interactivity, which is key to dynamic data exploration!

! !

slide-11
SLIDE 11

Visualization!

  • Reduces the computational barrier

posed by many data analysis workflows! ! !

Why visualization?

slide-12
SLIDE 12

2

Cancer Genomics

slide-13
SLIDE 13

Human genome

Image from UCSF School of Medicine Office of Educational Technology

slide-14
SLIDE 14

Cancer – disease of the genome

Li Ding et al. Nature 2012

slide-15
SLIDE 15

DNA Sequencing

AGCGCAGATACAGACAGGTGAAACAGTACAG! TGACAACAGTACCAAGTCAGAGTCCACATAG! TAGAGGAGAGGCCAACATATAGACAACAGTT! TGACAACAGTACCACAGAGTACATAGAGGAG! AGCGCAGATACAGACAGGTGACAACAGAGAG!

Input DNA prepared from a population of cells from a tissue sample Output Millions of sequencing reads

Illumina HiSeq

slide-16
SLIDE 16

Detecting genomic alternations from sequence

GATGACAACAGAGAGGTTACAC! TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG! GCGCAGATACAGACAGATGACA! AGATACAGACAGGTGACAACAG! GACAGGTGACAACAGAGAGGTT! ATACAGACAGGTGACAACAGAG! AGACAGATGACAACAGAGAGGT! CAGATGACAACAGAGAGGTTAC! AGATGACAACAGAGAGGTTACA!

reference reads

slide-17
SLIDE 17

GATGACAACAGAGAGGTTACAC!

Detecting genomic alternations from sequence

TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG! GCGCAGATACAGACAGATGACA! AGATACAGACAGGTGACAACAG! GACAGGTGACAACAGAGAGGTT! ATACAGACAGGTGACAACAGAG! AGACAGATGACAACAGAGAGGT! CAGATGACAACAGAGAGGTTAC! AGATGACAACAGAGAGGTTACA!

Mutation! reference reads

slide-18
SLIDE 18

GATGACAACAGAGAGGTTACAC!

Detecting genomic alternations from sequence

TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG! GCGCAGATACAGACAGGTGACA! AGATACAGACAGGTGACAACAG! GACAGGTGACAACAGAGAGGTT! ATACAGACAGGTGACAACAGAG! AGACAGATGACAACAGAGAGGT! CAGATGACAACAGAGAGGTTAC! AGATGACAACAGAGAGGTTACA!

Mutation!

G A G A allele ratio = 0.5 !

reference reads

coverage!

slide-19
SLIDE 19

Genomic alterations

Mutation! Copy number!

deletion deletion

Rearrangement!

translocation translocation

G A G A

slide-20
SLIDE 20

Revolution in DNA sequencing technologies

slide-21
SLIDE 21

The promise of data

Green E. et al. Nature. February 10, 2011

slide-22
SLIDE 22

Cancer genomics data interpretation

Computer automation

To predict diverse genomic alterations!

Human expert

To integrate and interpret these alternations together with relevant patient metadata!

deletion deletion translocation translocation

MutationSeq!

Ding et al.! Bioinformatics 2012!

Titan!

Ha et al.! Genome Research! 2014!

deStruct!

G A G A

slide-23
SLIDE 23

Cancer genomics data interpretation

deletion deletion translocation translocation

MutationSeq!

Ding et al.! Bioinformatics 2012!

Titan!

Ha et al.! Genome Research! 2014!

deStruct!

Need$interac+ve$visualiza+on$tools$to$ facilitate$the$human$component$and$ complement$the$computa+onal$one$

Computer automation Human expert

G A G A

slide-24
SLIDE 24

3

Cancer Genomics Visualization

slide-25
SLIDE 25

Visualizing multidimensional cancer genomics data

Michael P Schroeder1, Abel Gonzalez-Perez1 and Nuria Lopez-Bigas*1,2

REVIEW

Schroeder et al. Genome Medicine 2013, 5:9 http://genomemedicine.com/content/5/1/9

Many tools for many tasks

Matrix heatmaps Genomic coordinates Networks Chromosomal coordinates Clinical data Interactions Clinical data Omics data Genes Clinical data Omics data Omics data Genes Samples

slide-26
SLIDE 26

Visualizing multidimensional cancer genomics data

Michael P Schroeder1, Abel Gonzalez-Perez1 and Nuria Lopez-Bigas*1,2

REVIEW

Schroeder et al. Genome Medicine 2013, 5:9 http://genomemedicine.com/content/5/1/9

Many tools for many tasks

slide-27
SLIDE 27

h#p://www.cbioportal.org!

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

Key Feature 1 Flexible integration of views

slide-31
SLIDE 31

Integrate multiple data types into one view

Example analysis: Examine a mutation in its copy number context!

!

dele$on' muta$on'

slide-32
SLIDE 32

Integrate multiple data types into one view

Example analysis: Examine a mutation in its copy number context!

!

mutations! copy number!

slide-33
SLIDE 33

Compare data filters on a single data set

MutationSeq predictions!

Example analysis: ! Examine impact of MutationSeq probability threshold on coverage versus allele ratio distribution!

!

slide-34
SLIDE 34

Explore views of different data types

MutationSeq predictions! Titan copy number predictions!

Example analysis: ! Examine both the mutations and copy number alterations for a given sample! !

slide-35
SLIDE 35

v! d! Components

Data!

sample(s) + data type!

View!

visual representation!

Region Filter!

  • n genomic range!

Data Filter!

  • n data parameters!
slide-36
SLIDE 36

Integrate multiple data types into one view

v! d! d!

mutations! copy number!

slide-37
SLIDE 37

Compare data filters on a single data set

v! d! v! MutationSeq predictions!

slide-38
SLIDE 38

v! v! d! d! MutationSeq predictions! Titan copy number predictions!

Explore views of different data types

slide-39
SLIDE 39

Interface

web-application implemented using D3.js!

slide-40
SLIDE 40

Select a predefined structure!

Create

slide-41
SLIDE 41

Add to an existing structure!

Create

slide-42
SLIDE 42

Sample(s)! Query by project name / tumour type / sample id! ! Single data type! e.g. mutations, copy number, etc.!

Define Data

slide-43
SLIDE 43

Data filters depend

  • n previously

selected data type!

Filter Data

slide-44
SLIDE 44

Filter Regions

Limit the view to genes or regions of interest!

slide-45
SLIDE 45

View types depend

  • n previously

selected data type!

Select a View

slide-46
SLIDE 46

Adjust View

slide-47
SLIDE 47

Inspect/Modify

slide-48
SLIDE 48

Key Feature 2 Dynamic linking between views

slide-49
SLIDE 49

Dynamically link views of different data types

v! v! d! d! MutationSeq predictions! Titan copy number predictions!

slide-50
SLIDE 50

Dynamically link views of different data types

v! v! d! d!

slide-51
SLIDE 51

Dynamically link views of different data types

v! v! d! d! dele$on' muta$on'

slide-52
SLIDE 52

Key Feature 3 Scalability

slide-53
SLIDE 53

Research on big data visualization must address two major challenges: ! perceptual and interactive scalability!

Zhicheng Liu, Biye Jiang, Jeffrey Heer inMens, EuroVis 2013

slide-54
SLIDE 54

Interactive scalability

How to enable dynamic querying and rendering of millions of data points in real time? !

slide-55
SLIDE 55
  • Optimized for text search across documents!
  • All fields are indexed for fast retrieval (bag-of-terms approach)!
  • Query performance is a function of the number of query matches not the

total data set size!

  • Scales well as the data set size grows!
  • Appropriate for load-once-read-many workflows!

Search

slide-56
SLIDE 56

Elasticsearch

  • Chose for ease of use (built on top of Apache Lucene)!
  • Benefits include:!
  • Built-in support for distributed data (manages shards across nodes)!
  • Extensive caching!
  • Sophisticated query language (DSL)!
  • REST API!
slide-57
SLIDE 57

Relational Database!

  • Database!
  • Table!
  • Row!
  • Column!

Elasticsearch!

  • Index!
  • Type!
  • Document!
  • Field!

Storing data

slide-58
SLIDE 58

Storing data

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

Documents (records)' Fields'

slide-59
SLIDE 59

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

Storing data

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91!

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

CNV!

sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !

mutation!

sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.95! !

CNV!

sample id: SA091! chrom: 5! start: 2,062! end: 9,199! state: GAIN! !

CNV!

sample id: SA091! chrom: 2! start: 69,064! end: 89,119! state: DEL!

mutation!

sample id: SA091! chrom: 2! position: 19,586! ref_allele: G! alt_allele: G!

Index'

slide-60
SLIDE 60

Handling relational data

Challenge!

  • Search is not optimized for handling relationships between records!
  • To support coordinated views, we need to relate one data type to another!
  • Example: identify which mutations fall within which copy number

alteration based on genomic coordinates!

slide-61
SLIDE 61

Handling relational data

  • Exploit search engine’s very fast retrieval
  • f terms!

mutation!

sample id: SA091! chrom: 1! position: 104,589! probability: 0.91! events: ( ! !(type: titan, ! !chrom: 1! !start: 102,005! !end: 117,902! !state: deletion)! )!

  • Example: a mutation record contains

information about its overlapping copy number record! Solution!

  • Denormalize records to contain necessary terms to support real-time

retrieval of information requested by the frontend!

slide-62
SLIDE 62

Work-in-progress

  • Current index contains 800 million genomic events!
  • Queries sufficiently fast to support interaction!
  • Working to standardize our data loaders and develop a query API for

community use!

slide-63
SLIDE 63

Perceptual scalability

What to do when we have more data points than pixels?!

slide-64
SLIDE 64
  • Design views to present meaningful summaries!

Scalable views

  • Sample or smooth data while preserving potentially relevant outliers!
  • Exploit optimized elasticsearch aggregation functions (e.g. counts in

heatmaps computed during search)!

slide-65
SLIDE 65

4

Summary

slide-66
SLIDE 66

Summary

  • Prototyped a flexible web-based visualization platform to tackle the diverse

visualization demands of cancer genomics; support data integration and comparison!

  • Search engine technology is proving useful for enabling responsive

interactive querying and linking between views!

  • Visualizations designed with scalability in mind (aggregrate, sample, or filter

where possible)!

  • Open visualization challenges - examples:!
  • Best way to summarize genomic alternations of different scales across

tens or hundreds of genomes!

  • How to additionally represent frequency of each genomic alteration

within a tumour sample and over time!

slide-67
SLIDE 67

Acknowledgements

Sohrab Shah! ! Samuel Aparicio! David Huntsman! Marco Marra! Janessa Laskin! ! ! Michael Smith Genome Sciences Centre!

British Columbia Cancer Agency Vancouver, Canada

Tom Jin! Kevin Wagner! Daniel Machev! Kelsey Hamer! Ali Bashashati! ! Shah Lab Development Team!