SLIDE 1 Interactive Data Visualization in the Wild!
Challenges of Big Data in Cancer Genomics!
CYDNEY NIELSEN
UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY
SLIDE 2 Outline
1 Visualization and its role in scientific discovery! 2 Introduction to cancer genomics! 3 Cancer genomics visualization – building a scalable platform! 4 Summary! !
! !
SLIDE 3
1
Visualization and its role in scientific discovery
SLIDE 4 Discovery loop
INSIGHTS! QUESTIONS! DATA! hypothesis generation! interpretation! experiments!
SLIDE 5 Discovery loop
INSIGHTS! QUESTIONS! DATA! experiments! communication! PUBLICATIONS! interpretation!
SLIDE 6 Discovery loop
INSIGHTS! QUESTIONS! DATA! experiments! communication! PUBLICATIONS! interpretation! computer automation + human expert!
SLIDE 7 Intelligence Amplifying System > Artificial Intelligence System!
! That is, a machine and a mind can beat a mind-imitating machine working by itself.!
SLIDE 8 Why visualization?
Visualization!
- Leverages our ability to visually recognize patterns and enhances our ability to reason about data!
- Can reveal a level of detail that may be missed in summary statistics alone!
y I II III IV x 10 8 13 9 11 14 6 4 12 7 5 y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 x 10 8 13 9 11 14 6 4 12 7 5 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 x 8 8 8 8 8 8 8 19 8 8 8 y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89 x 10 8 13 9 11 14 6 4 12 7 5
a b
Figure 1 a
Anscombe’s quartet!
SLIDE 9 Why visualization?
INSIGHTS! DATA!
interpretation!
Visualization!
- Is well suited to questions where the solution is too ill-defined to be automatically computed!
! !
SLIDE 10 Why visualization?
www.apple.com
Example: ! ! Visual Information-Seeking Mantra! ! Overview first, zoom and filter, then details-on-demand. !
!
Visualization!
- Can be further enhanced with interactivity, which is key to dynamic data exploration!
! !
SLIDE 11 Visualization!
- Reduces the computational barrier
posed by many data analysis workflows! ! !
Why visualization?
SLIDE 12
2
Cancer Genomics
SLIDE 13 Human genome
Image from UCSF School of Medicine Office of Educational Technology
SLIDE 14 Cancer – disease of the genome
Li Ding et al. Nature 2012
SLIDE 15 DNA Sequencing
AGCGCAGATACAGACAGGTGAAACAGTACAG! TGACAACAGTACCAAGTCAGAGTCCACATAG! TAGAGGAGAGGCCAACATATAGACAACAGTT! TGACAACAGTACCACAGAGTACATAGAGGAG! AGCGCAGATACAGACAGGTGACAACAGAGAG!
Input DNA prepared from a population of cells from a tissue sample Output Millions of sequencing reads
Illumina HiSeq
SLIDE 16 Detecting genomic alternations from sequence
GATGACAACAGAGAGGTTACAC! TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG! GCGCAGATACAGACAGATGACA! AGATACAGACAGGTGACAACAG! GACAGGTGACAACAGAGAGGTT! ATACAGACAGGTGACAACAGAG! AGACAGATGACAACAGAGAGGT! CAGATGACAACAGAGAGGTTAC! AGATGACAACAGAGAGGTTACA!
reference reads
SLIDE 17 GATGACAACAGAGAGGTTACAC!
Detecting genomic alternations from sequence
TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG! GCGCAGATACAGACAGATGACA! AGATACAGACAGGTGACAACAG! GACAGGTGACAACAGAGAGGTT! ATACAGACAGGTGACAACAGAG! AGACAGATGACAACAGAGAGGT! CAGATGACAACAGAGAGGTTAC! AGATGACAACAGAGAGGTTACA!
Mutation! reference reads
SLIDE 18 GATGACAACAGAGAGGTTACAC!
Detecting genomic alternations from sequence
TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG! GCGCAGATACAGACAGGTGACA! AGATACAGACAGGTGACAACAG! GACAGGTGACAACAGAGAGGTT! ATACAGACAGGTGACAACAGAG! AGACAGATGACAACAGAGAGGT! CAGATGACAACAGAGAGGTTAC! AGATGACAACAGAGAGGTTACA!
Mutation!
G A G A allele ratio = 0.5 !
reference reads
coverage!
SLIDE 19 Genomic alterations
Mutation! Copy number!
deletion deletion
Rearrangement!
translocation translocation
G A G A
SLIDE 20
Revolution in DNA sequencing technologies
SLIDE 21 The promise of data
Green E. et al. Nature. February 10, 2011
SLIDE 22 Cancer genomics data interpretation
Computer automation
To predict diverse genomic alterations!
Human expert
To integrate and interpret these alternations together with relevant patient metadata!
deletion deletion translocation translocation
MutationSeq!
Ding et al.! Bioinformatics 2012!
Titan!
Ha et al.! Genome Research! 2014!
deStruct!
G A G A
SLIDE 23 Cancer genomics data interpretation
deletion deletion translocation translocation
MutationSeq!
Ding et al.! Bioinformatics 2012!
Titan!
Ha et al.! Genome Research! 2014!
deStruct!
Need$interac+ve$visualiza+on$tools$to$ facilitate$the$human$component$and$ complement$the$computa+onal$one$
Computer automation Human expert
G A G A
SLIDE 24
3
Cancer Genomics Visualization
SLIDE 25 Visualizing multidimensional cancer genomics data
Michael P Schroeder1, Abel Gonzalez-Perez1 and Nuria Lopez-Bigas*1,2
REVIEW
Schroeder et al. Genome Medicine 2013, 5:9 http://genomemedicine.com/content/5/1/9
Many tools for many tasks
Matrix heatmaps Genomic coordinates Networks Chromosomal coordinates Clinical data Interactions Clinical data Omics data Genes Clinical data Omics data Omics data Genes Samples
SLIDE 26 Visualizing multidimensional cancer genomics data
Michael P Schroeder1, Abel Gonzalez-Perez1 and Nuria Lopez-Bigas*1,2
REVIEW
Schroeder et al. Genome Medicine 2013, 5:9 http://genomemedicine.com/content/5/1/9
Many tools for many tasks
SLIDE 27
h#p://www.cbioportal.org!
SLIDE 28
SLIDE 29
SLIDE 30
Key Feature 1 Flexible integration of views
SLIDE 31 Integrate multiple data types into one view
Example analysis: Examine a mutation in its copy number context!
!
dele$on' muta$on'
SLIDE 32 Integrate multiple data types into one view
Example analysis: Examine a mutation in its copy number context!
!
mutations! copy number!
SLIDE 33 Compare data filters on a single data set
MutationSeq predictions!
Example analysis: ! Examine impact of MutationSeq probability threshold on coverage versus allele ratio distribution!
!
SLIDE 34 Explore views of different data types
MutationSeq predictions! Titan copy number predictions!
Example analysis: ! Examine both the mutations and copy number alterations for a given sample! !
SLIDE 35 v! d! Components
Data!
sample(s) + data type!
View!
visual representation!
Region Filter!
Data Filter!
SLIDE 36 Integrate multiple data types into one view
v! d! d!
mutations! copy number!
SLIDE 37 Compare data filters on a single data set
v! d! v! MutationSeq predictions!
SLIDE 38 v! v! d! d! MutationSeq predictions! Titan copy number predictions!
Explore views of different data types
SLIDE 39 Interface
web-application implemented using D3.js!
SLIDE 40
Select a predefined structure!
Create
SLIDE 41
Add to an existing structure!
Create
SLIDE 42
Sample(s)! Query by project name / tumour type / sample id! ! Single data type! e.g. mutations, copy number, etc.!
Define Data
SLIDE 43 Data filters depend
selected data type!
Filter Data
SLIDE 44
Filter Regions
Limit the view to genes or regions of interest!
SLIDE 45 View types depend
selected data type!
Select a View
SLIDE 46
Adjust View
SLIDE 47
Inspect/Modify
SLIDE 48
Key Feature 2 Dynamic linking between views
SLIDE 49 Dynamically link views of different data types
v! v! d! d! MutationSeq predictions! Titan copy number predictions!
SLIDE 50 Dynamically link views of different data types
v! v! d! d!
SLIDE 51 Dynamically link views of different data types
v! v! d! d! dele$on' muta$on'
SLIDE 52
Key Feature 3 Scalability
SLIDE 53 Research on big data visualization must address two major challenges: ! perceptual and interactive scalability!
Zhicheng Liu, Biye Jiang, Jeffrey Heer inMens, EuroVis 2013
SLIDE 54
Interactive scalability
How to enable dynamic querying and rendering of millions of data points in real time? !
SLIDE 55
- Optimized for text search across documents!
- All fields are indexed for fast retrieval (bag-of-terms approach)!
- Query performance is a function of the number of query matches not the
total data set size!
- Scales well as the data set size grows!
- Appropriate for load-once-read-many workflows!
Search
SLIDE 56 Elasticsearch
- Chose for ease of use (built on top of Apache Lucene)!
- Benefits include:!
- Built-in support for distributed data (manages shards across nodes)!
- Extensive caching!
- Sophisticated query language (DSL)!
- REST API!
SLIDE 57 Relational Database!
- Database!
- Table!
- Row!
- Column!
Elasticsearch!
- Index!
- Type!
- Document!
- Field!
Storing data
SLIDE 58
Storing data
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !
CNV!
sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !
Documents (records)' Fields'
SLIDE 59
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !
CNV!
sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !
Storing data
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91! !
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.91!
CNV!
sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !
CNV!
sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !
CNV!
sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !
CNV!
sample id: SA091! chrom: 1! start: 103,062! end: 109,114! state: GAIN! !
mutation!
sample id: SA091! chrom: 1! position: 104,589! ref_allele: A! alt_allele: T! probability: 0.95! !
CNV!
sample id: SA091! chrom: 5! start: 2,062! end: 9,199! state: GAIN! !
CNV!
sample id: SA091! chrom: 2! start: 69,064! end: 89,119! state: DEL!
mutation!
sample id: SA091! chrom: 2! position: 19,586! ref_allele: G! alt_allele: G!
Index'
SLIDE 60 Handling relational data
Challenge!
- Search is not optimized for handling relationships between records!
- To support coordinated views, we need to relate one data type to another!
- Example: identify which mutations fall within which copy number
alteration based on genomic coordinates!
SLIDE 61 Handling relational data
- Exploit search engine’s very fast retrieval
- f terms!
mutation!
sample id: SA091! chrom: 1! position: 104,589! probability: 0.91! events: ( ! !(type: titan, ! !chrom: 1! !start: 102,005! !end: 117,902! !state: deletion)! )!
- Example: a mutation record contains
information about its overlapping copy number record! Solution!
- Denormalize records to contain necessary terms to support real-time
retrieval of information requested by the frontend!
SLIDE 62 Work-in-progress
- Current index contains 800 million genomic events!
- Queries sufficiently fast to support interaction!
- Working to standardize our data loaders and develop a query API for
community use!
SLIDE 63
Perceptual scalability
What to do when we have more data points than pixels?!
SLIDE 64
- Design views to present meaningful summaries!
Scalable views
- Sample or smooth data while preserving potentially relevant outliers!
- Exploit optimized elasticsearch aggregation functions (e.g. counts in
heatmaps computed during search)!
SLIDE 65
4
Summary
SLIDE 66 Summary
- Prototyped a flexible web-based visualization platform to tackle the diverse
visualization demands of cancer genomics; support data integration and comparison!
- Search engine technology is proving useful for enabling responsive
interactive querying and linking between views!
- Visualizations designed with scalability in mind (aggregrate, sample, or filter
where possible)!
- Open visualization challenges - examples:!
- Best way to summarize genomic alternations of different scales across
tens or hundreds of genomes!
- How to additionally represent frequency of each genomic alteration
within a tumour sample and over time!
SLIDE 67 Acknowledgements
Sohrab Shah! ! Samuel Aparicio! David Huntsman! Marco Marra! Janessa Laskin! ! ! Michael Smith Genome Sciences Centre!
British Columbia Cancer Agency Vancouver, Canada
Tom Jin! Kevin Wagner! Daniel Machev! Kelsey Hamer! Ali Bashashati! ! Shah Lab Development Team!