CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC - - PowerPoint PPT Presentation
CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC - - PowerPoint PPT Presentation
CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC Cancer Agency Genome Sciences Centre Vancouver, Canada The Data Deluge ~$5,000 in 2001 ~10 in 2011 Sequencing Experiments De novo assembly Re-sequencing Enrichment
The Data Deluge
~$5,000 in 2001 ~10¢ in 2011
Sequencing Experiments
De novo assembly Re-sequencing Enrichment
AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA
Reference Genome Reference Genome Genome Assembly
Drew Sheneman, New Jersey - The Newark Star Ledger
Challenge 1
Large number of samples for comparison
“To systematically characterize the genomic changes in hundreds of tumors…and thousands of samples over the next five years” The Cancer Genome Atlas www.cancergenome.nih.gov
Genome Browsers
Stacked data tracks along a common genome x-axis
Genome coordinate Data samples
Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help
UCSC Cancer Genomics Heatmaps
Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200)
Tumor vs normal G e n d e r
Recurrent deletion of all or part of chromosome 10, peak at PTEN locus
Genome coordinate Data samples Zhu et al., Nature Methods, 2009
Heatmap provides a more condensed view
Challenge 1
Consider what information is needed
e.g. replace with biologically meaningful summary, such as significant change between samples
Large number of samples for comparison
Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help
UCSC Cancer Genomics Heatmaps
Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200)
Tumor vs normal G e n d e r
Recurrent deletion of all or part of chromosome 10, peak at PTEN locus
Zhu et al., Nature Methods, 2009
Example: Summary view (column averages)
Challenge 2
Large number of data types
Genomic location (Mb) 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 2 4 1 Copy number Allelic ratio 100 Non-inverted
- rientation
Inverted
- rientation
Deletion-type Tandem dup-type Head-to-head inverted Tail-to-tail inverted
SNU-C1 (colorectal): Chr 15
A
Stephens et al., Cell, 2011
Genomic rearrangements in cancer (complex representation)
Keane et al., Nature, 2011
CAST/EiJ WSB/EiJ PWK/PhJ
b a
SPRET/EiJ 1 2 9 P 2 / O l a H s d 1 2 9 S 1 / S v I m J 1 2 9 S 5 / S v E v
B r d
A / J A K R / J B A L B / c J C 3 H / H e J C 5 7 B L / 6 N J C B A / J D B A / 2 J L P / J N O D / S h i L t J N Z O / H I L t J
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 17 18 19 X X 19 18 1 7 16 15 1 4 13 12 11 10 9 8 7 6 5 4 3 2 1 X 19 18 17 16 15 1 4 13 1 2 1 1 1 9 8 7 6 5 4 3 2 1 X 1 9 1 8 1 7 16 15 14 13 12 11 1 9 8 7 6 5 4 3 2 1 X 19 18 1 7 16 15 14 13 1 2 11 1 9 8 7 6 5 4 3 2 1
>100,000 742 179 836 SNPs SVs TEs Uncallable
17 mouse genomes (more compact representation) Still difficult to represent many data types in a general tool
Challenge 2
Compact, customized data encoding Large number of data types
inversion event in a human lymphoma genome reference human genome
(a) (b)
Nielsen et al. Best Paper Award at InfoVis 2009
ABySS-Explorer
Represents sequence
- connectivity
- strand
- length
- mapping on reference
Interactively access
- sequence coverage
- scaffolding
Challenge 3
Genomic features are sparse
Genome Browsers
LOCAL VIEW
Human chr1, 1 pt corresponds to 480 kb, which is larger than 98% of all human genes!
- Martin Krzywinski
a b
Chromatin states: Chromosome 3L 5′ 3′ 5′ 3′ 5′ 3′ 5′ 3′
Pericentromeric heterochromatin Cluster of small expressed genes PcG domains Heterochromatin- like domain Open chromatin domain
9 8 7 6 5 4 3 2 1
Hilbert Curve
GLOBAL VIEW
Kharchenko et al., Nature, 2011 Anders, Bioinformatics, 2009
Challenge 3
Genomic features are sparse Need both overview and detail Functional axis (perhaps not full genome)
H3K4me3 H3K9Ac H3K4me1 H3K36me3 H3K27me3 H3K9me3 MeDIP MRE
- 2. ¡Extract ¡data ¡matrices ¡
- 1. ¡Focus ¡on ¡regions ¡of ¡interest ¡(e.g. ¡transcrip8onal ¡start ¡sites) ¡
- 3. ¡Cluster ¡matrices ¡
¡
- 4. ¡Interac8ve ¡cluster ¡visualiza8on ¡ ¡
Spark – a genomic data exploration tool
Nielsen et al. in preparation
Challenge 4
No longer one genome but many
Single nucleotide variation
Ossowski et al. Genome Research, 2008
Single nucleotide variation
Integrative Genomics Viewer (IGV)
Robinson et al. Nature Biotechnology, 2011
Bhutkar et al., Genetics, 2008
Structural variation
Challenge 4
No longer one genome but many Capture variation on a graph
Comeau et al., Mol. Biol. Evol., 2010
Sequence variation on a graph
Users may require more time to learn how to interpret graph representations, but such graphs are likely to scale better and may prove more powerful for analysis
Paten et al., Genome Research, 2011
Sequence variation on a graph
Challenge 5
Human Judgement Computational Analysis
Consed Genome Assembly and Finishing Tool
David Gordon and Phil Green Good example of integrated visualization and computational analysis functionality