CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC - - PowerPoint PPT Presentation

current challenges in genomic data visualization
SMART_READER_LITE
LIVE PREVIEW

CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC - - PowerPoint PPT Presentation

CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC Cancer Agency Genome Sciences Centre Vancouver, Canada The Data Deluge ~$5,000 in 2001 ~10 in 2011 Sequencing Experiments De novo assembly Re-sequencing Enrichment


slide-1
SLIDE 1

CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION

Cydney Nielsen

BC Cancer Agency Genome Sciences Centre Vancouver, Canada

slide-2
SLIDE 2

The Data Deluge

~$5,000 in 2001 ~10¢ in 2011

slide-3
SLIDE 3

Sequencing Experiments

De novo assembly Re-sequencing Enrichment

AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA

Reference Genome Reference Genome Genome Assembly

slide-4
SLIDE 4

Drew Sheneman, New Jersey - The Newark Star Ledger

slide-5
SLIDE 5

Challenge 1

Large number of samples for comparison

“To systematically characterize the genomic changes in hundreds of tumors…and thousands of samples over the next five years” The Cancer Genome Atlas www.cancergenome.nih.gov

slide-6
SLIDE 6

Genome Browsers

Stacked data tracks along a common genome x-axis

Genome coordinate Data samples

slide-7
SLIDE 7

Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help

UCSC Cancer Genomics Heatmaps

Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200)

Tumor vs normal G e n d e r

Recurrent deletion of all or part of chromosome 10, peak at PTEN locus

Genome coordinate Data samples Zhu et al., Nature Methods, 2009

Heatmap provides a more condensed view

slide-8
SLIDE 8

Challenge 1

Consider what information is needed

e.g. replace with biologically meaningful summary, such as significant change between samples

Large number of samples for comparison

slide-9
SLIDE 9

Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help

UCSC Cancer Genomics Heatmaps

Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200)

Tumor vs normal G e n d e r

Recurrent deletion of all or part of chromosome 10, peak at PTEN locus

Zhu et al., Nature Methods, 2009

Example: Summary view (column averages)

slide-10
SLIDE 10

Challenge 2

Large number of data types

slide-11
SLIDE 11

Genomic location (Mb) 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 2 4 1 Copy number Allelic ratio 100 Non-inverted

  • rientation

Inverted

  • rientation

Deletion-type Tandem dup-type Head-to-head inverted Tail-to-tail inverted

SNU-C1 (colorectal): Chr 15

A

Stephens et al., Cell, 2011

Genomic rearrangements in cancer (complex representation)

slide-12
SLIDE 12

Keane et al., Nature, 2011

CAST/EiJ WSB/EiJ PWK/PhJ

b a

SPRET/EiJ 1 2 9 P 2 / O l a H s d 1 2 9 S 1 / S v I m J 1 2 9 S 5 / S v E v

B r d

A / J A K R / J B A L B / c J C 3 H / H e J C 5 7 B L / 6 N J C B A / J D B A / 2 J L P / J N O D / S h i L t J N Z O / H I L t J

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 17 18 19 X X 19 18 1 7 16 15 1 4 13 12 11 10 9 8 7 6 5 4 3 2 1 X 19 18 17 16 15 1 4 13 1 2 1 1 1 9 8 7 6 5 4 3 2 1 X 1 9 1 8 1 7 16 15 14 13 12 11 1 9 8 7 6 5 4 3 2 1 X 19 18 1 7 16 15 14 13 1 2 11 1 9 8 7 6 5 4 3 2 1

>100,000 742 179 836 SNPs SVs TEs Uncallable

17 mouse genomes (more compact representation) Still difficult to represent many data types in a general tool

slide-13
SLIDE 13

Challenge 2

Compact, customized data encoding Large number of data types

slide-14
SLIDE 14

inversion event in a human lymphoma genome reference human genome

(a) (b)

Nielsen et al. Best Paper Award at InfoVis 2009

ABySS-Explorer

Represents sequence

  • connectivity
  • strand
  • length
  • mapping on reference

Interactively access

  • sequence coverage
  • scaffolding
slide-15
SLIDE 15

Challenge 3

Genomic features are sparse

slide-16
SLIDE 16

Genome Browsers

LOCAL VIEW

Human chr1, 1 pt corresponds to 480 kb, which is larger than 98% of all human genes!

  • Martin Krzywinski
slide-17
SLIDE 17

a b

Chromatin states: Chromosome 3L 5′ 3′ 5′ 3′ 5′ 3′ 5′ 3′

Pericentromeric heterochromatin Cluster of small expressed genes PcG domains Heterochromatin- like domain Open chromatin domain

9 8 7 6 5 4 3 2 1

Hilbert Curve

GLOBAL VIEW

Kharchenko et al., Nature, 2011 Anders, Bioinformatics, 2009

slide-18
SLIDE 18

Challenge 3

Genomic features are sparse Need both overview and detail Functional axis (perhaps not full genome)

slide-19
SLIDE 19

H3K4me3 H3K9Ac H3K4me1 H3K36me3 H3K27me3 H3K9me3 MeDIP MRE

  • 2. ¡Extract ¡data ¡matrices ¡
  • 1. ¡Focus ¡on ¡regions ¡of ¡interest ¡(e.g. ¡transcrip8onal ¡start ¡sites) ¡
  • 3. ¡Cluster ¡matrices ¡

¡

  • 4. ¡Interac8ve ¡cluster ¡visualiza8on ¡ ¡

Spark – a genomic data exploration tool

Nielsen et al. in preparation

slide-20
SLIDE 20

Challenge 4

No longer one genome but many

slide-21
SLIDE 21

Single nucleotide variation

Ossowski et al. Genome Research, 2008

slide-22
SLIDE 22

Single nucleotide variation

Integrative Genomics Viewer (IGV)

Robinson et al. Nature Biotechnology, 2011

slide-23
SLIDE 23

Bhutkar et al., Genetics, 2008

Structural variation

slide-24
SLIDE 24

Challenge 4

No longer one genome but many Capture variation on a graph

slide-25
SLIDE 25

Comeau et al., Mol. Biol. Evol., 2010

Sequence variation on a graph

Users may require more time to learn how to interpret graph representations, but such graphs are likely to scale better and may prove more powerful for analysis

slide-26
SLIDE 26

Paten et al., Genome Research, 2011

Sequence variation on a graph

slide-27
SLIDE 27

Challenge 5

Human Judgement Computational Analysis

slide-28
SLIDE 28

Consed Genome Assembly and Finishing Tool

David Gordon and Phil Green Good example of integrated visualization and computational analysis functionality

slide-29
SLIDE 29

Challenge 5

Need to integrate computation High interactivity, low memory overhead Avoid storing large data sets locally Popularity of web-based tools Evolving sequencing technologies

slide-30
SLIDE 30

Summary

Large number of samples for comparison Large number of data types Genomic features are sparse No longer one genome but many Need to integrate computational analysis 1 2 3 4 5

slide-31
SLIDE 31