In The Beginning Data. Lots of it. eg. VCF, BAM files In The - - PowerPoint PPT Presentation

in the beginning
SMART_READER_LITE
LIVE PREVIEW

In The Beginning Data. Lots of it. eg. VCF, BAM files In The - - PowerPoint PPT Presentation

In The Beginning Data. Lots of it. eg. VCF, BAM files In The Beginning Goal. Build a web-based interface on top of a fast backend to help navigate and explore the data esv Origin: Prototype Origin: Challenges Linking : All views should be


slide-1
SLIDE 1
slide-2
SLIDE 2

In The Beginning

Data. Lots of it.

  • eg. VCF, BAM files
slide-3
SLIDE 3

In The Beginning

  • Goal. Build a web-based interface on top
  • f a fast backend to help navigate and

explore the data

esv

slide-4
SLIDE 4

Origin: Prototype

slide-5
SLIDE 5

Origin: Challenges

Linking: All views should be interactive

slide-6
SLIDE 6

Origin: Challenges

Scalability: Creating, editing, and linking should be fast to drive data discovery

slide-7
SLIDE 7

Origin: Challenges

Interface: Exploring data should be natural, informative, and easy to follow

slide-8
SLIDE 8

Structures

View View Filter Data Filter Data

Visual Representation

  • n genomic positions, genes
  • n data parameters (eg. threshold, experiment type)

Underlying data source (ie. by sample ID, project)

slide-9
SLIDE 9

Progress

Major Highlights

  • Redesigned interface and editor
  • New query engine
  • Improved views / visualizations to support linking and interaction
  • Supertable
  • Data denormalization contributions
slide-10
SLIDE 10

Live Demo

ESV Demonstration

slide-11
SLIDE 11

Data Denormalization: Why?

  • ElasticSearch is an extremely fast text-search engine -

but it is schema-free ○ No set column names, no defined structure

  • How do we find relations then?
slide-12
SLIDE 12

Data Denormalization: Why?

TITAN Dataset Mutationseq Dataset

How do we know which mutations fall within which copy number alteration given a given genomic coordinate?

slide-13
SLIDE 13

Data Denormalization: How?

MutationSeq sample id: DG1155 chrom: 01 position: 104,589 ref_allele: A alt_allele: T probability: 0.91 ... TITAN sample id: DG1155 chrom: 01 start: 103,062 end: 109,114 state: GAIN ...

slide-14
SLIDE 14

Data Denormalization: How?

MutationSeq sample id: DG1155 chrom: 01 position: 104,589 ref_allele: A alt_allele: T probability: 0.91 events: {...} ... TITAN sample id: DG1155 chrom: 01 start: 103,062 end: 109,114 state: GAIN

slide-15
SLIDE 15

Data Denormalization: How?

Mutationseq sample id: DG1155 chrom: 01 position: 104,589 ref_allele: A alt_allele: T probability: 0.91 events: { chrom: 01 start: 103,062 end: 109,114 state: GAIN ... }

  • Unlike Facebook or Twitter, our

data is mainly static

  • Exploit ElasticSearch’s very fast

query term search

  • Ask questions like: Find me all the

TITAN segments that overlap a particular MutationSeq event

slide-16
SLIDE 16

Data Denormalization: Result

slide-17
SLIDE 17

To Infinity and Beyond

  • Applications to other areas of research and/or industry

in the future, as ESV was designed to be as general as possible

  • Addition of new datasets/datatypes (ie. single sample

MutationSeq)

  • User contributed views and additional default views
slide-18
SLIDE 18

Summary

Over the past 3 months:

  • Redesigned interface to support integration of complex views
  • Added support to easily add new views
  • Realtime search and filtering through ElasticSearch
  • Integrated and improved views/visualizations
  • Used denormalized data to support linking between any number
  • f views

http://cbioportal.mo.bccrc.ca:8000/

slide-19
SLIDE 19

Acknowledgements

Sohrab Shah Cydney Nielsen Development Team Daniel Machev Kelsey Hamer Ali Bashashati Kevin Wagner Shah Lab