Analytics Joel Saltz MD, PhD Director Center for Comprehensive - - PowerPoint PPT Presentation

analytics
SMART_READER_LITE
LIVE PREVIEW

Analytics Joel Saltz MD, PhD Director Center for Comprehensive - - PowerPoint PPT Presentation

Tools, Techniques and Methods for Integrative Data Analytics Joel Saltz MD, PhD Director Center for Comprehensive Informatics Center for Comprehensive Informatics Contributions Computer Science: Methods and middleware for analysis,


slide-1
SLIDE 1

Tools, Techniques and Methods for Integrative Data Analytics Joel Saltz MD, PhD Director Center for Comprehensive Informatics

slide-2
SLIDE 2

Center for Comprehensive Informatics

Contributions

  • Computer Science: Methods and middleware for

analysis, classification of very large datasets from low dimensional spatio-temporal sensors; methods to carry out comparisons and change detection between sensor datasets

  • Biomedical: Mine whole slide image datasets to

better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets

  • CFD: Quantitative characterization of spatio-

temporal features generated by large scale simulations, comparisons with experimental results, uncertainty quantification

slide-3
SLIDE 3

Center for Comprehensive Informatics

  • Leverage exascale data and

computer resources to squeeze the most out of image, sensor or simulation data

  • Run lots of different

algorithms to derive same features

  • Run lots of algorithms to

derive complementary features

  • Data models and data

management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms

Extreme Spatio-Temporal Data Analytics

slide-4
SLIDE 4

Center for Comprehensive Informatics

Application Targets

  • Multi-dimensional spatial-temporal datasets

– Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations

  • Can we analyze 100,000+ microscopy images per

hour?

  • Correlative and cooperative analysis of data from

multiple sensor modalities and sources

  • What-if scenarios and multiple design choices or

initial conditions

slide-5
SLIDE 5

Center for Comprehensive Informatics

Core Transformations

  • Data Cleaning and Low Level Transformations
  • Data Subsetting, Filtering, Subsampling
  • Spatio-temporal Mapping and Registration
  • Object Segmentation
  • Feature Extraction, Object Classification
  • Spatio-temporal Aggregation
  • Change Detection, Comparison, and Quantification
slide-6
SLIDE 6

Digital Pathology Analytics

Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)

slide-7
SLIDE 7

Center for Comprehensive Informatics

Morphological Tissue Classification

Nuclei Segmentation Cellular Features

Lee Cooper, Jun Kong

Whole Slide Imaging

slide-8
SLIDE 8

Center for Comprehensive Informatics

Whole Slide Imaging: Scale

slide-9
SLIDE 9

Center for Comprehensive Informatics

Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results

slide-10
SLIDE 10

Center for Comprehensive Informatics

Pathology Computer Assisted Diagnosis

Shimada, Gurcan, Kong, Saltz

slide-11
SLIDE 11

Computerized Classification System for Grading Neuroblastoma

  • Background Identification
  • Image Decomposition (Multi-

resolution levels)

  • Image Segmentation

(EMLDA)

  • Feature Construction (2nd
  • rder statistics, Tonal

Features)

  • Feature Extraction (LDA) +

Classification (Bayesian)

  • Multi-resolution Layer

Controller (Confidence Region)

No Yes

Image Tile Initialization I = L Background? Label Create Image I(L) Segmentation Feature Construction Feature Extraction Classification Segmentation Feature Construction Feature Extraction Classifier Training Down-sampling Training Tiles Within Confidence Region ? I = I -1 I > 1?

Yes Yes No No

TRAINING TESTING

slide-12
SLIDE 12

Center for Comprehensive Informatics

Direct Study of Relationship Between vs

slide-13
SLIDE 13

Center for Comprehensive Informatics Consensus clustering of morphological signatures

Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering

Nuclear Features Used to Classify GBMs

3 2 1

20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160

2 3 4 5 6 7 25 30 35 40 45 50 # Clusters Silhouette Area 0.5 1 1 2 3 Silhouette Value Cluster

slide-14
SLIDE 14

Center for Comprehensive Informatics

Clustering identifies three morphological groups

  • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
  • Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB)

  • Prognostically-significant (logrank p=4.5e-4)

Feature Indices

CC CM PB

10 20 30 40 50 500 1000 1500 2000 2500 3000 0.2 0.4 0.6 0.8 1 Days Survival

CC CM PB

slide-15
SLIDE 15

Novel Pathology Modalities

Imaging

Excellent Spatial Resolution Limited Molecular Resolution

Genomics

Excellent Molecular Resolution Limited Spatial Resolution

1000’s of genes

slide-16
SLIDE 16

Center for Comprehensive Informatics

slide-17
SLIDE 17

Extreme DataCutter Prototype

DataCutter

Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System S

Extreme DataCutter

Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm Bottom line – everything ends up as DAGS

slide-18
SLIDE 18

Center for Comprehensive Informatics

Extreme DataCutter – Two Level Model

slide-19
SLIDE 19

Center for Comprehensive Informatics

Node Level Work Scheduling

slide-20
SLIDE 20

Center for Comprehensive Informatics

Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)

slide-21
SLIDE 21

Center for Comprehensive Informatics

Structured/Unstructured Grid Calculations with Unpredictable Runtime Dependencies

Key Kernel in Distance Transform, Morphological Reconstruction, Delaney Triagulation

slide-22
SLIDE 22

Center for Comprehensive Informatics

Control Structures for Handling Fine Grained/Runtime Dependent Parallelism in GPUs

Morphological Reconstruction:

8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs

slide-23
SLIDE 23

Center for Comprehensive Informatics

“Speedup” relative to single CPU core

slide-24
SLIDE 24

Center for Comprehensive Informatics

Large Scale Data Management

  • Represented by a complex data model capturing

multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.

  • Support for complex relationships and spatial

query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships

  • Highly optimized spatial query and analyses
  • Implemented in a variety of ways including
  • ptimized CPU/GPU, Hadoop/HDFS and IBM DB2
slide-25
SLIDE 25

Spatial Centric – Pathology Imaging “GIS”

Point query: human marked point inside a nucleus

.

Window query: return markups contained in a rectangle Spatial join query: algorithm validation/comparison Containment query: nuclear feature aggregation in tumor regions

slide-26
SLIDE 26

Algorithm Validation: Intersection between Two Result Sets (Spatial Join)

PAIS: Example Queries

. .

slide-27
SLIDE 27

Center for Comprehensive Informatics

VLDB 2012

Change Detection, Comparison, and Quantification

slide-28
SLIDE 28

Center for Comprehensive Informatics

CPU/GPU Methods for Comparing Many Polygons

  • Cross-compare two sets of polygons, segmented by

different algorithms or the same algorithm with different parameters

  • Jaccard similarity of P and Q -- two sets of

polygons representing the spatial boundaries of

  • bjects generated by two methods from the same

image.

  • PixelBox accepts an array of polygon pairs as input

and computes their areas of intersection and union.

slide-29
SLIDE 29

Center for Comprehensive Informatics

Performance Improvement from PixelBox (VLDB 2012)

slide-30
SLIDE 30

Center for Comprehensive Informatics

Summary and Perspective

  • Extreme Spatio temporal data analytics
  • Quantitative characterization of spatio-temporal

features generated by large scale simulations, comparisons with experimental results

  • Methods and tools for extreme scale data analysis

pipelines

  • Uncertainty quantification, comparison with

experimental results

slide-31
SLIDE 31

Thanks to:

  • In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David

Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)

  • caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin

Kurc, Himanshu Rathod Emory leads

  • caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon,

Daniel Rubin, Fred Prior, Larry Tarbox and many others

  • In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
  • Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul

Pantalone

  • Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony

Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)

  • NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima

Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe

  • ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc,

Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado- Ramos

  • NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman,

Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi

slide-32
SLIDE 32

Thanks!